Author: James McPherson
One of the key features of Oracle Solaris 11.4 is Solaris Analytics. This is a radical redesign of the way that we look at how our Solaris systems are performing. To set the scene for why and how it is useful to you, let’s take a short walk along memory lane for a minute.
In my years as a customer and working in support for Sun, whenever we had an issue with how our systems were performing we would crack out the old favourites: iostat, vmstat, lockstat. Run those tools for some time while the problem was occurring and then post-process the information.
At some point during my support career, somebody realised that if we gathered a
lot more information about a problematic system, then we would have a better chance to zero in on where the problem lay. The Explorer tool quickly became a recommended component, and engineers within the support organisation spent countless hours figuring out ways to interrogate this data.
All of which was fine until the explosion of virtualisation and the invention of DTrace. DTrace enables you to dig very deeply into your system with a singular focus. What it does not do, however, is pull other sources of relevant information together to help you truly answer your questions about what the system is actually doing.
So how do we solve these problems? How do we help the duty engineer at 3am who is wondering why a business-critical service has failed or is just performing badly? How can we give you a way to say “Actually, it’s not the hardware or the operating system which is misbehaving, it’s a specific application”?
With some inspiration from the
ZFS Storage Appliance interface, and taking note of our engineering culture to design observability features
in
, the
Solaris Analytics project was born. Architected by Liane Praza and Bart Smaalders, it pulls together information from kstats, hardware configuration, audit and FMA events, service status, processes and certain utilities. This information is stored in the Solaris Stats Store (
svc:/system/sstore), and made visible to you via the Solaris Analytics Web UI (“the bui”,
svc:/system/webui/server) as well as via a command line tool
sstore(1).
For the rest of this post I’ll discuss our nomenclature and give some simple examples of how you can immediately start seeing what your system is doing.
The Stats Store makes use of
providers
in order to gather stats. These providers include the Solaris Audit facility, FMA, SMF, the process table and kstats. There is also a userspace provider, so that you can provide your own statistics. (I’ll cover userspace stats in a later post).
This brings us to the concept of the Stats Store namespace, which is composed of classes, resources, stats and partitions. Classes are groupings of resources which share common statistics. For example, each resource in //:class.disk has these statistics:
jurassic:jmcp $ sstore list //:class.disk//:stat.*
IDENTIFIER
//:class.disk//:stat.read-bytes
//:class.disk//:stat.read-ops
//:class.disk//:stat.write-bytes
//:class.disk//:stat.write-ops
Some statistics can be partitioned, too:
jurassic:jmcp $ sstore list //:class.disk//:stat.read-bytes//:part.*
IDENTIFIER
//:class.disk//:stat.read-bytes//:part.controller
//:class.disk//:stat.read-bytes//:part.device
//:class.disk//:stat.read-bytes//:part.devid
//:class.disk//:stat.read-bytes//:part.disk
We also have the concept of topology and mappings, so that you can see, for example, the zpool that your disk is part of:
jurassic:jmcp $ sstore list //:class.disk//:res.name/sd5//:class.zpool//:*
IDENTIFIER
//:class.disk//:res.name/sd5//:class.zpool//:res.name/rpool
Each of these names is an SSID, or Stats Store IDentifier, and the delimiter is “//:”. You can read all about ssids in the manpage ssid(7).
The sstore(1) command is our command-line interface to the Stats Store, and lets you list, capture, export and read information about statistics. (sstore also has an interactive shell mode, which lets you explore the namespace).
I’ll finish this post with two simple examples of how we tie information together to make this feature more than just a bunch of numbers.
The first example demonstrates how we answer the question “How many failed login attempts have there been on this system?”. To do this we need to look at two types of Solaris Audit events: AUE_login and AUE_ssh. The first type refers to login attempts on any of the virtual or physical console devices. Let’s see how many login failures have been recorded since 1 January 2017 on a convenient example system:
$ sstore export -t 2017-01-01T00:00:00 \
"//:class.event//:event.adm-action//:op.filter(event=AUE_login,return/errval=failure)//:op.count" \
"//:class.event//:event.adm-action//:op.filter(event=AUE_ssh,return/errval=failure)//:op.count"
TIME VALUE IDENTIFIER
2018-01-27T10:19:57 7 //:class.event//:event.adm-action//:op.filter(event=AUE_login,return/errval=failure)//:op.count
2018-01-27T10:13:13 15 //:class.event//:event.adm-action//:op.filter(event=AUE_ssh,return/errval=failure)//:op.count
The filter operator lets you search on string or numeric values, or elements of an nvlist (which is the format that we get events from the audit service). The count operator just sums the number of data points provided to it and reports the current timestamp. The “-t 2017-01-01T00:00:00” limits the query to data points that occurred after that time.
The second example looks at two aspects of disk errors: absolute numbers, and the rate at which they are occurring. Rather than wade through iostat(8) output and try to figure out whether any of the disks in my media server were dying, I can use sstore(1) instead. Let’s start by looking at the current disk error count, partitioned by device id:
$ sstore export -p -1 //:class.disk//:stat.errors//:part.devid
TIME VALUE IDENTIFIER
2018-01-27T10:30:04 //:class.disk//:stat.errors//:part.devid
id1,sd@SATA_____ST1000DM003-1CH1____________Z1D5K89L: 1731.0
id1,sd@SATA_____ST2000DM001-1ER1____________Z4Z0EQF2: 1729.0
id?: 0.0
id1,sd@n5000039fe2df1c15: 0.0
id1,sd@n5000cca248e72f12: 0.0
id1,sd@n5000cca39cd93711: 0.0
id1,sd@n5000cca242c08f82: 0.0
id1,sd@n5000c50067485f33: 0.0
id1,sd@n5000cca35dea1ed5: 0.0
id1,sd@n5000039ff3f0d2f0: 0.0
id1,sd@n5000cca248e728b6: 0.0
id1,sd@n5000039ff3f0d8d9: 0.0
id1,sd@n5000cca35de9c580: 0.0
id1,sd@SATA_____WDC_WD30EFRX-68E_____WD-WCC4N7CNYH0S: 1729.0
id1,sd@SATA_____SPCC_Solid_StateP1701137000000018623: 1730.0
id1,sd@n50014ee26441fe70: 0.0
id1,sd@n50014ee2b951e6ae: 0.0
We’ve got four devices with a non-zero error count. That might be bad, if it’s increasing quickly. Let’s have a look at the rate at which they’re occurring:
$ sstore export -p -10 //:class.disk//:stat.errors//:part.devid//:op.rate
...
2018-01-27T10:35:14 //:class.disk//:stat.errors//:part.devid//:op.rate
id1,sd@SATA_____ST1000DM003-1CH1____________Z1D5K89L: 0.0
id1,sd@SATA_____ST2000DM001-1ER1____________Z4Z0EQF2: 0.0
id?: 0.0
id1,sd@n5000039fe2df1c15: 0.0
id1,sd@n5000cca248e72f12: 0.0
id1,sd@n5000cca39cd93711: 0.0
id1,sd@n5000cca242c08f82: 0.0
id1,sd@n5000c50067485f33: 0.0
id1,sd@n5000cca35dea1ed5: 0.0
id1,sd@n5000039ff3f0d2f0: 0.0
id1,sd@n5000cca248e728b6: 0.0
id1,sd@n5000039ff3f0d8d9: 0.0
id1,sd@n5000cca35de9c580: 0.0
id1,sd@SATA_____WDC_WD30EFRX-68E_____WD-WCC4N7CNYH0S: 0.0
id1,sd@SATA_____SPCC_Solid_StateP1701137000000018623: 0.0
id1,sd@n50014ee26441fe70: 0.0
id1,sd@n50014ee2b951e6ae: 0.0
This is the tenth data point, and you will observe that the rate of errors in each case is 0. A quick check of zpool status shows that my zpools are fine, leaving me with one less thing to worry about.
In my next post I’ll talk about how we bring this sort of information together in the WebUI. You can start exploring this yourself by uttering
# svcadm enable webui/server
and then point your browser at ‘https://127.0.0.1:6787’.
We have a manual in the Oracle Solaris 11.4 Information Library for Analytics, entitled Using Oracle® Solaris 11.4 Analytics which covers this feature in much more detail than this overview.