X

An Oracle blog about Instant Client

  • January 30, 2018

Solaris Analytics: An Overview

James McPherson
Principal Software Engineer


One of the key features of Oracle Solaris 11.4 is Solaris Analytics. This is a radical redesign of the way that we look at how our Solaris systems are performing. To set the scene for why and how it is useful to you, let's take a short walk along memory lane for a minute.

In my years as a customer and working in support for Sun, whenever we had an issue with how our systems were performing we would crack out the old favourites: iostat, vmstat, lockstat. Run those tools for some time while the problem was occurring and then post-process the information.

At some point during my support career, somebody realised that if we gathered a lot more information about a problematic system, then we would have a better chance to zero in on where the problem lay. The Explorer tool quickly became a recommended component, and engineers within the support organisation spent countless hours figuring out ways to interrogate this data.

All of which was fine until the explosion of virtualisation and the invention of DTrace. DTrace enables you to dig very deeply into your system with a singular focus. What it does not do, however, is pull other sources of relevant information together to help you truly answer your questions about what the system is actually doing.

So how do we solve these problems? How do we help the duty engineer at 3am who is wondering why a business-critical service has failed or is just performing badly? How can we give you a way to say "Actually, it's not the hardware or the operating system which is misbehaving, it's a specific application"?

With some inspiration from the ZFS Storage Appliance interface, and taking note of our engineering culture to design observability features in, the Solaris Analytics project was born. Architected by Liane Praza and Bart Smaalders, it pulls together information from kstats, hardware configuration, audit and FMA events, service status, processes and certain utilities. This information is stored in the Solaris Stats Store (svc:/system/sstore), and made visible to you via the Solaris Analytics Web UI ("the bui", svc:/system/webui/server) as well as via a command line tool sstore(1).

For the rest of this post I'll discuss our nomenclature and give some simple examples of how you can immediately start seeing what your system is doing.

The Stats Store makes use of providers in order to gather stats. These providers include the Solaris Audit facility, FMA, SMF, the process table and kstats. There is also a userspace provider, so that you can provide your own statistics. (I'll cover userspace stats in a later post).

This brings us to the concept of the Stats Store namespace, which is composed of classes, resources, stats and partitions. Classes are groupings of resources which share common statistics. For example, each resource in //:class.disk has these statistics:

 

    jurassic:jmcp $ sstore list //:class.disk//:stat.*
    IDENTIFIER
    //:class.disk//:stat.read-bytes
    //:class.disk//:stat.read-ops
    //:class.disk//:stat.write-bytes
    //:class.disk//:stat.write-ops


Some statistics can be partitioned, too:
 
    jurassic:jmcp $ sstore list //:class.disk//:stat.read-bytes//:part.*
    IDENTIFIER
    //:class.disk//:stat.read-bytes//:part.controller
    //:class.disk//:stat.read-bytes//:part.device
    //:class.disk//:stat.read-bytes//:part.devid
    //:class.disk//:stat.read-bytes//:part.disk


We also have the concept of topology and mappings, so that you can see, for example, the zpool that your disk is part of:
 
    jurassic:jmcp $ sstore list //:class.disk//:res.name/sd5//:class.zpool//:*
    IDENTIFIER
    //:class.disk//:res.name/sd5//:class.zpool//:res.name/rpool


Each of these names is an SSID, or Stats Store IDentifier, and the delimiter is "//:". You can read all about ssids in the manpage ssid(7).

The sstore(1) command is our command-line interface to the Stats Store, and lets you list, capture, export and read information about statistics. (sstore also has an interactive shell mode, which lets you explore the namespace).

I'll finish this post with two simple examples of how we tie information together to make this feature more than just a bunch of numbers.

The first example demonstrates how we answer the question "How many failed login attempts have there been on this system?". To do this we need to look at two types of Solaris Audit events: AUE_login and AUE_ssh. The first type refers to login attempts on any of the virtual or physical console devices. Let's see how many login failures have been recorded since 1 January 2017 on a convenient example system:
 
    $ sstore export -t 2017-01-01T00:00:00 \
	"//:class.event//:event.adm-action//:op.filter(event=AUE_login,return/errval=failure)//:op.count" \
	"//:class.event//:event.adm-action//:op.filter(event=AUE_ssh,return/errval=failure)//:op.count"
    TIME                VALUE IDENTIFIER
    2018-01-27T10:19:57 7 //:class.event//:event.adm-action//:op.filter(event=AUE_login,return/errval=failure)//:op.count
    2018-01-27T10:13:13 15 //:class.event//:event.adm-action//:op.filter(event=AUE_ssh,return/errval=failure)//:op.count


The filter operator lets you search on string or numeric values, or elements of an nvlist (which is the format that we get events from the audit service). The count operator just sums the number of data points provided to it and reports the current timestamp. The "-t 2017-01-01T00:00:00" limits the query to data points that occurred after that time.

The second example looks at two aspects of disk errors: absolute numbers, and the rate at which they are occurring. Rather than wade through iostat(8) output and try to figure out whether any of the disks in my media server were dying, I can use sstore(1) instead. Let's start by looking at the current disk error count, partitioned by device id:
 
    $ sstore export -p -1  //:class.disk//:stat.errors//:part.devid  
    TIME                VALUE IDENTIFIER
    2018-01-27T10:30:04  //:class.disk//:stat.errors//:part.devid
                    id1,sd@SATA_____ST1000DM003-1CH1____________Z1D5K89L: 1731.0
                    id1,sd@SATA_____ST2000DM001-1ER1____________Z4Z0EQF2: 1729.0
                    id?: 0.0
                    id1,sd@n5000039fe2df1c15: 0.0
                    id1,sd@n5000cca248e72f12: 0.0
                    id1,sd@n5000cca39cd93711: 0.0
                    id1,sd@n5000cca242c08f82: 0.0
                    id1,sd@n5000c50067485f33: 0.0
                    id1,sd@n5000cca35dea1ed5: 0.0
                    id1,sd@n5000039ff3f0d2f0: 0.0
                    id1,sd@n5000cca248e728b6: 0.0
                    id1,sd@n5000039ff3f0d8d9: 0.0
                    id1,sd@n5000cca35de9c580: 0.0
                    id1,sd@SATA_____WDC_WD30EFRX-68E_____WD-WCC4N7CNYH0S: 1729.0
                    id1,sd@SATA_____SPCC_Solid_StateP1701137000000018623: 1730.0
                    id1,sd@n50014ee26441fe70: 0.0
                    id1,sd@n50014ee2b951e6ae: 0.0


We've got four devices with a non-zero error count. That might be bad, if it's increasing quickly. Let's have a look at the rate at which they're occurring:
 
    $ sstore export -p -10 //:class.disk//:stat.errors//:part.devid//:op.rate
    ...
	2018-01-27T10:35:14  //:class.disk//:stat.errors//:part.devid//:op.rate
                    id1,sd@SATA_____ST1000DM003-1CH1____________Z1D5K89L: 0.0
                    id1,sd@SATA_____ST2000DM001-1ER1____________Z4Z0EQF2: 0.0
                    id?: 0.0
                    id1,sd@n5000039fe2df1c15: 0.0
                    id1,sd@n5000cca248e72f12: 0.0
                    id1,sd@n5000cca39cd93711: 0.0
                    id1,sd@n5000cca242c08f82: 0.0
                    id1,sd@n5000c50067485f33: 0.0
                    id1,sd@n5000cca35dea1ed5: 0.0
                    id1,sd@n5000039ff3f0d2f0: 0.0
                    id1,sd@n5000cca248e728b6: 0.0
                    id1,sd@n5000039ff3f0d8d9: 0.0
                    id1,sd@n5000cca35de9c580: 0.0
                    id1,sd@SATA_____WDC_WD30EFRX-68E_____WD-WCC4N7CNYH0S: 0.0
                    id1,sd@SATA_____SPCC_Solid_StateP1701137000000018623: 0.0
                    id1,sd@n50014ee26441fe70: 0.0
                    id1,sd@n50014ee2b951e6ae: 0.0


This is the tenth data point, and you will observe that the rate of errors in each case is 0. A quick check of zpool status shows that my zpools are fine, leaving me with one less thing to worry about.

In my next post I'll talk about how we bring this sort of information together in the WebUI. You can start exploring this yourself by uttering
 
    # svcadm enable webui/server



and then point your browser at 'https://127.0.0.1:6787'.

We have a manual in the Oracle Solaris 11.4 Information Library for Analytics, entitled Using Oracle® Solaris 11.4 Analytics which covers this feature in much more detail than this overview.

Join the discussion

Comments ( 1 )
  • DavidH Thursday, February 8, 2018
    I remember using the original near real-time perfmeter tools under SunTools - at the time, it was ground-breaking!

    I was wondering when Oracle was going to release a replacement for the old-school SVR4 analytics tools in pre-solaris 11 that were discontinued.

    One-liners with XTerm in TEK mode with sar, awk, plot, spline, and graph was helpful to plot bitmapped graphs for decades, but the output was quite dated... the advantage was that once the sys cron was setup, historical data was always on-hand to be graphed (in addition to near real-time data graphing directly out of sar.)

    I am looking forward to seeing the new facilities!
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.