Thursday Jul 28, 2011

nicstat update - version 1.90

Yes! A new version is now available with some long-awaited features. Many thanks to those who suggested improvements and helped with testing.

Changes for Version 1.90, April 2011

Common

  • nicstat.sh script, to provide for automated multi-platform deployment. See the Makefile's for details.
  • Added "-x" flag, to display extended statistics for each interface.
  • Added "-t" and "-u" flags, to include TCP and UDP (respectively) statistics. These come from tcp:0:tcpstat and udp:0:udpstat on Solaris, or from /proc/net/snmp and /proc/net/netstat on Linux.
  • Added "-a" flag, which equates to "-tux".
  • Added "-l" flag, which lists interfaces and their configuration.
  • Added "-v" flag, which displays nicstat version.

Solaris

  • Added use of libdladm.so:dladm_walk_datalink_id() to get list of interfaces. This is better than SIOCGLIFCONF, as it includes interfaces given exclusively to a zone.
    NOTE: this library/routine can be (by default is) linked in to nicstat in "lazy" mode, meaning that a Solaris 11 binary built with knowledge of the routine will also run on Solaris 10 without failing when the routine or library is not found - in this case nicstat will fall back to the SIOGLIFCONF method.
  • Added search of kstat "link_state" statistics as a third method for finding active network interfaces. See the man page for details.

Linux

  • Added support for SIOCETHTOOL ioctl, so that nicstat can look up interface speed/duplex (i.e. "-S" flag not necessarily needed any longer).
  • Removed need for LLONG_MAX, improving Linux portability.

Availability

nicstat source and binaries are available from sourceforge.

History

For more history on nicstat, see my earlier entry

Wednesday Dec 08, 2010

Which package delivers the file/directory I can not find?

Back in the day (ahem, Solaris 10 and earlier...) you could use a command like the following to figure out which package was responsible for; e.g.; "/usr/include/math.h", or any other "include/math.h", just in case it is somewhere less obvious:

mashie$ fgrep include/math.h /var/sadm/install/contents
/usr/include/math.h f none 0644 root bin 10514 7356 1249334889 SUNWlibm

In other words, /var/sadm/install/contents was a flat-file index of every file, directory, link, etc. installed on the system. The last field is the package name - "SUNWlibm" in this case.

Things are a little more tricky with the new packaging system in Solaris 11 Express. Here is the equivalent "pkg contents" command:

mashie$ pkg contents -o pkg.name,path -a path='\*/include/math.h'
PKG.NAME                        PATH
system/library/math/header-math usr/include/math.h

Thursday Sep 17, 2009

querystat - DTrace script to monitor your queries, query cache and server thread pre-emption

I was recently helping some colleagues check what was happening with their MySQL queries, and wrote a DTrace script to do it. Time to share that script.

First of all, a look at some output from the script:

mashie[bash]# ./querystat.d -p `pgrep mysqld`
Tracing started at 2009 Sep 17 16:28:35
2009 Sep 17 16:28:38   throughput 3 queries/sec
2009 Sep 17 16:28:41   throughput 4 queries/sec
2009 Sep 17 16:28:44   throughput 528 queries/sec
2009 Sep 17 16:28:47   throughput 1603 queries/sec
2009 Sep 17 16:28:50   throughput 1676 queries/sec
\^C
Tracing ended   at 2009 Sep 17 16:28:51
Average latency, all queries: 107 us
Latency distribution, all queries (us): 
           value  ------------- Distribution ------------- count    
              16 |                                         0        
              32 |@@                                       170      
              64 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@        3728     
             128 |@@@@@                                    533      
             256 |                                         26       
             512 |                                         18       
            1024 |                                         2        
            2048 |                                         1        
            4096 |                                         0        
            8192 |                                         1        
           16384 |                                         1        
           32768 |                                         0        
Query cache statistics:
    count             hit: 6
    count            miss: 4474
    avg latency      miss: 107 (us)
    avg latency       hit: 407 (us)
Latency distribution, for query cache hit (us): 
           value  ------------- Distribution ------------- count    
              64 |                                         0        
             128 |@@@@@@@@@@@@@                            2        
             256 |@@@@@@@                                  1        
             512 |@@@@@@@@@@@@@@@@@@@@                     3        
            1024 |                                         0        
Latency distribution, for query cache miss (us): 
           value  ------------- Distribution ------------- count    
              16 |                                         0        
              32 |@@                                       170      
              64 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@        3728     
             128 |@@@@@                                    531      
             256 |                                         25       
             512 |                                         15       
            1024 |                                         2        
            2048 |                                         1        
            4096 |                                         0        
            8192 |                                         1        
           16384 |                                         1        
           32768 |                                         0        
Average latency when query WAS NOT pre-empted: 73 us
Average latency when query     WAS pre-empted: 127 us
Pre-emptors:
[...]
   mysql                                     6
   Xorg                                     18
   sched                                    25
   firefox-bin                              44
   sysbench                               3095

You can see that while the script is running (prior to pressing <Ctrl>-C), we get a throughput count every 3 seconds.

Then we get some totals, some averages, and even some distribution histograms, covering all queries, then with breakdowns on whether we used the query cache, and whether the thread executing the query was pre-empted.

This may be useful for determining things like:

  • Do I have some queries in my workload that consume a lot more CPU than others?
  • Is the query cache helping or hurting?
  • Are my database server threads being pre-empted (kicked off the CPU) by (an)other process(es)?

Things have become easier since I first tried this, and had to use the PID provider to trace functions in the database server.

If you want to try my DTrace script, get it from here. NOTE: You will need a version of MySQL with DTrace probes for it to work.

Friday Sep 04, 2009

nicstat - the Solaris and Linux Network Monitoring Tool You Did Not Know You Needed

Update - Version 1.95, January 2014

Added "-U" option, to display separate read and write utilization. Simplified display code regarding "-M" option. For Solaris, fixed fetch64() to check type of kstats andf ixed memory leak in update_nicdata_list(). Full details at the entry for version 1.95

Update - Version 1.92, October 2012

Added "-M" option to display throughput in Mbps (Megabits per second). Fixed some bugs. Full details at the entry for version 1.92

Update - Version 1.90, July 2011

Many new features available, including extended NIC, TCP and UDP statistics. Full details at the entry for version 1.90

Update - February 2010

Nicstat now can produce parseable output if you add a "-p" flag. This is compatible with System Data Recorder (SDR). Links below are for the new version - 1.22.

Update - October 2009

Just a little one - nicstat now works on shared-ip Solaris zones.

Update - September 2009

OK, this is heading toward overkill...

The more I publish updates, the more I get requests for enhancement of nicstat. I have also decided to complete a few things that needed doing.

The improvements for this month are:

  • Added support for a "fd" or "hd" (in reality anything starting with an upper or lower-case F or H) suffix to the speed settings supplied via the "-S" option. This advises nicstat the interface is half-duplex or full-duplex. The Linux version now calculates %Util the same way as the Solaris version.
  • Added a script, enicstat, which uses ethtool to get speeds and duplex modes for all interfaces, then calls nicstat with an appropriate -S value.
  • Made the Linux version more efficient.
  • Combined the Solaris and Linux source into one nicstat.c. This is a little ugly due to #ifdef's, but that's the price you pay.
  • Wrote a man page.
  • Wrote better Makefile's for both platforms
  • Wrote a short README
  • Licensed nicstat under the Artistic License 2.0

All source and binaries will from now on be distributed in a tarball. This blog entry will remain the home of nicstat for the time being.

Lastly, I have heard the requests for easier availability in OpenSolaris. Stay tuned.

Update - August 2009

That's more like it - we should get plenty of coverage now :)

A colleague pointed out to me that nicstat's method of calculating utilization for a full-duplex interface is not correct.

Now nicstat will look for the kstat "link_duplex" value, and if it is 2 (which means full-duplex), it will use the greater of rbytes or wbytes to calculate utilization.

No change to the Linux version. Use the links in my previous post for downloading.

Update - July 2009

I should probably do this at least once a year, as nicstat needs more publicity...

A number of people have commented to me that nicstat always reports "0.00" for %Util on Linux. The reason for this is that there is no simple way an unprivileged user can get the speed of an interface in Linux (quite happy for someone to prove me wrong on that however).

Recently I got an offer of a patch from David Stone, to add an option to nicstat that tells it what the speed of an interface is. Pretty reasonable idea, so I have added it to the Linux version. You will see this new "-S" option explained if you use nicstat's "-h" (help) option.

I have made another change which makes nicstat more portable, hence easier to build on Linux.

History

A few years ago, a bloke I know by the name of Brendan Gregg wrote a Solaris kstat-based utility called nicstat. In 2006 I decided I needed to use this utility to capture network statistics in testing I do. Then I got a request from a colleague in PAE to do something about nicstat not being aware of "e1000g" interfaces.

I have spent a bit of time adding to nicstat since then, so I thought I would make the improved version available.

Why Should I Still Be Interested?

nicstat is to network interfaces as "iostat" is to disks, or "prstat" is to processes. It is designed as a much better version of "netstat -i". Its differences include:

  • Reports bytes in & out as well as packets.
  • Normalizes these values to per-second rates.
  • Reports on all interfaces (while iterating)
  • Reports Utilization (rough calculation as of now)
  • Reports Saturation (also rough)
  • Prefixes statistics with the current time

How about an example?

eac-t2000-3[bash]# nicstat 5
    Time      Int   rKB/s   wKB/s   rPk/s   wPk/s    rAvs    wAvs %Util    Sat
17:05:17      lo0    0.00    0.00    0.00    0.00    0.00    0.00  0.00   0.00
17:05:17  e1000g0    0.61    4.07    4.95    6.63   126.2   628.0  0.04   0.00
17:05:17  e1000g1   225.7   176.2   905.0   922.5   255.4   195.6  0.33   0.00
    Time      Int   rKB/s   wKB/s   rPk/s   wPk/s    rAvs    wAvs %Util    Sat
17:05:22      lo0    0.00    0.00    0.00    0.00    0.00    0.00  0.00   0.00
17:05:22  e1000g0    0.06    0.15    1.00    0.80   64.00   186.0  0.00   0.00
17:05:22  e1000g1    0.00    0.00    0.00    0.00    0.00    0.00  0.00   0.00
eac-t2000-3[bash]# nicstat -i e1000g0 5 4
    Time      Int   rKB/s   wKB/s   rPk/s   wPk/s    rAvs    wAvs %Util    Sat
17:08:49  e1000g0    0.61    4.07    4.95    6.63   126.2   628.0  0.04   0.00
17:08:54  e1000g0    0.06    0.04    1.00    0.20   64.00   186.0  0.00   0.00
17:08:59  e1000g0   239.2    2.33   174.4   33.60  1404.4   71.11  1.98   0.00
17:09:04  e1000g0    0.01    0.04    0.20    0.20   64.00   186.0  0.00   0.00

For more examples, see the man page.

References & Resources

Friday Aug 14, 2009

nicstat - Update for Solaris only

Update - August 2009

That's more like it - we should get plenty of coverage now :)

A colleague pointed out to me that nicstat's method of calculating utilization for a full-duplex interface is not correct.

Now nicstat will look for the kstat "link_duplex" value, and if it is 2 (which means full-duplex), it will use the greater of rbytes or wbytes to calculate utilization.

No change to the Linux version. Use the links in my previous post for downloading.

Monday Apr 27, 2009

pstime - a mash-up of ps(1) and ptime(1)

I have done some testing in the past where I needed to know the amount of CPU consumed by a process more accurately than I can get from the standard set of operating system utilities.

Recently I hit the same issue - I wanted to collect CPU consumption of mysqld.

To capture process CPU utilization over an interval on Solaris, about the best I can get is the output from a plain "prstat" command, which might look like:

mashie ) prstat -c -p `pgrep mysqld` 5 2
Please wait...
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP       
  7141 mysql     278M  208M cpu0    39    0   0:38:13  40% mysqld/45
Total: 1 processes, 45 lwps, load averages: 0.63, 0.33, 0.18
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP       
  7141 mysql     278M  208M cpu1    32    0   0:38:18  41% mysqld/45
Total: 1 processes, 45 lwps, load averages: 0.68, 0.34, 0.18

I am after data from the second sample only (still not sure exactly how prstat gets data for the fist sample, which comes out almost instantaneously), so you can guess I will need some sed/perl that is a litte more complicated than I would prefer.

pstime reads PROCFS (i.e.. the virtualized file-system mounted on /proc) and captures CPU utilization figures for processes. It will report the %USR and %SYS either for a specific list of processes, or every process running on the system (i.e., running at both sample points). The start sample time is recorded in high resolution at the time a process' data is captured, and then again after N seconds, where N is the first parameter supplied to pstime.

The default output of pstime is expressed as either a percentage of whole system CPU, or CPU seconds, with four significant digits. Solaris itself records the original figures in nanosecond resolution, although we do not expect today's hardware to be that accurate.

Here is an example:

mashie ) pstime 10 `pgrep sysbench\\|mysqld`
  UID    PID  %USR  %SYS COMMAND
mysql   7141 44.17 3.391 /u/dist/mysql60-debug/bin/mysqld --defaults-file=/et
mysql  19870 2.517 2.490 sysbench --test=oltp --oltp-read-only=on --max-time=
mysql  19869 0.000 0.000 /bin/sh -p ./run-sysbench

Downloads

Wednesday Apr 08, 2009

Testing the New Pool-of-Threads Scheduler in MySQL 6.0

I have recently been investigating a bew feature of MySQL 6.0 - the "Pool-of-Threads" scheduler. This feature is a fairly significant change to the way MySQL completes tasks given to it by database clients.

To begin with, be advised that the MySQL database is implemented as a single multi-threaded process. The conventional threading model is that there are a number of "internal" threads doing administrative work (including accepting connections from clients wanting to connect to the database), then one thread for each database connection. That thread is responsible for all communication with that database client connection, and performs the bulk of database operations on behalf of the client.

This architecture exists in other RDBMS implementations. Another common implementation is a collection of processes all cooperating via a region of shared memory, usually with semaphores or other synchronization objects located in that shared memory.

The creation and management of threads can be said to be cheap, in a relative sense - it is usually significantly cheaper to create or destroy a thread rather than a process. However these overheads do not come for free. Also, the operations involved in scheduling a thread as opposed to a process are not significantly different. A single operating system instance scheduling several thousand threads on and off the CPUs is not much less work than one scheduling several thousand processes doing the same work.

Pool-of-Threads

The theory behind the Pool-of-Threads scheduler is to provide an operating mode which supports a large number of clients that will be maintaining their connections to the database, but will not be sending a constant stream of requests to the database. To support this, the database will maintain a (relatively) small pool of worker threads that take a single request from a client, complete the request, return the results, then return to the pool and wait for another request, which can come from any client. The database's internal threads still exist and operate in the same manner.

In theory, this should mean less work for the operating system to schedule threads that want CPU. On the other hand, it should mean some more overhead for the database, as each worker thread needs to restore the context of a database connection prior to working on each client request.

A smaller pool of threads should also consume less memory, as each thread requires a minimum amount of memory for a thread stack, before we add what is needed to store things like a connection context, or working space to process a request.

You can read more about the different threading models in the MySQL 6.0 Reference Manual.

Testing the Theory

Mark Callaghan of Google has recently had a look at whether this theory holds true. He has published his results under "No new global mutexes! (and how to make the thread/connection pool work)". Mark has identified (via this bug he logged) that the overhead for using Pool-of-Threads seems quite large - up to 63 percent.

So, my first task is see if I get the same results. I will note here that I am using Solaris, whereas Mark was no doubt using a Linux distro. We probably have different hardware as well (although both are Intel x86).

Here is what I found when running sysbench read-only (with the sysbench clients on the same host). The "conventional" scheduler inside MySQL is known as the "Thread-per-Connection" scheduler, by the way.

This is in contrast to Mark's results - I am only seeing a loss in throughput of up to 30%.

What about the bigger picture?

These results do show there is a definite reduction in maximum throughput if you use the pool-of-threads scheduler.

I believe it is worth looking at the bigger picture however. To do this, I am going to add in two more test cases:

  • sysbench read-only, with the sysbench client and MySQL database on separate hosts, via a 1 Gb network
  • sysbench read-write, via a 1 Gb network

What I want to see is what sort of impact the pool-of-threads scheduler has for a workload that I expect is still the more common one - where our database server is on a dedicated host, accessed via a network.

As you can see, the impact on throughput is far less significant when the client and server are separated by a network. This is because we have introduced network latency as a component of each transaction and increased the amount of work the server and client need to do - they now need to perform ethernet driver, IP and TCP tasks.

This reduces the relative overhead - in CPU consumed and latency - introduced by pool-of-threads.

This is a reminder that if you are conducting performance tests on a system prior to implementing or modifying your architecture, you would do well to choose a test architecture and workload that is as close as possible to that you are intending to deploy. The same is true if you are are trying to extrapolate performance testing someone else has done to your own architecture.

The Converse is Also True

On the other hand, if you are a developer or performance engineer conducting testing in order to test a specific feature or code change, a micro-benchmark or simplified test is more likely to be what you need. Indeed, Mark's use of the "blackhole" storage engine is a good idea to eliminate that processing from each transaction.

In this scenario, if you fail to make the portion of the software you have modified a significant part of the work being done, you run the risk of seeing performance results that are not significantly different, which may lead you to assume your change has negligible impact.

In my next posting, I will compare the two schedulers using a different perspective.

Wednesday Dec 17, 2008

MySQL 5.1 Memory Allocator Bake-Off

After getting sysbench running properly with a scalable memory allocator (see last post), I can now return to what I was originally testing - what memory allocator is best for the 5.1 server (mysqld).

This stems out of studies I have made of some patches that have been released by Google. You can read about the work Google has been doing here.

I decided I wanted to test a number of configurations based on the MySQL community source, 5.1.28-rc, namely:

  • The baseline - no Google SMP patch, default memory allocator (5.1.28-rc)
  • With Google SMP patch, mem0pool enabled, no custom malloc (pool)
  • With Google SMP patch, mem0pool enabled, linked with mtmalloc (pool-mtmalloc)
  • With Google SMP patch, mem0pool disabled, linked with tcmalloc (TCMalloc)
  • With Google SMP patch, mem0pool disabled, linked with umem (umem)
  • With Google SMP patch, mem0pool disabled, linked with mtmalloc (mtmalloc)

Here are some definitions, by the way:

mem0pool InnoDB's internal "memory pools" feature, found in mem0pool.c (NOTE: Even if this is enabled, other parts of the server will not use this memory allocator - they will use whatever allocator is linked with mysqld)
tcmalloc The "libtcmalloc_minimal.so.0.0.0" that is built from google-perftools-0.99.2
Hoard The Hoard memory allocator, version 3.7.1
umem The libumem library (included with Solaris)
mtmalloc The mtmalloc library (included with Solaris)

My test setup was a 16-CPU Intel system, running Solaris Nevada build 100. I chose to use only an x86 platform, as I was not able to build tcmalloc on SPARC. I also chose to run with the database in TMPFS, and with an innoDB buffer size smaller than the database size. This was to ensure that we would be CPU-bound if possble, rather than slowed by I/O.

If I built any package (no need for mtmalloc or umem), I used GCC 4.3.1, except for Hoard, which seemed to prefer the Sun Studio 11 C compiler (over Sun Studio 12 or GCC).

My test was a sysbench OLTP read-write run, of 10 minutes. Each series of runs at different thread counts is preceded by a database re-build and 20 minute warmup. Here are my throughput results for 1-32 SysBench threads, in transactions per second:

These results show that while the Google SMP changes are a benefit, the disabling of InnoDB's mem0pool does not seem to provide any further benefit for my configuration. My results also show that TCMalloc is not a good allocator for this workload on this platform, and Hoard is particularly bad, with significant negative scaling above 16 threads.

The remaining configurations are pretty similar, with mtmalloc and umem a little ahead at higher thread counts.

Before I get a ton of comments and e-mails, I would like to point out that I did some verification of my TCMalloc builds, as the results I got surprised me. I verified that it was using the supplied assembler for atomic routines, and I built it with optimization (-O3) and without.

I also discovered that TCMalloc was emitting this diagnostic when mysqld was starting up:

src/tcmalloc.cc:151] uname failed assuming no TLS support (errno=0)

I rectified this with a change in tcmalloc.cc, and called this configuration "TCMalloc -O3, TLS". It is shown against the other two configurations below.

I often like to have a look at what the CPU cost of different configurations are. This helps to demonstrate headroom, and whether different throughput results may be due to less efficient code or something else. The chart below lists what I found - note that this is system-wide CPU (user & system) utilization, and I was running my SysBench client on the same system.

Lastly, I did do one other comparison, which was to measure how much each memory allocator affected the virtual size of mysqld. I did not expect much difference, as the most significant consumer - the InnoDB buffer pool - should dominate with large long-lived allocations. This was indeed the case, and memory consumption grew little after the initial start-up of mysqld. The only allocator that then caused any noticable change was mtmalloc, which for some reason made the heap grow by 35MB following a 5 minute run (it was originally 1430 MB)

References

Friday Dec 12, 2008

Scalability and Stability for SysBench on Solaris

My mind is playing "Suffering Succotash..."

I have been working on MySQL performance for a while now, and the team I am in have discovered that SysBench could do with a couple of tweaks for Solaris.

Sidebar - sysbench is a simple "OLTP" benchmark which can test multiple databases, including MySQL. Find out all about it here , but go to the download page to get the latest version.

To simulate multiple users sending requests to a database, sysbench uses multiple threads. This leads to two issues we have identified with SysBench on Solaris, namely:

  • The implementation of random() is explicitly identified as unsafe in multi-threaded applications on Solaris. My team has found this is a real issue, with occasional core-dumps happening to our multi-threaded SysBench runs.
  • SysBench does quite a bit of memory allocation, and could do with a more scalable memory allocator.

Neither of these issues are necessarily relevant only to Solaris, by the way.

Luckily there are simple solutions. We can fix the random() issue by using lrand48() - in effect a drop-in replacement. Then we can fix the memory allocator by simply choosing to link with a better allocator on Solaris.

To help with a decision on memory allocator, I ran a few simple tests to check the performance of the two best-known scalable allocators available in Solaris. Here are the results ("libc" is the default memory allocator):

Throughput

To see the differences more clearly, lets do a relative comparison, using "umem" (A.K.A. libumem) as the reference:

Relative Throughput

So - around 20% less throughput at 16 or 32 threads. Very little difference at 1 thread, too (where the default memory allocator should be the one with the lowest synchronization overhead).

Where you see another big difference is CPU cost per transaction:

CPU Cost

I will just point out two other reasons why I would recommend libumem:

I have logged these two issues as sysbench bugs:

However, if you can't wait for the fixes to be released, try these:

About

Tim Cook's Weblog The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today