Friday Jan 18, 2008

Averaging performance data

When you are optimizing benchmarks, the typical process involves running the same benchmark N times, and picking an arbitrary run of the benchmark (called a run) from these N runs to get the representative run. Another option is to average these N runs (creating a new run N') and pick that one as the representative run. In fenxi, we have discussed automatically averaging a bunch of runs. Performance data can be of two types
  • Numerical Data (Throughput, Response time, etc)
  • Textual Data (OS Patch level, syslog messages, etc.)
Averaging numerical data is very easy. Averaging textual data is not possible, or desired. However, since we are creating a new run N', we need to select textual data to be part of this new run. Which run do we pick it from? We are trying to solve this via the Fenxi project. If you have any thoughts or suggestions regarding this, please feel free to contact us.

Monday Jan 07, 2008

Fenxi - Performance analysis made easy

We just opensourced a nice performance analysis tool called Fenxi. Fenxi is a pluggable Java-based post-processing, performance analysis tool that parses and loads the data from a variety of tools into a database, and then allows you to query and compare different sets of performance data. Fenxi can also be used to graph data from performance tools. Fenxi (mandarin for analyze) is the successor to the Sun-internal tool called Xanadu. It is integrated with the Faban Benchmark harness.

If you have ever worked with performance data, you will pretty soon realize that
Performance Data can get huge.
Consider a benchmark running on a 64 core system with 100's of disks attached, with multiple network interfaces for 30 minutes. If you collect mpstat at 10 second intervals for the whole run, you end with more than 11,000 lines of data! (That is 400 CNTRL-F's if you are using VI in a regular sized termial). If you collect data from more tools like vmstat, iostat, trapstat, busstat, cpustat, etc you will end up with much more! Going through each of them line by line is not a scalable approach.
Performance Data is interrelated.
The tool outputs are just different views of the system behavior. We want to look at the system as a whole, rather than at its individual views. If your incoming network packets peaks, your interrupts in your mpstat most likely peaks. We may want to see if throughput was impacted as a result of a burst of writes to our disks, etc.
Some performance data makes sense visually.
For large data, a visual view gives a quick summary of the data. As Tim Cook states it, "the human brain is a powerful pattern-recognition machine - graphs allow you to spot things you would never see in numbers (like waves of CPU migrations moving across different cores)". Look at the bottom of the blog for more details
Performance Data should be queryable
We want to be able to query or ask questions to the performance data.  For ex, you might want to know "What are my hot disks?". Traditionally, people have answered such questions  by writing custom scripts using sed/awk/perl. This can get tedious very fast. We need a better way of asking questions. In Fenxi, we store the data in the database, and questions are formulated in SQL.
Performance Data should be comparable, averageable, etc.
Since I work in the performance group at Sun, we run a lot of benchmarks. Since the goal of [most] benchmarks is to maximize the performance of a system, we are always constantly trying out new changes to the system. Typically, we change a parameter and repeat the benchmark and see if it has improved performance.
Performance Data should be sharable.
We rarely work in isolation. We should be able to share data with our peers and collaborate on finding performance fixes.

Fenxi tries to solve all of the above problems.

Sample Graph

Sample Text

Fenxi text view

You can see a sample database run processed by Fenxi. I urge you to check it out!

Thursday Feb 08, 2007

ZFS and OLTP workloads: Time for some numbers

My last entry provided some recommendations regarding the use of ZFS with databases. Time now to share some updated numbers.

Before we go to the numbers, it is important to note that these results are for the OLTP/Net workload, which may or may not represent your workload. These results are also specific to our system configuration, and may not be true for all system configurations. Please test your own workload before drawing any conclusions. That said, OLTP/Net is based on well known standard benchmarks, and we use it quite extensively to study performance on our rigs.

UFS Directio N/A
UFS Directio N/A
1 Both block checksumming as well as block checking
2 Bigger is better

Databases usually checksum its blocks to maintain data integrity. Oracle for example, uses a per-block checksum. For Oracle, checksum checking is on by default. This is typically recommended as most filesystems do not have a checksumming feature. With ZFS checksums are enabled by default. Since databases are not tightly integrated with the filesystem/volume manager, a checksum error is handled by the database. Since ZFS includes volume manager functionality, a checksum error will be transparently handled by ZFS (i.e if you have some kind of redundancy like mirroring or raidz), and the situation is corrected before returning a read error to the database. Moreover ZFS will repair corrupted blocks via self-healing. While RAS experts will note that end-to-end checksum at the database level is slightly better than end-to-end checksum at the ZFS level, ZFS checksums give you unique advantages while providing almost the same level of RAS.

If you do not penalize ZFS with double checksums, you can note that we are within 6% of our best UFS number.  So 6% gives you provable data integrity, unlimited snapshots, no fsck, and all the other good features. Quite good in my book :-) Of course, this number is only going to get better as more performances enhancements make it into the ZFS code.

More about the workload.
The tests were done with OLTP/Net with a 72 CPU Sun Fire E25K connected to 288 15k rpm spindles. We ran the test with around 50% idle time to simulate real customers. The test was done on Solaris Nevada build 46. Watch this space for numbers with the latest build of Nevada.

Wednesday Aug 02, 2006

Solaris Internals

There are very few books that let people understand and admire the complexities of Solaris. Richard McDougall, and Jim Mauro have written two such masterpieces titled Solaris Internals 2nd Edition and Solaris Performance and Tools. I highly recommend you get your copy fast. Both Richard and Jim are colleagues of mine at PAE, so I am sure to get my book autographed!

Monday Jul 31, 2006

Real-World Performance

Performance for the real-world, where it matters the most.

A major portion of my job (@ PAE) is spent trying to optimize Solaris for real customer workloads. We tend to focus on databases, but work with other applications too. We have tons (both weight wise and dollar wise :-)) of equipment in our labs, where we try to replicate a real enterprise data center. Of course, the term "real customer workload" is a loaded term. Since most big customers are rarely willing to share their workloads, we have to simulate them or write something close it in house. Trying to rewrite every customer's workload is not a scalable approach. Hence we have developed a workload called OLTP/Net that can be retrofitted to fit most customer workloads. Using several tuning knobs we can control the amount of reads, writes, network packet per transaction, connects, disconnects, etc.. Think of it like a super workload! We have used it quite effectively to simulate several customer workloads.

There is a big difference in trying to get the best numbers for a benchmark and in replicating a customer's setup. PAE has traditionally focused on getting the most out of the system. Our machines typically run at 100% utilization, run the latest and greatest Solaris builds, have lot of tunings applied to the system. We believe fully in Carry Millsap's statement

Each CPU cycle that passes by unused is a cycle that you will never have a chance to use again; it is wasted forever. Time marches irrevocably onward."
(Performance Management: Myths & Fact, Cary V. Millsap, Oracle Corp, June 28, 1999)

However, many customers run their machines at less than 100% utilization to leave enough headroom for growth. When machines are not running at 100% utilization, things like idle loop performance matter a lot. If you have followed Solaris releases closely, there were several enhancements to the idle loop performance that increase the efficiency of lightly loaded systems by quite a bit. Similarly we have seen quite a few UFS + Database performance enhancements over the past few releases of Solaris.

So while benchmark numbers do matter, real performance also matters, and we are working on it!

Monday Dec 12, 2005

Six OS's on a disk? Wait I can do seven!!

Update: In my previous blog I showed how to install 6 os's on a disk. Well, actually you can have seven (7). Disk partitions are numbered from 0 to 7. Ignoring slice 2, that leaves us with 7 free slices on which to install our OS. Although I am yet to log on to a machine with 7 OS's on disk!!

Richard Elling pointed it out that you could also use slice 2 (the loopback/backup/overlap slice) also. So that's 8. He also mentions that some SCSI devices support 16 slices, and so you could install quite a lot more OS installations! Maybe we should have a completion of how many OS's you have installed on a single disk :-) My personal best is 6.

Friday Dec 02, 2005

Six OS's in one disk? Yes it is possible

Six (6) OS's in one disk

Do you want to install 6 OS's on a single disk? If so read on..

The goal is to have 6 bootable OS on a single disk. Why should one do it? Because better sharing, more reliability, easier comparisons between OS versions, quicker recovery, ...BTW, I have only tried this on sparc.

Although I am sure that people have been doing this for ages, I first heard it from Charles Suresh, who encouraged me to go ahead and give it a try.

Create Partitions

Disk partitions usually are from 0 - 7, with 2 being the overlap. For our experiment, we set 1 to be the swap. We sized the other partitions equally, with 0 being a little smaller than others. On my 36G disk, the partition looks like the following

Part      Tag    Flag     Cylinders         Size            Blocks
  0       root    wm    2178 -  5655        4.79GB    (3478/0/0)  10047942
  1       swap    wu       0 -  2177        3.00GB    (2178/0/0)   6292242
  2     backup    wm       0 - 24619       33.92GB    (24620/0/0) 71127180
  3       root    wm    5656 -  9285        5.00GB    (3630/0/0)  10487070
  4       root    wm    9286 - 12915        5.00GB    (3630/0/0)  10487070
  5       root    wm   12916 - 16545        5.00GB    (3630/0/0)  10487070
  6       root    wm   16546 - 20175        5.00GB    (3630/0/0)  10487070
  7       root    wm   20176 - 24619        6.12GB    (4444/0/0)  12838716

Install The OS

Install Solaris from any source. I typically download the images from nana.eng, and use my jumpstart server. You can also install from CD, DVD etc.. Once you install on a slice, you can dd(1) it to other slices, and fix /etc/vfstab. This is the fastest way of installing multiple solaris instances on a disc. If you want another version, or a different build, bfu is your friend. You can also save off these slices to some /net/... place and restore an OS at will (again using dd both ways since you need to preserve the boot blocks). If you slice multiple machines this way, you can even copy slices across machines (assuming same architecture etc) - more scripts are needed to change /etc/hosts, hostname, net/\*/hosts etc

Install via Jumpstart: Setup Profile

If you like things automated, you could perform a hands-off install via custom jumpstart. The first step is to setup the profile for your server. Since you want to preserve the existing partitions, you have to use the preserve keyword. The profile for my machine looks like the following
$cat zeeroh_class
install_type    initial_install
system_type    server
partitioning    explicit
dontuse        c1t0d0
filesys        c1t1d0s0 existing /
filesys        c1t1d0s1 existing swap
filesys        c1t1d0s3 existing /s3 preserve
filesys        c1t1d0s4 existing /s4 preserve
filesys        c1t1d0s5 existing /s5 preserve
filesys        c1t1d0s6 existing /s6 preserve
filesys        c1t1d0s7 existing /s7 preserve
cluster        SUNWCall

To install an OS on another slice, just change the root disk (c1t0d0s0 above).

Make sure that the directory where the profiles are stored is shared read-only.

Also ensure that you have a sisidcfg file setup correctly.
[neel@slc-olympics] config > cat sysidcfg
[neel@slc-olympics] config >

Run the check script.

Note that these profiles can be stored on any server. That machine does not need to have anything special installed. You only need to make sure that the location of the profile, and other custom jumpstart scripts are shared via NFS in a "read-only" mode.


On the jumpstart server (abc.yyy in my case), we added our machine to the list of clients as follows

./add_install_client -i -e a:b:c:d:e:f -c slc-olympics:/export/config -p slc-olympics:/export/config zorrah sun4u

Now reboot your machine as follows

$ reboot -- net - install

Booting via multiple disks/partitions

  1. Find the path (ls -l /dev/rdsk/..)
  2. At the ok prompt, type show-disks and select disk
  3. Type nvalias diskX # this paste's the selected path
  4. init 0
  5. boot diskX

Technorati Tag:
Technorati Tag:

Tuesday Nov 15, 2005


I guess an introduction is necessary!

I am Neelakanth Nadgir and I am a part of PA2E (Performance Architecture, and Availability Organization) group. I work out of Menlo Park, CA. My professional interests include scalability, networking, filesystems, distributed systems etc.

Before joining PA2E, I worked at Sun's Market Development Engineering, where I spent 4 years working on Performance tuning, Porting, Sizing, and ISV account management.

I am was also involved with several open source projects. I am an active member of the JXTA community and jointly started two projects viz Ezel Project and JNGI Project. I have also served as web-master to the GNU project for 2 years. I also contributed to the Mozilla project in the past by providing sparc binaries and misc performance fixes.

Before working at Sun, I graduated with a masters in Computer Sc from Texas Tech University at Lubbock, TX (GO Raiders!). My thesis was on the Reliability of distributed systems, where I devised a faster algorithm for calculating minimal file spanning trees. I have a Bachelor's degree in Computer Sc from Karnatak University, India.

My other interests include Cricket, and tropical aquarium fish ( African cichlids in particular) My favorite fish is known as Pseudotropheus demasoni. My wife got me hooked on to the aquarium hobby after we got married, and even before I knew, we had more than 60 fishes in 6 tanks :-)

I plan to use this blog to share the knowledge that I gained from working with lots of cool people here at Sun. Keep tuned for more insights!




« July 2016