Recent Posts


Mercurial says "nothing changed", but it did. Sometimes my software is too clever.

It seems I have found a "bug" in Mercurial. It takes a shortcut when checking for differences in tracked files. If the file's size and modification time are unchanged, it assumes its contents are unchanged:$ hg init .$ cp -p .sccs2hg/2005-06-05_00\:00\:00\,nicstat.c nicstat.c$ ls -ogE nicstat.c-rw-r--r-- 1 14722 2012-08-24 11:22:48.819451726 -0700 nicstat.c$ hg add nicstat.c$ hg commit -m "added nicstat.c"$ cp -p .sccs2hg/2005-07-02_00\:00\:00\,nicstat.c nicstat.c$ ls -ogE nicstat.c-rw-r--r-- 1 14722 2012-08-24 11:22:48.819451726 -0700 nicstat.c$ hg diff$ hg commitnothing changed$ touch nicstat.c$ hg diffdiff -r b49cf59d431d nicstat.c--- a/nicstat.cFri Aug 24 11:21:27 2012 -0700+++ b/nicstat.cFri Aug 24 11:22:50 2012 -0700@@ -2,7 +2,7 @@ * nicstat - print network traffic, Kb/s read and written. Solaris 8+. *"netstat -i" only gives a packet count, this program gives Kbytes. *- * 05-Jun-2005, ver 0.81 (check for new versions, http://www.brendangregg.com)+ * 02-Jul-2005, ver 0.90 (check for new versions, http://www.brendangregg.com) *[...]Now, before you agree or disagree with me on whether this is a bug, I will also say that I believe it is a feature. Yes, I feel it is an acceptable shortcut because in "real" situations an edit to a file will change the modification time by at least one second (the resolution that hg diff or hg commit is looking for). The benefit of the shortcut is greatly improved performance of operations like "hg diff" and "hg status", particularly where your repository contains a lot of files.Why did I have no change in modification time? Well, my source file was generated by a script that I have written to convert SCCS change history to Mercurial commits. If my script can generate two revisions of a file within a second, and the files are the same size, then I run afoul of this shortcut.Solution - I will just change my script to apply the modification time from the SCCS history to the file prior to commit. A "touch -t " will do that easily.

It seems I have found a "bug" in Mercurial. It takes a shortcut when checking for differences in tracked files. If the file's size and modification time are unchanged, it assumes its contents...


Analyzing Interrupt Activity with DTrace

This article is about interrupt analysis using DTrace. It is alsoavailable on the Solaris Internals and Performance FAQ Wiki,as part of the DTrace Topics collection. Interrupt Analysis Interrupts are events delivered to CPUs, usually by external devices(e.g. FC, SCSI, Ethernet and Infiniband adapters). Interrupts cancause performance and observability problems for applications.Performance problems are caused when an interrupt "steals" a CPU froman application thread, halting its process while the interrupt isserviced. This is called pinning - the interrupt will pin anapplication thread if the interrupt was delivered to a CPU on whichan application was executing at the time.This can affect other threads or processes in the application if forexample the pinned thread was holding one or more synchronizationobjects (locks, semaphores, etc.)Observability problems can arise if we are trying to account for workthe application is completing versus the CPU it is consuming. Duringthe time an interrupt has an application thread pinned, the CPU itconsumes is charged to the application. Strategy The SDT provider offers the following probes that indicate when an interrupt is being serviced: interrupt-start interrupt-completeThe first argument (arg0) to both probes is the address of astruct dev_info (AKA dev_info_t *), which can beused to identify the driver and instance for the interrupt. Pinning If the interrupt has indeed pinned a user thread, the following willbe true: curthread->t_intr != 0; curthread->t_intr->t_procp->p_pidp->pid_id != 0The pid_id field will correspond to the PID of the process that hasbeen pinned. The thread will be pinned until eithersdt:::interrupt-complete or fbt::thread_unpin:returnfire. DTrace Scripts Attached are somescripts that can be used to assess the effect ofpinning. These have been tested with Solaris 10 and Solaris 11.Probe effect will vary. De-referencing four pointers then hashingagainst a character string device name each time an interrupt fires;as some of the scripts do; can be expensive. The last two scripts aredesigned to have a lower probe effect if your application or system issensitive to this. The scripts and their outputs are: pin_by_drivers.dHow much drivers are pinning processes. Does not identify the PID(s) affected. pids_by_drivers.d How much each driver is pinning each process. pid_cpu_pin.dCPU consumption for a process, including pinning per driver, and time waiting on run queues. intr_flow.dIdentifies the interrupt routine name for a specified driver The following scripts are designed to have a lower probe effect pid_pin_devi.d Pinning on a specific process - shows drivers as raw "struct dev_info *" values. pid_pin_any.d Lowest probe effect - shows pinning on a specific process without identifying the driver(s) responsible. Resolving Pinning Issues The primary technique used to improve the performance of anapplication experiencing pinning is to "fence" the interrupts from theapplication. This involves the use of either processor binding orprocessor sets (sets are usually preferable) to either dedicateCPUs to the application that are known to not have the high-impactinterrupts targeted at them, or to dedicate CPUs to the driver(s)delivering the high-impact interrupts.This is not the optimal solution for all situations. Testing isrecommended.Another technique is to investigate whether the interrupt handling forthe driver(s) in question can be modified. Some drivers allow formore or less work to be performed by worker threads, reducing the timeduring which an interrupt will pin a user thread. Other drivers candirect interrupts at more than a single CPU, usually depending on theinterface on which the I/O event has ocurred. Some network driverscan wait for more or fewer incoming packets before sending aninterrupt.Most importantly, only attempt to resolve these issues yourself if youhave a good understanding of the implications, preferably onebacked-up by testing. An alternative is to open a service call withOracle asking for assistance to resolve a suspected pinning issue.You can reference this article and include data obtained by using theDTrace scripts. Exercise For The Reader If you have identified that your multi-threaded or multi-processapplication is being pinned, but the stolen CPU time does not seem toaccount for the drop in performance, the next step in DTrace would beto identify whether any critical kernel or user locks are being heldduring any of the pinning events. This would require marryinginformation gained about how long application threads are pinned withinformation gained from the lockstat and plockstatproviders. References Solaris Processor Sets Made Easy Oracle Solaris 11 Network Virtualization Technology - and specifically the article that includes detail on Network Performance.

This article is about interrupt analysis using DTrace. It is also available on the Solaris Internals and Performance FAQ Wiki, as part of the DTrace Topics collection. Interrupt Analysis Interrupts are...


nicstat update - version 1.90

Yes! A new version is now available with some long-awaited features. Many thanks to those who suggested improvements and helped with testing.Changes for Version 1.90, April 2011Common nicstat.sh script, to provide for automated multi-platform deployment. See the Makefile's for details. Added "-x" flag, to display extended statistics for each interface. Added "-t" and "-u" flags, to include TCP and UDP (respectively) statistics. These come from tcp:0:tcpstat and udp:0:udpstat on Solaris, or from /proc/net/snmp and /proc/net/netstat on Linux. Added "-a" flag, which equates to "-tux". Added "-l" flag, which lists interfaces and their configuration. Added "-v" flag, which displays nicstat version.Solaris Added use of libdladm.so:dladm_walk_datalink_id() to get list of interfaces. This is better than SIOCGLIFCONF, as it includes interfaces given exclusively to a zone. NOTE: this library/routine can be (by default is) linked in to nicstat in "lazy" mode, meaning that a Solaris 11 binary built with knowledge of the routine will also run on Solaris 10 without failing when the routine or library is not found - in this case nicstat will fall back to the SIOGLIFCONF method. Added search of kstat "link_state" statistics as a third method for finding active network interfaces. See the man page for details.Linux Added support for SIOCETHTOOL ioctl, so that nicstat can look up interface speed/duplex (i.e. "-S" flag not necessarily needed any longer). Removed need for LLONG_MAX, improving Linux portability.Availabilitynicstat source and binaries are available from sourceforge.HistoryFor more history on nicstat, see my earlier entry

Yes! A new version is now available with some long-awaited features. Many thanks to those who suggested improvements and helped with testing. Changes for Version 1.90, April 2011 Common nicstat.sh...


Oracle RAC Cache Fusion Testing on SPARC Enterprise T5440

Tableof ContentsRAC Cache Fusion on SPARC EnterpriseT54401Summary of FindingsConfiguration Under Test10 Gigabit versus 1 Gigabit InterconnectScaling up Number of LMS ProcessesTheoretical Peak of 10 Gigabit InterconnectCPU Tuning – Processor Sets andInterrupt FencingNetwork TuningChoice of Block SizeImpact of Additional 10 Gigabit InterfaceHow About if We Use Remote SQL\*PlusClients?Bi-Directional Cache Fusion testCPU Impact of Select TuningsAppendix A – Notes on the BenchmarkAppendix B – How to Set Network TuningsSummary of FindingsThe minimum latency of an 8KB block transfer can be reducedby 25% with network tuning.Latency increases with block size but not dramatically, forexample latency of 16KB block transfer is around 50% more than 2KBblock transfers.Latency increases almost linearly with load (number of blockstransferred per second) - as expected - till it reaches the "knee"of the curve, where the latency starts to increase very fast. The"knee" for 4KB block size is about 2 times that for 16Kblock size. This happens when the CPU(s) running LMS process(es)become saturated.With no tunings, 4 LMS processes are needed to saturate a 1Gigabit NIC using 8KB block size. The default number of LMS serversconfigured by 11gR2 on the SPARC Enterprise T5440 is 10.With or without network tuning, we are not able to saturate asingle 1 Gigabit NIC using 2KB block size. Tuning for low latencywill sometimes see a 1 Gigabit NIC saturated, but throughputdegrades significantly at loads beyond the saturation point.Using 10 Gigabit NIC, throughput increases up to 2.33 times;by increasing the number of LMS processes from the default 10 themaximum 35.The response time was marginally better for 10 Gigabit thanfor 1 Gigabit.Provisioning a larger-than-optimal number of LMS processesdoes not cause a significant increase in latency at low to mediumthroughput. Further, increasing the number of LMS servers onlyincreases server and client CPU consumption by a small amountPlacing LMS processes into a processor set, and fencing thatprocessor set from interrupts provides up to 40% more throughputthan Out-Of-the-Box when using a 10 Gigabit interconnect.Jumbo Frames is confirmed as a Best Practice for RACclusters. They offer around 20% better throughput.On the network side, enabling UDP check-sum offload,disabling RX soft rings and disabling interrupt blanking can improvelatency by up to 32%. This also has a side benefit of reduced CPUconsumption (normalized by throughput).Use of two 10 Gigabit interfaces can potentially provide thebest of both worlds – low latency and high throughput.Increasing the number of LMS servers only increases serverand client CPU consumption by a small amountConfiguration Under TestThe following components were used in testing RAC interconnectperformance on the T5440:Sun SPARC Enterprise T5440, 1.4 GHz, 256 MB RAM (two)4 x 1 Gigabit Ethernet (built-in)2 x 10 Gigabit PCIe Ethernet (Neptune)Solaris 10 Update 9 (development build)Oracle release (11gR2)Sun StorEdge 6140 ArraysOverview of BenchmarkThis testing uses a simple test to exercise Cache Fusion. Theframework for the test includes a single table, with one row perdatabase block. The method involved is to:Load all rows to be used in a test on one node, to be knownas the “block-serving node”.Update these rows then do a checkpointQuery the rows using a simple query via SQL\*Plus from thesecond node, to be known as the “client” nodeWith the correct settings to your database, we reach a steadystate where all the SQL\*Plus clients are requesting blocks which needto be satisfied via Global Cache transfers from the block-servingnode to the client node. Very little disk I/O is happening, justnetwork traffic.For further details on the benchmark, see Appendix A – Notes on the Benchmark.10 Gigabitversus 1 Gigabit InterconnectThe Sun SPARC Enterprise T5440 includes 4 x 1 Gigabit Ethernetconnections. A common choice to increase both available throughputand reduce latency is to add one or more 10 Gigabit Ethernet links. Let's compare the performance of these choices.1 Gigabit has reached saturation at 24 clients. 10 Gigabitoffers much more throughput10 Gigabit does offer better latencyThese charts do not yet explorethe maximum throughput available for 10 Gigabit – see 10 Gigabitbelow.Scaling up Number of LMS Processes1 GigabitWhat happens if we increase the number of LMS processes?Here we can see the effects of the LMS processes becomingsaturated; which is the case for 1, 2 and 3 LMS.Once we have more LMS processes than we need to saturate theNIC, we can get much greater throughput; as seen for 4 and 8 LMS.LMS CPU Consumption as LMS Processes areIncreasedThis is not normalized to the number of LMS processes, so the2 LMS maximum is 200%, 4 LMS if 400%, etc.Again we can see that for 1 and 2 LMS, we are reaching CPUsaturation; but not for greater number of LMS; where the figuresshow the NIC is saturatedNo significant increase in CPU consumption for 12 LMS over 8,even though we have established 12 LMS offers no other advantage inthroughput10 GigabitThroughput - Can we Reach NIC Saturation?Out of the box, the default number of LMS processes for 11gR2 onthe SPARC Enterprise T5440 is 10. This does not change if you changethe number of type of interconnects. The maximum number of LMSprocesses that can be started is 35. Usingthe default 10 LMS processes significantly limits throughput for 1 x10 Gigabit. Using the maximum of 35 is similar to using 32.At the peak, the throughput for 35 LMS is 2.33 times thethroughput for 10 LMS serversResponse TimeAt low throughput, the difference in latency as the number ofLMS is increased is negligible.As the throughput is increased, the larger number of LMSservers is able to respond to the greater throughput with lessdegradation.TheoreticalPeak of 10 Gigabit InterconnectObserving that we have been able to easily saturate a 1 Gigabitinterconnect, but unable to saturate a 10 Gigabit interconnect, acolleague suggested that it would be unlikely we could saturate a 10Gigabit interconnect due to the average packet size of Cache Fusiontraffic – approximately 4,300 bytes for a raw, 8KBunidirectional Cache Fusion test workload on the side requestedblocks are sent.We decided to test the 10 Gigabit network link to see just whatbandwidth we could get if there were no database or Cache Fusion testinvolved. The tool for this is uperf[1], which was configured toreplicate and scale up the packet flow as observed from a one-clientrun.Thesetests used an “Out-of-the-Box” tuning for the network,with the exclusion of Jumbo Frames being enabled for both testsuperf peak is 87.5% versus 70.6% for our Cache Fusion testOur Cache Fusion test used 64 LMS processesWe can also test how our Cache Fusion test compares to uperf whenthe nodes have their networks tuned for low latency. This is atuning we have called “N3” that is described later in Tuning for Network Latency.The theoretical maximum bandwidth is lower for this tuning,as we have observed for our Cache Fusion testuperf peak is 70.1% versus 54.5% for Cache Fusion testCache Fusion test used 35 LMS processesCPU Tuning –Processor Sets and Interrupt FencingProcessor Sets – 1 GigabitFor these studies, the “T1” tuning implements aprocessor set of 16 CPUs (2 cores) on the block-serving node, whichinterrupts disabled on these CPUs. The “T2” tuning isthe same as T1, but in addition we have all CPUs in three of the foursockets turned off.Throughput – Out of Box versus T1, T2Very small improvement for T1psrset (e.g. 1.7% at 32 clients)Same for T2 - 3.1% higher thanbaseline at 32 clientsOnce the interconnect issaturated, all perform the same.Response Timeby CR Block Throughput – Out-of-Box versus T1, T2As for throughput, very slightimprovement; no difference once we saturate the interconnect.Processor Sets – 10 GigabitFor 10 Gigabit, let's examine runs with 32 LMS's, with andwithout processor sets for the block-server node LMS processes. Processor sets can be used to isolate processes from each other,fence processes from the effects of interrupts, limit the cores orsockets processes run on (which in turn affects their cache usage andmemory locality efficiency), or combinations of these.The processor sets used were:T3: 64 CPUs, on 8 cores (each core 100% dedicated to set),across all 4 sockets, with interrupts disabled for the processor setT7: 32 CPUs, on 8 cores (each core 50% dedicated to set),across all 4 sockets, with interrupts disabled for the processor setThroughput – Out of Box versus T3 and T7Processor SetsEffective? Yes. That is +33% and +35% at 240 clients for T3and T7, respectivelyBoth variants offer similar improvements. T7 potentiallyoffers greater co-operation with other workloads on the same system– hard to tell from our Cache Fusion test as there is littleother load.Latency – Out of Box versus T3 ProcessorSetIn this case, let's use our network tuning, so that we havealready improved latency as much as we can that way.You have to take my word – there are two lines in thatchart. So, I think we can conclude the difference is negligible.Network TuningJumbo FramesJumbo Frames is a feature of some NICs that allows the use ofEthernet Maximum Transmission Units (MTU) greater than 1500. This isan established Best Practice for Oracle RAC. Let's see why.Throughput – Jumbo Frames versus No JumboFramesNoJumbo is MTU of 1500Jumbo frames offer 20% morethroughputResponse Time by CR Block Throughput, JumboFrames v No JumboBothworkloads reach a peak then retrogradeJumbo frames obviously betterLMS CPU Consumption by Throughput, Jumbo Frames vNo JumboAs expected, use of Jumbo Frames entails lower CPUconsumption by the LMS processesThere is a similar but less pronounced gap in system-wide CPUconsumption on both nodesTuning for Network LatencyHere we will study a set of network tunings that are intended toimprove network latency. There are usually a number of trade-offsinvolved in network tuning; between latency of individual packets,peak packet throughput, peak bandwidth, etc.The following configurations are compared against theOut-Of-the-Box (OOB) network configuration – although we areusing Jumbo Frames in all configurations, which is already arecognized best practice for RAC and other large packet workloads:N1 – use of hardware offload for UDP check-sums. Thisis normally disabled for nxge interfaces, as it can sometimes fail. In the case of RAC, block transfers are check-summed by thedatabase, so we do not risk data corruption by enabling thisfeature.N2 – N1 plus disabling of RX soft rings. Soft ringsare enabled by default to provide scalability to very high packetrates per interface. RAC does not involve very high packet rates,so can we make the trade-off the other way and use simpler deliveryvia a single CPU?.N3 – N2 plus disabling of interrupt blanking. Interrupt blanking is used to reduce the number of times a networkinterface interrupts a CPU to deliver packets. Again, we have a lowpacket rate, so can we afford one interrupt per packet?N4 – N3 plus disabling of TX serialization. This issimilar to soft rings, but on the transmit side.For details on how to enable these network tunings, see Appendix B – How to Set Network Tunings.All tunings up to N3 appear to provide incrementally betterresponse timeN3offers a 14-32% reduction in latencySimilar improvements for throughputN3 Looks like the best choice of tunings overallBroaderComparisons – OOB versus N3 TuningSeeing N3 appears to be a likely recommendation, let's look closerat the numbers.Dataindicates the N3 tuning has saturated the network interface at 32clients, whereas OOB peaks around 90% saturationA different view of RT – normalized to the throughputN3 better than OOB across range, with higher peak throughputcapabilityUseful to consider – does the disabling of many ofthese features mean there is higher CPU consumption? There may beat the system level (I can dig up the numbers from theseexperiments), but the impact on the LMS processes is a reductionacross the range.Network utilization is also plotted so that you can see howthe OOB configuration does not manage to saturate the networkChoice of BlockSizeThere are a range of block sizes available with Oracle RDBMS. Themost common choices for SPARC platforms are 8 KB and 4 KB. Here weexamine the impact of block size for a single 1 Gigabit interface.Larger block sizes reach saturation earlierIt took a larger number of LMS processes to reach saturation(or near saturation) as the block size decreased – 4 for 16and 8 KB, 16 for 4KB and 32 for 2KB2KB blocks were unable to saturate 1 x 1 Gigabit. Averagepacket size is 1240 bytes on the side blocks are sent.Is There a Bottleneck For 2 KB?The above chart shows that for 2 KB, we were not able to saturatea single 1 Gigabit interface. This was investigated further withsome different configurations and the results are as follows:Configurationsusing 4, 16 and 32 LMS were tried; with and without network tuningintended to improve latency (see discussion of “N3”later in this report).The maximum achieved was 39,081 block/second, whichcorresponds to 97% utilization of the network interface.Response Time for Various Block SizesNote:some samples beyond peak throughput have been omitted for legibilityHere we can see the response time increases as we increasethe block size.As each workload reaches or approaches saturation, it is ableto stay close to peak throughput as the response time degradesLMS CPU% by Throughput, for each Block SizeThis is normalized to the number of LMS processes (2), so themaximum; indicating saturation of the 2 LMS processes; is 100%This verifies we are saturating the LMS processes before the1Gigabit NIC is saturatedImpact of Additional 10 Gigabit InterfaceThe next step beyond increasing your bandwidth to 10 Gigabitsmight be – use two 10 Gigabit interfaces. So, we then want toknow, do we get any scalability out of this? The followingconfigurations were done with 35 LMS processes – the maximum.Forthe N3 (better latency) configurations, there is little differencebetween 1 and 2 interfaces below 64 clients, but then there is anadvantage of up to 20% as we head toward 192 clients. Then there issome degradation for the 2 x 10 Gigabit configuration.For the non-N3 configurations, we see only a slight advantagefor 2 interfaces through the entire range. We do see the peakthroughput advantage return to the non-N3 configurations at 192clients and above; or around 85,000 CR blocks/second.Peak for 1 x 10 Gigabit is 89,729 at 224 clientsPeak for 2 x 10 Gigabit is 93,817 at 240 clients, an increaseof 4.6%As expected, latency is better for the N3 configurations atlower throughput, but little different for the extra interface.The N3 configuration with 2 interfaces nearly matches the thenon-N3 configurations in terms of scalability before degrading.How About Impact for 4KB Blocks?For these experiments, we were using 35 block-serving LMS's, 4KBblocks (raw), with _fast_cursor_reexecute=true, N3 and T3 tunings.Very little differenceNotice also that 4KB throughput peaks at lower number ofclients than for 8KB.Same as for throughput – little difference; 2 x 10Gigabit slightly worseConclusionsSo, an additional 10 Gigabit interface, while not providing a 100%increase in maximum throughput, does offer:In the absence of extreme throughput requirements, the “bestof both worlds” configuration for 8 KB database blocks wouldbe two 10 Gigabit interfaces with the N3 tuning.In general, just adding a second 10 Gigabit interface andmaking no other changes offers little additional scalability.There is one other issue with multiple network interfaces (of anykind) with RAC at present – which is that there is no fail-overcapability that you might expect with NIC bonding at the OS level. This OS feature is to be tested in a future study.How About if WeUse Remote SQL\*Plus Clients?The normal configuration for our Cache Fusion test is to useSQL\*Plus clients on the “client” RAC node. We couldinstead use SQL\*Plus on remote node(s) and see if that makes adifference. The differences would potentially be:Reduced CPU/memory consumptionIntroduction of a small amount of think time between queriesWe can also investigate if there is any affect on client networktraffic from our N3 network tuning.The first thing discovered was that remote client network trafficis very small (less than one packet/sec), so an affect of N3 networktuning on the remote client will likely be immeasurable.This may be because the database foreground processes still run onthe client RAC node, and they are responsible for nearly all the CPUand memory consumption also. Here is are the results:Remoteclients shows slightly less throughput than local clientsRuns at 240 & 256 clients were inconsistentLocalclients get slightly betterresponse time than remote – note that the response time ismeasured between RAC nodes,not from the perspective of the client.One thing that did change with the move to remote clients was theprofile of Oracle's Top 5 Timed Events, as reported via statspack. We went from 16.0% on “gc cr block lost”, to 7.7% “gccr block lost” and 8.0% “gc cr block congested”. See below for details.Local Clients – 240 Client run, 97,555 blocks/secondTop 5 Timed Events Avg %Total ~~~~~~~~~~~~~~~~~~ wait Call Event Waits Time (s) (ms) Time ------------------------------- ------------ ----------- ------ ------ gc cr block 2-way 57,970,653 71,341 1 43.3 CPU time 65,674 39.9gc cr block lost 48,849 26,341 539 16.0 gc cr block congested 443,140 695 2 .4 ges message buffer allocation 52,587,636 384 0 .2 Remote Clients – 240 Client run, 91,039 blocks/secondTop 5 Timed Events Avg %Total ~~~~~~~~~~~~~~~~~~ wait Call Event Waits Time (s) (ms) Time ------------------------------- ------------ ----------- ------ ------ CPU time 69,851 42.4 gc cr block 2-way 54,072,799 67,608 1 41.1 gc buffer busy acquire 6,477,349 13,139 2 8.0 gc cr block lost 23,729 12,664 534 7.7 gc cr block congested 449,769 739 2 .4Bi-Directional Cache Fusion testThe conventional Cache Fusion testinvolves loading up a table on one RAC node (by updating each row),then querying it from another node in the same RAC cluster. Thisproduces lop-sided network traffic – for example, at athroughput of 97,824 8KB blocks/second, we see 805 MB/s in thedirection the blocks are being sent, but only 54 MB/s in the otherdirection.Let us now look at what we get if wesplit the table into two ranges, and have queries originating on bothnodes.Bi-directional –1 x 10 GigabitHere we compare a 16 + 16 LMSbi-directional configuration to both 16 and 32 LMS uni-directionalsetup.Significant advantage to either 16+ 16 LMS bi-directional or 32 LMS uni-directional configuration –each around 65% higher peak throughput.Bi-directional workload hasslightly higher throughput for same number of total LMS serverprocesses – 2.0% higher at the peakThe peak on this chart correspondsto 51.8% utilization on the single 10 Gigabit NIC in use, so it isunlikely that network bandwidth or packet rate is the bottleneckBi-directional –1 x 1 GbitHow about if we do the same where weknow we have saturated (at least one direction of) the interconnect?Aswe might expect, the bi-directional Cache Fusion test on a single 1Gigabit link sees quite a bit more Cache Fusion transfers. At a 79%improvement, we do not see a doubling however.CPU Impact of Select TuningsAn important question is – What is the effect in CPUconsumption of tuning my system in these ways? This is important because our Cache Fusion test exercises a smallpart of RAC, using a very simple query. It would be good to know howmuch CPU available to a “real world” workload after wetune for best Cache Fusion performance.“N3” Network Tuning for best LatencyFirst, let's look at the impact of the“N3” network tuning. It might be expected that thesetunings will increase CPU load, although the offloading of UDPcheck-sums to network hardware in theory should reduce CPUconsumption. Lets look at an “entry-level” comparison –a single 1 Gigabit interface and 3 LMS server processes.The simple answer is that the N3 tuning actually consumesless CPU, both on the server node (called “1” here) andon the client node generating the Cache Fusion requests (called “2”here).Is there a suggestion of a bottleneck on the client node,hinted at by the near-vertical increase in CPU consumption?How about if we are generating much more throughput? Perhaps with1 x 10 Gigabit and 64 LMS processes?Again, N3 tuning consumes less CPU. We do forgo a higherpeak throughput for this configuration, however.The “base” configuration went considerablyretrograde in throughput, and this is not shown.Impact of Increasing Number of LMS ProcessesHow about if we scale up the number of LMS processes on theblock-serving node? We do this so that we can handle more throughput– do we suffer in terms of CPU consumption?Here we see the CPU consumption on the block-serving node islittle different as we add more LMS serversThere is a more noticeable increase in CPU consumption on theclient side, although it is still smallAppendix A – Notes on the BenchmarkWe use several non-standard tunings to achieve the desiredbehavior where we see a constant stream of database blocks from thebuffer cache of the block-serving node to the client node. Theseinclude:Buffer sizingWe size the database buffers on the block-serving node largeenough to hold the range of block we will use, but not large enoughon the client node, so that we can iterate through the range ofblocks and by the time we query any block that was previouslytransferred to the client node will have since been evicted from theclient node's cache._fairness_thresholdThis is set to 0, which means that the block-serving node willretain blocks in its cache even in the face of the client node'srequests and absence of any accesses local to the block-servingnode. Without this setting, the block-serving node would relinquishownership of the blocks that it has cached._fast_cursor_reexecuteThis setting advises the database to maintain cursor context forqueries, so that they can be re-executed more efficiently. Withoutthis setting we hit a bottleneck with many simultaneous SQL\*Plusclients.Fusion CompressionFusion Compression is a feature; on by default; where “empty”space in a database block is not transmitted in response to the GCblock request. Instead, the block is transmitted as just the metadata and rows (or other data), then re-constituted on the receivingside.For our Cache Fusion test framework, we have only one row per DBblock; for each block size. This is so that we can better controlthe GC block caching and transfer. This means we want to disableGC Fusion Compression, so that our network utilization is more inline with real world experience. Experiments with Compressiondisabled will be referred to as “Raw”.The impact of using Fusion Compression;for various block sizes; is illustrated below:The size of a transfer, when there is a single row in a blockis fairly constant regardless of block size.The effect of different block sizes on throughput istherefore masked by compression when we have only one block per row.The effect on response time is also masked.Hence the decision to disable Fusion Compression for these tests.Appendix B –How to Set Network TuningsThe simple details on how to set the network tunings used in thesetests. Some of these are specific to the “nxge”interface.Jumbo FramesThis depends on your Solaris version, you may need to use:# dladm set-linkprop -p mtu=9000 nxge0or you may need to add this to /platform/sun4v/kernel/drv/nxge.confaccept_jumbo = 1;Hardware offload for UDP check-sums:Add this to /etc/system:set nxge:nxge_cksum_offload = 1Disable RX soft rings:Add this to /etc/system:set ip:ip_squeue_fanout = 0 set ip:ip_soft_rings_cnt = 0 Disable TX serialization:Add this to /etc/system:set nxge:nxge_tx_scheme = 1 Disable TX serialization:Add this to /platform/sun4v/kernel/drv/nxge.conf:rxdma-intr-time = 1; rxdma-intr-pkts = 8; Page 38of 38

Table of Contents RAC Cache Fusion on SPARC Enterprise T5440 1 Summary of Findings Configuration Under Test 10 Gigabit versus 1 Gigabit Interconnect Scaling up Number of LMS Processes Theoretical Peak of 10...


querystat - DTrace script to monitor your queries, query cache and server thread pre-emption

I was recently helping some colleagues check what was happening withtheir MySQL queries, and wrote a DTrace script to do it. Time toshare that script.First of all, a look at some output from the script:mashie[bash]# ./querystat.d -p `pgrep mysqld`Tracing started at 2009 Sep 17 16:28:352009 Sep 17 16:28:38 throughput 3 queries/sec2009 Sep 17 16:28:41 throughput 4 queries/sec2009 Sep 17 16:28:44 throughput 528 queries/sec2009 Sep 17 16:28:47 throughput 1603 queries/sec2009 Sep 17 16:28:50 throughput 1676 queries/sec\^CTracing ended at 2009 Sep 17 16:28:51Average latency, all queries: 107 usLatency distribution, all queries (us): value ------------- Distribution ------------- count 16 | 0 32 |@@ 170 64 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 3728 128 |@@@@@ 533 256 | 26 512 | 18 1024 | 2 2048 | 1 4096 | 0 8192 | 1 16384 | 1 32768 | 0 Query cache statistics: count hit: 6 count miss: 4474 avg latency miss: 107 (us) avg latency hit: 407 (us)Latency distribution, for query cache hit (us): value ------------- Distribution ------------- count 64 | 0 128 |@@@@@@@@@@@@@ 2 256 |@@@@@@@ 1 512 |@@@@@@@@@@@@@@@@@@@@ 3 1024 | 0 Latency distribution, for query cache miss (us): value ------------- Distribution ------------- count 16 | 0 32 |@@ 170 64 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 3728 128 |@@@@@ 531 256 | 25 512 | 15 1024 | 2 2048 | 1 4096 | 0 8192 | 1 16384 | 1 32768 | 0 Average latency when query WAS NOT pre-empted: 73 usAverage latency when query WAS pre-empted: 127 usPre-emptors:[...] mysql 6 Xorg 18 sched 25 firefox-bin 44 sysbench 3095You can see that while the script is running (prior to pressing<Ctrl>-C), we get a throughput count every 3 seconds.Then we get some totals, some averages, and even some distributionhistograms, covering all queries, then with breakdowns on whether weused the query cache, and whether the thread executing the query waspre-empted.This may be useful for determining things like: Do I have some queries in my workload that consume a lot moreCPU than others? Is the query cache helping or hurting? Are my database server threads being pre-empted (kicked off theCPU) by (an)other process(es)?Things have become easier since I first tried this, and had to use the PIDprovider to trace functions in the database server.If you want to try my DTrace script,get it from here. NOTE: You will need a version of MySQL withDTrace probes for it to work.

I was recently helping some colleagues check what was happening with their MySQL queries, and wrote a DTrace script to do it. Time to share that script. First of all, a look at some output from the...


nicstat - the Solaris and Linux Network Monitoring Tool You Did Not Know You Needed

Update - Version 1.95, January 2014 Added "-U" option, to display separate read and write utilization. Simplified display code regarding "-M" option.For Solaris, fixed fetch64() to check type of kstats andf ixed memory leak in update_nicdata_list(). Full details at the entry for version 1.95 Update - Version 1.92, October 2012 Added "-M" option to display throughput in Mbps (Megabits per second). Fixed some bugs. Full details at the entry for version 1.92 Update - Version 1.90, July 2011 Many new features available, including extended NIC, TCP and UDP statistics. Full details at the entry for version 1.90 Update - February 2010 Nicstat now can produce parseable output if you add a "-p" flag.This is compatible with System Data Recorder (SDR).Links below are for the new version - 1.22. Update - October 2009 Just a little one - nicstat now works on shared-ip Solaris zones. Update - September 2009 OK, this is heading toward overkill...The more I publish updates, the more I get requests forenhancement of nicstat. I have also decided to complete a few thingsthat needed doing.The improvements for this month are: Added support for a "fd" or "hd" (in reality anything startingwith an upper or lower-case F or H) suffix to the speed settingssupplied via the "-S" option. This advises nicstat the interfaceis half-duplex or full-duplex. The Linux version now calculates%Util the same way as the Solaris version. Added a script, enicstat, which uses ethtool to getspeeds and duplex modes for all interfaces, then calls nicstatwith an appropriate -S value. Made the Linux version more efficient. Combined the Solaris and Linux source into one nicstat.c.This is a little ugly due to #ifdef's, but that's the price you pay. Wrote a man page. Wrote better Makefile's for both platforms Wrote a short README Licensed nicstat under the Artistic License 2.0All source and binaries will from now on be distributed in a tarball.This blog entry will remain the home of nicstat for the timebeing.Lastly, I have heard the requests for easier availability inOpenSolaris. Stay tuned. Update - August 2009 That's more like it - we should get plenty of coverage now :)A colleague pointed out to me that nicstat's method of calculatingutilization for a full-duplex interface is not correct.Now nicstat will look for the kstat "link_duplex" value, and if it is 2 (which means full-duplex),it will use the greater of rbytes or wbytes to calculate utilization.No change to the Linux version. Use the links in my previous post for downloading. Update - July 2009 I should probably do this at least once a year, as nicstat needsmore publicity...A number of people have commented to me that nicstat always reports"0.00" for %Util on Linux. The reason for this is that there is nosimple way an unprivileged user can get the speed of an interface inLinux (quite happy for someone to prove me wrong on that however).Recently I got an offer of a patch from David Stone, to add an optionto nicstat that tells it what the speed of an interface is. Prettyreasonable idea, so I have added it to the Linux version. You willsee this new "-S" option explained if you use nicstat's "-h" (help)option.I have made another change which makes nicstat more portable, henceeasier to build on Linux. History A few years ago, a bloke I know by the name of Brendan Gregg wrote aSolaris kstat-based utility called nicstat. In 2006 I decided Ineeded to use this utility to capture network statistics in testing Ido. Then I got a request from a colleague in PAE to do somethingabout nicstat not being aware of "e1000g" interfaces.I have spent a bit of time adding to nicstat since then, so I thoughtI would make the improved version available. Why Should I Still Be Interested? nicstat is to network interfaces as "iostat" is to disks, or "prstat"is to processes. It is designed as a much better version of "netstat-i". Its differences include: Reports bytes in & out as well as packets. Normalizes these values to per-second rates. Reports on all interfaces (while iterating) Reports Utilization (rough calculation as of now) Reports Saturation (also rough) Prefixes statistics with the current time How about an example? eac-t2000-3[bash]# nicstat 5 Time Int rKB/s wKB/s rPk/s wPk/s rAvs wAvs %Util Sat17:05:17 lo0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0017:05:17 e1000g0 0.61 4.07 4.95 6.63 126.2 628.0 0.04 0.0017:05:17 e1000g1 225.7 176.2 905.0 922.5 255.4 195.6 0.33 0.00 Time Int rKB/s wKB/s rPk/s wPk/s rAvs wAvs %Util Sat17:05:22 lo0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0017:05:22 e1000g0 0.06 0.15 1.00 0.80 64.00 186.0 0.00 0.0017:05:22 e1000g1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00eac-t2000-3[bash]# nicstat -i e1000g0 5 4 Time Int rKB/s wKB/s rPk/s wPk/s rAvs wAvs %Util Sat17:08:49 e1000g0 0.61 4.07 4.95 6.63 126.2 628.0 0.04 0.0017:08:54 e1000g0 0.06 0.04 1.00 0.20 64.00 186.0 0.00 0.0017:08:59 e1000g0 239.2 2.33 174.4 33.60 1404.4 71.11 1.98 0.0017:09:04 e1000g0 0.01 0.04 0.20 0.20 64.00 186.0 0.00 0.00For more examples, see the man page. References & Resources You can get source and binaries from sourceforgeNote - the Solaris binaries will work on later releases; and probably on earlierreleases of Solaris - as Solaris is just likethat... brendangregg.com - Downloads, the original nicstat solarisinternals.com - Performance Tool List OpenSolaris Forums - Posting from Brendan about recent updates tonicstat Blog O' Matty - Viewing NIC throughput with nicstat Weak Focus - Nicstat, Solaris network utilization

Update - Version 1.95, January 2014 Added "-U" option, to display separate read and write utilization. Simplified display code regarding "-M" option.For Solaris, fixed fetch64() to check type of kstats...


pstime - a mash-up of ps(1) and ptime(1)

I have done some testing in the past where I needed to know the amountof CPU consumed by a process more accurately than I can get from thestandard set of operating system utilities.Recently I hit the same issue - I wanted to collect CPU consumption ofmysqld.To capture process CPU utilization over an interval on Solaris, aboutthe best I can get is the output from a plain "prstat" command, whichmight look like:mashie ) prstat -c -p `pgrep mysqld` 5 2Please wait... PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 7141 mysql 278M 208M cpu0 39 0 0:38:13 40% mysqld/45Total: 1 processes, 45 lwps, load averages: 0.63, 0.33, 0.18 PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 7141 mysql 278M 208M cpu1 32 0 0:38:18 41% mysqld/45Total: 1 processes, 45 lwps, load averages: 0.68, 0.34, 0.18I am after data from the second sample only (still not sure exactlyhow prstat gets data for the fist sample, which comes out almostinstantaneously), so you can guess I will need some sed/perl that is alitte more complicated than I would prefer.pstime reads PROCFS (i.e.. the virtualized file-system mounted on/proc) and captures CPU utilization figures for processes. It willreport the %USR and %SYS either for a specific list of processes, orevery process running on the system (i.e., running at both samplepoints). The start sample time is recorded in high resolution at thetime a process' data is captured, and then again after Nseconds, where N is the first parameter supplied topstime.The default output of pstime is expressed as either a percentage ofwhole system CPU, or CPU seconds, with four significant digits.Solaris itself records the original figures in nanosecond resolution,although we do not expect today's hardware to be thataccurate.Here is an example:mashie ) pstime 10 `pgrep sysbench\\|mysqld` UID PID %USR %SYS COMMANDmysql 7141 44.17 3.391 /u/dist/mysql60-debug/bin/mysqld --defaults-file=/etmysql 19870 2.517 2.490 sysbench --test=oltp --oltp-read-only=on --max-time=mysql 19869 0.000 0.000 /bin/sh -p ./run-sysbench Downloads Source - pstime.c Binary - pstime.i386, built on Solaris 9 Binary - pstime.sparc, built on Solaris 9

I have done some testing in the past where I needed to know the amount of CPU consumed by a process more accurately than I can get from the standard set of operating system utilities. Recently I hit...


Expanding Google's InnoDB Synchronization Improvements to Solaris

There is much excitement today at thelaunch of MySQL 5.4, so I willrelate my story about a project I contributed to this new version.When we started looking at performance improvements for MySQL, we wereinterested in "low hanging fruit", or fixes and changes that couldreap measurable benefits for users in the short term.An obvious candidate at that time was the now well-knownGoogle SMP patch. I had seen Mark Callaghan present on this atthe MySQL User Conference in 2008, and was interested toinvestigate.I was pretty new to InnoDB at that time, and was soon to discover thatInnoDB was possibly experiencing poor scalability around its mutexesand read-write locks because InnoDB had a private implementation ofadaptive mutexes and read-write locks, and this was probably not thebest implementation on all or even most platforms MySQL is availableon.Now InnoDB's "private" mutexes and rw-locks were a good way to getspin-locks on all platforms, which may be a win in many cases, but asthe Google team had demonstrated, it could be improved on. Indeed, Iknew that adaptive spin-locks are available on Solaris, and they offeran extra advantage - if the holder of a lock is found to be off CPU,we don't bother spinning, but instead put the thread wanting the lockstraight to sleep.So, I decided to undertake a couple of performance studies of InnoDB'slocking, being: Apply the Google SMP patch to MySQL 5.1 and test Modify InnoDB in 5.1 to use POSIX mutexes and RW-locks and testThe second step turned out to be quite complicated. I could not evenchange all of InnoDB's RW-locks to POSIX ones, as the InnoDBsychronization objects offer functionality not available via POSIX.It also meant we would be diverging more significantly from the InnoDBin 5.1, so this option - although looking promising - was shelved.within how InnoDB is licensed to MySQL. Phew./em >This left the Google SMP patch. It also looked promising. It was aless dramatic change, and offered scaling benefits in all the testingI did.There was one last snag though - the mutex and RW-lock improvments inthe Google SMP patch would only be applied if you were building onx86/x64 with GCC 4.1 or later, as they relied onGCC's atomic built-ins.You can consider that we have a two-dimensional matrix of platformsthat MySQL supports, being a compiler, then an Operating System. Tomake a feature portable across this matrix, you need to find aportable API, write code that is portable, or write code that uses achoice of different portable API's depending on what is available.Now we definitely wanted to get a similar benefit for InnoDB on SPARC,and not necessarily just with GCC. In any case, GCC did not offer allof the built-in atomics for SPARC at the time. Happily, there areatomic functions available in Solaris that fit the job fine. MySQL5.4 uses the functions if you build on Solaris without a version ofGCC that supports built-in atomics.Just so you understand though, here is (a simplified version of) whathappens when you build MySQL 5.4 on your chosen platform with yourchosen compiler: IF (compiler has GCC built-in atomics)use GCC built-in atomics ELSE IF (OS has atomic functions)use atomic functions ELSEuse traditional InnoDB synchronization objects, based on pthread_mutex\*.SummaryAsNeel points out in his blog, it was an exercise we learntsomething from, even if we diddevelop functionality that will not be used. The important thing iswe know we have improved the performance of MySQL, by extending theGoogle SMP improvements to all Solaris users, regardless of chosencompiler.

There is much excitement today at the launch of MySQL 5.4, so I will relate my story about a project I contributed to this new version. When we started looking at performance improvements for MySQL, we...


Testing the New Pool-of-Threads Scheduler in MySQL 6.0, Part 2

Inmy last blog, I introduced my investigation of the"Pool-of-Threads" scheduler in MySQL 6.0. Read on to see where I wentnext.I now want to take a different approach to comparing the twoschedulers. It is one thing to compare how the schedulers work "flatout" - with a transaction request rate that is limited only by themaximum throughput of the system under test. I would like to insteadlook at how the two schedulers compare when I drive mysqld at aconsistent transaction rate, then vary only the number of connectionsover which the transaction requests are arriving. I will aim to comeup with a transaction rate that sees CPU utilization somewhere in the40-60% range.This is more like how real businesses use MySQL every day, as opposedto the type of benchmarking that computer companies usually engagein. This will also allow me to look at how the schedulers run at muchhigher connection counts - which is where the pool-of-threadsscheduler is supposed to shine.Now, I will let you all know that I first conducted my experimentswith mysqld and the load generator (sysbench) on the same system. Iwas again not sure this would be be the best methodology, primarilybecause I would end up having one operating system instance schedulingin some cases a very large number of sysbench threads along with themysqld threads.It turned out the results from this mode threw up some issues (likenot being able to get my desired throughput with 2048 connections inpool-of-threads mode), so I repeated my experiments - the second setof results have the load generation coming from two remote systems,each with a dedicated 1 Gbit ethernet link to the DB server.The CPU utilization I have captured was just the %USR plus %SYS forthe mysqld process. This makes the two sets of metrics comparable.Here are my results. First for experiments where sysbench ran on thesame host as mysqld:Then for experiments where sysbench ran on two remote hosts, each witha dedicated Gigabit Ethernet link to the database server:As you can see, the pool-of-threads model does incur an overhead, bothin terms of CPU consumption and response time, at low connectionscounts. As hoped though, the advantage swings in pool-of-threads'favour. This is particularly noticeable in the case where our clientsare remote. It is arguable that an architecture involving manyhundreds or thousands of client connections is more likely to havethose clients located remote from the DB server.Now, the first issue I have is that while pool-of-threads starts towin on response time, the response time is still increasing in asimilar fashion to thread-per-connection's response time (note - thescale is logarithmic). This is not what I expected, so we have ascalability problem in there somewhere.The second issue is where I have to confess - I only got one "lucky"run where my target transaction rate was achieved for pool-of-threadsat 2048 connections. For many other runs, the target rate could notbe achieved, as these raw numbers show:connectionstpsmysqld%usrmysqld%sysmysqld%cpuavg-resp95%-resp2048962.2225.2314.9340.161943.782368.7820481197.0030.5911.2041.79317.98435.192048836.5021.9811.0933.072259.362287.032048963.0026.4912.0738.561333.671128.932048992.2525.8115.0840.891851.172280.502048915.7124.1615.0539.212220.452342.062048919.5424.2515.0539.302210.952331.452048917.0924.1515.0539.202217.862321.402048875.0923.2013.2936.492188.692344.9120481180.6231.3514.5745.921439.961772.8620481185.8030.7414.2444.981185.711814.2420481146.9030.3415.2345.571602.851842.1420481141.4730.2015.2245.421612.341873.9520481158.7430.4712.9943.46999.761870.3520481177.5930.6714.9745.641403.221838.84This indicates we have some sort of bottleneck right at or around the2048 thread point. This is not what we want with pool-of-threads, soI will continue my investigation.

In my last blog, I introduced my investigation of the "Pool-of-Threads" scheduler in MySQL 6.0. Read on to see where I went next. I now want to take a different approach to comparing the twoschedulers....


Testing the New Pool-of-Threads Scheduler in MySQL 6.0

I have recently been investigating a bew feature of MySQL 6.0 - the"Pool-of-Threads" scheduler. This feature is a fairly significantchange to the way MySQL completes tasks given to it by databaseclients.To begin with, be advised that the MySQL database is implemented as asingle multi-threaded process. The conventional threading model isthat there are a number of "internal" threads doing administrative work(including accepting connections from clients wanting to connect tothe database), then one thread for each database connection. Thatthread is responsible for all communication with that database clientconnection, and performs the bulk of database operations on behalf ofthe client.This architecture exists in other RDBMS implementations. Anothercommon implementation is a collection of processes all cooperating viaa region of shared memory, usually with semaphores or othersynchronization objects located in that shared memory.The creation and management of threads can be said to be cheap, in arelative sense - it is usually significantly cheaper to create ordestroy a thread rather than a process. However these overheads do notcome for free. Also, the operations involved in scheduling a threadas opposed to a process are not significantly different. A singleoperating system instance scheduling several thousand threadson and off the CPUs is not much less work than one scheduling severalthousand processes doing the same work. Pool-of-Threads The theory behind the Pool-of-Threads scheduler is to provide anoperating mode which supports a large number of clients that will bemaintaining their connections to the database, but will not be sendinga constant stream of requests to the database. To support this, thedatabase will maintain a (relatively) small pool of worker threadsthat take a single request from a client, complete the request, returnthe results, then return to the pool and wait for another request,which can come from any client. The database's internal threads stillexist and operate in the same manner.In theory, this should mean less work for the operating system toschedule threads that want CPU. On the other hand, it should meansome more overhead for the database, as each worker thread needs torestore the context of a database connection prior to working on eachclient request.A smaller pool of threads should also consume less memory, as eachthread requires a minimum amount of memory for a thread stack, beforewe add what is needed to store things like a connection context, orworking space to process a request.You can read more aboutthe different threading models in the MySQL 6.0 Reference Manual. Testing the Theory Mark Callaghan of Google has recently had a look at whether thistheory holds true. He has published his results under"No new global mutexes! (and how to make the thread/connection poolwork)". Mark has identified (viathis bug he logged)that the overhead for using Pool-of-Threads seems quite large - up to63 percent.So, my first task is see if I get the same results. I will note herethat I am using Solaris, whereas Mark was no doubt using a Linuxdistro. We probably have different hardware as well (although bothare Intel x86).Here is what I found when running sysbench read-only (with thesysbench clients on the same host). The "conventional" schedulerinside MySQL is known as the "Thread-per-Connection" scheduler, by theway.This is in contrast to Mark's results - I am only seeing a loss inthroughput of up to 30%. What about the bigger picture? These results do show there is a definite reduction in maximumthroughput if you use the pool-of-threads scheduler.I believe it is worth looking at the bigger picture however. To dothis, I am going to add in two more test cases: sysbench read-only, with the sysbench client and MySQL databaseon separate hosts, via a 1 Gb network sysbench read-write, via a 1 Gb networkWhat I want to see is what sort of impact the pool-of-threadsscheduler has for a workload that I expect is still the more commonone - where our database server is on a dedicated host, accessed via anetwork.As you can see, the impact on throughput is far less significant whenthe client and server are separated by a network. This is because wehave introduced network latency as a component of each transaction andincreased the amount of work the server and client need to do - theynow need to perform ethernet driver, IP and TCP tasks.This reduces the relative overhead - in CPU consumed and latency -introduced by pool-of-threads.This is a reminder that if you are conducting performance tests on asystem prior to implementing or modifying your architecture, you woulddo well to choose a test architecture and workload that is as close aspossible to that you are intending to deploy. The same is true if youare are trying to extrapolate performance testing someone else hasdone to your own architecture. The Converse is Also True On the other hand, if you are a developer or performance engineerconducting testing in order to test a specific feature or code change,a micro-benchmark or simplified test is more likely to be what youneed. Indeed, Mark's use of the "blackhole" storage engine is a goodidea to eliminate that processing from each transaction.In this scenario, if you fail to make the portion of the software youhave modified a significant part of the work being done, you run therisk of seeing performance results that are not significantlydifferent, which may lead you to assume your change has negligibleimpact.In my next posting, I will compare the two schedulers using adifferent perspective.

I have recently been investigating a bew feature of MySQL 6.0 - the "Pool-of-Threads" scheduler. This feature is a fairly significant change to the way MySQL completes tasks given to it by databaseclie...


New Feature for Sysbench - Generate Transactions at a Steady Rate

Perhaps I am becoming a regular patcher of sysbench...I have developed a new feature for sysbench - the ability to generatetransactions at a steady rate determined by the user.This mode is enabled using the following two new options:--tx-rateRate at which sysbench should attempt to send transactions to thedatabase, in transactions per second. This is independent ofnum_threads. The default is 0, which means to send as manyas possible (i.e., do not pause between the end of one transaction andthe start of another. It is also independent of other options like--oltp-user-delay-min and --oltp-user-delay-max,which add think time between individual statements generated bysysbench.--tx-jitterMagnitude of the variation in time to start transactions at, inmicroseconds. The default is zero, which asks each thread to vary itstransaction period by up to 10 percent (i.e. 10\^6 /tx-rate \* num-threads / 10). A standard pseudo-randomnumber generator is used to decide each transaction start time.My need for these options is simple - I want to generate a steadyload for my MySQL database. It is one thing to measure the maximumachievable throughput as you change your database configuration,hardware, or num-threads. I am also interested in how the system (orjust mysqld's) utilization changes, at the same transaction rate, whenI change other variables.An upcoming post will demonstrate a use of sysbench in this mode.For the moment my new feature can be added to sysbench 0.4.12 (andprobably many earlier versions) viathis patch.These changes are tested on Solaris, but I did choose only APIs thatare documented as also available on Linux. I have also posted my patchonsourceforge as a sysbench feature enhancement request.

Perhaps I am becoming a regular patcher of sysbench... I have developed a new feature for sysbench - the ability to generate transactions at a steady rate determined by the user. This mode is enabled...


MySQL 5.1 Memory Allocator Bake-Off

After getting sysbench running properly with a scalable memoryallocator (see last post), I can now return to what I was originallytesting - what memory allocator is best for the 5.1 server (mysqld).This stems out of studies I have made of some patches that have beenreleased by Google. You canread about the work Google has been doing here.I decided I wanted to test a number of configurations based on the MySQL community source, 5.1.28-rc, namely: The baseline - no Google SMP patch, default memory allocator (5.1.28-rc) With Google SMP patch, mem0pool enabled, no custom malloc (pool) With Google SMP patch, mem0pool enabled, linked with mtmalloc (pool-mtmalloc) With Google SMP patch, mem0pool disabled, linked with tcmalloc (TCMalloc) With Google SMP patch, mem0pool disabled, linked with umem (umem) With Google SMP patch, mem0pool disabled, linked with mtmalloc (mtmalloc)Here are some definitions, by the way:mem0poolInnoDB's internal "memory pools" feature, found in mem0pool.c (NOTE: Even if this is enabled, other parts of the server will not use this memory allocator - they will use whatever allocator is linked with mysqld)tcmallocThe "libtcmalloc_minimal.so.0.0.0" that is built from google-perftools-0.99.2HoardThe Hoard memory allocator, version 3.7.1umemThe libumem library (included with Solaris)mtmallocThe mtmalloc library (included with Solaris)My test setup was a 16-CPU Intel system, running Solaris Nevada build100. I chose to use only an x86 platform, as I was not able to buildtcmalloc on SPARC. I also chose to run with the database in TMPFS,and with an innoDB buffer size smaller than the database size. Thiswas to ensure that we would be CPU-bound if possble, rather thanslowed by I/O.If I built any package (no need for mtmalloc or umem), I used GCC4.3.1, except for Hoard, which seemed to prefer the Sun Studio 11 Ccompiler (over Sun Studio 12 or GCC).My test was a sysbench OLTP read-write run, of 10 minutes. Eachseries of runs at different thread counts is preceded by a databasere-build and 20 minute warmup. Here are my throughput results for1-32 SysBench threads, in transactions per second:These results show that while the Google SMP changes are a benefit,the disabling of InnoDB's mem0pool does not seem to provide anyfurther benefit for my configuration. My results also show thatTCMalloc is not a good allocator for this workload on this platform,and Hoard is particularly bad, with significant negative scaling above16 threads.The remaining configurations are pretty similar, with mtmalloc andumem a little ahead at higher thread counts.Before I get a ton of comments and e-mails, I would like to point outthat I did some verification of my TCMalloc builds, as the results Igot surprised me. I verified that it was using the supplied assemblerfor atomic routines, and I built it with optimization (-O3) andwithout.I also discovered that TCMalloc was emitting this diagnostic whenmysqld was starting up:src/tcmalloc.cc:151] uname failed assuming no TLS support (errno=0)I rectified this with a change in tcmalloc.cc, and called thisconfiguration "TCMalloc -O3, TLS". It is shown against the other twoconfigurations below.I often like to have a look at what the CPU cost of differentconfigurations are. This helps to demonstrate headroom, and whetherdifferent throughput results may be due to less efficient code orsomething else. The chart below lists what I found - note that this is system-wide CPU (user & system) utilization, and I was running my SysBench client on the same system.Lastly, I did do one other comparison, which was to measure how mucheach memory allocator affected the virtual size of mysqld. I did notexpect much difference, as the most significant consumer - the InnoDBbuffer pool - should dominate with large long-lived allocations.This was indeed the case, and memory consumption grew little after theinitial start-up of mysqld. The only allocator that then caused anynoticable change was mtmalloc, which for some reason made the heapgrow by 35MB following a 5 minute run (it was originally 1430 MB) References Sun Developer Network - A Comparison of Memory Allocators inMultiprocessors The Hoard Memory Allocator TCMalloc : Thread-Caching Malloc google-perftools - the home of TCMalloc MySQL InnoDB Performance Tuning for the Solaris 10 OS- includes a recommendation for mtmalloc MySQL scalability on Linux with sysbench My previous blog on improving SysBench with a scalable memoryallocator, which also discusses mtmalloc and umem (a version of) the source for mem0pool.c, which documents what InnoDB'smemory pools do

After getting sysbench running properly with a scalable memory allocator (see last post), I can now return to what I was originally testing - what memory allocator is best for the 5.1 server (mysqld). T...


Scalability and Stability for SysBench on Solaris

My mind is playing "Suffering Succotash..."I have been working on MySQL performance for a while now, and the teamI am in have discovered that SysBench could do with a couple of tweaksfor Solaris.Sidebar - sysbench is a simple "OLTP" benchmark which can testmultiple databases, including MySQL. Find outall about it here, butgo to the download page to get the latest version.To simulate multiple users sending requests to a database, sysbenchuses multiple threads. This leads to two issues we have identifiedwith SysBench on Solaris, namely: The implementation of random() is explicitly identified as unsafein multi-threaded applications on Solaris. My team has found this isa real issue, with occasional core-dumps happening to our multi-threadedSysBench runs. SysBench does quite a bit of memory allocation, and could do witha more scalable memory allocator.Neither of these issues are necessarily relevant only to Solaris, by the way.Luckily there are simple solutions. We can fix the random() issue byusing lrand48() - in effect a drop-in replacement. Then we can fix thememory allocator by simply choosing to link with a better allocator onSolaris.To help with a decision on memory allocator, I ran a few simple teststo check the performance of the two best-known scalable allocatorsavailable in Solaris. Here are the results ("libc" is the defaultmemory allocator):To see the differences more clearly, lets do a relative comparison, using "umem" (A.K.A. libumem) as the reference:So - around 20% less throughput at 16 or 32 threads. Very little difference at 1 thread, too (where the default memory allocator should be the one with the lowest synchronization overhead).Where you see another big difference is CPU cost per transaction:I will just point out two other reasons why I would recommend libumem: It's portable - available for at least Linux, OS X and Windows. This means any cross-OS comparisons could eliminate the memory allocator as a possible difference. You get debugging and profiling for free. See these links: Identifying Memory Management Bugs Within Applications Using the libumem Library Debugging with libumem and MDB Adam Leventhal's Weblog - Number 11 of 20: libumem I have logged these two issues as sysbench bugs: 2422927 use of random() & srandom() is not MT-Safe on Solaris 2422935 not using scalable memory allocator on SolarisHowever, if you can't wait for the fixes to be released, try these: Patch to make SysBench 0.4.8 use lrand48() instead of random() Script to configure SysBench 0.4.8 so that it will be linked withlibumem

My mind is playing "Suffering Succotash..." I have been working on MySQL performance for a while now, and the team I am in have discovered that SysBench could do with a couple of tweaks for Solaris. Side...


The Seduction of Single-Threaded Performance

The following is a dramatization. It is used to illustrate someconcepts regarding performance testing and architecting of computersystems. Artistic license may have been taken with events, people andtime-lines. The performance data I have listed is real and currenthowever.I got contacted recently by the Systems Architect of latestrage.com.He has been a happy Sun customer for many years, but was a littledispleased when he took delivery of a beta test system of one of ourlatest UltraSPARC servers."Not very fast", he said."Is that right, how is it not fast?", I inquired eagerly."Well, it's a lot slower than one of the LowMarginBrand x86 servers wejust bought", he trumpeted indignantly."How were you measuring their speed?", I asked, getting wary."Ahh, simple - we were compressing a big file. We were careful to notlet it be limited by I/O bandwidth or memory capacity, though..."What then ensues is a discussion about what was being used to test"performance", whether it matches latestrage.com's typical productionworkload and further details about architecture and objectives.Data compression utilities are a classic example of a seemingly maturearea in computing. Lots of utilities, lots of different algorithms, afew options in some utilities, reasonable portability between operatingsystems, but one significant shortcoming - there is no commonlyavailable utility that is multi-threaded.Let me pretend I am still in this situation of using compression toevaluate system performance, and I am wanting to compare the new SunSPARC Enterprise T5440 with a couple of current x86 servers. Here is my ownfirst observation about such a test, using a single-threadedcompression utility:Now if you browse down to older blog entries, you will see I have written my own multi-threaded compression utility.It consists of a thread to read data, as many threads to compress ordecompress data as demand requires, and one thread to write data. Let me see whether I can fully exploit the performance of the T5440 with Tamp...Well, this turned out to be not quite the end of the story. Idesigned my tests with my input file located on a TMPFS(in-memory) filesystem, and with the output being discarded. This left the system focusing on the computation of compression, without being obscured by I/O. This is the same objective that latestrage.com had.What I found on the T5440 was that Tamp would not use more than 12-14 threads forcompression - it was limited by the speed at which a single thread couldread data from TMPFS.So, I chose to use another dimension by which we can scale up workon a server - add more sources of workload. This is represented bymultiple "Units of Work" in my chart below.After completing my experiments I discovered that, as expected, the T5440may disappoint if we restrict ourselves to a workload that can notfully utilize the available processing capacity. If we add more workhowever, we will find it handily surpasses the equivalent 4-socket quad-core x86systems. Observing Single-Thread Performance on a T5440 A little side-story, and another illustration of how inadequate asingle-threaded workload is at determining the capability of the T5440. Take a look at the following output from vmstat, and answer this question:Is this system "maxed out"?(Note: the "us", "sy" and "id" columns list how much CPU time is spent in User, System and Idle modes, respectively) kthr memory page disk faults cpu r b w swap free re mf pi po fr de sr d0 d1 d2 d3 in sy cs us sy id 0 0 0 1131540 12203120 1 8 0 0 0 0 0 0 0 0 0 3359 1552 419 0 0 100 0 0 0 1131540 12203120 0 0 0 0 0 0 0 0 0 0 0 3364 1558 431 0 0 100 0 0 0 1131540 12203120 0 0 0 0 0 0 0 0 0 0 0 3366 1478 420 0 0 99 0 0 0 1131540 12203120 0 0 0 0 0 0 0 0 0 0 0 3354 1500 441 0 0 100 0 0 0 1131540 12203120 0 0 0 0 0 0 0 0 0 0 0 3366 1549 460 0 0 99 Well, the answer is yes. It is running a single-threaded process, which is using 100% of one CPU. For the sake of my argument we will say the application is the critical application on the system. It has reached it's highest throughput and is therefore "maxed out". You see, when one CPU represents less than 0.5% of the entire CPU capacity of a system, then a single saturated CPU will be rounded down to 0%. In the case of the T5440, one CPU is 1/256th or 0.39%.Here is a tip for watching a system that might be doing nothing, butthen again might be doing something as fast as it can:$ mpstat 3 | grep -v ' 100$'This is what you might see:CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 2 0 48 204 4 2 0 0 0 0 127 1 1 0 99 32 0 0 0 2 0 3 0 0 0 0 0 0 8 0 92 48 0 0 0 6 0 0 5 0 0 0 0 100 0 0 0CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 1 0 49 205 5 3 0 0 0 0 117 0 1 0 99 32 0 0 0 4 0 5 0 0 1 0 0 0 14 0 86 48 0 0 0 6 0 0 5 0 0 0 0 100 0 0 0CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 48 204 4 2 0 0 0 0 103 0 1 0 99 32 0 0 0 3 0 4 0 0 0 0 3 0 14 0 86 48 0 0 0 6 0 0 5 0 0 0 0 100 0 0 0mpstat uses "usr", "sys", and "idl" to represent CPU consumption. For moreon "wt" you can read my older blog.For more on utilization, see the CPU/Processor page on solarisinternals.comTo read more about the Sun SPARC Enterprise T5440 which is announced today, go to Allan Packer's blog listing all the T5440 blogs. Tamp - a Multi-Threaded Compression Utility Some more details on this: It uses a freely-available Lempel-Ziv-derived algorithm, optimisedfor compression speed It was compiled using the same compiler and optimization settingsfor SPARC and x86. It uses a compression block size of 256KB, so files smaller than thiswill not gain much benefit I was compressing four 1GB database files. They were being reduced insize by a little over 60%. Browse my blog for more details and a download

The following is a dramatization. It is used to illustrate some concepts regarding performance testing and architecting of computersystems. Artistic license may have been taken with events,...


Tamp - a Lightweight Multi-Threaded Compression Utility

UPDATE: Tamp has been ported to Linux, and is now at version 2.5 Packages for Solaris (x86 and SPARC), and a source tarball are available below. Back Then Many years ago (more than I care to remember), I saw an opportunity toimprove the performance of a database backup. This was before the time ofOracle on-line backup, so the best choice at that time was to: shut down the database export to disk start up the database back up the export to tapeThe obvious thing to improve here is the time between steps 1 and 3.We had a multi-CPU system running this database, so it occurred to methat perhaps compressing the export may speed things up.I say "may" because it is important to remember that if the compression utility has lower throughput than the output of the database export (i.e. raw output; excluding any I/O operations to save that data) we may just end up with a different bottleneck, and not run any faster; perhaps even slower.As it happens, this era also pre-dated gzip and other newercompression utilities. So, using the venerable old "compress", itactually was slower. It did save some disk space, because Oracle exportfiles are eminently compressible.So, I went off looking for a better compression utility. I was nowmore interested in something that was fast. It needed to not be thebottleneck in the whole process.What I found did the trick - It reduced the export time by 20-30%, andsaved some disk space as well. The reason why it saved time was thatit was able to compress at least as fast as Oracle's "exp" utility wasable to produce data to compress, and it eliminated some of the I/O - the real bottleneck. More Recently I came across a similar situation more recently - I was again doing"cold" database restores and wanted to speed them up. It was alittle more challenging this time, as the restore was already parallelat the file level, and there were more files than CPUs involved (72).In the end, I could not speed up my 8-odd minute restore of ~180GB,unless I already had the source files in memory (via the filesystemcache). That would only work in some cases, and is unlikely to workin the "real world", where you would not normally want this much sparememory to be available to the filesystem.Anyway, it took my restore down to about 3 minutes in cases where allmy compressed backup files were in memory - this was because it hadnow eliminated all read I/O from the set of arrays holding my backup.This meant I had eliminated all competing I/O's from the setof arrays where I was re-writing the database files. Multi-Threaded Lightweight Compression I could not even remember the name of the utility I used years ago,but I knew already that I would need something better. The computersof 2008 have multiple cores, and often multiple hardware threads percore. All of the current included-in-the-distro compression utilities (well, almost all utilities) for Unix are stillsingle-threaded - a very effective way to limit throughput ona multi-CPU system.Now, there are a some multi-threaded compression utilities available,if not widely available: PBZIP2 is a parallel implementation of BZIP2. You can find out more here PIGZ is a parallel implementation of GZIP, although it turns out it is not possible to decompress a GZIP stream with more than one thread. PIGZ is available here.Here is a chart showing some utilities I have tested on a 64-way SunT5220. The place to be on this chart is toward the bottom right-handcorner.Here is a table with some of the numbers from that chart:Utility Reduction (%) Elapsed (s) tamp 66.18 0.31 pigz --fast 71.18 1.04 pbzip2 --fast 77.17 4.17 gzip --fast 71.10 16.13 gzip 75.73 40.29 compress 61.61 18.21 To answer your question - yes, tamp really is 50-plus-times faster than "gzip --fast". Tamp The utility I have developed is called tamp. As the name suggests, itdoes not aim to provide the best compression (although it is betterthan compress, and sometimes beats "gzip --fast").It is however a proper parallel implementation of an already fastcompression algorithm.If you wish to use it, feel free to download it. I will be blogging in the near future on a different performance test I conducted using tamp. Compression Algorithm Tamp makes use of the compression algorithm from Quick LZ version 1.40. I have tested a couple of other algorithms, and the code in tamp.c can be easily modified to use a different algorithm. You can get QuickLZ from here (you will need to download source yourself if you want to build tamp).Update, Jan 2012 - changed the downloads to .zip files, as it seems blogs.oracle.com interprets a download of a file ending in .gz as a request to compress the file via gzip before sending it. That confuses most people. Resources Tamp 2.5 package for Solaris x86 (built on Solaris 10) Tamp 2.5 package for Solaris SPARC (built on Solaris 10) Tamp 2.5 source tarball

UPDATE: Tamp has been ported to Linux, and is now at version 2.5 Packages for Solaris (x86 and SPARC), and a source tarball are available below. Back Then Many years ago (more than I care to remember),...


Installing Solaris from a USB Disk

I regularly do a full install of a Solaris Development release onto mylaptop. Why full? Well, that is another story for another day, butit is not because the Solaris Upgrade software; including Live Upgrade;is lacking.I decided I no longer see the sense of burning a DVD to do this; and Iknow that Solaris can boot from a USB device.I used James C. Liu's blog as aninspiration, but the following is what I have found worked well toboot an install image located on a USB disk. You may also be interested in theSolaris Ready USB FAQ.NOTE: This procedure only has a chance of working if you have aversion of Solaris 10 or later that uses GRUB and has a USB driverthat works with your drive. Set up an 8GB "Solaris2" partition on the USB drive using fdisk.Make it the active partition. Set up a UFS slice using all but the first cylinder of that 8GBas slice 0 using format. Run newfs. Mount. The first cylinder ends up being dedicated to a "boot" slice. I do not know what it is used for, perhaps avoidance of overwriting PC-style partition table & boot program. Mount the DVD ISO using lofiadm/mount (hint: google lofiadm solaris iso) Use cpio to copy the contents of the DVD ISO into the UFSpartition on the USB drive, e.g:# cd <rootdir of DVD ISO># find . | cpio -pdum <rootdir of USB filesystem> Run installgrub to install the stage1 & stage2 files from the DVDISO onto the USB driveIf the filesystem on your USB drive has mounted as /dev/dsk/c2t0d0s0 for example, then use:# cd <rootdir of DVD ISO># /sbin/installgrub boot/grub/stage1 boot/grub/stage2 /dev/rdsk/c2t0d0s0 Boot off the USB disk. It uses the same GRUB install that would be on a DVD. Now, I can not remember whether the next step was either: Wait for the install to fail (unable to find distribution), or: Exit/quit out of installation...but you need to get to a shell. Manually mount the USB partition at /cdrom NOTE: your controller numbers are probably not as you expect at this point, so double-check what you are mounting. Re-start the install I used "suninstall". I think you can use "solaris-install" instead.The install seemed to run fine from there, however it went through asysconfig stage after the reboot.Then I ended up with one teeny problem - my X server would not start.I discovered some issues with fonts, and then decided to check theinstall log. I discovered a number of packages had reported statuslike:Installation of <SUNWxwfnt> partially failed.19997 blockspkgadd: ERROR: class action script did not complete successfullyInstallation of <SUNWxwcft> partially failed.Installation of <SUNW5xmft> partially failed.Installation of <SUNW5ttf> partially failed.Installation of <SUNWolrte> partially failed.Installation of <SUNWhttf> partially failed.I have since pkgrm/pkadd-ed these packages (using -R while running thelaptop on an older release with the new boot environment mounted), andall is now well.

I regularly do a full install of a Solaris Development release onto my laptop. Why full? Well, that is another story for another day, butit is not because the Solaris Upgrade software; including...


Building GCC 4.x on Solaris

I needed to build GCC 4.3.1 for my x86 system running a recentdevelopment build of Solaris. I thought I would share what Idiscovered, and then improved on.I started with Paul Beach's Blog on the same topic, but I knew it had a couple ofshortcomings, namely: No mention of a couple of pre-requisites that are mentioned inthe GCC document Prerequisitesfor GCC A mysterious "cannot compute suffix of object files" error in thebuild phase No resolution of how to generate binaries that have a usefulRPATH (see Shared Library Search Paths for a discussion on the importance ofRPATH).I found some help on this via this forum post, but here is my own cheat sheet. Download & install GNU Multiple Precision Library (GMP)version 4.1 (or later) from sunfreeware.com. This will end up located in /usr/local. Download, build & install MPFR Library version 2.3.0 (orlater) from mpfr.org. This willalso end up in /usr/local. Download & unpack the GCC 4.x base source (the one of theform gcc-4.x.x.tar.gz) from gcc.gnu.org Download my example config_make script, edit as desired (youprobably want to change OBJDIR and PREFIX, and you may want to addother configure options. Run the config_make script "gmake install" as root (although I instead create the directorymatching PREFIX, make it writable by the account doing the build, then"gmake install" using that account).You should now have GCC binaries that look for the shared librariesthey need in /usr/sfw/lib, /usr/local/lib and PREFIX/lib, withoutanyone needing to set LD_LIBRARY_PATH. In particular, modern versionsof Solaris will have a libgcc_s.so in /usr/sfw/lib.If you copy your GMP and MPFR shared libraries (which seem to beneeded by parts of the compiler) into PREFIX/lib, you will also havea self-contained directory tree that you can deploy to any similarsystem more simply (e.g. via rsync, tar, cpio, "scp -pr", ...)

I needed to build GCC 4.3.1 for my x86 system running a recent development build of Solaris. I thought I would share what I discovered, and then improved on. I started with href="http://paulbeachsblog.b...


Comparing the UltraSPARC T2 Plus to Other Recent SPARC Processors

Update - now the UltraSPARC T2 Plus has been released, and isavailable in several new several Sun servers. Allan Packer haspublished a new collectionof blog entries that provide lots of detail.Here is my updated table of details comparing a number of currentSPARC processors. I can not guarantee 100% accuracy on this, but Idid quite a bit of reading...NameUltraSPARC IV+®SPARC64TM VIUltraSPARCTM T1UltraSPARCTM T2UltraSPARCTM T2 PlusCodenamePantherOlympus-CNiagaraNiagara 2Victoria FallsPhysicalprocess90nm90nm90nm65nm65nmdie size335 mm2421 mm2379 mm2342 mm2pins136819331831transistors295 M540 M279 M503 Mclock1.5 – 2.1 GHz2.15 – 2.4 GHz1.0 – 1.4 GHz1.0 – 1.4 GHz1.2 – 1.4 GHzArchitecturecores22888threads/core12488threads/chip24326464FPU : IU1 : 11 : 11 : 81 : 11 : 1integration8 × small crypto8 × large crypto, PCI-E, 2 × 10Gbe8 × large crypto, PCI-E, multi-socketcoherencyvirtualizationdomains1hypervisorL1 i$64K/core128K/core16K/coreL1 d$64K/core128K/core8K/coreL2 cache (on-chip)2MB, shared, 4-way, 64B lines6MB, shared, 10-way, 256B lines3MB, shared, 12-way, 64B lines4MB, shared, 16-way, 64B lines L3 cache 32MB shared, 4-way, tags on-chip, 64B lines n/a n/a MMUon-chipon-chip, 4 × DDR2on-chip, 4 × FB-DIMMon-chip, 2 × FB-DIMMMemory ModelsTSOTSOTSO, limited RMOPhysical Address Space43 bits47 bits40 bitsi-TLB16 FA + 512 2-way SA64 FAd-TLB16 FA + 512 2-way SA64 FA128 FAcombined TLB32 FA + 2048 2-way SAPage sizes8K, 64K, 512K, 4M, 32M, 256M8K, 64K, 512K, 4M, 32M, 256M8K, 64K, 4M, 256M Memory bandwidth2 (GB/sec) 9.6 25.6 60+ 32 Footnotes 1 - domains are implemented above the processor/chip level 2 - theoretical peak - does not take cache coherency or other limits into accountGlossary FA - fully-associative FPU - Floating Point Unit i-TLB - Instruction Translation Lookaside Buffer (d means Data) IU - Integer (execution) Unit L1 - Level 1 (similarly for L2, L3) MMU - Memory Management Unit RMO - Relaxed Memory Order SA - set-associative TSO - Total Store OrderReferences: Sun SPARC® Enterprise T5120, T5220, T5140, AND T5240 ServerArchitecture whitepaper VictoriaFalls - Scaling Highly-Threaded Processor Cores- presentation to HOT CHIPS 19 Memory and coherency on the UltraSPARC T2 Plus Processor- Denis Sheahan's blog UltraSPARC T2 Supplement to the UltraSPARCArchitecture 2007 UltraSPARC T1 Supplement to the UltraSPARCArchitecture 2005 SPARC64 VI Extensions UltraSPARC-IV+ Processor User's Manual Supplement Wikipedia - SPARC

Update - now the UltraSPARC T2 Plus has been released, and is available in several new several Sun servers. Allan Packer has published a new collection of blog entries that provide lots of detail. Here...


What Drove Processor Design Toward Chip Multithreading (CMT)?

I thought of a way of explaining the benefit of CMT (or more specifically, interleaved multithreading - see this article for details) using an analogy the other day.Bear with me as I wax lyrical on computer history...Deep back in the origins of the computer, there was only one process(as well as one processor). There was no operating system, so in turnthere were no concepts like: scheduling I/O interrupts time-sharing multi-threadingWhat am I getting at? Well, let me pick out a few of the advances incomputing, so I can explain why interleaved multithreading is simply the next logical step.The first computer operating systems (such as GM-NAA I/O)simply replaced (automated) some of the tasks that were undertakenmanually by a computer operator - load a program, load some utilityroutines that could be used by the program (e.g. I/O routines), recordsome accounting data at the completion of the job. They did nothingduring the execution of the job, but they had nothing to do - no otherwork could be done while the processor was effectively idle, such as whenwaiting for an I/O to complete.Then muti-processing operating systems were developed. Suddenly wehad the opportunity to use the otherwise wasted CPU resource while oneprogram was stalled on an I/O. In this case the O.S. would switch inanother program. Generically this is known as scheduling, andoperating systems developed (and still develop) more sophisticatedways of sharing out the CPU resources in order to achieve thegreatest/fairest/best utilization.At this point we had enshrined in the OS the idea that CPU resourcewas precious, not plentiful, and there should be features designedinto the system to minimize its waste. This would reduce or delay theneed for that upgrade to a faster computer as we continued to add newapplications and features to existing applications. This is analogousto conserving water to offset the need for new dams & reservoirs.With CMT, we have now taken this concept into silicon. Ifwe think of a load or store to or from main (uncached) memory as atype of I/O, then thread switching in interleaved multithreading isjust like the idea of a voluntary context switch.We are not giving up the CPU for the duration of the "I/O", but we aregiving up the execution unit, knowing that if there is another threadthat can use it, it will.In a way, we are delaying the need to increase the clock rate orpipe-lining abilities of the cores by taking this step.Now the underlying details of the implementation can be more complexthan this (and they are getting more complex as we release newer CPUarchitectures like the UltraSPARC T2 Plus - see theT5140 Systems Architecture Whitepaper for details), but thisanalogy to I/O's and context switches workswell for me to understand why we have chosen this direction.To continue to throw engineering resources at faster, more complicatedCPU cores seems to be akin to the idea of the mainframe (the closestdescendant to early computers) - just make it do more of the same typeof workload.See here for the full collection of UltraSPARC T2 Plus blogs

I thought of a way of explaining the benefit of CMT (or more specifically, interleaved multithreading - see this article for details) using an analogy the other day.Bear with me as I wax lyrical on...


Margins in Consumer Telephony

Here is a little observation on telephone margins that is dear to myheart. Below is a list of rates (in US dollars per minute, taxes andother fees not shown) for various methods of calling from the US to aland-line in Australia. The last four options use VoIP.SourceCarrierAdd-on PlanAdd-on $/monthRateLand-lineAT&Tnone – Peak-$4.00Land-lineAT&Tnone – Off-peak-$2.76MobileAT&Tnone-$3.49MobileAT&TWorld Connect$3.99$0.09Land-lineAT&TOccasional Calling$1.00$1.75Land-lineAT&TWorldwide Value Calling$5.00$0.09Land-lineTime-Warner Cable-$0.10Land-lineComcast$0.09Land-lineVonage$0.05Land-lineAT&T CallVantage$0.04Land-lineCallcentric$0.0231Land-lineCallWithUs$0.0148As you may see, there is a 27000% range in these numbers. Even withthat one carrier there is a 100x range. Plenty of opportunity forprofit.Hopefully it is useful to be aware there can be some verysteep rates for ex-pat Aussies to call home if they are away fromtheir preferred carrier.I have been quite satisfied with CallWithUs, if anyone is interested.They even have a call-back feature if I want to call from mymobile.While I'm on the topic, I should also mention this helpful message Igot from my wireless (mobile) provider (although they are no longer myprovider):When you're on the go and don't have the info you need, AT&T 411is here to help. Whether you're searching for a business or residence- dial 4-1-1 to get quick access to phone numbers and addresses. Plus,with AT&T 411 you can find movie times, driving directions andmore. And it's just $1.79 per call plus standard airtime charges.\*Thanks for the reminder - I will be vigilant to avoid that $1.79charge, and stick to 1-800-FREE411...

Here is a little observation on telephone margins that is dear to my heart. Below is a list of rates (in US dollars per minute, taxes andother fees not shown) for various methods of calling from the...


Utilization - Can I Have More Accuracy Please?

Just thought I would share another Ruby script - this one takes the output of mpstat, and makes it more like the output of mpstat -a, only the values are floating point. I wrote it to process mpstat -a that I got from a customer. It can also cope with the date (in Unix ctime format) being prepended to every line. Here is some sample output:CPUs minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 4 7.0 0.0 26.0 114.25 68.0 212.75 16.75 64.75 11.75 0.0 141.25 1.0 1.0 0.0 98.5 4 0.75 0.0 929.75 2911.5 1954.75 10438.75 929.0 4282.0 715.0 0.0 6107.25 39.25 35.75 0.0 25.25 4 0.0 0.0 892.25 2830.25 1910.5 10251.5 901.5 4210.0 694.5 0.0 5986.0 38.5 35.0 0.0 26.75 4 0.0 0.0 941.5 2898.25 1926.75 10378.0 911.75 4258.0 698.0 0.0 6070.5 39.0 35.5 0.0 25.25 4 0.0 0.0 893.75 2833.75 1917.75 10215.0 873.75 4196.25 715.25 0.0 5925.25 38.0 34.75 0.0 27.25The script is here.Interestingly, you can use this to get greater accuracy on things like USR and SYS than you would get if you just used vmstat, sar, iostat or mpstat -a. This depends on the number of CPUs you have in your system though.Now, if you do not have a lot of CPUs, but still want greater accuracy, I have another trick. This works especially well if you are conducting an experiment and can run a command at the beginning and end of the experiment. This trick is based around the output of vmstat -s:# vmstat -s[...] 54056444 user cpu 42914527 system cpu1220364345 idle cpu 0 wait cpuThose numbers are "ticks" since the system booted. A tick is usually 0.01 seconds.NEW: I have now uploaded a script that uses these statistics to track system-wide utilization.

Just thought I would share another Ruby script - this one takes the output of mpstat, and makes it more like the output of mpstat -a, only the values are floating point. I wrote it to process mpstat...


How Event-Driven Utilization Measurement is Better than Sample-Based

...and how to measure both at the same timeWith the delivery of Solaris 10, Sun made two significant changes to how systemutilization is measured. One change was to how CPU utilisation is measuredSolaris used to (and virtually all other POSIX-like OS'es still) measure CPU utilisation by sampling it. This happened once every "clock tick". A clock tick is a kernel administrative routine which is executed once (on one CPU) for every clock interrupt that is received, which happens once every 10 milliseconds. At this time, the state of each CPU was inspected, and a "tick" would be added to each of the "usr", "sys", "wt" or "idle" buckets for that CPU.The problem with this method is two-fold: It is statistical, which is to say it is an approximation of something, derived via sampling The sampling happens just before the point when Solaris looks for threads that are waiting to be woken up to do work.Solaris 10 now uses microstate accounting. Microstates are a set of finer-grained states of execution, including USR, SYS, TRP (servicing a trap), LCK (waiting on an intra-process lock), SLP (sleeping), LAT (on a CPU dispatch queue), although these all fall under one of the traditional USR, SYS and IDLE. These familiar three are still used to report system-wide CPU utilisation (e.g. in vmstat, mpstat, iostat), however you can see the full set of states each process is in via "prstat -m".The key difference in system-wide CPU utilization comes in how microstate accounting is captured - it is captured at each and every transition from one microstate to another, and it is captured in nanosecond resolution (although the granularity of this is platform-dependent). To put it another way it, it is event-driven, rather than statistical sampling.This eliminated both of the issues listed above, but it is the second issue that can cause some significant variations in observed CPU utilization.If we have a workload that does a unit of work that takes less than one clock tick, then yields the CPU to be woken up again later, it is likely to avoid being on a CPU when the sampling is done. This is called "hiding from the clock", and is not difficult to achieve (see "hide from the clock" below).Other types of workloads that do not explicitly behave like this, but do involve processes that are regularly on and off the CPU can look like they have different CPU utilization on Solaris releases prior to 10, because the timing of their work and the timing of the sampling end up causing an effect which is sort-of like watching the spokes of a wheel or propeller captured on video. Another factor involved in this is how busy the CPUs are - the closer a CPU is to either idle or fully utilized, the more accurate sampling is likely to be. What This Looks Like in the Wild I was recently involved in an investigation where a customer had changed only their operating system release (to Solaris 10), and they saw an almost 100% increase (relative) in reported CPU utilization. We suspected that the change to event-based accounting may have been a factor in this.During our investigations, I developed a DTrace utility which can capture CPU utilization that is like that reported by Solaris 10, then also measure it the same way as Solaris 9 and 8, all at the same time.The DTrace utility, called util-old-new, is available here.It works by enabling probes from the "sched" provider to track when threads are put on and taken off CPUs. It is event-driven, and sums up nanoseconds the same way Solaris 10 does, but it also tracks the change in a system variable, "lbolt64" while threads are on CPU, to simulate how many "clock ticks" the thread would have accumulated. This should be a close match, because lbolt64 is updated by the clock tick routine, at pretty much the same time as when the old accounting happened.Using this utility, we were able to prove that the change in observed utilisation was pretty much in line with the way Solaris has changed how it measures utilisation. The up-side for the customer was that their understanding of how much utilisation they had left on their system was now more accurate. the down side was that they now had to re-assess whether, and by how much, this changed the amount of capacity they had left.Here is some sample output from the utility. I start the script when I already have one CPU-bound thread on a 2-CPU system, then I start up one instance of Alexander Kolbasov's "hide-from-clock", which event-based accounting sees, but sample-based accounting does not:mashie[bash]# util-old-new 5NCPUs = 2Date-time s8-tk/1000 s9-tk/1000 ns/10002007 Aug 16 12:12:14 508 523 5402007 Aug 16 12:12:19 520 523 5532007 Aug 16 12:12:24 553 567 7542007 Aug 16 12:12:29 549 551 7982007 Aug 16 12:12:34 539 549 810\^C The Other Change in Utilization Measurement By the way, the other change was to "hard-wire" the Wait I/O("%wio" or "wt" or "wait time") statistic to zero. The reasoning behind this is that CPU's do not wait for I/O (or any other asynchronous event) to complete - threads do. Trying to characterize how much a CPU is not doing anything in more than one statistic is like having two fuel gauges on your car - one for how much fuel remains for highway driving, and another for city driving. References & Resources My util-old-new utility - based on DTrace "How Busy Is Your CPU, Really?" - article by Adrian Cockcroft. Eric Schrock's blog entry describing microstate accounting. Alexander Kolbasov's "hide from the clock" example program. "How busy is the CPU, really" - Adrian Cockcroft's ITworld article from 1998 - good diagrams to explain the shortcoming of sample-based accounting. Usenix Security Symposium paper - "Secretly Monopolizing the CPU Without Superuser Privileges" Interesting opensolaris.org thread that covers the Wait I/O issue. Call Record for the RFE to hard-wire Wait I/O to zero in Solaris 10.P.S. This entry is intended to cover what I have spoken about in my previous two entries. I will soon delete the previous entries.

...and how to measure both at the same time With the delivery of Solaris 10, Sun made two significant changes to how system utilization is measured. One change was to how CPU utilisation is measured Sola...


Using DTrace to Capture Statement Execution Times in MySQL

Introduction I have recently been engaged with a customer that is evaluating MySQL,in particular with its Memory (formerly known as Heap) engine, whichstores all database data and metadata in RAM.I was asked to assist with diagnosing if/whether/where statements weretaking longer than 5 milliseconds to complete. Now, this was beingobserved from the viewpoint of the "client" - the client was asynthetic benchmark run as a Java program. It could be run either ona separate system or on the same system as the MySQL database, and asmall number of transactions would be measured as taking longer than5ms.Now, there is more than one way to skin this cat, and it turns outthatJennyChen has had a go at putting static probes into MySQL. For my(and the customer's) purposes however, we wanted to skip the step ofre-compiling with our own probes, and just use what we can observe viause ofthePID provider. How Do We Do This? Well, it is not trivial. However as it turns out, I have seen a bitof the MySQL code. I also had someone pretty senior from MySQL nextto me, who helped confirm what was happening, as I used some "fishing"bits of DTrace to watch a mysqld thread call functions as we ran the"mysql" client and entered simple statements.This allowed me to narrow down on a couple of vio_\*() routines, and tobuild some pseudo-code to describe the call flow around reception ofrequests from a client, processing of the statement, then return of aresult to the client.It is not as simple as looking for the entry and return of a singlefunction, because I wanted to capture the full time from when themysqld thread returns from a read(2) indicating a request has arrivedfrom a client through to the point where the same thread hascompleted a write(2) to send the response back. This is the broadestdefinition of "response time measure at the server". The Result The result of all of our measurements showed that there were nostatements from the synthetic benchmark that took longer than 5 ms tocomplete. Here is an example of the output of my DTrace capture (ahistogram of microseconds):bash-3.00# ./request.d -p `pgrep mysqld`\^C value ------------- Distribution ------------- count< 0 | 0 0 |@@@@@@@@@@@@@@@@@@@@@ 10691 500 |@@@@@@@@@@@@@@@@@ 8677 1000 |@ 680 1500 | 31 2000 | 0 The Script Feel free to use theDTrace script for your own purposes. It should work on MySQL 5.0 and5.1. The Upshot - Observability in the JVM There is a nagging question remaining - why was the Java client seeingsome requests run longer than 5 ms?.There are three possible answers I can think of: There is latency involved in transmitting the requests and responses between the client and server (e.g. network packet processing). The thread inside the JVM was being pre-empted (thrown off the CPU) between taking its time measurements. The measurements (taken using System.nanoTime()) are not reliable.

Introduction I have recently been engaged with a customer that is evaluating MySQL, in particular with its Memory (formerly known as Heap) engine, which stores all database data and metadata in RAM. I...


Case Study - AT&T Customer Relationship Management

Time-line of Events April 20th, 2006 - I apply for SBC Yahoo Internet, whichincludes a sweetheart deal that will last for 12 months. Included inthe terms & conditions is an early termination fee of $99 ismentioned.April 21st, 2006 - The service is connectedApril 13th, 2007 (approx) - I contact AT&T (had sincere-acquired SBC) to disconnect phone & internet service, due to myimminent move to a new apartment. Disconnection is booked for April14th. No mention from AT&T call center staff of earlytermination fee.April 18th, 2007 - AT&T issue me a a bill. It is for$17.22 and is clearly marked "FINAL".May 16th, 2007 - AT&T issue me another bill. This one ismarked "REVISED FINAL BILL", and contains a $99 charge for "EARLYTERMINATION-HSI BASIC".May 25th, 2007 - I contact AT&T and ask for an explanation.They explain the early termination fee. I ask if I can get my servicere-connected. I am told that no, I can not get it re-connected as ithas been more than 30 days since it was disconnected. I ask if thereis any way this situation can be rectified - suggest that I couldreconnect if a credit was made to cover the early termination fee.There is no flexibility available to the person I was speaking to,though they did empathize with my situation. UPDATE June 26th, 2007 - I send a check to AT&T for the $99, but include a request for an enclosed copy of this blog entry to be forwarded to their customer service department.July 27th, 2007 - AT&T send a check for $99 back to me, refunding the early termination fee. Outcome So, initially I was disappointed, but ultimately I am happy with AT&T. I could never say that they were not in the right, but I would still suggest that they could give more latitude to the people who work in their call centers.It would also be wise to advise customers of the pending early termination fee at the earliest opportunity.

Time-line of Events April 20th, 2006 - I apply for SBC Yahoo Internet, which includes a sweetheart deal that will last for 12 months. Included inthe terms & conditions is an early termination fee of...


Simplifying "lockstat -I" Output (or Ruby Is Better Than Perl)

There, I said it. I have been a Perl scripter for nearly 20 years(since early version 3). I think Ruby has pretty much everything Perlhas, and more, like: It is really object-oriented (rather than Perl's bolt-on objects). I am much more likely to end up using this functionality. It has operator overloading It has a "case" statement There is no Obfuscated Ruby Contest (though IMHO there could be one)Anyway, enough religous argument. I want to thankmy colleague Neel for putting me onto it,and now I will provide a simple example. Post-Processing "lockstat -I" The Ruby script below post-processes the output of "lockstat -I". Whyyou ask? - well, because the output of "lockstat -I" can be tens ofthousands of lines, and it does not coalesce by CPU.The script below will coalesce by CPU, and PC within a function(useful if you forgot the "-k" flag to lockstat). A very simplechange will also make it coalesce by PIL, but I will leave thatas an exercise for the reader. Ruby Post-Processor Script One thing about this is that I would write this script almost exactlythe same way if I used Perl. That is another plus of Ruby - it iseasy to pick-up for a Perl programmer.#!/bin/env ruby -w# lockstatI.rb -Simplify "lockstat -I" output## http://blogs.sun.com/timc, 16 Mar 2007## SYNOPSIS#lockstat -I [-i 971 -n <nnnnnn> ] sleep <nnn> | lockstatI.rbPROG = "lockstatI"#-- Once we have printed values that cover this proportion, then quitCUTOFF = 95.0DASHES = ('-' \* 79) + "\\n"#========================================================================#MAIN#========================================================================print "#{PROG} - will display top #{CUTOFF}% of events\\n"events = 0period = 0state = 0counts = {}#-- The classic state machineARGF.each_line do |line| next if line == "\\n" case state when 0if line =~ /\^Profiling interrupt: (\\d+) events in (\\d.\*\\d) seconds/then puts line state = 1 events = $1 period = $2 nextend when 1if line == DASHES then state = 2end when 2if line == DASHES then breakendf = line.splitcount = f[0]cpu, pil = f[5].split('+')#-- Coalesce PCs within functions; i.e. do not differentiate by#-- offset within a function. Useful if "-k" was not specified#-- on the lockstat command.caller = f[6].split('+')[0]if pil then caller = caller + "[" + pil + "]"endcounts[caller] = counts[caller].to_i + count.to_i endend#-- Give me an array of keys sorted by descending valuecaller_keys = counts.keys.sort { |a, b| counts[b] <=> counts[a] }#-- Dump it outprintf "%12s %5s %5s %s\\n", "Count", "%", "cuml%", "Caller[PIL]"cuml = 0.0caller_keys.each do |key| percent = counts[key].to_f \* 100.0 / events.to_f cuml += percent printf "%12d %5.2f %5.2f %s\\n", counts[key], percent, cuml, key if cuml >= CUTOFF thenbreak endend Example Output Free beer to the first person to tell me what this is showing. It wasnot easy to comprehend the 90,000 line "lockstat -I" output priorto post-processing it though. You get this problem when youhave 72 CPUs...lockstatI - will display top 95.0% of eventsProfiling interrupt: 4217985 events in 60.639 seconds (69559 events/sec) Count % cuml% Caller[PIL] 1766747 41.89 41.89 disp_getwork[11] 1015005 24.06 65.95 (usermode) 502560 11.91 77.86 lock_set_spl_spin 83066 1.97 79.83 lock_spin_try[11] 62670 1.49 81.32 mutex_enter 53883 1.28 82.60 mutex_vector_enter 40847 0.97 83.57 idle 40024 0.95 84.51 lock_set_spl[11] 27004 0.64 85.15 splx 17432 0.41 85.57 send_mondo_set[13] 15876 0.38 85.94 atomic_add_int 14841 0.35 86.30 disp_getwork

There, I said it. I have been a Perl scripter for nearly 20 years (since early version 3). I think Ruby has pretty much everything Perl has, and more, like: It is really object-oriented (rather than...


SSH Cheat Sheet

This is offered for those who want to kick their telnet habit. I also offer asimple text version, which you can keep in ~/.ssh. To create an SSH key for an account srchost$ ssh-keygen -t rsaThis will create id_rsa and id_rsa.pub in~/.ssh. "-t dsa" can be used instead. You will need an SSHkey if you want to log in to a system without supplying apassword. To be able to log in to desthost from srchost without a password (as below) srchost$ ssh desthostdesthost$Simply add the contents ofsrchost:~/.ssh/id_rsa.pubtodesthost:~/.ssh/authorized_keysin the form "ssh-rsa AAAkeystringxxx= myusername@srchost". To enable forwarding of an X-windows session back to your $DISPLAY onsrchost Just use "-X":srchost$ ssh -X desthostdesthost$ xterm If I use a different account on desthost (and I want to use a short namefor desthost) Add something like this:Hostpaedata Hostnamepaedata.sfbay Usertc35445to srchost:~/.ssh/config Still Getting Prompted For a Password If I find that my key is not being recognised on desthost (I still getprompted for a password), I probably have a premission problem. try thisas the user on desthost:cdchmod g-w,o-w .chmod g=,o= .ssh .ssh/authorized_keys To allow root logins (but must specify password or have anauthorized_key) on a host Edit /etc/ssh/sshd_config, change line toPermitRootLogin yesSolaris 9 & earlier:# /etc/init.d/sshd restartSolaris 10 & later:# svcadm restart sshHere is a patch (will save the originial config file in sshd_config.orig)/usr/bin/patch -b /etc/ssh/sshd_config

This is offered for those who want to kick their telnet habit. I also offer asimple text version, which you can keep in ~/.ssh. To create an SSH key for an account srchost$ ssh-keygen -t rsa This will...


The Compiler Detective - What Compiler and Options Were Used to Build This Application?

("cc CSI") Introduction Performance engineers often look at improving application performanceby getting the compiler to produce more efficient binaries from thesame source. This is done by changing what compiler options are used.In this modern era of Open Source Software, you can often get yourhands on a number of binary distributions of an application, but ifyou really want to roll your sleeves up, the source is there, justwaiting to be compiled with the latest compiler and optimisations.Now, it might be useful to have as a reference the compiler version andflags that were originally used on the binary distribution you triedout, or you just might be interested to know. Read on for details on theforensic tools. What architecture was the executable compiled for? Solaris supports 64 and 32-bit programming models on SPARC and x86.You may need to know which one an application is using - it's easyenough to find out.$ file \*.oxxx.o: ELF 64-bit LSB relocatable AMD64 Version 1yyy.o: ELF 32-bit LSB relocatable 80386 Version 1 [SSE2]Note: yyy.o was built using "-native", informing the compilerthat it can use the SSE2 instruction set extensions supported by theCPUs on the build system.

("cc CSI") Introduction Performance engineers often look at improving application performance by getting the compiler to produce more efficient binaries from thesame source. This is done by changing...


Why CONNECT / AS SYSDBA did not work - Trap For Young Players

Background I have been doing some installing of Oracle databases onto system where a database and oracle account are already installed. I wanted to effect changes that could be easily cleaned up. So, I created my own oracle account/uid and group/gid. The Problem The problem I encountered was that when I tried to connect to the database to do DBA-type things, my experience went like this:tora ) sqlplus / as sysdba...ERROR:ORA-01031: insufficient privileges The Solution This mode of connection, where a bare slash (/) is used instead of oracleaccount/oraclepassword, is where you ask Oracle to authorize you via your OS credentials. What I was missing is that the Unix group specified during creation of the database (or installation of the software, I forget which) is separate from the Unix group used to automatically grant a user the SYSDBA and SYSOPER privileges - which would have allowed the "sqlplus / as sysdba" to succeed.So the Unix group that does grant these privileges (I think they are Oracle roles, just as Solaris has roles) is "dba". I am not sure whether this can be changed, so I took the shortest route and added my "tora" user to the "dba" group, and all was well.File this one under just-hoping-anyone-in-the-same-pain-finds-this-via-Google.I also would like to plug Installing and Configuring Oracle Database 10g on the Solaris Platform by Roger Schrag as a useful Cheat Sheet.

Background I have been doing some installing of Oracle databases onto system where a database and oracle account are already installed. I wanted to effect changes that could be easily cleaned up. So,...


Lsync - Keeping Your Sanity by Keeping Your Home Directory Synchronised

I haver battled for a number of years on how to keep a laptop and ahome directory reasonably syncronised. About a year ago I decided tosolve this problem once and for all.Below I describe Lsync, a script I wrote to do the work for me. Here isthe Lsync script for download. Introduction WARNING: Command-Line Content Lsync is a tool to keep your home directory synchronised across multiplesystems. While there are a few solutions out there for doing thisautomagically, such solutions are either still in a research phase, havea price tag, or are designed on a different scale.This solution is intended to be cheap, and work with most variants ofUnix. It is also intended to be used by the user who is doing updateson one or both copies of their own home directory. It should be usedregularly, to minimise the work required at each operation, and toreduce the risk of data loss, or insanity.Lsync is implemented on top of the excellent OSS program rsync. Pre-Requisites As configured, Lsync depends on bash, ssh, and a recent version of rsync(exactly how recent I do not know). It can probably be modified to usethe (far less secure) rsh/rexec protocol, but I am not going to do thiswork. So far, I have found it to work on Solaris 9, 10 and Express,as well as Mac OS X 10.3.9. I would fully expect it to work on anyvariant of Linux.The user will need to edit the Lsync script to set the values for their"laptop" and "master" hosts. What It Does Lsync offers 5 basic functions: Check on what needs to be synced from your "laptop" to your "master" Check on vice-versa Sync from your "laptop" to your "master" (make it so...) Sync vice-versa Edit your "rsync includes" fileThe sync/check operations by default are done between the user's homedirectories on the "laptop" and "master" hosts, but they can insteadperform the operation just on a sub-directory of the home directory.Any operation that modifies data requires further input from the user -either editing of the "rsync includes" file, or entering a "y" toconfirm that we really want to sync.An advantage of using ssh as the remote shell protocol is that the usercan leverage ~/.ssh/config to specify a different username to be used ona remote host. For example, I have the username "timc" on my laptop,but have to use the corporate standard "tc35445" on any SWAN host(yuck). The magic for this to happen without work on my behalf is toput this in ~/.ssh/config on my laptop:Host\*.sfbay User tc35445 Terminology laptop A host you designate. This can have a dynamic name,established at run-time as the host you are running Lsync on,or it can be static. If the laptop host is static, you can runLsync on either the master or the laptop host. master A host you designate. This is static. sender Host deemed to have the current authoritative version ofyour home directory. receiver Host being synchronised to the sender's version of yourhome directory. rsync includes file A file containing rules for including & excluding filesand/or directories to be synchronised. The format isdocumented in the rsync(1) manual page. The defaultlocation for this is ~/.rsync-include. Performance Unless you have seen rsync before, you will be surprised how fast it cando it's work. I am currently syncing about 50,000 files, but if themetadata for these is cached on both systems, a full home directorycheck takes around a minute. If it is the first check of the day, itmight take 5 minutes.In either case, it is fast enough to use at least daily, which means youcan easily have your full, up-to-date "working set" with you on yourlaptop when you are out of the office, but go back to a SunRay when youget in.Alternatively, if you know you have been working in a sub-directory, justspecify that subdirectory, and Lsync limits its update to that directorytree. For example:$ Lsync L ~/tools/sh/Lsync Caveats, Warnings, Disclaimers and Other Fine Print It is important to understand that I am using the "--delete" option torsync, which means that rsync will delete files on the receiver that donot exist on the sender. This means if you delete something on onehost, it won't come back to haunt you, but it also means you must getsync operations in the correct order with your work activity. For easyidentification, rsync tells you whenever it would have deleted or isdeleting something by prepending it with "deleting".Also, all sync operations will act on files and directories at the samelevel under your home directory on both systems. In other words, youcan not use Lsync to copy a directory to a different location on thereceiver, leaving the receiver's version of the directory in place.This is a deliberate decision - Lsync is intended to synchronise, not toreplicate. Examples When I want to sync my home directory, if I am not sure what I havemodified on what directory, I first check:d-mpk12-65-186 ) Lsync l-- Lsync: listing what to sync under ~ from d-mpk12-65-186.SFBay.Sun.COM to paedata.sfbaybuilding file list ... donedeleting tools/sh/Lsync/tmpfile./man/cat1/man/cat1/rsync.1.gztools/sh/Lsync/READMEwrote 1016147 bytes read 28 bytes 16796.28 bytes/sectotal size is 4882980897 speedup is 4805.26-- Use "Lsync L" to perform syncd-mpk12-65-186 ) Then, noticing that the only thing to delete is something I have deletedon the sender and genuinely do not want any more, I go ahead and "makeit so":d-mpk12-65-186 ) Lsync L-- Lsync: SYNCING everything under ~ from d-mpk12-65-186.SFBay.Sun.COM to paedata.sfbayEnter 'y' to confirm: ybuilding file list ... 49676 files to considerdeleting tools/sh/Lsync/tmpfile./man/cat1/man/cat1/rsync.1.gz 47248 100% 780.49kB/s 0:00:00 (1, 70.8% of 49676)tools/sh/Lsync/tools/sh/Lsync/README 12288 100% 255.32kB/s 0:00:00 (2, 99.4% of 49676)wrote 1075767 bytes read 60 bytes 16942.16 bytes/sectotal size is 4882980897 speedup is 4538.82If I want to double-check, I can now see what might be out of datewith my "master":d-mpk12-65-186 ) Lsync m-- Lsync: listing what to sync under ~ from paedata.sfbay to d-mpk12-65-186.SFBay.Sun.COMreceiving file list ... donewrote 337 bytes read 1049835 bytes 21653.03 bytes/sectotal size is 4882980897 speedup is 4649.70-- Use "Lsync M" to perform syncIf I want to just sync a particular directory tree, I can specify thisas an argument after the operation letter. This will be a lot fasterthan examining my whole home directory tree on both hosts. It alsoignores my "rsync includes" file.Any absolute or relative path can be specified, but it must resolve tosomething below my $HOME:d-mpk12-65-186 ) pwd/Users/timc/tools/sh/Lsyncd-mpk12-65-186 ) Lsync l .-- Lsync: listing what to sync under ~/tools/sh/Lsync from d-mpk12-65-186.SFBay.Sun.COM to paedata.sfbaybuilding file list ... done./READMEwrote 165 bytes read 28 bytes 55.14 bytes/sectotal size is 32083 speedup is 166.23-- Use "Lsync L" to perform syncd-mpk12-65-186 ) Lsync l ~/tools/-- Lsync: listing what to sync under ~/tools from d-mpk12-65-186.SFBay.Sun.COM to paedata.sfbaybuilding file list ... donesh/Lsync/sh/Lsync/READMEwrote 54756 bytes read 28 bytes 9960.73 bytes/sectotal size is 84506552 speedup is 1542.54-- Use "Lsync L" to perform syncd-mpk12-65-186 ) Lsync l /var/tmpLsync: can not sync "/var/tmp", as it is not under $HOME

I haver battled for a number of years on how to keep a laptop and a home directory reasonably syncronised. About a year ago I decided to solve this problem once and for all. Below I describe Lsync, a...


How the T2000 handles 20,000 Database Connections

I have recently changed careers at Sun, moving from a pre/post-salestechnical role to an engineer role. I now work in Sun's PAE(Performance Availability & Architecture Engineering) group.You could say I am definitely enjoying the change, getting to play withreal geeky stuff :)Both jobs had the opportunity to play with new hardware, but the newjob should see me get much more play time with pre-release systems,which I find very interesting.Anyway, I had the chance to do some testing on one of our new systemsthat many customers have also had a great chance to test on - the T2000.The workload I am using is designed to replicate a database query loadthat is common for some parts of modern on-line retail operations -almost exclusive relatively simple queries, running pretty much fullycached in the RAM of the database server. The queries come from a verylarge population of clients (5 - 20 thousand).As a highly analytical fact-oriented engineer, you need to realise thatup until my first opportunity to experience the T2000 first-hand, Iremained just a little sceptical that it could deliver something likethe promise of what I had heard. It has been covered that the T2000suits some workloads better than others - I was keen to get aquantitative understanding of this to support the qualitative things Ihad heard.My base system I had been doing tests on was an 8-CPU (16 core) E2900.I expected this would provide a pretty fair comparison to a system thathas 8 cores in its single CPU. In fact I knew of some preliminarynumbers for the T2000 which made be think the E2900 would come outslightly ahead.I need to update my thinking. The advantage of four threads per core isproven by reality, with the 8-core T2000 showing it certainly hadgreater throughput for my workload than the 16-core E2900.As you can see in the graph, the T2000's CPU consumption is well aheadof the E2900 (lower is better). Also, as I scale up the number of connections (keepingthe overall throughput the same), the T2000 does not see as much of apenalty for the increase.References: Sun T2000 vs Dell 6850: LDAP Authrate T2000 & T1000 industry benchmark results Sun Fire Server Try 'n' Buy programs

I have recently changed careers at Sun, moving from a pre/post-sales technical role to an engineer role. I now work in Sun's PAE (Performance Availability & Architecture Engineering) group. You could...


Demonstrating ZFS Self-Healing

I'm the kind of guy who likes to tinker. To see under the bonnet. Iused to have a go at "fixing" TV's by taking the back off and seeingwhat could be adjusted (which is kind-of anathema to one of thephilosophies of ZFS).So, when I have been presenting and demonstrating ZFS to customers,the thing I really like to show is what ZFS does when I inject "silentdata corruption" into one device in a mirrored storage pool.This is cool, because ZFS does a couple of things that are not done byany comparable product: It detects the corruption by using checksums on all data and metadata. It automatically repairs the damage, using data from the other mirror, assuming checksum(s) on that mirror are OK.This all happens before the data is passed off to the processthat asked for it. This is how it looks in slideware:The key to demonstrating this live is how to inject corruption,without having to apply a magnet or lightning bolt to my disk. Here ismy version of such a demonstration: Create a mirrored storage pool, and filesystemcleek[bash]# zpool create demo mirror /export/zfs/zd0 /export/zfs/zd1cleek[bash]# zfs create demo/ccs Load up some data into that filesystem, see how we are doingcleek[bash]# cp -pr /usr/ccs/bin /demo/ccscleek[bash]# zfs listNAME USED AVAIL REFER MOUNTPOINTdemo 2.57M 231M 9.00K /demodemo/ccs 2.51M 231M 2.51M /demo/ccs Get a personal checksum of all the data in the files - the "find/cat" will output the contents of all files, then I pipe all that data into "cksum"cleek[bash]# cd /demo/ccscleek[bash]# find . -type f -exec cat {} + | cksum1891695928 2416605 Now for the fun part. I will inject some corruption by writingsome zeroes onto the start of one of the mirrors.cleek[bash]# dd bs=1024k count=32 conv=notrunc if=/dev/zero of=/export/zfs/zd032+0 records in32+0 records out Now if I re-read the data now, ZFS will not find anyproblems, and I can verify this at any time using "zpool status"cleek[bash]# find . -type f -exec cat {} + | cksum1891695928 2416605cleek[bash]# zpool status demo pool: demo state: ONLINE scrub: none requestedconfig: NAME STATE READ WRITE CKSUM demo ONLINE 0 0 0 mirror ONLINE 0 0 0 /export/zfs/zd0 ONLINE 0 0 0 /export/zfs/zd1 ONLINE 0 0 0The reason for this is that ZFS still has all the data for thisfilesystem cached, so it does not need to read anything from thestorage pool's devices. To force ZFS' cached data to be flushed, I export and re-importmy storage poolcleek[bash]# cd /cleek[bash]# zpool export -f democleek[bash]# zpool import -d /export/zfs democleek[bash]# cd -/demo/ccs At this point, I should find that ZFS has found some corruptmetadatacleek[bash]# zpool status demo pool: demo state: ONLINEstatus: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.action: Determine if the device needs to be replaced, and clear the errors using 'zpool online' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: none requestedconfig: NAME STATE READ WRITE CKSUM demo ONLINE 0 0 0 mirror ONLINE 0 0 0 /export/zfs/zd0 ONLINE 0 0 7 /export/zfs/zd1 ONLINE 0 0 0 Cool - Solaris Fault Manager at work. I'll bring that mirror back online, so ZFSwill try using it for what I plan to do next...cleek[bash]# zpool online demo/export/zfs/zd0Bringing device /export/zfs/zd0 online Now, I can repeat my read of data to generate my checksum, andcheck what happenscleek[bash]# find . -type f -exec cat {} + | cksum1891695928 2416605 note that my checksum is the samecleek[bash]# zpool status[...] NAME STATE READ WRITE CKSUM demo ONLINE 0 0 0 mirror ONLINE 0 0 0 /export/zfs/zd0 ONLINE 0 0 63 /export/zfs/zd1 ONLINE 0 0 0Of course, if I wanted to know the instant things happened, I couldalso use DTrace (in another window):cleek[bash]# dtrace -n :zfs:zio_checksum_error:entrydtrace: description ':zfs:zio_checksum_error:entry' matched 1 probeCPU ID FUNCTION:NAME 0 40650 zio_checksum_error:entry 0 40650 zio_checksum_error:entry 0 40650 zio_checksum_error:entry 0 40650 zio_checksum_error:entry[...]Technorati Tag: ZFS

I'm the kind of guy who likes to tinker. To see under the bonnet. I used to have a go at "fixing" TV's by taking the back off and seeing what could be adjusted (which is kind-of anathema to one of thep...


Customising man(1) for more readable manual pages

This entry can be considered a test run, or it can be considered asan entry that will stop my next entry from being the first and onlyentry when it is released. These things are true, but I hope thisentry will be of interest anyway.I am still a habitual user of the "man" command in Solaris (and otherUnixes). It has pretty good response time, and the format isfamiliar. I did notice a number of years ago that the format could bebetter. All sorts of terminals are able to display bold andunderlined text, and printed man pages show some elements asbold, but man(1) on Solaris only shows underlined elements (seebelow).My interest was piqued when I re-acquainted myself with Linux, andnoticed that man(1) was showing bold elements. I had toinvestigate.I did a few things like "truss -f" on the man program, "strings" onthe binary, until I discovered that Solaris' man(1) was different inthat it was using a "-u0" option to nroff(1) when formatting thetext. This flag was undocumented at the time, but I discovered thatif I hand-built a man page and used "-u1" instead, I got bold text.I was obviously not firing on all cylinders that day, as I chose tocustomise man(1) by putting a copy of /usr/bin/man (the binary) in my~/bin directory, then editing the binary, changing any occurence of"-u0" to "-u1".When I next upgraded my workstation, I then had to replace the binarywith a script that called a release-specific binary, as I was thenusing multiple releases of Solaris on different systems.Eventually, the injector on cylinder 3 cleared, and I figured out howto do it with a script that effectively interposed on /usr/bin/man and/usr/bin/nroff. This would also work on many releases of Solaris(currently working fine on S10 and the most recent build ofOpenSolaris). At the same time, I figured I could use a customPAGER program for displaying man pages, and implement this in the samescript.If you feel envious and want this for yourself,download the script from here. Put this in a directory in yourPATH that comes before /usr/bin and /bin, then link it to "nroff" inthe same directory (e.g. "ln ~/bin/man ~/bin/nroff"). You will thenget what I see:

This entry can be considered a test run, or it can be considered as an entry that will stop my next entry from being the first and onlyentry when it is released. These things are true, but I...