Wednesday Mar 28, 2007

Eeek: Time isn't accurate in a VM!

I noticed that time was all over the place on my Solaris guest, and it made me wonder just how to measure and quantify time on a virtualized guest. The problem is, if you use gettimeofday() in the guest as a reference, it too may not be accurate. So, I used an external time reference to measure the guest, and low and behold, time was indeed out!

On my most recent VMware test configuration, Solaris was jumping forward several seconds/minutes at random times with snv as a guest to vmware on Ubuntu.

The test host was Ubuntu 6.10 on a 2x2core opteron rev.e system.

Basically, the problem is that the ubuntu dom0 power manages the opteron cores, AND it seems that some virtualization layers (in this case we used vmware) don't take into account that the time registers (tsc's) are not in sync across cores. When this happens, time jumps forward at random intervals, sometimes up to an hour. This particular problem only happens if numa systems are used with non syncrhonous tsc's.

To solve the problem, I bound my snv guest to a core, and tell VMware not to adjust the tsc: here's what I have as a description: "The host.noTSC and ptsc.noTSC lines enable a mechanism that tries to keep the guest clock accurate even when the time stamp counter (TSC) is slow."

processor0.use = FALSE 
processor1.use = FALSE 
processor2.use = FALSE 
host.cpukHz = 2200000 
host.noTSC = TRUE 
ptsc.noTSC = TRUE 

Here's what I did to quantify the issue: I ran an externally controlled timed benchmark of a time program, one of a reference (unvirtualized host) and the other on the virtualized snv guest. That way, I could know what the real elapsed time was rather than assuming what the guest was telling me. Interestly, the guest was indeed lying about it's notion of wallclock time.

Here's what 18 seconds looks like on (a) a vmware solaris guest, and (b) a reference machine (old SPARC), e.g. both of these test ran for exactly 18 real seconds. Sec is the # of seconds via gettimeofday() from the start, the tod is a delta of gettimeofday() between 1s sleeps.

Sec     tod   hrtime 
 0  1009354  1009476 
 1  1009801  1009817 
 2  1010484  1010505 
 3  1009315  1009333 
 4  1009905  1009923 
 5  1009909  1009930 
 6  1009905  1009924 
 7  1009911  1009945 
 8  1009842  1009869 
 9  1009897  1009918 
10  1009887  1009906 
11  1009918  1009933 
12  1009903  1009920 
13  1009913  1009931 
14  1009892  1009910 
15  1009908  1009928 
16  1009895  1009921 
17  1009900  1009929 
18  1009876  1009896 

snv Guest:

Sec     tod   hrtime 
 0  1000665  1000702  
 1  1007738  1007756  
 2 177169798 177169813  <= Argh!
179  1011251  1011275  
180  1008404  1008432  
181  1009989  1010016  
182  1009618  1009644  
183  1009896  1009924  
184  1009747  1009766  
185  1000265  1000291  
186  1009336  1009360  

In the Solaris vmware guest with numanode = "1" set, it gets better, but now time runs slow (setting this binds the guest onto a numa & time coherent set of cores):

Sec     tod   hrtime 
 0  1000122  1000139 
 1  1004468   989472 
 2  4973883   939326 
 6  1009682  1009689 
 7  1005019   991275 
 8  4975168   939355 
13  1009630  1009638 
14  1003097   989955 

With the following params set:

processor0.use = FALSE 
processor1.use = FALSE 
processor2.use = FALSE 
host.cpukHz = 2200000 
host.noTSC = TRUE 
ptsc.noTSC = TRUE 

Sec     tod   hrtime 
 0  1004787  1004911 
 1  1009783  1009802 
 2  1009914  1009935 
 3  1009894  1009913 
 4  1009895  1009913 
 5  1009900  1009918 
 6  1037644  1037680 
 7  1002091  1002117 
 8  1009910  1009929 
 9  1009897  1009920 
10  1009904  1009923 
11  1009893  1009913 
12  1009916  1009934 
13  1009876  1009894 
14  1009893  1009918 
15  1009873  1009891 
16  1009901  1009921 
17  1009883  1009911 
18  1009903  1009922 


Thursday Mar 15, 2007

New Features of the Solaris Performance Wiki

We've added a few new features and some more content at the Solaris Performance Wiki.


  • Popular content Rating and Navigation
  • Performance news aggregation
  • New content

Of course, everyone is welcome to contribute!

[ T: ]

Thursday Oct 19, 2006

BlackBox Project

I put up a few of the Blackbox release-day photos at my website. IMHO, a great new breakthrough; putting datacenters next to power stations and eliminating the power transmission costs is actually a pretty big deal when deploying at massive scale.

On a related note, this site is running in an OpenSolaris zone, connected via gigabit ethernet to an OpenSolaris ZFS NFS file server, along side of and about a dozen other websites... ;-)

Sunday Sep 24, 2006

Virtualization in Paris

Last week I presented on server consolidation & virtualization technologies in Paris, at Sun's SunUP-Network conference. We had a good turnout, around 100 customers from around Europe. There was a tremendous interest in virtualization, and some of the customers are quite a long way down the path of deploying virtualization.

I presented on OS virtualization vs Hardware virtualization, and talked about the differences between the two; including some of the performance studies we have been doing.

Some of the interesting tid-bits from the discussions:

  • Over 90% of the attendees were considering or currently deploying virtualization technologies
  • Some are already using VMware - all but one were using it for Windows consolidation, for consolidating many small older servers to increase server utilization. We're doing lots of exciting work with VMware on our new Galaxy servers, including the X4600
  • Many customers are already planning to use the up and comming logical domaining capability of the T2000 SPARC systems for consolidation of many small SPARC systems
  • Zones is increasingly being used as a consolidation technology, given that it is the lightest weight overhead of all the solutions (it has the lowest virtualization performance overhead and minimal OS administration requirements). One financial customer has standardized on using Containers/Zones for all new application deployments, and has completed internal standards and training for new deployments. This will allow them to template their provisioning strategy, and make it easy to migrate applications around between servers.

All in all a very interesting meeting and set of discussions!

Friday Aug 04, 2006

DTrace, MDB: Solaris Internals Podcast

During Jim's recent west coast tour, we were apparently overheard talking in the local about Solaris Internals. Catch the Podcast here, or the raw mp3 here.

[ T: ]

Wednesday Jul 26, 2006

Fun at the OpenSolaris Users Group last night

Jim and I had fun at last nights Silicon Valley OpenSolaris Users Group. We finally met a few more great OpenSolaris community members, including Ben Rockwood, who has blogged about the meeting already! We ran a quiz and gave away a set of signed books too.

I just posted the slides we used to talk about the book here, for reference.

[ T: ]

Wednesday Jul 19, 2006

Solaris Internals Released!

I'm very happy to be finally able to say that Solaris Internals is shipping! I received a box from the same batch that went to Amazon this week, and Amazon have updated their status to available.

We expect the 2nd book (Solaris Performance and Tools) to ship next week.

Also, we've started creating a Peformance FAQ for the 2nd book. It's in early stages right now, but growing quickly.

On a final note, Jim and I hope to do a talk about the books at the Silicon Valley OpenSolaris Users Group next week in Santa Clara; hope to see you there!

[ T: ]

Monday Apr 03, 2006

Performance, Observability, DTrace and MDB

Solaris Internals, 2nd Edition is finally done!

At 5:30am this morning, Jim, I and Brendan Gregg submitted two completed books to the publisher. You may notice two easter eggs here; first there are now TWO books, and there is another primary author in the fold - Brendan Gregg.

The first of the two books is an update to Solaris Internals, for Solaris 10 and OpenSolaris. It covers Virtual Memory, File systems, Zones, Resource Management, Process Rights etc (all the good stuff in S10). This book is about 1100 pages.

The TOC for this book is here

The second book is aimed at Administrators to learn about performance and debugging. It's basically the book to read to understand and learn DTrace, MDB and the Solaris Performance tools, and a methodology for performance observability and debugging. This book is about 550 pages.

The TOC for this book is here

We need your help to name the two books. The current proposals are:

  • Solaris Internals: Kernel Architecture for Solaris 10 and OpenSolaris
  • Solaris Performance and Tools: Performance Measurement and Debugging with DTrace and MDB

We would very much like to hear your thoughts on what you feel would be great titles and subtitles for the books.

We welcome and look forward to your thoughts on the titles!

[ T: ]

Wednesday Jan 25, 2006

Previously confidential SPARC docs released via OpenSPARC

I see the OpenSPARC folks have opened up the specifications for the Niagara processor AND the new Hypervisor over at the OpenSPARC community website.

[ T:

Tuesday Jan 17, 2006

Solaris Internals - 2nd Edition

It's coming. Really! Jim, I and team think we are within a couple of weeks of finishing the writing phase. You can check the TOC here, and please do comment.

Solaris Internals, 2nd Edition

[ T: ]

Tuesday Dec 06, 2005

Welcome to the CMT Era!

You've no doubt heard a lot of noise about a new chip from Sun code-named Niagara. It's Sun's first chip level multiprocessor, with 32 virtual CPUs (threads) on a single chip. But wait, isn't this just another product release on the roadmap? Heck no. This is the dawn of the CMT era, which I believe represents a significant shift in the way we build and deploy massive scale systems. The official name is UltraSPARC T1, but personally I like the code-name Niagara. Today, we released two systems around the Niagara chip, the T1000 and T2000.

I was convinced of the significance of CMT about two years ago by Rick Hetherington, Distinguished engineer and architect of the Niagara based system. I was working with extreme scale web provider here in the bay area, who roles out thousands of web facing servers. So many in fact that they had already concluded that server power consumption was responsible for up to 40% of the cost of running their data center; due to the relationship between power, ac, ups, floorspace and infrastructure costs. I went in with an open mind, considering SPARC (at the time), commodity x86, and a range of low power x86 options. Rick Hetherington and Kunle Olukotun (a founding architect of the chip) started sketching out how much throughput they would expect from their CMT design - 8 1.2GHz cores on a single 60 watt die, which was still being taped out at the time. Being a skeptic, I thew in some what-if questions comparing the throughput from some of the new break away x86 ultra-low power cores, like the AMD Geode or Via EPIA's -- about 1GHz @ 10-20 watts. It turns out that they were right; while the Geodes and EPIAs were much more efficient than commmodity x86, none of these options came close to the throughput per watt and cost per throughput delivered by a single die with many cores. Two years later, it seems so obvious to conclude that the more cores you put on a single die, the greater the savings in both cost and power, and the beginning of the tag-line "cool-threads". Check out Sim Datacenter, a downloadable power simulator for the datacenter.

I'm pleased today to be able to walk you through some of today's Niagara blog entries from the microprocessor, hardware, operating system and application performance teams. There's some great articles on all aspects of the technical details around Niagara:

A hearty congratulations to the whole team who brought this technology together. I've personally observed one of the most significant cross-company collaboration efforts ever -- this technology brought together teams from the microprocessor group, the Solaris kernel group, the JVM, compilers, and application experts all across the company over the past two years, with an enthusiasm level that's hard to put words to.

On a final note, there's two easter eggs: Oracle have just announced that they recognise Niagara as 2 cpu system from a licencing persepective. And, we've Open Sourced SPARC!.

We hope you enjoy exploring CMT and the new Niagara based servers. We'll be opening up a discussion forum shortly, to connect you directly with the developers and application performance experts who work with these systems. Stay Tuned!

[ T: ]

Monday Dec 05, 2005

Welcome to the CMT Era!

Richard McDougall: Today is the release of the most exciting processor development in the last decade: UltraSPARC T1 - the first Chip Level Multithreading based system from Sun code-named Niagara. Today, you'll find an exciting set of discussions direct from the experts; discussing CMT processor principles, blazing application performance, and what all the buzz around "cool threads" is about. Check out my introductory story linking to all the discussions.

Friday Nov 11, 2005

Tuning for Maximum Sequential I/O Bandwidth

John asks the question "what does maxphys do, and how should I tune it?"

The maxphys parameter use to be the authoritive limit to the maximum I/O transfer size in Solaris. A large transfer size, if the device and I/O layer supports it, generally provides better large I/O throughput. This is generally supported by the fact that disks like larger transfers if you are trying to get absolute maximum throughput. For a single disk, 64k is generally the point at which maximum transfer rate occurs, and for disk arrays, typically 1MB.

Historically, there was a maxphys of 56k set, due to some older (VME?) bus max xfer limitations. The maxphys limit was increased to 128k on SPARC around the Solaris 2.6 timeframe. Since Solaris 7, the sd/ssd drivers (SCSI and Fibre Channel) override maxphys if the device supports tag queuing, up to a default of 1MB.

On x86/x64, the default is still 56k (we need to look at this!).

In summary, the defaults for SPARC with SCSI or Fibre channel are optimal defaults. You can always check by doing a large I/O test and observing the average I/O size:

# dd if=/dev/rdsk/c0t0d0s0 of=/dev/null bs=8192k

# iostat -xnc 3
 us sy wt id
 10  4  0 86
                    extended device statistics              
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   22.4    0.0 22910.7    0.0  0.0  1.0    0.1   44.6   0 100 c0t0d0

Here we can see that we are doing 22 reads per second, and 22MB/s; thus the disks are performing optimal 1MB I/O's. Also, a simple set of test with varying block sizes will help identify the best I/O size for your device.

One more comment about max transfer size; these parameters (the driver's max transfer size) is read by UFS's newfs command, and used to set the cluster size of the file system; i.e. the number of contiguous blocks to read ahead or write behind. If you put a file system on a device and want to see large I/O's, you'll need to ensure that the maxcontig parameter also reflects the devices's max transfer size. It can be tweaked after with tunefs.

# mkfs -m /dev/dsk/c0t0d0s0
mkfs -F ufs -o nsect=248,ntrack=19,bsize=8192,fragsize=1024,cgsize=22,free=1,rps=120,nbpi=8155,opt=
t,apc=0,gap=0,nrpos=8,maxcontig=128,mtb=n /dev/dsk/c0t0d0s0 18588840

Here you can see that maxcontig=128 8k blocks, which is 1MB. If you tune maxphys, then reset the file system's max cluster size with tunefs after, too.

[ T: ]

Monday Oct 31, 2005

Update: Cheap Terabyte of NAS

I looks like there is now a NFS option for the Buffalo Terastation, and a community around the device:

Hacking the Terastation

CMT is coming: Is your application ready?

We're close to seeing some of the most exciting SPARC systems in over a decade. The new Niagara based systems are the most aggressive CMT systems the industry has seen to date, with 32 threads in a single chip. A chip like this will be able to deliver the performance of up to 15 UltraSPARC processors while using less than one third of the power. This represents a compelling advantage not only in performance, but as a significant reduction power, cooling and space.

Since even a single Niagara chip presents itself to software as a 32-processor system, the ability of system and application software to exploit multiple processors or threads simultaneously is becoming more important than ever. As CMT hardware progresses, the software is required to scale accordingly to fully exploit the parallelism of the chip.

Current efforts are delivering successful scaling scaling results for key applications. Oracle, Sun Web Server, SAP are among many examples of applications which have already shown scalability which can fully exploit all the threads of a Niagara based system.

To maximize the success of CMT systems we need renewed focus on application scalability. Many of the applications we migrate to CMT systems will have been developed on low end Linux systems; they may have never been tested on a higher end system.

The Association for Computing Machinery (ACM) is running a special feature on the impact of CMT on software this month. There are several relevant articles in this issue:

  • Kunle Olukotun, founder of Afara Websystems that pioneered what is now the Niagara processor, writes about the inevitable transition to CMT:
  • "the transition to CMPs is inevitable because past efforts to speed up processor architectures with techniques that do not modify the basic von Neumann computing model, such as pipelining and superscalar issue, are encountering hard limits. As a result, the microprocessor industry is leading the way to multicore architectures"

    Throughput computing is the first and most pressing area where CMPs are having an impact. This is because they can improve power/performance results right out of the box, without any software changes, thanks to the large numbers of independent threads that are available in these already multithreaded applications."


  • Luiz Barroso, principal engineer at Google, shows why CMT is the only viable economic solution to large datacenter scale-out:
  • "We can break down the TCO (total cost of ownership) of a large-scale computing cluster into four main components: price of the hardware, power (recurring and initial data-center investment), recurring data-center operations costs, and cost of the software infrastructure.

    ...And it gets worse. If performance per watt is to remain constant over the next few years, power costs could easily overtake hardware costs, possibly by a large margin."


  • Richard McDougall, Performance and Availability Engineering on the fundamentals of software scaling:
  • "We need to consider the effects of the change in the degree of scaling on the way we architect applications, on which operating system we choose, and on the techniques we use to deploy applications - even at the low end."


  • Herb Sutter, Architect at Microsoft writes about changes to programming languages which could exploit parallelism:
  • "But concurrency is hard. Not only are today's languages and tools inadequate to transform applications into parallel programs, but also it is difficult to find parallelism in mainstream applications, and - worst of all - concurrency requires programmers to think in a way humans find difficult.

    Nevertheless, multicore machines are the future, and we must figure out how to program them. The rest of this article delves into some of the reasons why it is hard, and some possible directions for solutions."


In addition to the ACM queue articles, there was a recent NetTalk on Scaling My Apps, featuring Bryan Cantrill. There will be a followup experts exchange on this topic, where customers can live chat with the technical scaling experts. Also, look for a new whitepaper on scaling applications for CMT, from Denis Sheahan of the Niagara architecture group.




« April 2014