Sunday Sep 24, 2006

Virtualization in Paris

Last week I presented on server consolidation & virtualization technologies in Paris, at Sun's SunUP-Network conference. We had a good turnout, around 100 customers from around Europe. There was a tremendous interest in virtualization, and some of the customers are quite a long way down the path of deploying virtualization.

I presented on OS virtualization vs Hardware virtualization, and talked about the differences between the two; including some of the performance studies we have been doing.

Some of the interesting tid-bits from the discussions:

  • Over 90% of the attendees were considering or currently deploying virtualization technologies
  • Some are already using VMware - all but one were using it for Windows consolidation, for consolidating many small older servers to increase server utilization. We're doing lots of exciting work with VMware on our new Galaxy servers, including the X4600
  • Many customers are already planning to use the up and comming logical domaining capability of the T2000 SPARC systems for consolidation of many small SPARC systems
  • Zones is increasingly being used as a consolidation technology, given that it is the lightest weight overhead of all the solutions (it has the lowest virtualization performance overhead and minimal OS administration requirements). One financial customer has standardized on using Containers/Zones for all new application deployments, and has completed internal standards and training for new deployments. This will allow them to template their provisioning strategy, and make it easy to migrate applications around between servers.

All in all a very interesting meeting and set of discussions!

Wednesday Jan 25, 2006

Previously confidential SPARC docs released via OpenSPARC

I see the OpenSPARC folks have opened up the specifications for the Niagara processor AND the new Hypervisor over at the OpenSPARC community website.

[ T:

Tuesday Jan 17, 2006

Solaris Internals - 2nd Edition

It's coming. Really! Jim, I and team think we are within a couple of weeks of finishing the writing phase. You can check the TOC here, and please do comment.

Solaris Internals, 2nd Edition

[ T: ]

Tuesday Dec 06, 2005

Welcome to the CMT Era!

You've no doubt heard a lot of noise about a new chip from Sun code-named Niagara. It's Sun's first chip level multiprocessor, with 32 virtual CPUs (threads) on a single chip. But wait, isn't this just another product release on the roadmap? Heck no. This is the dawn of the CMT era, which I believe represents a significant shift in the way we build and deploy massive scale systems. The official name is UltraSPARC T1, but personally I like the code-name Niagara. Today, we released two systems around the Niagara chip, the T1000 and T2000.

I was convinced of the significance of CMT about two years ago by Rick Hetherington, Distinguished engineer and architect of the Niagara based system. I was working with extreme scale web provider here in the bay area, who roles out thousands of web facing servers. So many in fact that they had already concluded that server power consumption was responsible for up to 40% of the cost of running their data center; due to the relationship between power, ac, ups, floorspace and infrastructure costs. I went in with an open mind, considering SPARC (at the time), commodity x86, and a range of low power x86 options. Rick Hetherington and Kunle Olukotun (a founding architect of the chip) started sketching out how much throughput they would expect from their CMT design - 8 1.2GHz cores on a single 60 watt die, which was still being taped out at the time. Being a skeptic, I thew in some what-if questions comparing the throughput from some of the new break away x86 ultra-low power cores, like the AMD Geode or Via EPIA's -- about 1GHz @ 10-20 watts. It turns out that they were right; while the Geodes and EPIAs were much more efficient than commmodity x86, none of these options came close to the throughput per watt and cost per throughput delivered by a single die with many cores. Two years later, it seems so obvious to conclude that the more cores you put on a single die, the greater the savings in both cost and power, and the beginning of the tag-line "cool-threads". Check out Sim Datacenter, a downloadable power simulator for the datacenter.

I'm pleased today to be able to walk you through some of today's Niagara blog entries from the microprocessor, hardware, operating system and application performance teams. There's some great articles on all aspects of the technical details around Niagara:

A hearty congratulations to the whole team who brought this technology together. I've personally observed one of the most significant cross-company collaboration efforts ever -- this technology brought together teams from the microprocessor group, the Solaris kernel group, the JVM, compilers, and application experts all across the company over the past two years, with an enthusiasm level that's hard to put words to.

On a final note, there's two easter eggs: Oracle have just announced that they recognise Niagara as 2 cpu system from a licencing persepective. And, we've Open Sourced SPARC!.

We hope you enjoy exploring CMT and the new Niagara based servers. We'll be opening up a discussion forum shortly, to connect you directly with the developers and application performance experts who work with these systems. Stay Tuned!

[ T: ]

Monday Dec 05, 2005

Welcome to the CMT Era!

Richard McDougall: Today is the release of the most exciting processor development in the last decade: UltraSPARC T1 - the first Chip Level Multithreading based system from Sun code-named Niagara. Today, you'll find an exciting set of discussions direct from the experts; discussing CMT processor principles, blazing application performance, and what all the buzz around "cool threads" is about. Check out my introductory story linking to all the discussions.

Monday Oct 31, 2005

Update: Cheap Terabyte of NAS

I looks like there is now a NFS option for the Buffalo Terastation, and a community around the device:

Hacking the Terastation

Friday Jul 15, 2005

New look for Solaris Internals Website

Well, one thing that having a blog with any significant technical content forces you to do is understand cascading style sheets. As a result, I've also given Solaris Internals the css makeover with our OpenSolaris theme.

I found a picture that my good friend Bill Walker took of our two cars parked together in the Grand Canyon, and included that too.

Technorati Tag: OpenSolaris

Technorati Tag: Solaris

Monday Jul 11, 2005

A terabyte of NAS for $800!

I just noticed these little devices at Fry's: The Terrastation

While the target market is no doubt CIFS/Windows, it does claim to support NFS (Appletalk and FTP, too).

It has 4 x 250GB drives in a RAID configuration, and gigabit ethernet as the primary transport.

Looks pretty cool on the surface, I wonder if it has a decent NFS implementation?

Wednesday Apr 13, 2005

Commentary about IP Storage Directions

It's pretty visible that storage is rapidly being recast as a network data service. If implemented correctly, we can make our servers become stateless nodes on an IP network -- accessing their storage with the same ease of which we access HTML content on a web server. In this discussion, I would like to open for discussion the emerging shift to network attached storage (NAS). I believe that the emergence of NAS is being facilitated by two major transitions; that of block level device access to file level network data services, and the merging of storage and IP networks.

The summary of this blog is:

    - Ethernet bandwidth is growing faster than traditional storage interconnect bandwidth

    - Ethernet is cheap -- less than $10 per port at commodity volumes

    - Ten Gigabit ethernet is available today, and we are measuring >800MBytes per link with the Solaris 10 FireEngine network stack

    - iSCSI growth is exponential, and even in early form its demonstrating >500MBytes/s per link

    - NFS bandwidth single stream bandwidth in Solaris 10 is now over 200MBytes/s (5x increase in the last two years)

    - NFS in Solaris 10 is now data center ready, and can be used for commercial applications

Here's a link if you want to skip straight to the performance numbers.

Storage area Network or Network Attached Storage

A Storage Area Network (SAN) is defined as a fabric of interconnected block level devices -- disks or virtualized disks. A SAN is merely an evolution of disk interconnects, which have been extended so that they may be connected via specialized hubs and or switches. A practical implementation of a SAN consists of hosts connected via fibre channel to fibre channel switches (pictorially, a storage cloud) to storage appliances with fibre channel target adapters. In a SAN, we virtualize the disk -- we still use a file system on each host computer, and often a volume manager to aggregate multiple disks into one that can be used by the file system (a way of aggregating storage across multiple channels or storage servers).

Network Attached Storage (NAS) is a term used to describe data services connected with networking protocols -- e.g. Network File System (NFS) or Microsoft Common Internet File Services (CIFS). NAS storage is typically configured using fast networks, i.e. gigabit ethernet. Access to the data is typically via a network file system client that is built in the operating system. For example, the NFS client is provided as part of both Linux and Solaris, allowing heterogeneous access to data residing on NFS servers.

NAS storage is much easier to administer -- connecting a new client system to pre configured NAS storage is typically just one command on a Unix system (or just a point and click on a desktop OS). Unlike a SAN, no special device administration is required on the client, since the data is accessed as a network service at the file level.

Interestingly, we're at a pivotal point for SANs and storage interconnects. SANs are attempting to evolve into networks. Historically SANs have offered higher performance than NAS, forcing NAS to be used only in environments with low performance requirements -- e.g. web servers or for desktop user files/directories. However, today's commodity networking speeds are allowing NAS to grow into the heart of the commercial application space, and take a prominent place in our data centers.

A transitional step along the way is iSCSI. A new mechanism of running a SAN topology across a commodity network. iSCSI allows us to configure raw block devices on the client machine that connect to a SAN -- presenting themselves on the client in exactly the same way as a fibre channel HBA does (as a raw block device). It still requires device administration, a volume manager and file system on the client. Why is this interesting, given that NAS offers easier administration than it's block/SAN counterpart? Primarily as a transition vehicle, allowing cost effective access to an existing corporate SAN.

Transports and Interconnects

Storage interconnect technologies have historically been primarily driven by performance characteristics and cost - bandwidth being the most important. Over the last decade, the focus has additionally emphasized the ability to soft-cable the storage, allowing separation of the host computer from the storage device.If we take a look storage interconnect milestones over the last 20 years, we can see a transition in cost/performance occurring - Rather than being the slower counterpart, networks are becoming faster and cheaper than storage-specific interconnects. Le't take a look at the bandwidth available from both network and storage interconnects from 1989 through 2005:

In 1989, SPARCservers used 5MB/s SCSI to connect to disks, and 1MB/s (10Mbit) ethernet -- storage was 5x the bandwidth of the network port. Today, we have ethernet at 10Gbits (1Gbyte/s) and storage at 200MB/s (Fibrechannel).

In addition to bandwidth, we are also seeing the storage interconnects tackling many of the classic networking problems -- the ability to use longer cables, the ability to connect multiple systems via hubs and switches, and the build out of a virtual network cloud. It's also possible to cast an opinion that the storage interconnect industry is attempting to solve many of the "solved" networking problems, from that of the IP networking counterpart.

SCSI performance grew from the original 1MByte/s to over 160MBytes/s. Whilst Parallel SCSI offers quite good performance it is seriously distance limited, and supports a small number of devices on one connection. Fibre channel (FC) was introduced as a practical, inexpensive mechanism of connecting systems using fibre optic cables.

Fibre channel is simply another form of networking -- it uses a serial physical link, which can carry higher level protocols such as ATM or IP, but is primarily used to transport SCSI from hosts to storage.

The Fibre Channel Protocol (FCP) serializes SCSI over Fibre Channel.By using FCP, storage can be connected at speeds of 25MBytes/s (when it was first introduced) over distances up to 30 meters. Shortly after its introduction, FC was increased to 100MBytes, which is the most common deployment today. This simplest form of FC is an abritrated loop (FC-AL), in which a series of disks can be connected to a hub. However, some of the real benefits of FC storage come from it's ability to be switched like a network. This is of course the basis for our storage area networks (SAN's) today.

Ethernet is Commodity - Let the tides turn

Technically superior solutions continue to leapfrog Ethernet -- historically there is ATM, Fibre channel and more recently Infiniband. All offer significant more bandwidth than Ethernet at the same point in time. We could easily predict Ethernet's demise. Nonetheless, as usual, conventional wisdom is wrong. Ethernet is quietly preparing for a new era of unified interconnects.

Gilder is quoted saying: "The reason Ethernet prevailed in the first place is that it was incredibly simple and elegant and robust. In other words, it is cheap and simple for the user. Customers can preserve their installed base of equipment while the network companies innovate with new transmission media. When the network moves to new kinds of copper wires or from one mode of fiber optics to another, Ethernet still looks essentially the same to the computers attached to it. Most of the processing - connecting the user to the network, sensing a carrier frequency on the wire and detecting collisions - can be done on one Ethernet controller chip that costs a few dollars."

Recall that we started with SCSI being much faster than a network at 5MBytes/s and Ethernet at 1MB/s. Just a few years ago, both were at 100MBytes/s (Gigabit), and now Ethernet is deployable at 1GByte/s (10 Gigabit).Today, adding a server to a SAN costs at least $1,800 in switch port and host bus adapter (HBA) costs alone. Unlike switch port pricing, which has dropped down to well under $1,000/port, HBA prices have remained steady. The Fibre Channel Industry Association reports that in 2004 the number of Fibre channel ports topped 10 million. A significant number, but ethernet volume dwarfs this number; an estimated one billion ports today and a $4B market. Growth will continue to extend a further ten times every three years.

Volume drives ethernet per port costs down -- to less than ten dollars per port. Gigabit ethernet is between $5-$10 today. Sure, 10Gigabit ethernet is still around $4k per port, but history shows that it will drop to under $100 within the next 12-18 months.

Shared Access, Centralized Admin

"While we may end up using coaxial cable trees to carry our broadcast transmissions, it seems wise to talk in terms of an ether, rather than `the cable'...." Robert Metcalfe

Metcalfe's Law states that the usefulness, or utility, of a network equals the square of the number of users. The same is true of storage networks -- if only one person or host can access data, it's usefulness is limited. Imagine a campus wide file server that held data for all employees for that location, to which you could only access your data from one client system...

The "SunOS Netdisk" (introduced and died in 1983) attempted to solve this problem; it allowed a disk to be accessed across a network by transferring disk blocks across the network, just like an extension of SCSI. However due to available network bandwidth it was very slow, and since it merely exported a disk blocks across the network, all it did was allow separation of the host from the storage -- the client side semantics were exactly the same as with a local disk -- just one client host with a single file system could access each "network disk". There was no semantic sharing. Imaging a campus with 5000 desktop systems -- the storage administrator would need to format and administer 5000 separate UFS file systems!

A better answer is the concept of a Network File System -- if we share "files" rather than blocks, we can treat storage as a network data service, which can then be shared by multiple users. A single file system can be administered centrally, and accessed by thousands of users. Just one storage admin, one storage pool to administer, and no storage administration at the client. The ability to separate and delegate administration, and share to thousands of clients are two of the key reasons network file systems have become ubiqitous as the way to access and manage data for workgroup users. The Sun Network File System (NFS) and Windows File Sharing (SMB/CIFS) are standard practices in such scenarios.

So what happened in our datacenters? We've been steadily evolving storage networks into something that approximates some the services network file system -- SCSI disks became "shareable" via fabrics of fibre channel. Storage has become somewhat separated from the client; at least the volume management and data recovery part has moved mostly to the backend as a virtualized network disk. However, only part of the storage administration has been centralized -- in this paradigm we still rely on file systems in the client host systems, and more often than not we still use a volume manager on the client.

How did workgroup data management and datacenter storage diverge so much? Performance and evolution are the most significant factors.

IP Storage

The commoditization of Ethernet and bandwidth in the hundreds of megabytes per second are making storage over IP viable in many places for which it was often excluded. We are seeing two key realizations -- that of storage networks constructed from commodity IP networks and switches, and the rise of Network File systems suitability for commercial applications. Price performance is driving IP storage networks.

Our data centers are moving to grid based architectures, and as a result data needs to be accessed from many (sometimes thousands of) nodes. As a direct result, the shared access semantics together with complexity reduction is driving network file systems rapidly into this space. In this model, servers become diskless clients of a networked data service. For example, we can easily provision a system as an Oracle database server, simply by mounting all of it's configuration state and data files over a NFS network. "data files?" I hear someone screaming? Yes, today's network file systems are more than capable of running many database configurations over gigabit networks.

A new Netdisk: iSCSI

Price performance alone is giving rise to iSCSI. Since the cost of connecting a host system to a SAN is expensive, often over one thousand dollars, the SAN connection alone can now exceed the cost of the server, especially at the low end. iSCSI offers a cheap method of connecting systems into a SAN. By running an iSCSI client, a LUN on a storage server can be accessed as if it were connected by a fibre channel adapter. The SAN connection comes virtually for free, given that most systems have gigabit ethernet as standard options.

Sharing blocks over the network offers little improvements in the way we access data, it merely provides a cheap per-port cost to access a SAN. As a result, iSCSI is today primarily considered at the low end as a gateway into the corporate SAN.

However, iSCSI is a classic disruptive technology. IDC's forecast expects a surge in use of Ethernet switch ports with iSCSI, and predicts that the total market for iSCSI-based disk arrays will grow from $216 million in 2003 to $4.9 billion in 2007. They also predict that iSCSI-based switches used in SAN implementations will grow from some 18,000 total ports in 2003 to 6.94 million 1GbE-equivalent ports in 2007.

What is iSCSI?

An iSCSI network consists of an iSCSI initiator and an iSCSI target. An iSCSI Initiator is the client of an iSCSI network. The iSCSI initiator connects to an iSCSI target over an IP network using the iSCSI protocol. As with SCSI, Logical units have Logical Unit Numbers (LUNS), where the numbering is the per target. The same LU may have different LUNS at different targets.

The iSCSI protocol is implemented underneath SCSI and atop of TCP/IP. Multiple TCP connections are allowed per session, and can be optionally used to exploit parallelism and provide error recovery. In addition to SCSI, there are commands to login (connect a TCP session) and logout (TCP session teardown). iSCSI naming is similar to SCSI -- names are globally unique using a world wide number. The DNS system is used as a naming authority, providing reversed hostnames. iSCSI targets may share a common IP address and port, and initiators/targets may have multiple IP addresses.

There are two basic types of discovery: static configuration and a naming service. The static configuration defines an IP address and port for each target. The naming service allows consolidation of iSCSI target information into a network service.

iSCSI as a SAN Gateway

The most common deployment option for iSCSI is that of a SAN gateway. Using a FC to iSCSI gateway, a host can be connected to the SAN storage via ethernet. Typically, a dedicated or VLAN'ed network will be used for the iSCSI portion, to provide an additional level of security and/or performance partitioning.

Here's the common roll-out form for iSCSI: an ethernet network (shown in green) connects all of the hosts together, and to a FC to iSCSI gateway. The FC gateway connects to the SAN in the same way as a regular FC host, but allows multiple iSCSI connections to pas through a single FC port.

iSCSI Performance

We're observing interesting performance with an iSCSI initiator on Solaris. Given that it utilizes the Solaris FireEngine networking stack, we're able to leverage the bandwidth features of it to move data efficiently. We've just seen the iSCSI initiator providing an impressive 120MBytes/s (wirespeed) over Gigabit ethernet, and over half a gigabytes per second over 10Gigabit Ethernet. If we ever thought performance was going to be significant show stopper, those arguments are looking pretty bleak.

So what about NAS?

Performance is one of the most significant drivers of current storage architectures. If we look at the bandwidth and IOPs characteristics, we can draw a map of workload segmentation.

Workloads such as TPC-C (an artificial transaction workload) are dominated by thousands of small I/Os (~8k). A large 72 CPU config might perform ~30k IOPS, but it's bandwidth requirements are less than 100MB/s. On the other hand, a TPC-H (an artificial decision support workload) might only perform a few thousand IOPS, but requires several GB/s, even for systems with only 24CPUs.

The I/O requirements of typical commercial applications are shown at the area to the bottom left of the diagram; they have relatively low IOPS (less than 15k IOPS) and bandwidth requirements (less than 100MB/s). A moderately configured storage system and file system is typically provides adequate throughput, and our focus is more of parallelism and CPU efficiency at the client.

Many scientific applications are bandwidth hungry. These applications often require bandwidth in the hundreds of megabytes or sometimes several gigabytes per second. These workloads typically only have a few streams of execution, e.g. low concurrency. These are the types of applications which today still exceed the limits of gigabit IP storage.

In the context of IP storage, this helps position why many commercial workloads obtain "good enough" performance, even over gigabit ethernet.

NFS for Bandwidth Intensive Applications

The Solaris implementation of NFS has been significant optimized for bandwidth during the last three releases of Solaris. Just a few years ago, it wouldn't be uncommon to measure 20-30MB/s maximum for large streaming file I/O. NFS optimizations together with the Solaris FireEngine networking stack provide substantial improvements in NFS performance. At the time of writing the Solaris 10 NFS 4 NFS client can perform streaming I/O at approximately 120MBbytes/s on a Gigabit link (wirespeed) and 250MBytes/s over a 10Gigabit link. For now, fibre channel offers a higher bandwidth potential than NFS, but further optimizations are in the pipeline :-)

So, NFS can be used for Commercial Applications?

In a nutshell, yes! It used to be true that it was rare and risky to attempt to put mission critical, latency sensitive commercial applications on NFS. Today I often see customers who have live OLTP applications deployed on gigabit ethernet NAS.

Why is it now possible? A couple of years ago, we asked "what is the performance for commercial apps over NAS?" Our engineering team took a large SMP machine, configured Oracle across NFS on gigabit ethernet and compared it to local file systems. Not surprisingly, the performance was terrible. However, after peeling back the layers, it became quickly apparent that this was a fixable problem. About 12 months later we had Oracle over NFS running at the same throughput as the local SAN connected disks!

It turns out that there wasn't anything wrong with ethernet, IP, or the NFS protocol - rather a few key optimizations in the Solaris NFS client could easily provide database I/O semantics, which work just fine across gigabit ethernet. The latencies were actually better in the NFS case with the new Solaris client. These optimizations are part of the new Solaris 10 NFS clients (both V3 and V4). Some of the changes were also made available in later versions of Solaris 9.

For those who like to see specifics, I'll include some numbers: our Oracle system running an OLTP benchmark was able to deliver equivalent throughput at about 50,000 transactions per minute, with slightly better response times: the transaction response times for both the gigabit ethernet case and local disk case were ~0.3ms. The NFS system actually showed faster I/O response time - iostat reported 7.39ms for the UFS file system, and 7.19ms for the NFS case! The NFS system did however use slightly more host CPU (approx 20%), but this was with a heavy I/O database benchmark -- in reality the overhead is likely to be lower and virtually un-noticable on a real application.

Are you saying NFS is ready for all commercial applications?

While many apps are viable, we still need to correctly look at the application scenario and how it's being deployed. For applications which do small I/Os but are latency sensitive, NFS is now a strong candidate. Common sense prevails for the interconnect however -- if you try and run your database over 10mbit ethernet, you'll see the performance you would expect. A dedicated gigabit ethernet is a good starting point.

That all for this blog essay. We now have quite a bit of performance data comparing local filesystems to NFS for a variety of different commercial workloads -- which I will begin to summarize in a later discussion.

Wednesday Apr 06, 2005

Entering the blogosphere, an Introductory Blog

OK, so I've been convinced by numerous people that I should start a blog. Today is the first, and I hope I to push out many of the performance tools and tips that we often destine for papers though this vehicle. I'll also try and capture my thoughts (and opinons) on some technical topics here, the first will likely be on IP storage.

I work within the performance engineering group (officially known as "Performance and Availability Engineering"), and have been working on various aspects of system performance. I moved to Menlo Park in 1998, after working for Sun engineering remotely from Australia as part of a high end systems group with Brian Wong and Adrian Cockroft. I've spent most of my time working on operating system performance and workload management -- and have enjoyed studies in the areas of virtual memory and file system performance.

Jim Mauro and I published Solaris Internals in 2000, and we are working agressively on a new edition for Solaris 10 -- leveraging DTrace for performance observability will be one of the main focuses on the new edition. The new edition is targetted for this summer. Most recently, I've been working with a team on Solaris improvements for high end systems, looking at OS implications for CMT processors, and in my spare time, looking at ways to characterize and improve file system performance.

Applied Performance Engineering

So what do we do? Our team focusses on characterization and optimization for performance and RAS. The group's charter encompasses developing workloads for performance measurement, characterizing performance, and identifying opportunities for improvement. We cover Solaris, Opteron/SPARC hardware and key ISV applications (like Oracle, SAP, and BEA) as part of the product development.

We work closely with customers and engineering, and create a link between the two. It's extremely important that we design systems closely to how they will be used. We partner with sister performance groups, such as the Strategic Application Engineering group (who are responsible for the majority of the benchmark publications), and the Performance Pit who run extensive performance testing as part of the performance lifestyle.

Workload Characterization

Capturing data about systems deployment is key to knowing what to optimize. We use a variety of methods to do this -- a toolset known as WCP (workload charactertization for performance - sun only link) to collect the most relevant data from real customer applications into a database, allowing detailed mining of many aspects of systems performance. This data comes from applications which are tested in the Sun Benchmark centers, and a large sample from live customer applications.

In addition, extensive trace data is collected from the key "benchmark" workloads, like SJAppserver and SPECweb, to run simulations.

Workload Creation

Once an application has been characterized, it can be decomposed into a representative benchmark. Some of the more recent public workloads we've been responsible for are SPEC jAppServer2004 and SPEC WEB2005.

Other than the formal workloads, there are a variety of in house developed workloads that fit closely with the customer applications: for example, we have a large mockup datacenter application (OLTP-NW) running on F25k to simulate how our starcats are used. Others include XMLmark for XML parsing, FileBench -- which emulates a large list of applications for file system measurement and libMicro which characterizes and operating system at the system call level.

Performance Optimization and Prototyping

Performance work starts early in the development process of our products, to identify performance opportunities early enough to make changes. By improving the operating system and applications we can ensure the applications run at their full potential of the platform. Areas of expertise are networking (network stack and drivers), compilers, JVM+Application Server, filesystems, NFS, and operating system internals.

As an example of some of the customer connected work -- we looked closely at databases on file systems. It was clear that for benchmarking purposes, databases run much faster on raw disks than file systems. However, we rarely use raw disks in a production environment, because it makes for complex administration. Prior to analysis, this was slated down as being due to CPU overheads with the file system. After deeper analysis with the Oracle database, we discovered that it mostly as a result of escerteric interactions with the databases use of synchronous operations (O_DSYNC), used for guaranteeing writes to stable storage for data integrity. Once resolved, we were able to get the file system performance for databases went from being 5x slower to within a few percent of that of raw disks -- in the noise.

Of course it goes without saying that we love DTrace!. We use DTrace extensively -- prior to the existance of DTrace, we had to write custom kernels or tools to instrument the layers we were intested in. We can now ask all the arbritrary questions at any time, and zero in on exactly where to optimize. It's hard to describe wiuth words just how much DTrace helps us... We are in the process of making all of our DTrace scripts and tools externally available -- I'll post more about this in a followup.

Knowledge Management

It's important to leverage the performance knowledge, so that we can all understand how to better configure and tune our systems. We distill the actionable performance information into forms that can be used by the Sun field engineers and to our customers. We make this information available through various mechanisms, including books, via TOI's at conferences, and papers (and now blogs :-)). Allan Packer has a wealth of knowledge on database tuning and capacity planning, -- he captured a lot of it in his book on Configuring and Tuning Databases. In the future, we will be communicating more of the timely information also through this medium.

Conferences and Introductions

There are two conferences coming up SuPerG and Usenix. Some of the PAE'ers are presenting there: Phil Harman is doing his famous DTrace talk and live demo, Bob Sneed is talking about Starcat Performance, Biswadeep Nag will talk about optimizing Oracle RAC, Roch Bourbonnais talking about NFS optimization, Richard Elling is talking about benchmarking for Availability, and I will be presenting on IP Storage. Jim and I are also doing a 2-day Solaris 10 Performance and DTrace tutorial at Usenix this month -- you can access the updated slides at

Stay tuned for followups, we plan on pushing out additional material in the future.





« July 2016