Wednesday Aug 28, 2013

Threads, as fas the eye can see...

Recently, I contributed to a new white paper that addresses the question "ok... now I have a computer with over 1,000 hardware threads... what do I do with all of those threads?" The topics include details of the newest SPARC S3 core, workload consolidation and server virtualization, multi-threaded programming, and more.

Oracle published the paper, which you can find here: My personal thanks go to Dr. Foxwell, who filled the role of engineer-herder for this project, and also to the other co-authors: Ruud, JeffS, and Darryl.

Wednesday Aug 17, 2011

Oracle Virtualization Launch

On August 23, Oracle will host a virtualization launch.

You can attend in person, in Redwood Shores, CA, or via webcast:

Wednesday Dec 01, 2010

SPARC and Solaris Webcast

Today Oracle will webcast several announcements and roadmaps, including:

  • New SPARC cluster - including world record database performance
  • Oracle Exalogic Elastic Cloud powered by the SPARC T3
  • Sun SPARC Enterprise Servers, M-series, with SPARC64 VII+ processors
  • Oracle Solaris 11
For details and webcast registration, see:

Wednesday Mar 04, 2009

AMD Names Fab Unit

The newest major chip manufacturing corporation finally has a name: GlobalFoundries. The next step is to build its new fab.

AMD has been slowly spinning off its chip fab business for the past few years, and in the process is building a new US$4.2 billion fab in the northeast USA - near Albany, New York. Cash-strapped AMD was only able to do this via a joint venture with an investment fund headquartered in Abu Dhabi.

More details are available at eWeek, PCWorld, and the major newspaper in Albany, the Times Union.

Friday Sep 05, 2008

got (enough) memory?

DBAs are in for a rude awakening.

A database runs most efficiently when all of the data is held in RAM. Insufficient RAM causes some data to be sent to a disk drive for later retrieval. This process, called 'paging' can have a huge performance impact. This can be shown numerically by comparing the time to retrieve data from disk (about 10,000,000 nanoseconds) to the access time for RAM (about 20 ns).

Databases are the backbone of most Internet services. If a database does not perform well, no amount of improvement of the web servers or application servers will achieve good performance of the overall service. That explains the large amount of effort that is invested in tuning database software and database design.

These tasks are complicated by the difficulty of scaling a single database to many systems in the way that web servers and app servers can be replicated. Because of those challenges, most databases are implemented on one computer. But that single system must have enough RAM for the database to perform well.

Over the years, DBAs have come to expect systems to have lots of memory, either enough to hold the entire database or at least enough for all commonly accessed data. When implementing a database, the DBA is asked "how much memory does it need?" The answer is often padded to allow room for growth. That number is then increased to allow room for the operating system, monitoring tools, and other infrastructure software.

And everyone was happy.

But then server virtualization was (re-)invented to enable workload consolidation.

Server virtualization is largely about workload isolation - preventing the actions and requirements of one workload from affecting the others. This includes constraining the amount of resources consumed by each workload. Without such constraints, one workload could consume all of the resources of the system, preventing other workloads from functioning effectively. Most virtualization technologies include features to do this - to schedule time using the CPU(s), to limit use of network bandwidth... and to cap the amount of RAM a workload can use.

That's where DBAs get nervous.

I have participated in several virtualization architecture conversations which included:
Me: "...and you'll want to cap the amount of RAM that each workload can use."
DBA: "No, we can't limit database RAM."

Taken out of context, that statement sounds like "the database needs infinite RAM." (That's where the CFO gets nervous...)

I understand what the DBA is trying to say:
DBA: "If the database doesn't have sufficient RAM, its performance will be horrible, and so will the performance of the web and app servers that depend on it."

I completely agree with that statement.

The misunderstanding is that the database is not expected to use less memory than before. The "rude awakening" is modifying one's mind set to accept the notion that a RAM cap on a virtualized workload is the same as having a finite amount of RAM - just like a real server.

This also means that system architects must understand and respect the DBA's point of view, and that a virtual server must have available to it the same amount of RAM that it would need in a dedicated system. If a non-consolidated database needed 8GB of RAM to run well in a dedicated system, it will still need 8GB of RAM to run well in a consolidated environment.

If each workload has enough resources available to it, the system and all of its workloads will perform well.

And they all computed happily ever after.

P.S. Memory needs of consolidated systems require that a system running multiple workloads will need more memory than each of the unconsolidated systems had - but less than the aggregate amount they had.

Considering that need, and the fact that most single-workload systems were running at 10-15% CPU utilization, I advise people configuring virtual server platforms to focus more effort on ensuring that the computer has enough memory for all of its workloads, and less effort on achieving sufficient CPU performance. If the system is 'short' on CPU power by 10%, performance will be 10% less than expected. That rarely matters. But if the system is 'short' on memory by 10%, excessive paging can cause transaction times to increase by 10 times, 100 times, or more.

Thursday Aug 21, 2008

Virtual Eggs, One Basket

One of the hottest computer industry trends is virtualization. If you skim off the hype, there is still a great deal to be excited about. Who doesn't like reducing the number of servers to manage, and reducing the electric power consumed by servers and by the machines that move the heat they create (though I supposed that the power utilities and the coal and natural gas companies aren't too thrilled by virtualization...)

But there are some critical factors which limit the consolidation of workloads into virtualized environments (VE's). One, often-overlooked factor is that the technology which controls VE's is a single point of failure (SPOF) for all of the VE's it is managing. If that component (a hypervisor for virtual machines, an OS kernel for operating system-level virtualization, etc.) has a bug which affects its guests, they may all be impacted. In the worst case, all of the guests will stop working.

One example of that was the recent licensing bug in VMware. If the newest version of VMware ESX was in use, the hypervisor would not permit guests to start after August 12. EMC created a patch to fix the problem, but solving it and testing the fix took enough time that some customers could not start some workloads for about one day. For some details, see and

Clearly, the lesson from this is the importance of designing your consolidated environments with this factor in mind. For example, you should never configure both nodes of a high-availability (HA) cluster as guests of the same hypervisor. In general, don't assume that the hypervisor is perfect - it's not - and that it can't fail - it can.

Thursday Mar 15, 2007

Spawning 0.5kZ/hr (Part 2)

As I said last time, zone-clone/ZFS-clone is time- and space-efficient. And that entry looked briefly at cloning zones. Now let's look at the integration of zone-clones and ZFS-clones.

Enter Z\^2 Clones

Instead of copying every file from the original zone to the new zone, a clone of a zone that 'lives' in a ZFS file system is actually a clone of a snapshot of the original zone's file system. As you might imagine, this is fast and small. When you use zone-clone to install a zone, most of the work is merely copying zone-specific files around. Because all of the files start out identical from one zone to the next, and because each zone is a snapshot of an existing zone, there is very little disk activity, and very little additional disk space is used.

But how fast is the process of cloning, and how small is the new zone?

I asked myself those questions, and then used a Sun Fire X4600 with eight AMD Opeteron 854's and 64GB of RAM to answer them. Unfortunately the system only has its internal disk drives. The disk drive was the bottleneck most of the time. I created a zpool from one disk slice on that drive, which is neither robust nor efficient. But it worked.

Creating the first zone took 150 seconds, including creating the ZFS file system for the zone, and used 131MB in the zpool. Note that this is much smaller than the disk space used by other virtualization solutions. Creating the next nine zones took less than 50 seconds, and used less than 20MB, total, in the zpool.

The length of time to create additional zones gradually increased. Creation of the 200th through 500th zones averaged 8.2 seconds each. Also, the disk space used gradually increased per zone. After booting each zone several times, they each used 6MB-7MB of disk space. The disk space used per zone increased as each zone made its own changes to configuration files. But the final rate of creation was 489 zones per hour.

But will they run? And are they as efficient at memory usage as they are at disk usage?

I booted them from a script, sequentially. This took roughly 10 minutes Using the "memstat" tool of mdb, I found that each zone uses 36MB of RAM. This allowed all 500 zones to run very comfortably in the 64GB on this system. This small amount was due to the model used by sparse-root zones: a program that is running in multiple zones shares the program's text pages.

The scalability of performance was also excellent. A quick check of CPU usage showed that all 500 zones used less than 2% of the eight CPUs in the system. Of course, there weren't any applications running in the zones, but just try to run 500 guest operating systems in your favorite hypervisor-based virtualization product...

But why stop there? 500 zones not enough for you? Nah, me neither. How about 1,000 zones? That sounds like a good reason for a "Part 3."


New features added recently to Solaris zones improve on their excellent efficiency:

  1. making a copy of a zone was reduced from 30 minutes to 8 seconds
  2. disk usage of zones decreased from 100MB to 7 MB
  3. memory usage stayed extremely low - roughly 36MB per zone
  4. CPU cycles used just by unbooted zones is zero, and by running zones (with no applications) is negligible

So, maybe computers hate me for pushing them out of their comfort zone. Or maybe it's something else.

Wednesday Dec 20, 2006

Spawning 0.5kZ/hr (Part 1)

I hate computers. No, that's not true. They hate me. Probably because I derive pleasure from pushing them to the hirsute edge of their design specifications. They interpret this as sadistic behavior, and think that I am mean for no reason. But that's not true. I have a reason.

Actually, I have two reasons:

  1. I am curious, and enjoy measuring things
  2. I want to help you understand the limits of Solaris so you can safely maximize the benefits you obtain from using it, without performing all of these measurements yourself.

Over the past few weeks I have wondered how realistic Solaris' upper bound of 8,191 non-global zones (per Solaris instance) is. So I "acquired" access to a Sun Fire X4600 which has a pre-release build of Solaris on it. This build includes several useful features from Solaris 10 11/06 including:

  • zone migration: the ability to move a non-global zone from one computer to another
  • zone clone: make an identical copy of an existing non-global zone
  • integration of zone clones and ZFS clones, enabling incredibly efficient cloning of zones.

The efficiency extends to both time and space. Originally the only method to create a copy of a zone was configuring a new zone like an existing zone and installing it. This operation took 10-30 minutes, depending mostly on disk speed, and did not include any customizations made to the original zone after it was installed.

Spawn of Non-global Zone

The newer "zone clone" method is a vast improvement because it simply copies the files from the original zone to the new zone. This is much faster (2-5 minutes) and carries with it any post- installation customizations. This method, along with other new features in Solaris 10 11/06, enables a new method of provisioning virtual Solaris systems. I call this new method 'spawning a zone.'

Suppose that your data center uses five applications, but multiple computers run each application for different purposes. These might be test and production systems, for example. Each system that runs the same application has the same set of customizations applied to it.

Many data centers simplify the provisioning process by creating, testing, and internally certifying one or more Solaris bootable images, called 'gold masters,' which are then installed on their systems. This also simplifies troubleshooting because it removes the need to analyze a troubled system for customizations. In some cases, system administrators are specifically trained to troubleshoot systems installed with a gold master.

Solaris zones can be applied to this model by creating a gold master Solaris image that has "gold master zones" pre-installed on it. Each zone is customized for one of the five applications. After a new system is installed from the gold master, only the zone(s) customized for the applications that will be used by this computer are turned on. The rest are left off, merely using up a few hundred megabytes of disk space.

That model is simple and very useful, but often is not flexible enough. For example, it does not allow for multiple instances of one application on one system, each in separate zones.

Instead, a "gold master system" could be created that has one "gold master zone" installed per application, with appropriate customizations made to each zone. When an application is needed on a new or existing system, application provisioning includes these steps:

  1. clone a gold master zone
  2. migrate the clone to its new home system (also see my How To Guide on this topic)
    1. "zoneadm detach" the zone from the old system
    2. move the zone's files to the new system
    3. "zoneadm attach" the zone to its new system
  3. destroy the detached zone on the gold master system
You might imagine the gold master system as a queen bee, spawning zones instead of drones.

The new, recently migrated zone can then be further customized. If conditions change, and you want to move it to a different computer, you can do that, too.

Part 1 - Summary: Well, this has gotten a bit long-winded, but the explanation in this entry provides the background necessary to understand the next entry, which in turn (finally!) explains this entry's title.


  1. If that word isn't your vocabulary word for the day, perhaps it should be.
  2. Yes, I know: they can't think. But let's not tell them that. It will ruin their day.
  3. Back in college I was a founding member of the theatre group's Acquisitions Committee, so I have some experience in this area.

Monday Dec 18, 2006

Zones and Configurable Privileges

Part 2 of Many

Another network feature that won't work in a non-global zone in Solaris 10 3/05, 1/06, or 6/06 is the service "dhcp-server". I wondered if appropriate privileges could be assigned to a zone, using Solaris 10 11/06, in order to enable that service to work properly in a non-global zone.

But how do you know which privilege(s) are needed? Although a tool to analyze an executable (and the libraries that it uses) for necessary privileges would be very useful, I am not aware of such a tool. However, there is a tool which will analyze a running program: privdebug.

I used dhcpmgr(1M) to configure the global zone of one Solaris 10 system to be a DHCP server, and told another Solaris 10 system to be a DHCP client by creating the appropriate /etc/dhcp.<interface-name> file. Then I ran privdebug to start gathering data.

After running privdebug as:

# ./ -n in.dhcpd -v -f
its output looked something like this (abbreviated slightly):
STAT TIMESTAMP          PPID   PID    PRIV                 CMD
USED 481061858324       7      1489   proc_fork            in.dhcpd
USED 481063008106       1489   1490   sys_resource         in.dhcpd
USED 481067169173       1489   1490   net_privaddr         in.dhcpd
USED 481067214515       1489   1490   net_privaddr         in.dhcpd
USED 481067261082       1489   1490   net_privaddr         in.dhcpd
USED 7602182665254      7      2307   proc_fork            in.dhcpd
USED 7602184084176      2307   2308   sys_resource         in.dhcpd
USED 7602195780436      1      2308   net_privaddr         in.dhcpd
USED 7602195826717      1      2308   net_privaddr         in.dhcpd
USED 7602195874362      1      2308   net_privaddr         in.dhcpd
USED 7617671777513      1      2308   net_icmpaccess       in.dhcpd
USED 7618028208673      1      2308   sys_net_config       in.dhcpd
USED 7618028224029      1      2308   sys_net_config       in.dhcpd
USED 7618028622618      1      2308   sys_net_config       in.dhcpd
USED 7618937845453      1      2308   sys_net_config       in.dhcpd
USED 7618937861126      1      2308   sys_net_config       in.dhcpd
USED 7786427652239      1      2308   net_icmpaccess       in.dhcpd
USED 7786782253121      1      2308   sys_net_config       in.dhcpd
USED 7786782266742      1      2308   sys_net_config       in.dhcpd
USED 7786782417242      1      2308   sys_net_config       in.dhcpd
With that list, it was easy to check each of the privileges that in.dhcpd used against the list of privileges that are allowed in a non-global zone.

Although proc_fork, sys_resource, net_privaddr and net_icmpaccess are in a non-global zone's default list of privileges, sys_net_config is not allowed in a non-global zone. Because of that, a non-global zone cannot be a DHCP server using Solaris 10 11/06.

That was a fun experiment, but in order to make a non-global zone a DHCP server we must wait for the Crossbow project to add sufficient IP instance functionality, along with its new sys_ip_config privilege. The latter will be allowed in a non-global zone.

Monday Oct 30, 2006

Snoop Zoney Zone

Ever been frustrated by the inability to snoop network traffic from within your Solaris Zones? Good news: Solaris 10 11/06 adds "configurable privileges" - the ability to modify the security boundary around one or more zones. How can this help you?

First some background: part of the implementation of a zone's security boundary is the lack of certain Solaris Privileges(5) - privileges that, in the wrong hands, could be used to affect other zones or even the entire system. One simple example is the SYS_TIME privilege, which allows the user to change the system clock that is used by all zones.

In the first release of Solaris 10 (in March, 2005) those privileges were not allowed in a zone. Even the root user of a non-global zone could not gain those privileges. This was a Good Thing, as you would not want one zone to change the system clock, for example.

However, since the debut of Solaris 10, we have investigated the implications of adding those 'prohibited' privileges into specific zones. Solaris 10 11/06 allows many of those privileges to be added to the default set of privileges that are permitted in a zone. Adding privileges must be performed the global zone administrator by using zonecfg(1M). While adding this functionality, we also added the ability to remove privileges from a zone's limit set.

Of course, adding functionality may also add security risks, and this is true for "configurable privileges." Adding a privilege to a zone's limit set may have unintended consequences. It is crucial to understand the implications of a adding a privilege to a zone before actually doing so.

A comprehensive analysis of new possibilities would be a significant undertaking, but in this blog entry and a few others, I hope to provide some guidance on this topic. I'll start with the new ability to snoop network traffic from within a zone. Keep in mind that this includes all traffic on the network interface(s), including traffic for other zones, including the global zone. Adding net_rawaccess also allows the zone to do other nefarious things. Use this privilege, and others, with caution.

To allow a zone to snoop network traffic, you must add two directives to the zone's configuration, and then [re]boot the zone:

global# zonecfg -z twilight
<zonecfg:twilight> set limitpriv="default,net_rawaccess"
<zonecfg:twilight> add device
<zonecfg:twilight> set match=/dev/e1000g0
<zonecfg:twilight> end
<zonecfg:twilight> exit

After booting the zone, the root user can snoop that network interface, and see all traffic on that NIC.

Finally, note that the set of privileges and the rules regarding their use may change in the future. For example, Project Crossbow will significantly change the way that zones use IP networks.

Monday Jun 12, 2006

Itanium Followup

My previous blog entry decried the failed predictions of the future of Itanium. It has been almost two years since - and nothing has changed. The reports of Itanium's success have been greatly exaggerated. Instead of the "industry standard"[1] status which has been repeatedly claimed, the number of Itanium systems shipped has never exceeded 9,000 per quarter, and one vendor makes up 90% of that "volume." That peak was in the last quarter of 2004 - just after my previous blog. I guess I demoralized them. :-)

Further, Itanium shipments have only exceeded 0.5% (yes, "one-half of one percent") of the number of servers sold in one year, and that was in 2004. I think I can safely leave this topic behind and move on to more fruitful discourse.


Tuesday Sep 21, 2004

Itanium Who?

Instead of barreling onto the scene as originally predicted, Itanium seems to be knocking timidly at the door of the computer industry, asking for permission to join the party already in progress.

Industry analyst firm IDC originally predicted, back in 1997, that in 2002 Itanium-based server sales would generate about $33B in revenue. The actual number for that year was short by a few digits - in fact, they were shy by over 99%.

IDC made six revisions to their predictions by the time 2002 arrived, each progressively smaller, each pushing off the date to achieve any significant revenue.

Intel's Paul Otellini proclaimed 2003 the "Year of the Itanium." I'm not sure what his point was...perhaps "the year that we sell a few Itaniums?" But I am confident that you know what I mean when I nominate Itanium for the award "Most Over-Hyped Product of the Decade."

\* "Year of the Itanium"
\* Intel claims they will not do x86-64
\* "Itanium sales fall $13.4bn shy of $14bn forecast"
\* "Intel and IDC at odds over Itanium's future"


Jeff Victor writes this blog to help you understand Oracle's Solaris and virtualization technologies.

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.


« April 2014