Sunday Dec 19, 2004

Oracle RAC's Secret

I'm a big fan of Oracle's RAC technology. I (speaking for myself, not Sun) think it is the only database product out there that can solve the challenge of near continuous database transaction access to a single (complete) data set even when the database node that a client is connected to experiences a catastrophic failure. Traditional failover can incur a 10x longer service disruption, and multi-site state "replicated" designs are complex and subject to sync skew.

However, there is a little secret associated with the magic of Oracle RAC. Well, it isn't really a secret, it is just something that most people don't like to talk about, an elephant in the room that people choose to ignore. In fact, it is a very natural consequence of a NUMA design. NUMA, of course, means "non-uniform memory access", and is generally a serious issue when frequent memory accesses takes place in which there is a latency ratio from best to worst of >10x. Local SGA memory access latency on an SMP node takes from 100 to 400ns (depending on the type of node). However, if that node is part of a RAC Cluster and it needs to access a memory block on another RAC node's shared SGA (via cache fusion) the latency to retrieve that block will be measured in micro-seconds, often 1000x worse! Here is an illustration of the NUMA aspect of Oracle RAC:

Oracle published a paper recently in which it lists GBE as having an average latency from 600us to over 1000us. That is well over 1000x worse than the local SMP node! Even Infiniband has a latency of almost 200us, which is 1000x worse than a 200ns local SMP node. Ouch. That is a serious performance hit! Here is a graphic from Oracle's paper:

There is also the issue bandwidth. An older server from Sun, the F15K, has over 172GB/s of internal bandwidth. That's aggregate B/W among 18 boards. However, that is a TON of bandwidth. GBE, bless its heart, can only push about 70MB/s of user data. Even with 18 of those links (if you attempted to build an "F15K" from blades), that adds up to only 1.2GB/s. And consider CPU utilization needed to drive each GBE link. Hmmm. Let's see what Oracle says about Bandwidth, Latency, and CPU:

You can get an idea of why this is a problem when you understand the internal structure of an Oracle database. It's amazing what Oracle can do w.r.t. data integrity and performance. It takes a lot of behind the scenes action. Here is a peek:

And when you try to spread this out among even two nodes, you suffer the consequences of 1000x higher latency, and 100x less throughput. Here is a look at the protocol mgmt that must take place for every node sync or transfer, which can happen thousands of times per second:

So it is no wonder that RAC can run into scaling issues if the data is not highly partitioned, to reduce to a trickle the amount of remote references and cache fusion transfers. TPC-C is an example of a type of benchmark in which the work is split between each node without inter-node interaction. RAC scales wonderfully in that benchmark. The problem is that most ad-hoc databases that customers are attempting to use with RAC involve significant inter-node communications. You can imagine the challenge, even with Infiniband, which still has 1000x higher latency (according to Oracle's tests).

Compare this to an SMP node, in which we have shown near 100% linear scale to 100+ CPUs running real world workloads that involve intense remote memory access. Thankfully, a "remote" access on an SMP box (a CPU asking for a block that is cached by another CPU) is still in the nano-second range. Here is a look at what SMP can do:

I have graphs that show Oracle RAC performance on real-world workloads, but Oracle doesn't allow anyone to publish Oracle performance results without their permission. So I will only suggest that the graph has a much different shape. And that anyone contemplating Oracle RAC run full load testing and a comparison to a non-RAC SMP baseline.

Okay... what does all this mean? Well, just that Oracle RAC, as I started out saying, is incredible technology that solves a particularly nasty problem that many customers face. But, you must enter the decision to deploy RAC with full knowledge of the engineering trade offs. RAC can be made to perform well in many environments, given a proper design and data/query partitioning and proper skills training. But in general, if there is appreciable inter-node communications, then you should consider using fewer RAC nodes (eg: 2 or 3), in which each node is larger in size. This keeps memory accesses as local as possible.

For many customers, a traditional HA-Failover is actually a very good design choice, in which you leverage the linear scale of an SMP box, and let SunCluster restart the database on some other node if there is a problem. This generally takes ~5-10 minutes, which is an acceptable service disruption duration for many, especially since that might only happen a couple times per year. And, for clusters with more than 4 CPU cores, Oracle charges $70K per CPU core for RAC+Partitioning, whereas Oracle "only" charges $40K per CPU core for an HA-Failover environment (and for failover, you only pay Oracle for the active node if you only expect to run Oracle on the failover node for less than 10 days per year).

Wednesday Dec 15, 2004

Boeing & Root Cause of Failure

I found the following very interesting! It is buried in a 22 page report on Boeing's web site:

Statistical Summary of Commercial Jet Airplane Accidents Worldwide Operations (1959 - 2003)

On page 19, you'll find the following graphic (I've added some context elements) that describes the root cause of hull loss and/or loss of life in the worldwide commercial air fleet over the last 10 years:

It is interesting to note that the large majority of cases of, um, "down" time, were due to people either making mistakes (or acting maliciously), or people correctly following faulty or incomplete procedures (which were written by people). It is rarely the products (airplanes) or the environmentals (weather).

In the same way, Gartner and others have long held that complex IT systems fail to deliver expected service levels mostly because of people and process related root causes (est. ~80%). Product failures actually account for a tiny fraction of IT service disruptions.

This seems to point to a general pattern that whenever complex systems expose their complexity to human touch points, even in situations in which those humans are psychologically screened, highly trained, highly paid, and limited in number, that catastrophic failures will occur that impact business and/or life.

This is probably no surprise to us. Each one of us have made mistakes behind the wheel, in social settings, etc, due to a variety of reasons (boredom, over-confidence, etc). But the implication of the Boeing and Gartner studies is that we should strive to abstract complexity away from human touch points at every opportunity. Think of "fly-by-wire" controls, in which a pilot's actions are constrained by a flight control system that will not allow actions that could harm the airplane or its passengers. Freedom and flexibility are permitted up to, but not exceeding, a "pain" threshold.

In professional audio systems, a "compressor" is often used. Dynamic response is not affected unless it reaches a threshold in which it might distort or consume undesired energy. Then the system steps in a cleanly limits further dynamic range. As long as you operate in the expected range, you have freedom. If your actions threaten the quality of the output, your action is constrained. Seem like a fair trade-off of freedom and control.

The ultimate expression of an automated datacenter would be to define desired service levels (and the cost and reward sensitivities as actual service levels vary from the nominal/desired) and let "fly-by-wire" micro-adjustments to the IT Infrastructure control optimization. This could radically reduce IT Service Disruptions as complexity is managed by highly-available and hardened controllers, rather than distracted operators. The sensitivity parameters allow the system to distribute excess resources to those services that could benefit most from better than desired performance, or degrade the least sensitive services if a shortfall were to occur.

Of course, cascading failures are still possible. Remember the recent black out in the NE! Codified heuristics that control optimization decisions are simply human designed algorithmic procedures. And procedures can be flawed or reach an "if" part of a decision tree based on stimuli for which there is no "then" statement. But, once solved and hardened, this datacenter control "product" will be much more dependable at delivering desired service levels than an army of humans manually adjusting knobs.

Can this go too far? Sure! I'm not sure I'd want to fly in a pilotless helicopter around Kauai... There are limits to the value of automated services that pre-define concepts of optimization. However, a helicopter with controllers that prevent potentially harmful actions from an error-prone human pilot would be comforting, and not only might keep the charter service in business (and me alive!), but be leveraged as a way to drive more business.

Sunday Dec 12, 2004

ZFS: Boils the Ocean, Consumes the Moon

ZFS (aka: Zettabyte Filesystem), Sun's newest filesystem that will ship with an update of Solaris 10 in 2005, can address 128-bit filesystems! Let's explore how insanely huge this is from various perspectives.

First, I've heard several times now that to construct/power a storage farm of this size would boil the world's oceans. Is this just hyperbole? Presenters typically say something like: "someone in engineering ran the math and this is amazing but true". The latest was from Larry Wake's presentation in which he said:

If we could implement a physical system with the storage capacity that matches the 128-bit address range of ZFS, that we would "literally evaporate all the oceans on earth".

This got me a little curious... Let's see:
ZFS = 128-bit = 3\*1026 [3E26] TB  (per filesystem)

Using 300GB spindles, you'd need about 1E27 spindles. Seagate's modern drives consume ~10W idle, and ~14W for both startup and in operation. So, lets go with 10W each, for round numbers. That's 1E28W or 8.8E31 KW-hr (over a full 24x7 year). That's 3.2E38 Joules.,1084,362,00.html

If we apply E=mc\^2, we'd need to annhilate 3.5E21 kg of something (old beer cans?) to produce this much energy (power those spindles for a year).  The National Oceanic and Atmospheric Administration (NOAA) experts figure that the world's oceans consist of 275 million cubic miles. Seawater weighs 1027 kg/m\^3. That means all the oceans of the world weigh about 1.2E21 kg. Perfect conversion of the oceans to energy would spin those disks for about 4 months!!

Total World Consumption of Energy in 2002 was about 450 Quadrillion BTUs, or about 13E13 or 13 Trillion KW-hrs. Note (quadrillion means 1015 in the US, and 1024 in Europe... this stat is in the USA units).

Therefore we'd need 6E18 times more capacity than the current worldwide consumption just to power this storage farm. If all the world's power generating capacity was a \*single\* grain of sand, then we'd need ALL the sand in ALL the beaches from around the entire planet to produce enough power for this storage farm. Est number of gains of beach sand: 7.5E18.

As far as the size of this storage farm (just laying the drives as close as possible to each other) when each drive is ~600K mm3 (the size of Seagate's 180GB disk). You'd need 6E32 mm3. The land surface of the earth is about 1.5E8 km2. You'd cover the earth's land surface to a depth of 2.5 million miles deep with disks to get this capacity (about 10x the distance to the moon!).

Okay. The oceans are history! So are we. But ZFS will live forever. :-)

Let's look from another perspective. Lloyd tell us that the sub nuclear limit for storage is 1025 bits/kg. That means that a fully populated 128bit storage pool would have to weight at least 600 trillion pounds, for the the recording surface. Any less, and you can't exceed the 128bit space.  Sun employees see: http://zfs.eng/faq.shtml

A combat ready aircraft carrier weighs only 194 million pounds! The Empire State building weights only 1.1 billion pounds! A solid cube made up of 1 \*trillion\* pennies (273 feet//side) weighs ~5.5 billion lbs [about the length of a football field in each dimension]. That is 300% more pennies than the US mint has ever produced!

A penny made after 1982 weighs just 2.5 grams (5.5116 E -3 lbs). That site suggests that 1 trillion (+ 16k or so to make a cube) pennies weigh 3.125 tons. But in 1982, the penny's composition was altered from 95% copper 5% zinc, to the current 97.5% zinc, 2.5% copper mix, which made it "cheaper" and lighter. That many pennies now weigh just 2.75 tons (US) or 2.5 tons (metric), so we need a few more cubes.

You'd need 110 thousand of those cubes to equal the mass of the theoretically perfect mass-efficient storage pool.

In reality, the latest Seagate 300GB disk weighs 1.6 lbs. You'd need 1E27 of these, or 1.6E27 lbs. The moon weighs 1.6E24 lbs. So you'd need the weight of 1000 moons!! And that's just in the spindles (sans racks, air handlers, etc).

Hmmm. I'm thinking 128-bit filesystems might just be enough for a few years. :-)

Thursday Dec 09, 2004

Moore's Law & Greg's Law

Bill Joy, in a Wired Mag article about a year ago, thinks Moore's Law will last another 10  years at least. With algorithmic work, those processors will have 1000x more power than today (he wrote that in Dec 2003). That's "about" the timeframe as general use of Jim Mitchell's DARPA derived systems using a proximity interconnect. Hmmm. Is that possible? What has history taught us?

And Ray Kurzweil (yep - same Kurzweil famous for his MIDI keyboards and synths), wrote a fascinating book in 1999 called "The Age of the Spiritual Machine". On page 22, he charts the evolution of compute power since the 1900-era "mechanical" devices, thru the 1940's relay based devices, thru the 1950's vacuum tube computers, thru today's modern computers. Looking at the calcs/sec available for $1000.00 and plotting this on log paper results in an almost perfect straight line!!! Moore's Law simply tracks the 5th paradigm of computing. If the IC's useful life ends around 2020 (as suggested by some scientists)... a 6th paradigm will emerge and likely continue to sustain the exponential curve that started in 1900 and has continued for over 100 years and across 5 compute paradigms. Hard to argue with this historic consistency. Yet, exponentials always (must) tail off at some point.

If we were to extrapolate (just for fun) to the year 2020 (when Ray thinks the 5th Paradigm will end), using recent trends in H/W (CPU, storage, networking, etc...), an affordable home computer will offer the follow characteristics! Of course, as Jonathan points out [], the personal desktop computer is not that interesting anymore: "...hardware is nearly identical, and the value's moved to services available through the device. Over the network. Battery life matters more than processor speed. Size of display more than disk...". However, this extrapolation might well apply to a 1 RU server blade in 2020!

   4THz                   Processor!! (or the equiv in throughput power, as we now understand)
   10TB                   Disk (via NFS v6?)
   64GB+                RAM
   100Gbps             Wired Network (photonic?)
   1Gbps                 Wireless Network (the client/consumer end points)
   3D Holographic     Video (the presentation side of the net)

Pretty amazing. I'd bet against that kind of power in a home computer. But 10 years ago, everyone would have bet against a 2GHz CPU, 80GB disk, 1GB RAM, and 100Mbps networking in a laptop. Can you imagine what you could do with a system that contains a CPU that has the (throughput) power of 1000 PCs running 4GHz Pentiums?

Read more on Ray's thoughts on Technology at:

A friend replied to this.... While meant to be humorous, there is a grain of truth in this sarcasm. Enjoy...

Greg replies:
You have incorrectly assumed that Moore's law is the only law at work and have completely overlooked "Greg's Law".  While Moore's Law follows a geometric progression, Greg's Law is an inverse log relationship.  Further, one of the dependent variables is a function of Moore's Law there by making it a trinomial - inverse logarithmic function.

Briefly, Greg's Law states "The mass and volume of software, (i.e.  LOC size, memory demands, and processor loading) increase in an inverse natural logarithm relationship to the available processor resources".

 SWmass = e\^(PR),
    where PR follows Moore's Law {PR[1] = PR[0]\*2\^(t/18),
    where t is in months

combining the equations we have:

 SWmass = e\^(PR\*2\^(t/18))

solving for the SW increase in an 18 month period we have:

 SWmass = 7.39X

If we use your example of 20 years, processors will be roughly 1M times as powerful, yet software will be e\^1M times as massive which equates to "-E-" using my calculator.

Now consider MTBF.  With processors containing 1M times the circuitry, hardware failures will increase at a staggering rate.  Most will go unnoticed however, since the software will have grown by e\^1M causing applications and/or Windows to hang every 23.7ms on average and masking true HW failures.  This is conservative since I have assumed the SW failure rate to be equivalent to the HW rate.  Empirical data puts the SW failure rate at about 3 orders of magnitude higher.

In the near future:  The DLL to support graphical display and interfacing to the file system will require 1 GB of memory alone. Single instruction operations will be replaced with object oriented classes that consist of 1000 LOC and consume 1 M of RAM.  The program will require the transfer of several GB's of data and library calls to the point where the I/O will consume the first 1GHz of the processor power.

The bottom line is it will still take 30 seconds for the CNN webpage to come up - even though you will have a gazillion times the processing power of the Saturn V that took man to the surface of the moon and back.

It's all in the bloatware

The Galactic Scope of our Marketplace!

It is amazing to consider that our own Milky Way galaxy has between 100-300 billion stars:

And that even with the finite capability of the Hubble Telescope, we can see about 100 billion galaxies:

Now, if only we could figure out how sell our "Sun" systems beyond our own spec of dust... :-)

Of course, then we'd probably also have hundreds of billions of competitors, many of whom have figured out how to harness Fusion-power at Nano-scale with FTL communications and Superconductor cores operating at Googol-Hz rates.

Hmmm. We better put off our intergalactic sales campaign for another few years....

Evolution of Sun (Darwinian?)

Our corporate evolution demonstrates that Sun has a rare ability to predict, perceive, and navigate the time-compressed shifts, competitive pressures, and demand dynamics that have sent many of our peer "organisms" to their grave. Was it just dumb luck and fortuitous timing that has allowed us to re-invent ourselves and thrive? Let's consider this.... and our future...

One of our Distinguished Engineers asked:

"The big question in the coming years is are we selected for or selected against?"

I offer a few thoughts... linking them to how they apply to Sun. And why I'm encouraged by the answer to the question he asks above.

First, many scientists no longer believe that organisms can "evolve" into a different type. A mouse didn't turn into an elephant. A protozoa didn't eventually turn into a botanist :-) What does seem to be encoded into life is an ability to adapt and evolve within a species. Sun is an example of this. We had the right DNA from the beginning... Networked Computing. We survive because we don't need to re-sequence our DNA and turn into something we are not... We simply have to adapt our existing nature to the dynamics of our changing environment. We remain the same fundamental species. Some current competitors will not find the journey along the Networked Computing evolution as natural, due to less robust or complete DNA. Many will die.

If we ever find that Networked Computing is no longer viable, \*then\* Sun is in trouble.

Second, micro evolution depends on 1. an efficient process for discarding capabilities that have become a burden, and 2. a knack for inventing new capabilities, quickly, that provide for advantages as the environment shifts. Sun has, historically, been very good at this process. If we ever find ourselves spending more time on our appendix or tonsils (eg: Eagle / US-V), rather than investing in the areas that we'll need for the next age, then we are doomed. We need to encourage our leaders to make the hard decisions and take risks. They have proven themselves in the past. But we are responsible for being their eyes and ears, letting them know when subtle changes are occurring to which strategic adjustments might be needed.

Finally, Darwin would have suggested that surviving species were just lucky. That evolution and adaptation is based on huge numbers of random mutations over billions of years. Rabbits that started to evolve such that they glowed in the dark quickly died out, being a bad choice in fashion :-) However, Darwinism is pretty much discarded these days by many scientists. Many are convinced that there was some kind of "Intelligent Design" involved in the variety and complexity of life, over the limited time that Earth has existed. The good news is that Sun Microsystems demonstrates this as well. We aren't just at the whim of random choices in innovation, with the hope that these random capabilities produce an advantage that somehow helps us thrive. No, we have some of the brightest Intelligent Designers working to ensure that we invest in the right areas, and that we prune the baggage (no matter how useful that baggage was in the past). The ranks of our Intelligent Designers include our execs (eg: McNealy, Schwartz, etc), our DEs/TDs, and every other cell in our collective body (SEs, SRs, etc).

In summary, we started with the right DNA, tuned for long-term survival in a changing world. We have Intelligent Designers working on an active process of fasttrack microevolution within our species. And we've optimized the process of focusing our energy expenditure to those areas that matter most, looking forward. As long our eye-sight remains clear to our changing environment, and our DNA remains viable (Networked Computing), and our corporate culture continues to be efficient at pruning and focusing on the right use of our energy, then the following question becomes rhetorical, with the answer clear to every cell in our body.

We need to keep asking the hard questions, and driving towards the needed changes. But on the topic of basic viability and survivability, we need every cell in our body to be able to answer this question, in the affirmative, with confidence.

We do need to adapt to a new strategy for hunting (revenue capture). Our basic DNA remains unchanged. We remain focused on Networked Computing. But our behavior must adapt in the capture of food. We can no longer wait for opptys to present themselves. If we sit at the cave entrance, we'll starve. We must hunt in teams, carefully planning our strategies, allowing each member of the hunting party to do what they do best... organizing into lethal (to our competitors) bands that know what we need to accomplish for a particular mission, and seeing it through to the end. This shift in behavior will cause a reshuffling of the tribe, but the species will survive if leadership will support the effort needed to make the change. And it is clear that our leaders are 100% committed to this brave new world.

Wednesday Dec 01, 2004

My First (few) Computers (Sinclair Z81)

I was recently reminded of my first computer... The Sinclair Z81 in the summer of 1981. I paid $80.00 for the kit. I spent the whole day soldering the kit together (4 ICs, power supply, caps, jacks, etc). I used an old B&W TV (it had 32x15 text mode and 64x44 graphical resolution), and a cassette recorder as an I/O device to save my Basic programs. No sound capability. It had a 3.25MHz Zilog Z80A processor with 1KB of RAM (expandable to 56KB) and 8K of ROM with a crippled Basic interpreter. I was a sophomore in College and was just introduced to time-share systems with teletypes and punchcards. So this interactive graphical "personal" computer was very cool. The thermal printer was actually helpful for debugging my program listings! It is amazing what we put up with just 25 years ago!

My next purchase was a COLOR Timex/Sinclair 2068 in 1983 for $200.00. It had an amazing resolution of 256x192 color, or 512x192 monochrome. I picked up a crisp green-pixeled monochrome monitor. The TS2068 had the same processor (Zilog Z80A), but slightly faster (3.58MHz). It also had sound!! It had a whopping 72KB of RAM, and 24KB or ROM, with an expansion ROM cartridge port. I could use the same cassette recorder for saving programs. I still have this one :-)

I had an alternating semester co-op job in Boca Raton with IBM in '80-'82, and was there when then introduced the PC jr. So I was pretty much locked into the PC vector (rather than the Apple worldview). So, when I was ready to buy my first "real" computer, I settled on the Tandy T1000SL PC clone.  That was in 1988, when Radio Shack came out with the T1000SL for $900.00. It was actually a very nice system for the time with an 8MHz Intel 8086, 6400x200 video resolution with 16 colors! It had a 5.25" 360KB floppy, and a 20GB Hard Disk! It had 384KB of RAM, which I expanded to 640KB. It was an IBM XT clone with some PCjr enhancements.

The rest is, well, history... As I type on my laptop with an 80GB disk, wireless networking, 2.4GHz processor, and 1400x1050 display!!

Thursday Sep 23, 2004

2008 will ROCK!!

Check out the end of the following article:

The suggestion is made that a single ROCK processor (a single chip, which fits into a single socket, which could fit in a single rack unit server, or even in a blade form factor) could be as powerful as a $3.7M E25K server (which has 144 1.2GHz cores). The public roadmap shows the ROCK processor shipping in 2008.

Think about that.. a 42RU rack could deliver the compute power of $100M in today's IT dollars.

My goodness.... How things are going to change in just a few years. The rate of change continues to accelerate. The high-tech shakeup will intensify. Adapt or die!! Thankfully, Sun appears to be making the right decisions on many fronts (Soluitons, SOA, S/W & JES, Utility, Grids, Storage, Dev Tools, OpenSource, Desktops, Linux/x86, Fujitsu, Microsoft, Partners, etc....). Directed and inspired innovation will drive the revolution.

Me... I'm holding onto my stock options... Believing that lead can turn back into gold in the  thermonuclear furnace of our G2-class Sun.




« June 2016

No bookmarks in folder