500GB/sec and Database Machine Generation 2
By Jean-Pierre Dijcks on Sep 21, 2009
Last Tuesday we announced the second generation of Database Machine. This second generation of Oracle Exadata is now running on Sun hardware. The premise for Database Machine is still the same: deliver extreme performance systems on commodity hardware with ease of deployment.
The database machine is a prime example (as was the first generation) of software-enabled hardware. The software offers the real value, the hardware is of the shelve stuff allowing a great price point and an easy way to quickly release a next generation system and get the benefits of faster chips and other components. The software allows the easy migration and the extreme benefits.
The Sun Oracle Database Machine comes with some new and very cool Exadata software features, it once again has InfiniBand - generation 2 delivers even higher throughput numbers - and it is now available in smaller configurations.
So what is new here?
For one, the addition of flash into the system is something very compelling and a leap forward in terms of performance and throughput. And yes, that is where the 500GB/sec comes from...
Effectively what we did in generation 2 is adding a very fast cache into the storage tier of the system, and by doing this created a hierarchy as shown above. The fastest tier is the actual memory in the database nodes, which we increased on the machine. The bottom part of hierarchy is the disk, here we increased the throughput for a whole rack to 21GB/sec. By adding flash cards (not flash drives!) to the storage tier we can leverage this as cache and get the benefits from a scale out strategy. As we scale out the storage, we scale out the flash and the throughput.
The Exadata cache is a smart cache that we carefully manage. If you deem it necessary you can pin objects into the cache as well. Since the Exadata Storage Server actually understands the structure of the data stored, the cache does so too. It is after all managed by the Exadata software. This means that we do not use a regular LRU (Least Recently Used) algorithm, but determine which data is hot and cache these sets when we deem it better to do so.
One distinct difference with the flash you see in traditional storage arrays is that we are not using flash disks in Exadata. We are using PCIe cards. This means we are not constraint by slow disk controllers and can get these massive throughput numbers of 50GB/sec for a full rack database machine.
On top of this, we are introducing Hybrid Columnar Compression with Exadata generation 2. We talked about this already in a previous post around the 11g Release 2 database new features.
In the data warehousing workload (assuming bulk loads for example and lots of querying) we can achieve a 10x compression of the data with almost no impact on query performance. That compression rate allows us to achieve up to 500GB/sec of scan rates from the flash cards.
To put that into perspective, in generation 1 of the Database Machine we achieved up to 14GB/sec of throughput from the disks (in a full rack). In generation 2 we are up to 21 GB/sec, both numbers are uncompressed. Flash gets us to around 50GB/sec. The truly staggering numbers come with that 10x Hybrid Columnar Compression rate... For anyone who has ever run queries on a system, 500 GB/sec is really, really fast!
That is not all though. Generation 2 of Exadata also introduces Storage Indexes. A storage index is something more akin to a range partition, but we evaluate this at the storage layer. Sometime this is referred to as a negative index.
What happens is that for each column commonly queried we transparently store the min and max values of that column. We do this for a certain data size e.g. as soon as we finish writing the data and filling up that predefined size we calculate the min and max for the relevant columns. The result is something like this:
If the user now issues a query asking something like SELECT * FROM TABLE WHERE B<2 the scans will only look for the first set of rows in above picture. Since the minimum value in the second block is 3, no rows matching the query will be in that set of rows. This allows a Storage Index to gives us transparent data elimination without overhead, making the scans more distinctive and therefore faster.
So as you can see, the whole system is faster on all accounts than the already fast generation 1 system. It is also much faster than anything else out there in the market.
Seeing that there is much more news, like the actual family details (half racks, quarter racks and smaller) the offloading of data mining scoring and all the 11g Release 2 details we haven't yet cover, expect quite a few follow-up posts on both 11gR2 and Generation 2 Exadata.
Next is as promised earlier, the 11gR2 in-memory parallel execution