Tuesday May 26, 2009


This is my last entry in this Blog. I have decided to pursue my career outside of Sun.

I hope you enjoyed the Blog!

See you in the Net. 


Tuesday Mar 31, 2009

Intelligent Performance - Solaris optimized for Nehalem

Nehalem_Solaris.pngOver the last two years, Sun and Intel have been working together - from design and architecture through implementation - to ensure that Solaris is optimized to unleash the power and capabilities of current and future Intel Xeon processors. The compelling results include:
  • Increased performance as the Solaris OS takes advantage of Intel multi-core processor capabilities and Intel Turbo Boost Technology
  • Optimized power efficiency and utilization by enabling Solaris to take advantage of Intel Xeon processor 5500 (aka Nehalem) series performance-enhanced dynamic power management capabilities.
  • Extending Predictive capabilities to improve reliability by incorporating Nehalem features into the Solaris Fault Management Architecture (FMA).
Now, lets talk more in-depth about the innovations within Solaris in combination with Intels Nehalem Architecture.

Intelligent Performance

We have optimized the performance of Solaris for individual cores and the overall multi-core microarchitecture, which increases both single- or multi-threaded performance. Intel Turbo Boost Technology uses any available power headroom to deliver higher clock rates. In those situations where the application requires maximum processing power, the Intel Xeon processor 5500 series increases the frequency in the active core when conditions such as load, power consumption and temperature permit it.

The Solaris threading model provides a sophisticated performance with specific optimizations for the new Nehalem Architecture. Solaris also takes advantage of the capabilities of the new Intel QuickPath Interconnect QPI with capabilities such as an optimized scheduler and memory placement optimization (MPO) capability that has proven performance benefits with non-uniform memory access (NUMA). This reduces the memory latency. The Solaris NUMA implementation takes information from the Advance Configuration and Power Interface (ACPI) System Resource Affinity Table (SRAT) and the System Locality Information Table (SLIT).

Modern processors provide the ability to observer performance characteristics of applications using performance counters. Solaris provides the libcpc (\*LIB) API to access these performance counters. These interfaces can be used to observe the performance characteristics of applications. Following utilities provide you information from libcpc:
  • cpustat -h provides a listing of all the different events that are available on a give processor.
  • cputrack is used to analyze performance charateristics on a per-process or per-LWP basis
  • Performance Analyzer tools like collect, analyzer and er_print.
The DTrace CPU Performance Counter provider (cpc provider) makes available probes associated with processor performance counter events.

Automated Energy Efficiency

Solaris takes advantage of many power efficiency features in the Nehalem architecture. For example, an innovative Power Aware Dispatcher (PAD) has been integrated into OpenSolaris, enabling granularity in power management (P-states). We have seen a substantial reduction in idle power consumption, lower power cunsumption at maximum cpu utilization, and improved performance when switching between power states.

The kernel dispatcher - the part of the kernel that decides where threads should run - is integrated with the power management subsystem of the Nehalem cpu. Therefore the kernel now has the ability to utilize those parts of the processor that are active, and continue to avoid doing work on those parts that are powered down.

PowerTOP is a new command line tool that shows how effectively a system is taking advantage of the processor's power management features. The application observes the system on an interval basis and displays a summery of how long the processor is spending (on average) at each different state.

Reliability and Availability

Sun and Intel are working together to extend the capabilities of the Solaris Fault Manager by also supporting the Machine Check Architecture (MCA). The Fault Manager automatically diagnoses the underlying problems and responds by off-lining faulty components.

The ability to fast reboot a system drastically reduces downtime and improves efficiency. Fast Reboot is a command-line feature that enables you to reboot an Intel Xeon processor 5500 series system quickly, bypassing the BIOS, power on self test, and the GRUB Bootloader. Fast Reboot (reboot -f) implements an in-kernel boot loader that loads the kernel into memory and then switches to that kernel, so that the reboot process occurs within seconds.

The ability to work around processor errata in the operating system by applying microcode updates is available in Solaris. This support alleviates the need to upgrade a system's BIOS every time a new microcode update is required.


We have made optimizations to the compiler tools and runtime libraries including full support for Streaming SIMD Extensions (SSE 4.2). To enable the automatic usage for SSE instructions, specify -xvector=simd in your compiler options.

To speed up serial application performance on multithreaded chips like the Intel Xeon 5500 Series, you can use the compiler option -xautopar. The compiler will then generate codes which, when executed at runtime, will have more than one thread to execute the loop body.

OpenMP is the de-facto standard for writing multi-threaded applications to run on multi-threaded machines. OpenMP specification version 3.0 defines a rich set of directives, runtime routines, and environment variable that allow the programmer to write multi-threaded applications in C, C++, and Fortran. The Sun Studio software facilitates OpenMP development. The main motivations for using OpenMP are performance, scalability, portability, and standardization. With a relatively small amount of coding effort, a programmer can write multi-threaded applications to run on multi-threaded machines.


This video from David Steward (Intel) also provides some insights about the improvements in Solaris for the Nehalem architecture:

Sun provides everything necessary to get the best out of the latest Intel Xeon 5500 series CPU. Starting at the Hardware with our innovative X-Series Servers, the Solaris operating system with all its enhancements, and ending with the complete compiler environment that leverages all features for your application.

You can find more information on the following links:

Wednesday Mar 25, 2009

MacHeist 3 Bundle

The MacHeist bundle features a core lineup of a dozen award winning and popular apps, games and utilities that represent the cream of the crop from the Mac development community. This year, the bundle has some great an cool tools like iSale, Picturesque 2, SousChef, World of Goo, PhoneView, LittleSnapper, Kinemac, WireTap Studio, Boinx TV, The Hit List and Espresso.

MacHeist is also donating 25% of each bundle purchase to a partnering charity of the customer's choice. To date, MacHeist customers have raised over $700,000 for philanthropic efforts around the world.

So if you are a Mac user, and need some cool tiny apps to improve your daily work, this Bundle might be of interest for you. I personally bought it :-)


Tuesday Mar 17, 2009

Open Flash

Open_Flash_Module.pngOur new Open Flash Module is the world's first enterprise-quality, open-standard Flash design. Built to an industry-standard JEDEC form factor, the module is being made available to developers and the OpenSolaris Storage community to foster Flash innovation. The Open Flash Module delivers unprecedented I/O performance, saves on power, space, and cooling, and will enable new levels of server optimization and datacenter efficiencies.

Imagine what you can do with such a Flash Module!
  • Instead of having Servers that only use RAM to increase their performance, you can now build a Server that combines RAM and SSD in one. The big advantage is certainly the Solid State of the Flash Module. With this advantage you can use the Flash Module for caching write transactions or anything else that needs Solid Storage.
  • Or, use the module in traditional FC Array Controllers as a backup of the traditional RAM Cache. If the controller looses power, the RAM based Cache can be written to Flash in seconds, and therefore have a kind of Hybernate in a Array ;-)
  • You can also use the module in traditional Array Controllers as an extension of the RAM Cache. Very similar to the ZFS Hybrid Storage Pool Model, Raid Controllers could implement a tiered cache model. In this case any application or file system (that is not yet as innovative as ZFS) could profit from the combination of RAM and Flash!
  • Additionally to RAM, Notebooks could have a Flash DIMM Slot. On this Flash DIMM the complete operating system and also the active data you are working with is stored in a tiered storage approach. That allows the system to spin down your high capacity harddrive and save a lot of power!
  • Compute nodes in clusters could dramatically expand their work storage with Flash DIMMs. Imagine 8x Flash DIMMS in the initial specification would provide 192GB of quite fast compute memory.
In many situations the performance of a Flash DIMM is sufficient. Considering the higher density (24GB DIMM) and the lower power consumption than a RAM Module, this product is really powerful and interesting!

Initial Specifications

  • 24 GB initial capacity
  • 64 MB DRAM buffer cache

Form factor
  • JEDEC MO-258A
  • 3 Gb SATA-II/SAS-I

  • 7 x 24 x 3 years (100% write duty cycle)
  • Designed for enterprise-class applications

Want to know more?

See what Chief Architect Andy Bechtolsheim says about SSDs and Open Storage.

Product Information

To get more information about Flash Storage and the Open Flash Module, use the following links:

Tuesday Feb 10, 2009

Distributed Computing

cluster.pngToday everybody thinks about distributed computing, virtualization and data center optimization. So, within the next few blog entries I am going to talk about distributed computing.

Distributed Computing and its Market

In history, distributed computing was often used in Education and Research. In the last year, the trend has tremendously changed. Finance and Insurance Companies are more and more moving away from big iron systems into grid based solutions and distributed computing. It is not only much cheaper, but also much more efficient. Typical applications for grid computing are, Risk Analysis, Instant Credit Calculations, 3D Calculations and any Simulations like Weather, Earth Quake, Semiconductor, etc.. In fact distributed computing is great for every application that can run multiple processes in parallel on different systems.

What do I need for Distributed Computing

Basically distributed computing is very simple and you just need a few tools and systems to run your own Grid / Cluster / HPC. Bellow is a short overview on what you need to go Distributed:
  • 2 + N Server Nodes with same software stack
  • Centralized Storage that can be accessed from any System Node
  • A Scheduling Software that manages the processes which are distributed to the System Nodes
  • A Monitoring Tool that give you some insights on how efficient your System Nodes are being used
  • A Network for the efficient communication between the System Nodes, preferably > 1Gb Ethernet. Optimal would be Infiniband!


In the next few blogs entries, I am going to talk about:
  • Mastering the Grid
  • Storage in the Grid
  • Monitoring the Grid
  • Into the Cloud!
So, Have fun!

Wednesday Feb 04, 2009

The Screaming Fast Sun Modular Storage 6000 Family!

Did you know that both Sun StorageTek 6140 and 6540 disk arrays, which belong to our Modular Storage Line, are still leading the price/performance rankings in their class? Feel free to verify at StoragePerformance.org. The modular approach and the ability to upgrade from the smallest to the biggest system just by exchanging controllers is very unique and our customers love this investment protection!

The Uniqueness of the Sun Storage 6000 Familiy

Today, the 6000 modular storage portfolio looks as follows:
  • 6140-2 (up to 64 Disk Drives mixed FC and SATA)
  • 6140-4 (up to 112 Disk Drivers mixed FC and SATA)
  • 6540 (up to 224 Disk Drives mixed FC and SATA)
  • 6580 (up to 256 Disk Drives mixed FC and SATA)
  • 6780 (up to 256/448\* Disk Drives mixed FC and SATA)

All 6000 series arrays are using an ASIC (Application Specific Integrated Circuit) to do RAID operations. This results in a very low latency overhead and a guaranteed performance. The publicized cache volume is 100% dedicated to the ASIC and can't be accessed by the management CPU, which for example in case of the 6780 has a separate 2GB RAM. In the complete family, you have upgrade protection.

You can start with a 6140-2 and seamlessly upgrade to a 6780 by just replacing the contollers! No configuration changes or exports are necessary, as the complete RAID configuration is distributed to each single disk in the array. You can also move a complete RAID group to a different array in the family. Certainly you better take care that both are running on the same firmware level. ;-)

Sun StorageTek 6780 Array

As of today, Sun announced its latest and greatest midrange disk array. It is completing the modular line as the high end model of the 6000 series. The connectivity of the Storage Array and its features are very impressive and pretty unique in the midrange segment!
  • Replaceable Host Inteface Cards (two per Controller)
    • Up to 16x 4Gb or 8Gb\* FC Host Channels
    • Up to 8x 10Gb\* Ethernet Host Channels
  • 16x 4Gb FC Drive Channels
  • Up to 16x/28x\* Drive Enclosures
  • Up to 32GB\* dedicated RAID Cache
  • RAID 0,1,3,5,6,10
  • Up to 512 Domains = up to 512 servers with dedicated LUN mapping can be attached to the array
  • Enterprise Features:
    • Snapshot
    • Data Copy
    • Data Replicaton
Bellow are some insights about the architecture of the 6780 Array:


The internal flash storage allows longterm power outages without loosing IO that is not yet written on disk. As you can see, each drive chip has access to all disk drives. Everything in the controller and drive enclosure has at least a redundancy factor two. In some cases like the drive chips we have even a higher redundancy factor.


The expansion trays are SBODs (Switched Bunch Of Disks) and therefore limit the impact of a drive failure. Most other vendors still use looped JBODs. In such a case, a loop is vulnerable if a drive fails. In worst case a complete tray could fail just because of a failing drive. Also looped BODs are slower than switched BODs.


Due to the high amount of drive channels, the maximum drive count per dual 4Gb FC loop is 64 (with 448 Drives). With 256 Drives, you will only have 32 drives per dual 4Gb FC loop. Due to this fact, and the dedicated ASICs for RAID calculations, the 6780 array can do up to 175'000 IOPS and 6.4GB/s throughput in disk read operations. This is for sure the top rank in the midrange segment!


Latest by now, you should know that Sun is NOT a me too manufacturer in the storage business. Our modular storage family uses leading edge technology and delivers investment protection by providing an easy upgrade path the the next higher controller level.

\*Will be available after initial release.

Sunday Feb 01, 2009

Traditional Arrays vs "The Open Storage Approach"

Why should I still use a traditional Array?

You may ask yourself why you should use a traditional array, if Sun is pushing towards OpenStorage? Good question! Now, as there isn't a cow that provides, milk, coke and beer, there isn't a storage product that does everything for you ... today. So, while our OpenStorage family is today perfect for IP network oriented access like CIFS, NFS and iSCSI it doesn't cover yet the FC block attached community. And despite all honour that ZFS and OpenSolaris deserve, an ASIC, if you have the money and the skills to build one, will do faster RAID calculations. ASICs do not require an operating system underneat the RAID code, which results in far less latency in calculation.

The Unanswered Question ...

There is one unanswered question that remains in the IT business. How long can companies afford to build ASICs that keep up with the performance increases in the volume business? ASICs, as the name states, are built for a certain purpose and therefore manufactured in a much lower volume. Means, they are simply much more expensive than general purpose built CPUs.

An other question might give you an impression of the future. Who is still programming Assembler? Every programmer knows that if you write perfect Assembler Code, no but really NO C, C++ or Java program will ever run faster than your Assembler program, right? But, programming Assembler gets so complex that you can't manage anymore your code. That's why we use abstraction layers to simplify your business.... Got a hint?

Now, there is also a huge design problem with a dedicated ASIC. You cannot extend its features by just upgrading the software as it is hardware. An ASIC can do what it's built for, and therefore is very limited in extending features! In a manufacturing and design perspective, this can be very limited. One little thing missing or wrong in a ASIC, and you will fail with the complete product without the chance to fix or change it. Uhhh, you better make no mistake ...


So depending on your requirements, you will have to choose the appropriate technology! If you can afford the no compromise way of storage, the best solution is to have both or maybe a combination of each. :-)

In a long term perspective, I only see one solution that survives. The combined approach of commodity hardware and software, provides the key elements that will succeed. This namely are:
  • Great price/performance
  • Possibility to add features (in best case for free) with easy upgrades
If the used software is open sourced, you suddenly have the ability to add features yourself to the subsystem! One example is the project COMSTAR that turns an OpenSolaris host into a SCSI target.


So, you better keep the OpenStorage Vision of Sun in your mind.

See how the L2ARC in OpenStorage works

Brendan Greg from the Fishworks Engineering Team did a tremendous job in testing the behavior of L2ARC in a well populated Sun Unified Storage Server 7410. It is a great introduction how the L2ARC in combination with SSD technology works! To read more about it, click here.

Thursday Jan 15, 2009

OpenSolaris 2008.11, Closing the cap or leapfrogging Linux?

opensolaris.png When Solaris 2008.05 was released in May 2008 the feedback from several communities and magazines was in general positive but with several hints that OpenSolaris still has to go a long way to close some gaps to Linux. Often they have mentioned the amount of pre-compiled (packaged) applications available, the package management and lack of drivers for certain devices.

With the release of OpenSolaris 2008.11 we can say that we have addressed many of the mentioned topics. We have also improved other areas that we believe are simply unique in the Linux/UNIX world! Very often people just talk about the front-end and applications that run on your desktop by ignoring that OpenSolaris is not just an other Linux, in fact it is a UNIX!

Since the release of OpenSolaris 2008.05 the installed base has doubled and more than 80 OpenSolairs user groups today spreading around the world.

Let me now introduce some new features and key differentiators we have compared to any other Linux/UNIX:

If you look what made OpenSolaris more and more popular, it is certainly ZFS and DTrace.
  • Now DTrace is probably not the feature that you would use as a standard computer user, but it is essential for every developer to help him improve the quality of his application and discover bugs. DTrace allows you to debug your application on the fly while it is running in OpenSolaris without special debugging code (binaries). There is no Operating System that is capable of doing this, except FreeBSD and Mac OS X. The guys at FreeBSD are adopting and implementing a lot of cool features introduced first in Solaris and OpenSolaris, like ZFS.

  • Many people are switching to OpenSolaris because of ZFS. The very simple interface and the tremendous functionality of ZFS make this file system the best file system ever. This leads me to the new added feature called Time Slider. Time Slider is a combination of features built on top of ZFS, integrated into the GNOME file browser. It basically allows you to slide the time of your file system back and recover files that you have for example accidentally deleted or modified. You can compare this feature with Time Machine on Mac with the exception that you don't need an external disk to do this!
This video shows how Time Slider looks and works:

We have tremendously improved the application stack including:
  • updated GNOME to release 2.2.4
  • latest version of Firefox including DTrace probes for debuging web applications and behaviour
  • integration of Songbird which is built on the Mozilla Framework and is simply a great music player with a touch/look-and-feel from iTunes ;-)
  • new fully featured AMP stack including Ruby and Dtrace probes for easier trouble shooting
  • integration of OpenOffice 3.0
Further cool enhancements are:
  • more Open Source Software Support than ever before
  • tunings of OpenSolaris core components
  • support and performance optimizations for Intel Core Micro Architecture (upcoming Intel CPUs, Codename Nehalem)
  • introducing power efficiency optimizations into datacenters and not only notebooks
  • virtualization optimizations to optimize runtime environments
  • improved package management software
  • proper and clean sleep and resume functions for notebooks
See what Intel Says about their new Core Micro Architecture and the relationship with OpenSolaris!

Intel and AMD are also helping in optimizing the IOMMU which manages DMA. Wan't to know what DMA and IOMMU is? Check this Video for more information.

One great and unique feature of OpenSolaris is the simple and easy distribution upgrade. No reason to worry if you are not satisfied with the new release of the operating system. When upgrading to a next release, the system creates automatically a bootable ZFS snapshot of your old release, allowing you to boot your old environment whenever you want. Which Linux does this? There are also not many Linuxes that allow easy major release upgrades. As far as I know only the ones that are based on the Debian package manager.

Further news are that Sun & Toshiba announced pre-configured OpenSolaris notebooks that will be available soon, as well as the certification of Zamandas backup solutions for OpenSolaris.

The combination of Zmanda’s robust backup solutions (Amanda) with Sun’s innovative OpenSolaris operating system, including the advanced ZFS file system, creates one of the most advanced backup-to-disk offerings available on the market today. Specifically, the snapshot capability of ZFS enables fast and scalable backups of today’s most demanding workloads. With the new features and record-breaking performance introduced in OpenSolaris 2008.11, we demonstrate our commitment to rapidly innovating on the open source ecosystem.

Amanda Enterprise is an enterprise-grade network backup solution based on Amanda, the world’s most popular open source data backup and recovery software. It is a powerful, low-cost, open source solution that protects OpenSolaris, Solaris, Linux, Windows, and Mac OS X environments using a single management console.

Did you know, our Sun Storage 7000 Series is also certified for Amanda Enterprise Backup.


Well we might not have closed the gaps in the desktop area completely to the Linux community, but OpenSolaris is clearly leapfrogging Linux on the backend by using ZFS as the most flexible, scalabale and innovative file system ever and providing unique DTrace functionality to our open source developers. Today OpenSolaris is my personal choice for a Web 2.0 based environment!

Tuesday Nov 25, 2008

ZFS - A Smashing Hit

See how Jim Hughes demonstrates how ZFS compresses data and if failure occurs, the data is still there! This is a very practical video ;-) Have fun!

Thursday Nov 13, 2008

New Class Of Storage Systems - Sun Storage 7000 Unified Storage Systems

STK7410_Rack.pngI have been blogging for a while about Open Storage, ZFS Hybrid Storage Pools and Solid State Disks. Now the Products that combine all of those technologies are available and will disrupt the complete storage market!

You may ask why? Here are a few reasons:
  1. There is simple no price competitive system on the high end market
  2. Our system has no license fees - All inclusive incl. future features
  3. There is no other system that has in depth built-in analytics
  4. There is no other system that combines new technology SSD and traditional storage
  5. No other system has a rock solid OS like Solaris with all its features like DTRACE, Fault Management Architecture (FMA)
I could mention dozens of more reasons, but that should be already enough to seriously consider those systems in your business!

So, said enough marketing stuff, I would like to go a bit deeper and give you a short introduction of the products and its extraordinary features!

Sun Storage 7000 Unified Storage Systems

We have announced three different Unified Storage Systems for the beginning. The two smaller version are single node systems, while the 7410 can be used in a active/active cluster (2 Nodes). All systems are fully licensed and run the same OS. There are no features restrictions an the smaller systems, except the one given by the hardware configuration.

Sun Storage 7110 Unified Storage System

STK7110.pngThe 7110 is the entry level Unified Storage System. It has following hardware Specifications:
  • 14x Usable Disk Drives
  • Quad-Core Opteron
  • 8GB RAM
  • 4x 1 Gigabit Ethernet Ports
  • 6x PCI-E Slots per Node
  • 1Gb-E and 10 Gb-E Network Interface Cards
  • FC/SCSI HBA Options for Backup/Restore
Today the system is equipped with 14x 146GB 10k RPM Disks. In the future you will be able to have it also equipped with 14x 500GB SATA Disks.

The 7110 is the only system that cannot be equipped with SSDs for now. The system is perfectly suited as a workgroup storage and just uses 2u Rack Space.

Sun Storage 7210 Unified Storage System

STK7210.pngThe 7210 is the ultimate dense Unified Storage System. It doesn't only have a lot of disks but also quite some Caching and CPU power. Here are some hardware specifications:
  • 44-46x Usable Disk Drives
  • 0-2x LogZilla 18GB SSDs
  • Dual Quad-Core Opteron
  • 32GB/64GB RAM
  • 4x 1 Gigabit Ethernet Ports
  • 3x PCI-E Slots per Node
  • 1Gb-E and 10 Gb-E Network Interface Cards
  • FC/SCSI HBA Options for Backup/Restore
The 7210 is the ultimate dense storage pod! In combination with the 44TB Storage and the LogZilla write acceleration the system can provide up to 780MB/sec throughput! All of this in only 4u Rack Space!

Sun Storage 7410 Unified Storage System

T7410_Single_Node.pngThe 7410 is our highly available and performant Unfied Storage System. The 7410 System supports two configurations, a single node and a 2-node cluster for high availability. Each configuration has three levels, an Entry, Mid and High level, where the main differences are in computer power. Here is an overview of the hardware specifications:
  • Up to 576x Usable Disk Drives
  • Up to four Quad-Core Opteron per Node
  • Up to 4x LogZilla 18GB SSDs per Node
  • Up to 6x ReadZilla 100GB SSDs per Node
  • up to 128GB RAM per Node
  • 4x 1 Gigabit Ethernet Ports per Node
  • 6x PCI-E Slots per Node
  • 1Gb-E and 10 Gb-E Network Interface Cards
  • FC/SCSI HBA Options for Backup/Restore
T7410_Dual_Node.pngHave you ever seen a storage system that had 128GB Cache per Controller? We go even further by adding 600GB L2ARC Cache! So in fact if you go for the big cluster, you will have 256GB L1ARC Cache and 600GB L2ARC Cache. Again this is where we start today, imagine how much cache we will have in the future.

The 7410 is based on compute nodes (Heads) and storage nodes (JBODs). In regards to compute power you have 16 cores per head to do all storage and file system related work. In a clustered configuration you will have 32 cores that can perform in parallel (active/active)! The heads together have therefore a theoretical IO capability of more than 1.6 Mio IO/s per second!

A storage node is a 4u rack mountable chassis that can hold up to 24 disks. You can attach up to 24x storage nodes to this system which will give you a total of 576 disk drives.

The Sun Storage 7410 implements a true ZFS Hybrid Storage Pool with support for Flash-memory devices for acceleration of Reads (100GB Read Flash Accelerator, aka Readzilla) and Writes (18GB Write Flash Accelerator, Logzilla). Multiple configurations are provided on both the node configuration and the expansion array to accommodate the most demanding customer application performance requirements. You can find more details about the SSD Integration bellow in the feature section.

Extraordinary Features

Now as we have seen what hardware features the three Unified Storage Systems have, I wold like to go a bit deeper into the software features. These are in fact the features that make this products so unique and interesting!

SSD Integration / Hybrid Storage Pools

The Sun Storage 7000 system uses a Flash Hybrid Storage Pool design, which is composed of optional Flash-memory devices for acceleration of reads and writes, low-power and high-capacity enterpriseclass SATA disks, and DRAM memory. All these components are managed transparently as a single data hierarchy, with automated data placement by the file system. In the Storage 7410 model, both Write Flash Accelerator (write-optmized SSD, aka LogZilla) and Read Flash Accelerator (read-optimized SSD, aka ReadZille) are used to deliver superior performance and capacity at lower cost and energy consumption than competitive solutions. The Storage 7210 currently implements only write-optimized SSD, and the Storage 7110 does not currently implement this design.

ZFS provides two dimensions for adding flash memory to the file system stack, and improve overall system performance: the L2ARC (Level 2 ARC) for random reads, and the ZIL (ZFS Intent Log) for writes. The L2ARC (ARC is the ZFS main memory cache in DRAM) sits in between memory cache and disk drives and extends the main memory cache to improve read performance. The ZFS Intent Log uses Write-Flash SSD disks as log devices to improve write performance.

The main reason why we have chosen different SSDs (ReadZilla, WriteZilla) lays on the fact that flash based SSDs are still quite expensive and have some limitations in how they write data. The WriteZilla SSDs have a more complex controller chip that can handle thousands of write IO/s, a bigger DRAM cache and a capacitor that assures that in case of a power outage no IO gets lost between DRAM and the flash chips. WriteZilla SSDs are therefore optimized on writes while the ReadZilla SSDs are optimized on read operations.

Realtime Analytics

g20_abr_feature1_zoom.pngRealtime Analytics is one of the coolest features in this product and was only possible because Solaris has DTrace builtin. The Sun Storage 7000 Systems are equipped with Dtrace Analytics, an advanced DTrace-based facility for server analytics. DTrace Analytics provides real-time analysis of the Storage 7000 System and of the enterprise network, from the storage system to the clients accessing the data. It is an advanced facility to graph a variety of statistics in real-time and record this data for later viewing. It has been designed for both long term monitoring and short term analysis. When needed, it makes use of DTrace to dynamically create custom statistics, which allows different layers of the operating system stack to be analyzed in detail.

g20_abr_feature2_zoom.pngAnalytics has been designed around an effective performance analysis technique called drill-down analysis. This involves checking high level statistics first, and to focus on finer details based on findings so far. This quickly narrows the focus to the most likely areas.

So how does this work?

You may discover a throughput problem on your network. By selecting the interface that causes you some headache, you can drill down by protocol and even deeper onto the NFS client that causes the high load. Well we don't stop here and can drill down further to figure out what kind of files the nfs client is accessing at what latency, etc. DTrace Analytics creates datasets as you are drilling down. These datasets can be stored and reused at a later time. The analytic data is not discarded - if an appliance has been running for two years, you can zoom down to by-second views for any time in the previous two years for your archived datasets. The data is stored on a compressed file system and can be easily monitored. You can destroy datasets on demand or export them as CSV.

Other Features

The Unified Storage Systems have a lot of other features which I will cover in short.

Data Compression

Data compression is useful because it helps reduce the consumption of expensive resources, such as hard disk space or transmission bandwidth. The Sun Storage 7000 System software supports 4 levels of data compression, LZJB and 3 levels of GZIP. Shares can optionally compress data before writing to the storage pool. This allows for much greater storage utilization at the expense of increased CPU utilization. In the Sun Storage 7000 family, by default, no compression is done. If the compression does not yield a minimum space savings, it is not committed to disk to avoid unnecessary decompression when reading back the data.


A snapshot is a read-only copy of a file system or volume. Snapshots can be created almost instantly, and initially consume no additional disk space within the pool. When data within the active dataset change, the snapshot consumes disk space by continuing to reference the old data and so prevents the space from being freed. Snapshots are the base for replication and just-in-time backup.

Remote Replication

The Sun Storage 7000 Remote Replication can be used to create a copy of a filesystem, group of filesystems or LUNs from any Storage 7000 System to another 7000 system at a remote location through an interconnecting TCP/IP network that is responsible for propagating the data between them. Replication transfers the data and metadata in a project and its component shares either at discrete, point in time snapshots or continuously. Discrete replication can be initiated manually or occur on a schedule of your own creation. With continuous replication, data is streamed asynchronously to the remote appliance as it's modified locally at the granularity of storage transactions to ensure data consistency. In both cases, data transmitted between appliances is encrypted using SSL.

iSCSI Block Level Access

The Sun Storage 7000 family of products act as a iSCSI target for several iSCSI hardware and software initiators. When you configure a LUN on the appliance you can specify that it is an Internet Small Computer System Interface (iSCSI) target. The service supports discovery, management, and configuration using the iSNS protocol. The iSCSI service supports both unidirectional (target authenticates initiator) and bidirectional (target and initiator authenticate each other) authentication using CHAP. Additionally, the service supports CHAP authentication data management in a RADIUS database. You can even do thin provisioning with iSCSI Luns. Means they grow on demand.

Virus Scan

This feature allows the Storage 7000 family to be configured as a client of an antivirus scan engine. The Virus Scan service will scan for viruses at the filesystem level. When a file is accessed from any protocol, the Virus Scan service will first scan the file, and both deny access and quarantine the file if a virus is found. Once a file has been scanned with the latest virus definitions, it is not rescanned until it is next modified.

NDMP Backup and Restore

Backup and restore is one of the primary goals of enterprise storage management. Backup and restores should be in a timely, secure, and cost effective manner over enterprise wide operating systems. Companies need high performance backup and the ability to back up data to local media devices. While the data itself may be distributed throughout the enterprise, its cataloging and control must be centralized. The emergence of network-attached storage and dedicated file servers makes storage management more challenging. Network Data Management Protocol (NDMP) recognizes that these issues must be addressed. NDMP is an opportunity to provide truly enterprise-wide heterogeneous storage management solutions - permitting platforms to be driven at a departmental level and backup at the enterprise level.

The Sun Storage 7000 Systems support NDMP v3 and v4

Phone-Home of Telemetry for all Software and Hardware Issues

Phone-home provides automated case opening when failures are detected in the system. This assures faster time to resolutions and reduces the time to figure out what the problem might be.

End-to-End Data Integrity and self-healing mechanisms

The Sun Storage 7000 systems include FMA (Failure Management Architecture) which provides the capability to detect and take faulty hardware components offline in order to prevent system disruption. In addition, to avoid accidental data corruption, the ZFS file system provides memory-based end-to-end data and metadata checksumming with self-healing capabilities to fix potential issues. FMA combined with ZFS data integrity facilities, make the sun Storage 7000 the most comprehensive self-healing unified storage system.


What makes this system so screaming cool? It is simply the combination off all features, starting at the hardware with the SAS protocol, the incredibly high amount of caches, the integration of SSD technology going to the soft features like real time analysis, end-to-end data integrity, FMA (Fault Management Architecture), and finally its foundation on open source technology (OpenSolaris, ZFS, and many other open source projects) that assures future innovation. Features like, encryption, de-duplication and FC-target mode are on its way. And you know what, you will get them all at no additional license cost! That is what I call investment protection.

If you don't consider these Unified Storage Systems at your next IT investment, you are simply ignoring facts and may spent far too much money for a limited featured product.

Tuesday Nov 11, 2008

ZFS and the Hybrid Storage Concept

I have spoken initially in an older blog entry "Open Storage - The (R)Evolution" about ZFS and Hybrid Storage Pools. Now I would like to dive a bit deeper into this great feature.

Hybrid Storage

ZFS is not just a filesystem. It is actually a hybrid filesystem and volume manager. These two functions are the main source of the flexibility of ZFS. Being hybrid means that ZFS manages storage differently than traditional solutions. Traditionally, you have a 1:1 mapping of filesystems to disk partitions, or alternately you have a 1:1 mapping of filesystems to logical volumes, each of which is made out of one ore more disks. In ZFS, all disks participate in one storage pool. Each filesystem can use all disk drives in a pool, and since the filesystem is not mapped to a volume, all space is shared! Space can be reserved, so that a single filesystem cannot fill up the whole pool and space reservations can be changed at will. Growing or shrinking of a filesystem isn't just painless, it is irrelevant!

zfs_hybrid_storage_model.pngThe definition of hybrid storage within ZFS goes even further! A storage pool can have more than just logical volumes or partitions. You can split the pool into three different areas:
  1. ZIL - ZFS Intend Log
  2. Read / Write Cache Pool
  3. High Capacity Pool
By using different devices for each position above, you can tremendously increase the performance of your filesystem.

ZFS Intend Log (ZIL)

All file system related system calls are logged as transaction records by the ZIL. The transaction records contain sufficient information to replay them back in the event of a system crash.

The ZIL performance is critical for performance of synchronous writes. A common application that issues synchronous writes is a database. This means that all of these writes run at the speed of the ZIL.

Synchronous writes can be quickly written and acknowledged by the "slog" in ZFS jargon to the client before the data is written to the storage pool. The slog is used only for small transactions while large transactions use the main storage pool – it's tough to beat the raw throughput of large numbers of disks. A flash-based log device would be ideally suited for a ZFS slog. Using such a device with ZFS can reduce, latencies of small transactions to a range of 100-200µs.

Read Cache Pool

How many data on your traditional storage systems are active data? 5%? 10%? Wouldn't it be nice to have a low latency solid storage that delivers you the information in time and without additional IO on your traditional storage (disks)? Is your RAM not sufficient to store all hot read data or is it too expensive to have 256GB RAM?

That is exactly where the read cache pool has it's role.

ZFS and most other filesystems use a L1ARC (Adaptive Replacement Cache) that resides in your RAM memory. The drawback of this is that it is not solid and very expensive. Not solid means after each reboot you rely for a certain time on your traditional storage until the cache has been rebuilt for optimal performance.

The people from the ZFS team have now also implemented a L2ARC that can use whatever device to improve your read performance!

The level 2 ARC (L2ARC) is a cache layer in-between main memory and the disk. It uses dedicated storage devices to hold cached data, which are populated using large infrequent writes. The main role of this cache is to boost the performance of random read workloads. The intended L2ARC devices include short-stroked disks, solid state disks, and other media with substantially faster read latency than disk.

Imagine a 10TB file system with a 1TB SSD L2ARC! Screaming fast!

High Capacitiy Pool

The high capacity pool now just takes care for the mass storage. You can basically go with low performing high capacity disks as most of your IO/s are being handled in the L2ARC and ZIL.

Old fashioned Storage vs the new Fashion

The following pictures illustrate the historic view of filesystems and storage versus the actual view and implementation:
Old Model
Old Model

New Model
ZFS Model
ZFS Model


By combining the use of flash as an intent-log to reduce write latency with flash as a cache to reduce read latency, we create a system that performs far better and consumes less power than a traditional system at similar cost. It's now possible to construct systems with a precise mix of write-optimized flash, flash for caching, DRAM, and cheap disks designed specifically to achieve the right balance of cost and performance for any given workload with data automatically handled by the appropriate level of the hierarchy. Most generally, this new flash tier can be thought of as a radical form of hierarchical storage management (HSM) without the need for explicit management.

ZFS allows Flash to join DRAM and commodity disks to form a hybrid pool automatically used by ZFS to achieve the best price, performance and energy efficiency conceivable. Adding Flash will be like adding DRAM - once it's in, there's no new administration, just new capability.

And do you know what? All of those features are part of our recently announced Sun Storage 7000 Unified Storage Systems!

Monday Aug 25, 2008

Why you should avoid placing SSDs in traditional Arrays!

Performance_Car.pngSome vendors are announcing SSDs for their traditional arrays in the midrange and high end sector.

This is quite surprising to me, as it is comparable to place a 8-cylinder bi-turbo engine with 450HP into an entry level car (try to avoid to use any brands ;-)).

You might ask for an explanation? Here it is:

Traditional midrange arrays are developed to handle hundreds of traditional (15k RPM) harddisk drives. A traditional harddisk is capable of running about 250 IO/s. Now if we compare this with the actual enterprise class Solid State Disks available on the market, a single solid state disk can do about 50k IO/s read or 12k IO/s write. So in fact it is about 100x faster than a 15k RPM harddisk.

The controller of a midrange array system can probably do about 500k IO/s against it's internal cache. So in fact if we place about 10x Solid State Disks into such a storage system, it would simple consume the complete power of the controller. I didn't even start talking about RAID functionality!!!!

There is another major reason that makes such solutions ridiculous! It is the lantency you are adding by using FC networks. While a traditional harddisk works with a latency of about 3,125us (3.1ms), an enterprise class Solid State Disk works with a latency of less than 100us (0.1ms). By using FC, you might loose 1 IO/s with a traditional disk drive by adding the overhead of switches, cable length and array controllers! With a SSD and a latency of less than 100us, the overhead can end up at loosing 10'000 IO/s in the read performance.

So, where do I place the SSD technology?

The answer is simple!


The best protocol and technology today is SAS (Serial Attached SCSI). The only limitation of SAS is the cable length as it is limited to about 8m, but there is no additional protocol overhead as on FC!

There are two ways to implement SAS attached SSDs.
  1. Directly in a Server, as most of the servers anyway uses SAS attached internal harddisks.
  2. Attached via SAS JBOD (Just a Bunch Of Disks) if you need more disks than a server could cover.
You might also ask how to implement the SSD technology in the most cost effective way?

Thats where most vendors have to stop as they have no solution or good answer.

Sun's ZFS is exactely the product that is capable of using all the benefits of SSDs in combination with the benefits of traditional storage (DENSITY). Combining the two technologies within one file system provides performance AND density under one umbrella. The magic word is Hybrid Storage Pool.


While the slow part of ZFS (density) remains on traditional fibre channel storage arrays, the important parts (performance) like ZIL (ZFS Intend Log) and L2ARC (Level 2 Adaptive Replacement Cache) remains on SSD technology.

Monday Aug 04, 2008

Sun / Avaloq Banking Platform

About Avaloq

avaloq.pngThe Avaloq Banking System is an innovative and integrated IT platform which embraces modern banking practices. It is an ideal solution for asset managers, plus private, retail and commercial banks, wanting to increase their business efficiency and intending to protect their competitive advantage and long-term profitability. Avaloq's modular and open architecture provides comprehensive functionality, covering a variety of banking products, and enables the optimisation and break down of the value chain. Their flexible design allows financial institutions to adapt swiftly to changing market conditions, including the ability to rapidly launch new products and implement new business models.

Avaloq has a very fast growing customer base and is one of the most innovative banking solutions available today!

Sun's Avaloq Infrastructure

M5000_Front_Bird.pngThe Sun Infrastructure Landscape suits perfectly the Avaloq Banking requirements. We have discovered that the M-Series Servers are the ideal systems for a high scalable Avaloq Platform.

The M5000 System is the most frequently used Sun server for Avaloq implementations.

The major reason for that is it's internal scalability and realiability. No matter if it is for the project phase (implementation, integration, testing) or for production, the M5000 matches most requirements from our customers.

With the latest announcements of the Sparc 64 VII CPUs, the M-Series got a tremendous performance boost. The performance/power efficiency has been increased by 50% while the core density has been doubled! 32x Sparc 64 VII Cores at 2.4 Ghz within one single server! With todays 4GB DIMM's the server scales up to 256GB RAM. Nearly unlimited I/O expandibility delivers high performance connectivity to storage and network. Up to two internal I/O Units can be configured for the M5000, while each I/O Unit delivers:
  • 4 x8 PCI-E Slots
  • 1 64-bit PCI-X Slot
  • 2 SAS Disk Bays
  • 2 Gigabit Ethernet Ports
If this is not enough, you can attach up to four external I/O Units, while each external I/O Unit has two houses. Each house can deliver 6 PCI-E Slots or 6 PCI-X Slots. In case of a full expansion of the system, you could for example have up to 32 PCI-E slots ...... Enough? I think so.

Now that we know which system is mostly used for Avaloq implementations, let's figure out why and what the sizing rules are!

Typical Sun Avaloq Sizings

The following figure shows a possible implementation scenario for a medium sized bank:


A Typical SUN Avaloq Implementation is fragmented into three different areas:
  1. The Production Servers and Integration Servers are preferably identical.
  2. The Project, Development and Test Server needs in 99% of the cases more compute and memory power than the Production and Integration Environment. This is related to the amount of databases being used.
The Avaloq Project Phase is the most demanding phase for a server. During this time multiple Avaloq instances are running, for example to verify the implementation or to test various features and modifications.

In most of the cases such Avaloq instances are running in separate Solaris Zones, a builtin and free Solaris feature. In combination with Jumpstart, a complete Avaloq instance can be brought online from scratch within a few minutes!

High Availability Production System

As banking applications are mostly core for every Bank, you wouldn't like to have any interruptions at all. BUT, as systems, networks, storage and datacenters can fail even when they have the most complete RAS (Reliablility, Availability and Servicability) stack, it is wise to have a good failback scenario.

The figure bellow illustrates a possible HA implementation scenario. Certainly this could ba also expanded.


Avaloq Sizing Callenge

Depending on what kind of bank you are, the size of your production system may vary. Why? There are two major types of banks on the market:
  • Retail Banks
    A retail bank traditionaly has much more cashier transactions per day than a private bank thus causing a higher system load.

  • Private Banks
    Traditionaly a private bank doesn't have as many cashiers transactions than a retail bank but it has much more STEX (Stock Exchange) transactions.
It is wise to work together with Avaloq and Sun to design the optimal sizing for your specific needs. Avaloq and Sun have done many sizings together, and this is a major reason why our customers are happy!

The performance bottleneck!

6122-babasse-DiskipperLite3D.pngAs most DB oriented applications, also Avaloq implementations are higly dependent on the underlying storage subsystem if it is not capable of storing the transactions fast enough or creating complex reports.

The good message is, that Sun has the right answer in case you have to deal with such kind of problems. Sun does not only deliver the fastest midrange and high end storage subsystems, we also have the right answer when we start talking about Solid State Disks (SSDs)

One of the first operations to increase performance can be that you place the Oracle Redo-Log on Solid State Disks. Also a very Dense Solid State Disk could cover the complete Avaloq implementation. The disks subsystem will no longer be the bottle neck!

Keep in mind that in terms of performance increase, the NAND based solid state disk market is today growing faster than the CPU market. You can find more about the SSD technology here.

Monday Jul 28, 2008

NAND Flash based SSDs


What is Flash?

Flash memory is non-volatile computer memory that can be electrically erased and reprogrammed. The Flash technology is primarly used in memory cards and USB flash drives for general storage and transfer of data between coputers and other digital products. It is a specific type of EEPROM (Electrically Erasable Programmable Read-Only Memory) that is erased and (re)programmed in large blocks. In the early flash products, the entire chip had to be erased at once. Flash memory costs far less than byte-programmable EEPROM and therefore has become the dominant technology wherever a significant amount of non-volatile storage is needed.

Flash memory needs no power to maintain the information that is stored on the chip. In addition, flash memory offers fast read access times and better shock resistance then hard disks. Flash is able to withstand intense pressure, extremes of temparature and even immersion in water.

What is NAND?

The NAND flash architecture was introduced in 1989. These memories are accessed much like block devices such as hard disks or memory cards. Each block consists of a number of pages. The pages are typically 512, 2024 or 4096 bytes in size. Associated with each page are a few bytes (typically 12-16 bytes) that should be used for storage of an error detection and correction checksum.

While programming is performed on a page basis, erasure can only be performed on a block basis. Another limitation of NAND flash is data in a block can only be written sequentially. Number of operations (NOPs) is the number of times the sectors can be programmed. So far this number for MLC (Multy Level Cell) flash is always one whereas for SLC (Single Level Cell) flash it is 4.

NAND devices also require bad block management by a separate controller chip. SD cards, for example, include controller circuitry to perform bad block management and wear leveling.


MLC (Multi Level Cell) NAND flash allows each memory cell to store 2 bits of information, compared to 1 bit-per-cell for SLC NAND flash, resulting in a larger capacity and lower bit cost. As a rule of thumb, MLC devices are available at twice the density of SLC devices of the same flash technology. Mature and proven, MLC technology is generally used in cost-sensitive consumer products such as cell phones and memory cards.

A significant portion of the NAND flash-based memory cards on the market today are made from MLC NAND, and the continuing rapid growth of this market can be considered an indication that the performance is meeting consumers' needs. Although the use of MLC technology offers the highest density (and the lowest cost), the tradeoff compared to single-bit-per cell is lower performance in the form of slower write (and potentially erase) speeds, as well as reduced write/erase cycling endurance.

Also, because of the storage of 2 bits per cell, the probability of bit error is higher than for SLC technology. However, this is partially compensated for by using error detection and correction codes (EDC). System designers have long been aware of the benefits of using EDC to detect and correct errors in systems using Hamming codes (common in memory subsystems) and Reed Solomon codes (common in hard drives and CD-ROMs).

SLC NAND is generally specified at 100,000 write/erase cycles per block with 1-bit ECC. MLC is generally specified at 10,000 cycles with ECC. While the datasheet for the MLC device does not specify the level of ECC required, the MLC manufacturers recommend 4-bit ECC when using this technology. Therefore, when using the same controller, a storage device using SLC will have an endurance value roughly 10 times that of a similar MLC-based product.

The following table shows the advantages and disadvantages of SLC Flash and MLC Flash:
When we talk about Enterprise Class Flash Storage, we cleary talk about SLC based NAND Solid State Disks!

Why Solid Stated Disks?

One of the biggest advantages of Flash Based SSDs is their latency. The performance of Flash is a bit unusual as it is highly asymmetric. A block of flash must be erased before it can be written, which takes on the order of 1-2ms for a block. Writing to an erased flash requires around 200-300us. Most of the flash based disks try to maintain a pool of prefiously erased blocks, so that the latency of a write is just that of teh program operation. Read operations are much faster, 25-30us for 4k. Flash based SSDs also use internal DRAM Memory to assure a good write performance. The RAM is protected with a capacitor to avoid data losses on power outages. A capacitor doesn't require any maintenance.

Conventional storage solutions mix dynamic memory (DRAM) and hard drives; flash is interesting because it falls in a sweet spot between those two components for both cost and performance in that flash is significantly cheaper and denser than DRAM and also significantly faster than disk. Flash accordingly can augment the system to form a new tier in the storage hierarchy – perhaps the most significant new tier since the introduction of the disk drive with RAMAC in 1956.

ZFS by Sun Microsystems has been optimized to manage Flash SSD systems, both as cache as well as main storage facilities, available for OpenSolaris, Mac OS X, and the Linux operating system.

Business Cases for Flash Based SSDs

Bellow you can find a few business cases, where flash clearly is usefull and probably the technology of choice in the future:
  • As a LOG device for Databases, like Oracle Redo-Logs
  • As an extended, huge and solid cache for ZFS
  • As a ZIL (ZFS Intend Log) device of ZFS, which is similar to Redo-Logs in Oracle
  • As a MetaData device for SAM-QFS. QFS can separate the MetaData from real data. This improves the common file system operations like ls, find, by factors. It also improves the "Directory Name Lookup Cache" by factors on huge file systems (above 10Mio Files) as the traditional RAM wont be sufficient.
  • As a storage device for immense data transactions or databases with hughe amount of transactions per second; mostly WRITE as for read you can also use the RAM of the server

In this blog you can find interesting content about solutions and technologies that SUN is developing.

The blog is customer oriented and provides information for architects, chief technologists as well as engineers.


« July 2016