Commentary about IP Storage Directions
By rmc on Apr 13, 2005
It's pretty visible that storage is rapidly being recast as a network data service. If implemented correctly, we can make our servers become stateless nodes on an IP network -- accessing their storage with the same ease of which we access HTML content on a web server. In this discussion, I would like to open for discussion the emerging shift to network attached storage (NAS). I believe that the emergence of NAS is being facilitated by two major transitions; that of block level device access to file level network data services, and the merging of storage and IP networks.
The summary of this blog is:
- Ethernet bandwidth is growing faster than traditional storage interconnect bandwidth
- Ethernet is cheap -- less than $10 per port at commodity volumes
- Ten Gigabit ethernet is available today, and we are measuring >800MBytes per link with the Solaris 10 FireEngine network stack
- iSCSI growth is exponential, and even in early form its demonstrating >500MBytes/s per link
- NFS bandwidth single stream bandwidth in Solaris 10 is now over 200MBytes/s (5x increase in the last two years)
- NFS in Solaris 10 is now data center ready, and can be used for commercial applications
Here's a link if you want to skip straight to the performance numbers.
Storage area Network or Network Attached Storage
A Storage Area Network (SAN) is defined as a fabric of interconnected block level devices -- disks or virtualized disks. A SAN is merely an evolution of disk interconnects, which have been extended so that they may be connected via specialized hubs and or switches. A practical implementation of a SAN consists of hosts connected via fibre channel to fibre channel switches (pictorially, a storage cloud) to storage appliances with fibre channel target adapters. In a SAN, we virtualize the disk -- we still use a file system on each host computer, and often a volume manager to aggregate multiple disks into one that can be used by the file system (a way of aggregating storage across multiple channels or storage servers).
Network Attached Storage (NAS) is a term used to describe data services connected with networking protocols -- e.g. Network File System (NFS) or Microsoft Common Internet File Services (CIFS). NAS storage is typically configured using fast networks, i.e. gigabit ethernet. Access to the data is typically via a network file system client that is built in the operating system. For example, the NFS client is provided as part of both Linux and Solaris, allowing heterogeneous access to data residing on NFS servers.
NAS storage is much easier to administer -- connecting a new client system to pre configured NAS storage is typically just one command on a Unix system (or just a point and click on a desktop OS). Unlike a SAN, no special device administration is required on the client, since the data is accessed as a network service at the file level.
Interestingly, we're at a pivotal point for SANs and storage interconnects. SANs are attempting to evolve into networks. Historically SANs have offered higher performance than NAS, forcing NAS to be used only in environments with low performance requirements -- e.g. web servers or for desktop user files/directories. However, today's commodity networking speeds are allowing NAS to grow into the heart of the commercial application space, and take a prominent place in our data centers.
A transitional step along the way is iSCSI. A new mechanism of running a SAN topology across a commodity network. iSCSI allows us to configure raw block devices on the client machine that connect to a SAN -- presenting themselves on the client in exactly the same way as a fibre channel HBA does (as a raw block device). It still requires device administration, a volume manager and file system on the client. Why is this interesting, given that NAS offers easier administration than it's block/SAN counterpart? Primarily as a transition vehicle, allowing cost effective access to an existing corporate SAN.
Transports and Interconnects
Storage interconnect technologies have historically been primarily driven by performance characteristics and cost - bandwidth being the most important. Over the last decade, the focus has additionally emphasized the ability to soft-cable the storage, allowing separation of the host computer from the storage device.If we take a look storage interconnect milestones over the last 20 years, we can see a transition in cost/performance occurring - Rather than being the slower counterpart, networks are becoming faster and cheaper than storage-specific interconnects. Le't take a look at the bandwidth available from both network and storage interconnects from 1989 through 2005:
In 1989, SPARCservers used 5MB/s SCSI to connect to disks, and 1MB/s (10Mbit) ethernet -- storage was 5x the bandwidth of the network port. Today, we have ethernet at 10Gbits (1Gbyte/s) and storage at 200MB/s (Fibrechannel).
In addition to bandwidth, we are also seeing the storage interconnects tackling many of the classic networking problems -- the ability to use longer cables, the ability to connect multiple systems via hubs and switches, and the build out of a virtual network cloud. It's also possible to cast an opinion that the storage interconnect industry is attempting to solve many of the "solved" networking problems, from that of the IP networking counterpart.
SCSI performance grew from the original 1MByte/s to over 160MBytes/s. Whilst Parallel SCSI offers quite good performance it is seriously distance limited, and supports a small number of devices on one connection. Fibre channel (FC) was introduced as a practical, inexpensive mechanism of connecting systems using fibre optic cables.
Fibre channel is simply another form of networking -- it uses a serial physical link, which can carry higher level protocols such as ATM or IP, but is primarily used to transport SCSI from hosts to storage.
The Fibre Channel Protocol (FCP) serializes SCSI over Fibre Channel.By using FCP, storage can be connected at speeds of 25MBytes/s (when it was first introduced) over distances up to 30 meters. Shortly after its introduction, FC was increased to 100MBytes, which is the most common deployment today. This simplest form of FC is an abritrated loop (FC-AL), in which a series of disks can be connected to a hub. However, some of the real benefits of FC storage come from it's ability to be switched like a network. This is of course the basis for our storage area networks (SAN's) today.
Ethernet is Commodity - Let the tides turn
Technically superior solutions continue to leapfrog Ethernet -- historically there is ATM, Fibre channel and more recently Infiniband. All offer significant more bandwidth than Ethernet at the same point in time. We could easily predict Ethernet's demise. Nonetheless, as usual, conventional wisdom is wrong. Ethernet is quietly preparing for a new era of unified interconnects.
Gilder is quoted saying: "The reason Ethernet prevailed in the first place is that it was incredibly simple and elegant and robust. In other words, it is cheap and simple for the user. Customers can preserve their installed base of equipment while the network companies innovate with new transmission media. When the network moves to new kinds of copper wires or from one mode of fiber optics to another, Ethernet still looks essentially the same to the computers attached to it. Most of the processing - connecting the user to the network, sensing a carrier frequency on the wire and detecting collisions - can be done on one Ethernet controller chip that costs a few dollars."
Recall that we started with SCSI being much faster than a network at 5MBytes/s and Ethernet at 1MB/s. Just a few years ago, both were at 100MBytes/s (Gigabit), and now Ethernet is deployable at 1GByte/s (10 Gigabit).Today, adding a server to a SAN costs at least $1,800 in switch port and host bus adapter (HBA) costs alone. Unlike switch port pricing, which has dropped down to well under $1,000/port, HBA prices have remained steady. The Fibre Channel Industry Association reports that in 2004 the number of Fibre channel ports topped 10 million. A significant number, but ethernet volume dwarfs this number; an estimated one billion ports today and a $4B market. Growth will continue to extend a further ten times every three years.
Volume drives ethernet per port costs down -- to less than ten dollars per port. Gigabit ethernet is between $5-$10 today. Sure, 10Gigabit ethernet is still around $4k per port, but history shows that it will drop to under $100 within the next 12-18 months.
Shared Access, Centralized Admin
"While we may end up using coaxial cable trees to carry our broadcast transmissions, it seems wise to talk in terms of an ether, rather than `the cable'...." Robert Metcalfe
Metcalfe's Law states that the usefulness, or utility, of a network equals the square of the number of users. The same is true of storage networks -- if only one person or host can access data, it's usefulness is limited. Imagine a campus wide file server that held data for all employees for that location, to which you could only access your data from one client system...
The "SunOS Netdisk" (introduced and died in 1983) attempted to solve this problem; it allowed a disk to be accessed across a network by transferring disk blocks across the network, just like an extension of SCSI. However due to available network bandwidth it was very slow, and since it merely exported a disk blocks across the network, all it did was allow separation of the host from the storage -- the client side semantics were exactly the same as with a local disk -- just one client host with a single file system could access each "network disk". There was no semantic sharing. Imaging a campus with 5000 desktop systems -- the storage administrator would need to format and administer 5000 separate UFS file systems!
A better answer is the concept of a Network File System -- if we share "files" rather than blocks, we can treat storage as a network data service, which can then be shared by multiple users. A single file system can be administered centrally, and accessed by thousands of users. Just one storage admin, one storage pool to administer, and no storage administration at the client. The ability to separate and delegate administration, and share to thousands of clients are two of the key reasons network file systems have become ubiqitous as the way to access and manage data for workgroup users. The Sun Network File System (NFS) and Windows File Sharing (SMB/CIFS) are standard practices in such scenarios.
So what happened in our datacenters? We've been steadily evolving storage networks into something that approximates some the services network file system -- SCSI disks became "shareable" via fabrics of fibre channel. Storage has become somewhat separated from the client; at least the volume management and data recovery part has moved mostly to the backend as a virtualized network disk. However, only part of the storage administration has been centralized -- in this paradigm we still rely on file systems in the client host systems, and more often than not we still use a volume manager on the client.
How did workgroup data management and datacenter storage diverge so much? Performance and evolution are the most significant factors.
The commoditization of Ethernet and bandwidth in the hundreds of megabytes per second are making storage over IP viable in many places for which it was often excluded. We are seeing two key realizations -- that of storage networks constructed from commodity IP networks and switches, and the rise of Network File systems suitability for commercial applications. Price performance is driving IP storage networks.
Our data centers are moving to grid based architectures, and as a result data needs to be accessed from many (sometimes thousands of) nodes. As a direct result, the shared access semantics together with complexity reduction is driving network file systems rapidly into this space. In this model, servers become diskless clients of a networked data service. For example, we can easily provision a system as an Oracle database server, simply by mounting all of it's configuration state and data files over a NFS network. "data files?" I hear someone screaming? Yes, today's network file systems are more than capable of running many database configurations over gigabit networks.
A new Netdisk: iSCSI
Price performance alone is giving rise to iSCSI. Since the cost of connecting a host system to a SAN is expensive, often over one thousand dollars, the SAN connection alone can now exceed the cost of the server, especially at the low end. iSCSI offers a cheap method of connecting systems into a SAN. By running an iSCSI client, a LUN on a storage server can be accessed as if it were connected by a fibre channel adapter. The SAN connection comes virtually for free, given that most systems have gigabit ethernet as standard options.
Sharing blocks over the network offers little improvements in the way we access data, it merely provides a cheap per-port cost to access a SAN. As a result, iSCSI is today primarily considered at the low end as a gateway into the corporate SAN.
However, iSCSI is a classic disruptive technology. IDC's forecast expects a surge in use of Ethernet switch ports with iSCSI, and predicts that the total market for iSCSI-based disk arrays will grow from $216 million in 2003 to $4.9 billion in 2007. They also predict that iSCSI-based switches used in SAN implementations will grow from some 18,000 total ports in 2003 to 6.94 million 1GbE-equivalent ports in 2007.
What is iSCSI?
An iSCSI network consists of an iSCSI initiator and an iSCSI target. An iSCSI Initiator is the client of an iSCSI network. The iSCSI initiator connects to an iSCSI target over an IP network using the iSCSI protocol. As with SCSI, Logical units have Logical Unit Numbers (LUNS), where the numbering is the per target. The same LU may have different LUNS at different targets.
The iSCSI protocol is implemented underneath SCSI and atop of TCP/IP. Multiple TCP connections are allowed per session, and can be optionally used to exploit parallelism and provide error recovery. In addition to SCSI, there are commands to login (connect a TCP session) and logout (TCP session teardown). iSCSI naming is similar to SCSI -- names are globally unique using a world wide number. The DNS system is used as a naming authority, providing reversed hostnames. iSCSI targets may share a common IP address and port, and initiators/targets may have multiple IP addresses.
There are two basic types of discovery: static configuration and a naming service. The static configuration defines an IP address and port for each target. The naming service allows consolidation of iSCSI target information into a network service.
iSCSI as a SAN Gateway
The most common deployment option for iSCSI is that of a SAN gateway. Using a FC to iSCSI gateway, a host can be connected to the SAN storage via ethernet. Typically, a dedicated or VLAN'ed network will be used for the iSCSI portion, to provide an additional level of security and/or performance partitioning.
Here's the common roll-out form for iSCSI: an ethernet network (shown in green) connects all of the hosts together, and to a FC to iSCSI gateway. The FC gateway connects to the SAN in the same way as a regular FC host, but allows multiple iSCSI connections to pas through a single FC port.
We're observing interesting performance with an iSCSI initiator on Solaris. Given that it utilizes the Solaris FireEngine networking stack, we're able to leverage the bandwidth features of it to move data efficiently. We've just seen the iSCSI initiator providing an impressive 120MBytes/s (wirespeed) over Gigabit ethernet, and over half a gigabytes per second over 10Gigabit Ethernet. If we ever thought performance was going to be significant show stopper, those arguments are looking pretty bleak.
So what about NAS?
Performance is one of the most significant drivers of current storage architectures. If we look at the bandwidth and IOPs characteristics, we can draw a map of workload segmentation.
Workloads such as TPC-C (an artificial transaction workload) are dominated by thousands of small I/Os (~8k). A large 72 CPU config might perform ~30k IOPS, but it's bandwidth requirements are less than 100MB/s. On the other hand, a TPC-H (an artificial decision support workload) might only perform a few thousand IOPS, but requires several GB/s, even for systems with only 24CPUs.
The I/O requirements of typical commercial applications are shown at the area to the bottom left of the diagram; they have relatively low IOPS (less than 15k IOPS) and bandwidth requirements (less than 100MB/s). A moderately configured storage system and file system is typically provides adequate throughput, and our focus is more of parallelism and CPU efficiency at the client.
Many scientific applications are bandwidth hungry. These applications often require bandwidth in the hundreds of megabytes or sometimes several gigabytes per second. These workloads typically only have a few streams of execution, e.g. low concurrency. These are the types of applications which today still exceed the limits of gigabit IP storage.
In the context of IP storage, this helps position why many commercial workloads obtain "good enough" performance, even over gigabit ethernet.
NFS for Bandwidth Intensive Applications
The Solaris implementation of NFS has been significant optimized for bandwidth during the last three releases of Solaris. Just a few years ago, it wouldn't be uncommon to measure 20-30MB/s maximum for large streaming file I/O. NFS optimizations together with the Solaris FireEngine networking stack provide substantial improvements in NFS performance. At the time of writing the Solaris 10 NFS 4 NFS client can perform streaming I/O at approximately 120MBbytes/s on a Gigabit link (wirespeed) and 250MBytes/s over a 10Gigabit link. For now, fibre channel offers a higher bandwidth potential than NFS, but further optimizations are in the pipeline :-)
So, NFS can be used for Commercial Applications?
In a nutshell, yes! It used to be true that it was rare and risky to attempt to put mission critical, latency sensitive commercial applications on NFS. Today I often see customers who have live OLTP applications deployed on gigabit ethernet NAS.
Why is it now possible? A couple of years ago, we asked "what is the performance for commercial apps over NAS?" Our engineering team took a large SMP machine, configured Oracle across NFS on gigabit ethernet and compared it to local file systems. Not surprisingly, the performance was terrible. However, after peeling back the layers, it became quickly apparent that this was a fixable problem. About 12 months later we had Oracle over NFS running at the same throughput as the local SAN connected disks!
It turns out that there wasn't anything wrong with ethernet, IP, or the NFS protocol - rather a few key optimizations in the Solaris NFS client could easily provide database I/O semantics, which work just fine across gigabit ethernet. The latencies were actually better in the NFS case with the new Solaris client. These optimizations are part of the new Solaris 10 NFS clients (both V3 and V4). Some of the changes were also made available in later versions of Solaris 9.
For those who like to see specifics, I'll include some numbers: our Oracle system running an OLTP benchmark was able to deliver equivalent throughput at about 50,000 transactions per minute, with slightly better response times: the transaction response times for both the gigabit ethernet case and local disk case were ~0.3ms. The NFS system actually showed faster I/O response time - iostat reported 7.39ms for the UFS file system, and 7.19ms for the NFS case! The NFS system did however use slightly more host CPU (approx 20%), but this was with a heavy I/O database benchmark -- in reality the overhead is likely to be lower and virtually un-noticable on a real application.
Are you saying NFS is ready for all commercial applications?
While many apps are viable, we still need to correctly look at the application scenario and how it's being deployed. For applications which do small I/Os but are latency sensitive, NFS is now a strong candidate. Common sense prevails for the interconnect however -- if you try and run your database over 10mbit ethernet, you'll see the performance you would expect. A dedicated gigabit ethernet is a good starting point.
That all for this blog essay. We now have quite a bit of performance data comparing local filesystems to NFS for a variety of different commercial workloads -- which I will begin to summarize in a later discussion.