NAS and SAN sharing versus availability

Many people use NAS or SAN to share storage in their data centers. This has been pushed by the computer storage industry for many years. However, sharing does pose some architectural issues which lead to availability problems. The design of most computer/storage interfaces is built around the premise that the computer owns the storage and that the connection to storage is reliable. This is reflected in the protocols used in SANs, such as SCSI. NAS designs did take into account the unreliable nature of networks, but didn't do much on the sharing side. After all, the whole idea behind NAS or SAN is to share, right?

When we design products like Sun Cluster, we take a very data-centric view of the availability problem. We begin by ensuring that the data is only accessible by nodes which are part of the cluster. We do not want to share the data with just any node. We accomplish this for SANs and DAS by using SCSI reservations and failfast drivers. Only those nodes which are part of the cluster can access the data. If a node tries to access the data, but is not allowed because it doesn't hold the reservation, then the failfast driver will panic the node. This allows the cluster to control access to the data in a coordinated manner.

Architecturally, this concept of restricted access is at odds with the desire to share. So when you build a cluster using NAS or SAN, you must take extra precautions to ensure that only those nodes which should access the data can access the data. You can think of this as a form of security, implemented at the node level.

There is a real danger here when you mix different types of nodes on a SAN. For example, different computers running different operating systems. In this case, you cannot expect the nodes to have any real knowledge of the other nodes and their data access policies. For example, a SuSE node may not know anything about a Solaris node's data access policy. In this case, it is critical that the sharing policy can be enforced and is properly enforced. Unfortunately, this enforcement today is done largely by hand. Thus it is susceptible to human error -- the most difficult error to overcome.

We have seen such errors cause significant downtime. If you corrupt data, recovery time can be very long. Data integrity is a very high risk, so you should be extra careful when designing such systems. Often there is little that can be done physically, especially if you are using some sort of storage consolidation scheme. You will need to take special care to ensure that the sharing policies are enforced: NFS shares, SAN fabric configuration, etc. You will also need to make sure that change control mechanisms are in place to help prevent human error from being introduced later.

Comments:

Post a Comment:
Comments are closed for this entry.
About

relling

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today