Tuesday Jun 15, 2004

NFS clusters

I am part of a team which is taking a new look at NFS clusters. We've actually been doing this for a long time, but we're focusing on a few new techniques which should identify key areas where we can improve availability of NFS services. The current release of Sun Cluster, internally called SC3.1u2, known to the rest of the world as Sun Cluster 3.1 4/04, showed significant improvement in the recovery of NFS services as measured by clients. I have to use the term significant here because it is not easy to actually explain these sorts of improvements accurately with a single metric. We know we can improve it further, and are taking a slightly different approach than in the past. I would say it is already quite good. But as is often the case in such complex services, there are some failure modes which take longer than we'd like to recover from. And, we are constrained to refrain from making changes on the client, which for NFS service availability, is a serious constraint.

I'll try to communicate through this blog some of the gotchas we know about NFS services in a highly available environment. Stay tuned...

Thursday Jun 10, 2004

Sun Cluster forum

Reminder: we do offer a Sun Forum for discussing Sun Cluster issues.
http://forum.sun.com/forum.jsp?forum=1

Monday Jun 07, 2004

Diversity in your connections...

backhoe

Cingular is building a cell phone tower on the ranch. This view from the front porch of the construction reminds me that diversity in your connections is, and will always be, very important for maintaining a highly available system. The pole on the right next to the backhoe's bucket supplies all of the power and telephone to my house. If the bucket were to move about a foot to the right, it could knock out the power and phone service to my house. Given the nature of this sort of failure, recovery could take many hours, and ruin my day.

I use diversity to overcome this single point of failure to my house. For power, I've got a generator and generally use laptop computers with batteries. In a pinch, I could connect some solar power panels and charge controllers to run off-the-power-grid. For telephone service, I have a cellular phone (not Cingular, since they don't have service in my area, yet) with fairly good coverage on the ranch. For internet, I have a wireless connection to Thunder Mountain Wireless, a local wireless service provider run by my friend Terry.

As you build your highly available service, you should also consider these things. A backhoe many miles away could cut your power and telecom service. If you do not have diverse supplies for these crucial services, then your system is at risk. In large, metropolitan areas it may be very difficult to achieve diverse supplies for power and telecom services. But when you need it, you need it fast. Plan ahead to avoid this pitfall.

It is worth noting that this cell site is also susceptible to this same fault. The backhoe is digging the trench to supply power and telephone to the cell site. Generally, the site has batteries and perhaps a generator, but the connection to the telecom network is via the wired phone system (T1). If we fast-forward a few years and dig again, then we could put the site out of commission.

A few years ago I posted a similar experience I experienced in Central Florida where a developer was putting in a subdivision and the backhoe cut the power. In this specific datacenter, everything was functional except the automatic transfer switch (ATS) which was supposed to change the data center power over to the batteries and generators. The ATS had not been tested in several years and did indeed fail. Remember, an ATS is a single point of failure. My colleagues in Europe asked me what a "backhoe" was. I think this picture, and its direct application to my ranch, should drive the point home.

Friday May 21, 2004

NAS and SAN sharing versus availability

Many people use NAS or SAN to share storage in their data centers. This has been pushed by the computer storage industry for many years. However, sharing does pose some architectural issues which lead to availability problems. The design of most computer/storage interfaces is built around the premise that the computer owns the storage and that the connection to storage is reliable. This is reflected in the protocols used in SANs, such as SCSI. NAS designs did take into account the unreliable nature of networks, but didn't do much on the sharing side. After all, the whole idea behind NAS or SAN is to share, right?

When we design products like Sun Cluster, we take a very data-centric view of the availability problem. We begin by ensuring that the data is only accessible by nodes which are part of the cluster. We do not want to share the data with just any node. We accomplish this for SANs and DAS by using SCSI reservations and failfast drivers. Only those nodes which are part of the cluster can access the data. If a node tries to access the data, but is not allowed because it doesn't hold the reservation, then the failfast driver will panic the node. This allows the cluster to control access to the data in a coordinated manner.

Architecturally, this concept of restricted access is at odds with the desire to share. So when you build a cluster using NAS or SAN, you must take extra precautions to ensure that only those nodes which should access the data can access the data. You can think of this as a form of security, implemented at the node level.

There is a real danger here when you mix different types of nodes on a SAN. For example, different computers running different operating systems. In this case, you cannot expect the nodes to have any real knowledge of the other nodes and their data access policies. For example, a SuSE node may not know anything about a Solaris node's data access policy. In this case, it is critical that the sharing policy can be enforced and is properly enforced. Unfortunately, this enforcement today is done largely by hand. Thus it is susceptible to human error -- the most difficult error to overcome.

We have seen such errors cause significant downtime. If you corrupt data, recovery time can be very long. Data integrity is a very high risk, so you should be extra careful when designing such systems. Often there is little that can be done physically, especially if you are using some sort of storage consolidation scheme. You will need to take special care to ensure that the sharing policies are enforced: NFS shares, SAN fabric configuration, etc. You will also need to make sure that change control mechanisms are in place to help prevent human error from being introduced later.

Thursday Apr 29, 2004

LIPs on fabric?

A question came up today whether fibre channel fabrics also do LIPs. My FC docs are not conclusive. If you know, let me know and send a reference.
About

relling

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today