Friday Mar 23, 2007

RAID-0 (stripe) on solaris 10 using solaris volume manager


Motivation for this how to is partially implementation and excellent idea by Nemanja Lukic that it's several time faster to delete whole zone by issuing newfs than to delete all zone files using rm, so that each zone on our testing machine should be on separate FS. And it's not just about deleting zones; speed is significant factor too, and also usage of other FS tools like ufsdump, ufsrestore, fssnap etc, which is possible only if your zones are on separate file systems. So we have 4 zones and 3 hard drives (we actually have 4 hard drives, but we can use only 3 for this purpose, since first drive is system drive). We could of course create 2 slices on one drive, and one slice per remaining 2 drives, but that would be so uncool :). Cool stuff is to use Solaris volume manager, create RAID-0 (Stripe) metadevice/slice out of 3 hard drives, and then create 4 so called soft partitions within that metadevice. Using this approach way we can have exactly same size per soft partition, and all 3 hard drives will be used completely. Platform is brand new x4100, with 4 identical hard drives, 72 GB each; Operating System: Solaris 10 u3. Oh yes, if you are asking why I'm not using zfs, it's because software that is meant to be tested on zones is not supported (yet) on zfs :(.


First step is to prepare hard drives for raid 0. So we will create one big partition that will span across whole drive. Run format and select first drive:



root@jsc-x4100-17:~# format
Searching for disks...done


AVAILABLE DISK SELECTIONS:
0. c2t0d0
/pci@7b,0/pci1022,7458@11/pci1000,3060@2/sd@0,0
1. c2t1d0
/pci@7b,0/pci1022,7458@11/pci1000,3060@2/sd@1,0
2. c2t2d0
/pci@7b,0/pci1022,7458@11/pci1000,3060@2/sd@2,0
3. c2t3d0
/pci@7b,0/pci1022,7458@11/pci1000,3060@2/sd@3,0
Specify disk (enter its number): 1
selecting c2t1d0
[disk formatted]


FORMAT MENU:
disk - select a disk
type - select (define) a disk type
partition - select (define) a partition table
current - describe the current disk
format - format and analyze the disk
fdisk - run the fdisk program
repair - repair a defective sector
label - write label to the disk
analyze - surface analysis
defect - defect list management
backup - search for backup labels
verify - read and display labels
save - save new disk/partition definitions
inquiry - show vendor, product and revision
volname - set 8-character volume name
! - execute , then return
quit
format>p


PARTITION MENU:
0 - change `0' partition
1 - change `1' partition
2 - change `2' partition
3 - change `3' partition
4 - change `4' partition
5 - change `5' partition
6 - change `6' partition
7 - change `7' partition
select - select a predefined table
modify - modify a predefined partition table
name - name the current table
print - display the current table
label - write partition map and label to the disk
! - execute , then return
quit
partition> 0

Part Tag Flag Cylinders Size Blocks
0 home wm 1 - 8920 68.33GB (8920/0/0) 143299800

Enter partition id tag[home]:home

Enter partition permission flags[wm]:wm

Enter new starting cyl[1]: 1
Enter partition size[143299800b, 8920c, 8920e, 69970.61mb, 68.33gb]: $
partition> label
Ready to label disk, continue? y
partition> q

FORMAT MENU:
disk - select a disk
type - select (define) a disk type
partition - select (define) a partition table
current - describe the current disk
format - format and analyze the disk
fdisk - run the fdisk program
repair - repair a defective sector
label - write label to the disk
analyze - surface analysis
defect - defect list management
backup - search for backup labels
verify - read and display labels
save - save new disk/partition definitions
inquiry - show vendor, product and revision
volname - set 8-character volume name
! - execute , then return
quit
format> q



At this point we have disk 1 partitioned with slice 0 spanning from cyl 1 to the end of drive $. Instead of repeating same steps for disk 2 and disk 3, we will use Solaris prtvtoc to print disk's 1 partition table and fmthard to apply that table to disk 2 and 3 (all disks are identical).




root@jsc-x4100-17:~# prtvtoc /dev/rdsk/c2t1d0s2 > /var/tmp/prtvtoc.c2t1d0s2
root@jsc-x4100-17:~# fmthard -s /var/tmp/prtvtoc.c2t1d0s2 /dev/rdsk/c2t2d0s2
fmthard: New volume table of contents now in place.
root@jsc-x4100-17:~# fmthard -s /var/tmp/prtvtoc.c2t1d0s2 /dev/rdsk/c2t3d0s2
fmthard: New volume table of contents now in place.


Next step is to create replicas of metadevice state database. Metadevice database contains configuration and state of all metadevices and hot spare pools on the system. Since this information is important, we will be creating 3 replicas of this database, one per each drive. Metadevice state database can be created on any slice on hard drive, including slice that will later became part of metadevice. Also it's possible to create more than 1 replica of database per one slice. If one or more metadevice state databases fails, volume management compare other databases and based on majority consensus algorithm decides which replicas are valid. Command to create metadevice replicas is metadb.



root@jsc-x4100-17:~# metadb -a -f c2t1d0s0 c2t2d0s0 c2t3d0s0


-a is to add database replicas, and -f is to force adding (we have to force adding since there no metadevice state replicas exists). Use metadb -i to check state of metadevice replicas. In our case we can see that replicas are active a flag, and that they are up to date u flag



root@jsc-x4100-17:~# metadb -i
flags first blk block count
a u 16 8192 /dev/dsk/c2t1d0s0
a u 16 8192 /dev/dsk/c2t2d0s0
a u 16 8192 /dev/dsk/c2t3d0s0
r - replica does not have device relocation information
o - replica active prior to last mddb configuration change
u - replica is up to date
l - locator for this replica was read successfully
c - replica's location was in /etc/lvm/mddb.cf
p - replica's location was patched in kernel
m - replica is master, this is replica selected as input
W - replica has device write errors
a - replica is active, commits are occurring to this replica
M - replica had problem with master blocks
D - replica had problem with data blocks
F - replica had format problems
S - replica is too small to hold current data base
R - replica had device read errors


metadb's -c switch determines how many replicas per slice we want. If we have had issued -c 3 on three slices we would end up with 9 metadevice state database replicas:



root@jsc-x4100-17:~# metadb -a -f -c 3 c2t1d0s0 c2t2d0s0 c2t3d0s0

root@jsc-x4100-17:~# metadb -i

flags first blk block count
a u 16 8192 /dev/dsk/c2t1d0s0
a u 8208 8192 /dev/dsk/c2t1d0s0
a u 16400 8192 /dev/dsk/c2t1d0s0
a u 16 8192 /dev/dsk/c2t2d0s0
a u 8208 8192 /dev/dsk/c2t2d0s0
a u 16400 8192 /dev/dsk/c2t2d0s0
a u 16 8192 /dev/dsk/c2t3d0s0
a u 8208 8192 /dev/dsk/c2t3d0s0
a u 16400 8192 /dev/dsk/c2t3d0s0
r - replica does not have device relocation information
o - replica active prior to last mddb configuration change
u - replica is up to date
l - locator for this replica was read successfully
c - replica's location was in /etc/lvm/mddb.cf
p - replica's location was patched in kernel
m - replica is master, this is replica selected as input
W - replica has device write errors
a - replica is active, commits are occurring to this replica
M - replica had problem with master blocks
D - replica had problem with data blocks
F - replica had format problems
S - replica is too small to hold current data base
R - replica had device read errors


Once we have metadevice state databases we will proceede with creating of metadevice. Command is metainit metadevice ame number of stripes width logical name for slice1 slice2 .... Number of stripes parameter determines how many stripes we want in metadevice. For example if number of stripes equals to 1, we are creating simple stripe, if it's equal to number of slices than we have concatenation. width specifies number of slices that make up a stripe. In our case number of stripes will be 1, and width 3



root@jsc-x4100-17:~# metainit d0 1 3 c2t1d0s0 c2t2d0s0 c2t3d0s0
d0: Concat/Stripe is setup


To verify stripe and get some info we use metastat command



root@jsc-x4100-17:~# metastat
d0: Concat/Stripe
Size: 429835140 blocks (204 GB)
Stripe 0: (interlace: 32 blocks)
Device Start Block Dbase Reloc
c2t1d0s0 16065 Yes Yes
c2t2d0s0 16065 Yes Yes
c2t3d0s0 16065 Yes Yes

Device Relocation Information:
Device Reloc Device ID
c2t1d0 Yes id1,sd@SSEAGATE_ST973401LSUN72G_3710ZJ07____________3LB0ZJ07
c2t2d0 Yes id1,sd@SSEAGATE_ST973401LSUN72G_3710ZGLR____________3LB0ZGLR
c2t3d0 Yes id1,sd@SSEAGATE_ST973401LSUN72G_3710Z1DG____________3LB0Z1DG



And now final steps is to create 4 soft partitons within metadevice d0. For creating soft partitions we are using metainit with -p switch and specifying size of soft partition as last parameter (in our example it's 204gb/4 = 51gb)



root@jsc-x4100-17:~# metainit d1 -p d0 51g
d1: Soft Partition is setup
root@jsc-x4100-17:~# metainit d2 -p d0 51g
d2: Soft Partition is setup
root@jsc-x4100-17:~# metainit d3 -p d0 51g
d3: Soft Partition is setup
root@jsc-x4100-17:~# metainit d4 -p d0 51g
d4: Soft Partition is setup


you can verify this with metastat



root@jsc-x4100-17:home# metastat
d4: Soft Partition
Device: d0
State: Okay
Size: 106954752 blocks (51 GB)
Extent Start Block Block count
0 320864384 106954752

d0: Concat/Stripe
Size: 429835140 blocks (204 GB)
Stripe 0: (interlace: 32 blocks)
Device Start Block Dbase State Reloc Hot Spare
c2t1d0s0 16065 Yes Okay Yes
c2t2d0s0 16065 Yes Okay Yes
c2t3d0s0 16065 Yes Okay Yes

d3: Soft Partition
Device: d0
State: Okay
Size: 106954752 blocks (51 GB)
Extent Start Block Block count
0 213909600 106954752

d2: Soft Partition
Device: d0
State: Okay
Size: 106954752 blocks (51 GB)
Extent Start Block Block count
0 106954816 106954752

d1: Soft Partition
Device: d0
State: Okay
Size: 106954752 blocks (51 GB)
Extent Start Block Block count
0 32 106954752

Device Relocation Information:
Device Reloc Device ID
c2t1d0 Yes id1,sd@SSEAGATE_ST973401LSUN72G_3710ZJ07____________3LB0ZJ07
c2t2d0 Yes id1,sd@SSEAGATE_ST973401LSUN72G_3710ZGLR____________3LB0ZGLR
c2t3d0 Yes id1,sd@SSEAGATE_ST973401LSUN72G_3710Z1DG____________3LB0Z1DG



after this you can use soft partitions as you would be using any other partition, format them, mount, udsdump, ufsrestore etc ... for example:



root@jsc-x4100-17:~# echo y|newfs /dev/md/rdsk/d1

/dev/md/rdsk/d1: 106954752 sectors in 17408 cylinders of 48 tracks, 128 sectors
52224.0MB in 1088 cyl groups (16 c/g, 48.00MB/g, 5824 i/g)
super-block backups (for fsck -F ufs -o b=#) at:
32, 98464, 196896, 295328, 393760, 492192, 590624, 689056, 787488, 885920,
Initializing cylinder groups:
.....................
super-block backups for last 10 cylinder groups at:
105978656, 106077088, 106175520, 106273952, 106372384, 106470816, 106569248,
106667680, 106766112, 106864544


repeat this step for /dev/md/rdsk/d2 /dev/md/rdsk/d3 and /dev/md/rdsk/d4. when you'r finished you can happily mount soft partitions into locations where zones will be installed.

if you want to remove your metadevice/metadb use reversed steps:

first remove soft partitions from meta device



root@jsc-x4100-17:~# metaclear -p d0
d4: Soft Partition is cleared
d3: Soft Partition is cleared
d2: Soft Partition is cleared
d1: Soft Partition is cleared


then remove metadevice


root@jsc-x4100-17:~# metaclear d0
d0: Concat/Stripe is cleared


and finaly metadb


root@jsc-x4100-17:~# metadb -f -d c2t1d0s0 c2t2d0s0 c2t3d0s0


I hope that this was helpfull. Feel free to comment, and see you in my next blog which will probably be either about dtrace (basics) or x86 Crash Dump Analysis
About

bilke

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today