ZFS with Cloud Storage or Faraway Storage

Recently I am been testing few pieces of Storage projects of OpenSolaris with PostgreSQL. One of tests involves using an iSCSI disk  with PostgreSQL.  Unfortunately the storage that's available is  in Colorado  while my PostgreSQL server is  located in Massachusetts. Latency will definitely be one of my top problems since storage  is halfway across the country (in Colorado). Plus the fact that I will be running a database server on my end  doesn't really sound like a good idea. Come to think about it, this could be a more common problem nowadays since Cloud Storage (for example Amazon S3 Webservice ) could be  optimistically  half way across the country and pessimistically be on the other side of the world.

 So what are my options to solve such problems?  ZFS in OpenSolaris 2008.05 has many new features, couple of which can potentially help with my problem. 

  • ZFS Separate Intent Log: Ability to separate out the ZFS Intent Log (or log writes in simple terms)
  • ZFS L2 ARC: Ability to use a Level 2 Adaptive Replacement Cache which can be block device (or cache reads on device in simple terms).

Thats an interesting set of new features that I thought will be useful in my case. One to log writes separately which can be on a fast disk and another to use a fast disk for caching reads. Of course I am not the first to say on this topic since these new features have been discussed in length a lot specially with SSDs. But I plan to  solve the problem of latency of my Cloud Storage with these new ZFS features and some local disks partitions that I have in my system.

Many people do the analogy that compared to a regular 7,200 rpm SATA or 10,000 rpm SATA/SAS or 15,000 SAS drives, the SSDS act like 40,000 rpm drives. Well extending this to Cloud Storage, I think Cloud Storage is like more like a  500 rpm to 1000 rpm drives depending on the phase of the moon and/or the stock market.

Anyway to continue with my setup, I used an iSCSI disk exported  in Colorado.   I created a regular zpool on top of it  on the server in Massachusetts  and called it "colorado" as shown below:

# zpool create colorado c9t600144F048DAAA5E0000144FA6E7AC00d0

Then I created a PostgreSQL database in /colorado/pgdata and started loading up data in it using pgbench. It was painful to do this late in the day and then  waiting for it to finish. At this point of time I also wished that pgbench had a scale factor of smaller than 1 (maybe it does I don't know). Anyway I did not have the patience to let it finish. I terminated the process after about 8 minutes as  that scenario was unacceptable.

$ time /usr/postgres/8.2/bin/pgbench -i -s 1 pgbench
creating tables...
10000 tuples done.
20000 tuples done.
30000 tuples done.
40000 tuples done.
50000 tuples done.
60000 tuples done.
\^C

real    8m1.509s
user    0m0.052s
sys     0m0.011s


I destroyed that "colorado" pool .  I referred  the Solaris ZFS Administration Guide to get help with the updated syntax of these new features and recreated the pool using a local disk partition for cache and another for log as follows:

 # zpool create -f colorado2 c9t600144F048DAAA5E0000144FA6E7AC00d0 log c5t0d0s0 cache c5t0d0s1

And then repeated the steps and then recreated the database on it. Then I started loading the data again with pgbench.

Boom!!!  it finished in record time:


$ time /usr/postgres/8.2/bin/pgbench -i -s 1 pgbench
creating tables...
10000 tuples done.
20000 tuples done.
30000 tuples done.
40000 tuples done.
50000 tuples done.
60000 tuples done.
70000 tuples done.
80000 tuples done.
90000 tuples done.
100000 tuples done.
set primary key...
NOTICE:  ALTER TABLE / ADD PRIMARY KEY will create implicit index "branches_pkey" for table "branches"
NOTICE:  ALTER TABLE / ADD PRIMARY KEY will create implicit index "tellers_pkey" for table "tellers"
NOTICE:  ALTER TABLE / ADD PRIMARY KEY will create implicit index "accounts_pkey" for table "accounts"
vacuum...done.

real    0m4.560s
user    0m0.076s
sys     0m0.011s


Not bad. Cutting the latency of writes of something that would have taken in excess of 8-10 minutes is atleast recorded within 4 seconds on nonvolatile cache/log combination and allowing ZFS to sync it up to the actual storage. 

Now trying a quick pgbench run to make sure it executes as expected.

$ time /usr/postgres/8.2/bin/pgbench -c 1 -s 1 pgbench
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 1
number of clients: 1
number of transactions per client: 10
number of transactions actually processed: 10/10
tps = 113.030111 (including connections establishing)
tps = 119.500012 (excluding connections establishing)

real    0m0.144s
user    0m0.005s
sys     0m0.008s

So using these new features of ZFS in OpenSolaris 2008.05  can helps hide the latency of these low-cost Cloud Storage and actually make them usable even as a database server.

Plus I heard ZFS is also coming out with recovery  options which will allow to recover not only with the log but also without the separate log and cache device available. If your server dies and takes your ZIL disk with it, you can go and build another server and attach to your cloud device  to regain your data even if you don't have the device to replay the ZIL.If you have your ZIL Log, you can use it to get most current version.  This is important during disaster recovery where you are willing to take whatever you have and start the business again from that point.


Comments:

[Trackback] Recently I am been testing few pieces of Storage projects of OpenSolaris with PostgreSQL. One of tests involves sharing using an iSCSI disk  with PostgreSQL.  [Read More....]

Posted by Jignesh Shah's PostgreSQL Blog on September 24, 2008 at 10:56 PM EDT #

The scale factor tells pgbench how may entries go into a table named "branches". Since there has to be at least one branch, no you can't set the scale any smaller than one.

If you're willing to recompile pgbench, what you could do in order to make it operate on a smaller data set is to find every place in pgbench.c where the magic number "100000" is found (that's how many accounts each branch gets) and replace them all with something smaller, say 10000.

The other thing people should be aware of is that PostgreSQL checkpoints can be really brutal with remote storage like this. You really need to make sure checkpoint_segments is set to something reasonable (not the tiny default) if you want to keep that under control.

Posted by Greg Smith on September 25, 2008 at 09:10 PM EDT #

Hey Greg,
Thanks for your comment. :-)
checkpoint_segments is higher about 32 or so for this test. As for scale factor I was just lamenting since it was late in the day and the remote storage was taking too much time. But I am happy the cache/log features of ZFS are working to my benefit as I expected.

Posted by Jignesh Shah on September 26, 2008 at 02:45 AM EDT #

[Trackback] I talked about the advantages of the separated ZIL and the L2ARC even on rotating rust in my article "iSCSI for I/O intensive tasks in March 2008. I didn�t found the time to write down the findings of my tests as i started LKSF not much later. But Jign...

Posted by c0t0d0s0.org on September 27, 2008 at 01:55 AM EDT #

You can use mirroring on the ZIL in to help protect it from drive failure.

Posted by taltamir on May 24, 2009 at 08:29 AM EDT #

Post a Comment:
Comments are closed for this entry.
About

Jignesh Shah is Principal Software Engineer in Application Integration Engineering, Oracle Corporation. AIE enables integration of ISV products including Oracle with Unified Storage Systems. You can also follow me on my blog http://jkshah.blogspot.com

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today