Wednesday Sep 24, 2008

ZFS with Cloud Storage or Faraway Storage

Recently I am been testing few pieces of Storage projects of OpenSolaris with PostgreSQL. One of tests involves using an iSCSI disk  with PostgreSQL.  Unfortunately the storage that's available is  in Colorado  while my PostgreSQL server is  located in Massachusetts. Latency will definitely be one of my top problems since storage  is halfway across the country (in Colorado). Plus the fact that I will be running a database server on my end  doesn't really sound like a good idea. Come to think about it, this could be a more common problem nowadays since Cloud Storage (for example Amazon S3 Webservice ) could be  optimistically  half way across the country and pessimistically be on the other side of the world.

 So what are my options to solve such problems?  ZFS in OpenSolaris 2008.05 has many new features, couple of which can potentially help with my problem. 

  • ZFS Separate Intent Log: Ability to separate out the ZFS Intent Log (or log writes in simple terms)
  • ZFS L2 ARC: Ability to use a Level 2 Adaptive Replacement Cache which can be block device (or cache reads on device in simple terms).

Thats an interesting set of new features that I thought will be useful in my case. One to log writes separately which can be on a fast disk and another to use a fast disk for caching reads. Of course I am not the first to say on this topic since these new features have been discussed in length a lot specially with SSDs. But I plan to  solve the problem of latency of my Cloud Storage with these new ZFS features and some local disks partitions that I have in my system.

Many people do the analogy that compared to a regular 7,200 rpm SATA or 10,000 rpm SATA/SAS or 15,000 SAS drives, the SSDS act like 40,000 rpm drives. Well extending this to Cloud Storage, I think Cloud Storage is like more like a  500 rpm to 1000 rpm drives depending on the phase of the moon and/or the stock market.

Anyway to continue with my setup, I used an iSCSI disk exported  in Colorado.   I created a regular zpool on top of it  on the server in Massachusetts  and called it "colorado" as shown below:

# zpool create colorado c9t600144F048DAAA5E0000144FA6E7AC00d0

Then I created a PostgreSQL database in /colorado/pgdata and started loading up data in it using pgbench. It was painful to do this late in the day and then  waiting for it to finish. At this point of time I also wished that pgbench had a scale factor of smaller than 1 (maybe it does I don't know). Anyway I did not have the patience to let it finish. I terminated the process after about 8 minutes as  that scenario was unacceptable.

$ time /usr/postgres/8.2/bin/pgbench -i -s 1 pgbench
creating tables...
10000 tuples done.
20000 tuples done.
30000 tuples done.
40000 tuples done.
50000 tuples done.
60000 tuples done.

real    8m1.509s
user    0m0.052s
sys     0m0.011s

I destroyed that "colorado" pool .  I referred  the Solaris ZFS Administration Guide to get help with the updated syntax of these new features and recreated the pool using a local disk partition for cache and another for log as follows:

 # zpool create -f colorado2 c9t600144F048DAAA5E0000144FA6E7AC00d0 log c5t0d0s0 cache c5t0d0s1

And then repeated the steps and then recreated the database on it. Then I started loading the data again with pgbench.

Boom!!!  it finished in record time:

$ time /usr/postgres/8.2/bin/pgbench -i -s 1 pgbench
creating tables...
10000 tuples done.
20000 tuples done.
30000 tuples done.
40000 tuples done.
50000 tuples done.
60000 tuples done.
70000 tuples done.
80000 tuples done.
90000 tuples done.
100000 tuples done.
set primary key...
NOTICE:  ALTER TABLE / ADD PRIMARY KEY will create implicit index "branches_pkey" for table "branches"
NOTICE:  ALTER TABLE / ADD PRIMARY KEY will create implicit index "tellers_pkey" for table "tellers"
NOTICE:  ALTER TABLE / ADD PRIMARY KEY will create implicit index "accounts_pkey" for table "accounts"

real    0m4.560s
user    0m0.076s
sys     0m0.011s

Not bad. Cutting the latency of writes of something that would have taken in excess of 8-10 minutes is atleast recorded within 4 seconds on nonvolatile cache/log combination and allowing ZFS to sync it up to the actual storage. 

Now trying a quick pgbench run to make sure it executes as expected.

$ time /usr/postgres/8.2/bin/pgbench -c 1 -s 1 pgbench
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 1
number of clients: 1
number of transactions per client: 10
number of transactions actually processed: 10/10
tps = 113.030111 (including connections establishing)
tps = 119.500012 (excluding connections establishing)

real    0m0.144s
user    0m0.005s
sys     0m0.008s

So using these new features of ZFS in OpenSolaris 2008.05  can helps hide the latency of these low-cost Cloud Storage and actually make them usable even as a database server.

Plus I heard ZFS is also coming out with recovery  options which will allow to recover not only with the log but also without the separate log and cache device available. If your server dies and takes your ZIL disk with it, you can go and build another server and attach to your cloud device  to regain your data even if you don't have the device to replay the ZIL.If you have your ZIL Log, you can use it to get most current version.  This is important during disaster recovery where you are willing to take whatever you have and start the business again from that point.


Jignesh Shah is Principal Software Engineer in Application Integration Engineering, Oracle Corporation. AIE enables integration of ISV products including Oracle with Unified Storage Systems. You can also follow me on my blog


« July 2016