Monday Nov 10, 2008

PostgreSQL and Project Amber Road (Sun Storage 7000 Series)

The New Sun Microsystems  announced a new line of OpenStorage devices. The new Sun Storage 7000 Unified Storage Systems (code name Amber Road) are Storage appliances which uses ZFS as the core technology and DTrace as the means of "understanding" what's happening with the devices. This now allows systems which DO NOT have ZFS capabilities  (ala Linux, Windows, etc) to use and gain  benefit from the features of ZFS and DTrace (albeit limited to the "Storage" part only).

PostgreSQL (on Linux,Solaris, Windows) can use such systems through one of the standard protocols that these OpenStorage devices support

  • NFS
  • CIFS (Only on Windows)
  • iSCSI

PostgreSQL on NFS is already discussed at length (both merits and also the demerits of using NFS with PostgreSQL ) in the PostgreSQL community. Also I haven't met anybody yet who said that they are using PostgreSQL with their database on CIFS mounted devices. So I think one might actually select iSCSI as the favorable protocol for such OpenStorage devices with PostgreSQL. (Correct me via comments if you think otherwise.)

Here are quick steps that I had used to configure the PostgreSQL server (running Solaris 10) to use iSCSI devices exported from Sun Storage 7000 Unified Storage Platform

 In my case I had created two iSCSI LUN devices (one for PGDATA and one for pg_xlog) with file permissions for the postgres user. (This setup is generally done using the BUI - Browser User Interface of the OpenStorage "Amber Road" device.)

Now on the PostgreSQL Server I did the following to setup iSCSI initiators and LUNS:

  1. Add and display iSCSI target discovery address by giving the IP of the "Amber Road" system.

    # iscsiadm add discovery-address

    # iscsiadm list discovery-address

    Discovery Address:

  2. Display iSCSI targets discovered from the Amber Road system

    # iscsiadm list discovery-address -v

    Discovery Address:

    Target name:

    Target address:, 1

    Target name:

    Target address:, 1

  3. Enable and display static discovery.

    # iscsiadm modify discovery --static enable

    # iscsiadm list discovery


    Static: enabled

    Send Targets: disabled

    iSNS: disabled

  4. Adds a target to the list of statically configured targets. A connection to the target will not be attempted unless the static configuration method of discovery has been enabled.

  5. # iscsiadm add static-config,

    # iscsiadm add static-config,

  6. Use Solaris devfsadm(1M) to create iSCSI device nodes.

    # devfsadm -i iscsi

  7. Use format(1M) command to access iSCSI disks. The disk(s) to be selected contain /scsi_vhci in their path name. Local disks are listed before iSCSI disks in the format command list. The following shows disk no. 4 and 5 are iSCSI disks.

    # format

    Searching for disks...done


    0. c0t0d0 <DEFAULT cyl 17830 alt 2 hd 255 sec 63>


    1. c0t1d0 <DEFAULT cyl 17830 alt 2 hd 255 sec 63>


    2. c0t2d0 <DEFAULT cyl 17830 alt 2 hd 255 sec 63>


    3. c0t3d0 <DEFAULT cyl 17830 alt 2 hd 255 sec 63>


    4. c2t600144F04890703F0000144FA6CCAC00d0 <DEFAULT cyl 13051 alt 2 hd 255 sec 63>


    5. c2t600144F0489070250000144FA6CCAC00d0 <DEFAULT cyl 13051 alt 2 hd 255 sec 63>


  8. For UFS Follow the following procedure

    1. Create file systems on iSCSI disks.

    # newfs /dev/rdsk/c2t600144F04890703F0000144FA6CCAC00d0s0

    # newfs /dev/rdsk/c2t600144F0489070250000144FA6CCAC00d0s0

    1. Make the mount points.

    # mkdir /pgdata

    # mkdir /pglog

    1. Mount the iSCSI LUNs

    # mount -F ufs -o forcedirectio /dev/dsk/c2t600144F04890703F0000144FA6CCAC00d0s /pgdata

    # mount -F ufs -o forcedirectio /dev/dsk/c2t600144F0489070250000144FA6CCAC00d0s0 /pglog

  1. For ZFS Create zpool as follows:

    # zpool create pgdata c2t600144F04890703F0000144FA6CCAC00d0s

    # zpool create pglog c2t600144F0489070250000144FA6CCAC00d0s0

  1. For ZFS with Read-Cache and Separate Intent Log (OpenSolaris 2008.05 and later) using local disks

    # zpool create pgdata c2t600144F04890703F0000144FA6CCAC00d0s cache c0t1d0s1 log c0t1d0s0

    # zpool create pglog c2t600144F0489070250000144FA6CCAC00d0s0 cache c0t2d0s1 log c0t2d0s0

PostgreSQL Setup:

Many times, UFS filesystems have “lost+found” directory which causes PostgreSQL's initdb command to fail as directory is not empty. Also major versions of database are incompatible and hence it makes sense to take a naming convention as follows and create a sub directory within /pgdata and /pglog depending on the version you are planning to use:

mkdir /pgdata/8.3 /pglog/8.3

mkdir /pgdata/8.2 /pglog/8.2

mkdir /pgdata/8.1 /pglog/8.1

Then for PostgreSQL 8.3 (Solaris 10 10/08) the step is as follows

/usr/postgres/8.3/bin/initdb -D /pgdata/8.3 -X /pglog/8.3/pg_xlog

As for 8.2 the steps are as follows:

/usr/postgres/8.2/bin/initdb -D /pgdata/8.2

mv /pgdata/8.2/pg_xlog /pglog/8.2/

ln -s /pglog/8.2/pg_xlog /pgdata/8.2/pg_xlog

And for the default 8.1 the steps are similar:

/usr/bin/initdb -D /pgdata/8.1

mv /pgdata/8.1/pg_xlog /pglog/8.1/

ln -s /pglog/8.1/pg_xlog /pgdata/8.1/pg_xlog

And then use the pg_ctl from the version of PostgreSQL being used to start and stop the server:

pg_ctl start -D /pgdata/8.3 -l /pgdata/8.3/server.log

pg_ctl stop -D /pgdata/8.3

In another blog post I will explain on how to take snapshots with PostgreSQL and the OpenStorage devices.

Wednesday Sep 24, 2008

ZFS with Cloud Storage or Faraway Storage

Recently I am been testing few pieces of Storage projects of OpenSolaris with PostgreSQL. One of tests involves using an iSCSI disk  with PostgreSQL.  Unfortunately the storage that's available is  in Colorado  while my PostgreSQL server is  located in Massachusetts. Latency will definitely be one of my top problems since storage  is halfway across the country (in Colorado). Plus the fact that I will be running a database server on my end  doesn't really sound like a good idea. Come to think about it, this could be a more common problem nowadays since Cloud Storage (for example Amazon S3 Webservice ) could be  optimistically  half way across the country and pessimistically be on the other side of the world.

 So what are my options to solve such problems?  ZFS in OpenSolaris 2008.05 has many new features, couple of which can potentially help with my problem. 

  • ZFS Separate Intent Log: Ability to separate out the ZFS Intent Log (or log writes in simple terms)
  • ZFS L2 ARC: Ability to use a Level 2 Adaptive Replacement Cache which can be block device (or cache reads on device in simple terms).

Thats an interesting set of new features that I thought will be useful in my case. One to log writes separately which can be on a fast disk and another to use a fast disk for caching reads. Of course I am not the first to say on this topic since these new features have been discussed in length a lot specially with SSDs. But I plan to  solve the problem of latency of my Cloud Storage with these new ZFS features and some local disks partitions that I have in my system.

Many people do the analogy that compared to a regular 7,200 rpm SATA or 10,000 rpm SATA/SAS or 15,000 SAS drives, the SSDS act like 40,000 rpm drives. Well extending this to Cloud Storage, I think Cloud Storage is like more like a  500 rpm to 1000 rpm drives depending on the phase of the moon and/or the stock market.

Anyway to continue with my setup, I used an iSCSI disk exported  in Colorado.   I created a regular zpool on top of it  on the server in Massachusetts  and called it "colorado" as shown below:

# zpool create colorado c9t600144F048DAAA5E0000144FA6E7AC00d0

Then I created a PostgreSQL database in /colorado/pgdata and started loading up data in it using pgbench. It was painful to do this late in the day and then  waiting for it to finish. At this point of time I also wished that pgbench had a scale factor of smaller than 1 (maybe it does I don't know). Anyway I did not have the patience to let it finish. I terminated the process after about 8 minutes as  that scenario was unacceptable.

$ time /usr/postgres/8.2/bin/pgbench -i -s 1 pgbench
creating tables...
10000 tuples done.
20000 tuples done.
30000 tuples done.
40000 tuples done.
50000 tuples done.
60000 tuples done.

real    8m1.509s
user    0m0.052s
sys     0m0.011s

I destroyed that "colorado" pool .  I referred  the Solaris ZFS Administration Guide to get help with the updated syntax of these new features and recreated the pool using a local disk partition for cache and another for log as follows:

 # zpool create -f colorado2 c9t600144F048DAAA5E0000144FA6E7AC00d0 log c5t0d0s0 cache c5t0d0s1

And then repeated the steps and then recreated the database on it. Then I started loading the data again with pgbench.

Boom!!!  it finished in record time:

$ time /usr/postgres/8.2/bin/pgbench -i -s 1 pgbench
creating tables...
10000 tuples done.
20000 tuples done.
30000 tuples done.
40000 tuples done.
50000 tuples done.
60000 tuples done.
70000 tuples done.
80000 tuples done.
90000 tuples done.
100000 tuples done.
set primary key...
NOTICE:  ALTER TABLE / ADD PRIMARY KEY will create implicit index "branches_pkey" for table "branches"
NOTICE:  ALTER TABLE / ADD PRIMARY KEY will create implicit index "tellers_pkey" for table "tellers"
NOTICE:  ALTER TABLE / ADD PRIMARY KEY will create implicit index "accounts_pkey" for table "accounts"

real    0m4.560s
user    0m0.076s
sys     0m0.011s

Not bad. Cutting the latency of writes of something that would have taken in excess of 8-10 minutes is atleast recorded within 4 seconds on nonvolatile cache/log combination and allowing ZFS to sync it up to the actual storage. 

Now trying a quick pgbench run to make sure it executes as expected.

$ time /usr/postgres/8.2/bin/pgbench -c 1 -s 1 pgbench
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 1
number of clients: 1
number of transactions per client: 10
number of transactions actually processed: 10/10
tps = 113.030111 (including connections establishing)
tps = 119.500012 (excluding connections establishing)

real    0m0.144s
user    0m0.005s
sys     0m0.008s

So using these new features of ZFS in OpenSolaris 2008.05  can helps hide the latency of these low-cost Cloud Storage and actually make them usable even as a database server.

Plus I heard ZFS is also coming out with recovery  options which will allow to recover not only with the log but also without the separate log and cache device available. If your server dies and takes your ZIL disk with it, you can go and build another server and attach to your cloud device  to regain your data even if you don't have the device to replay the ZIL.If you have your ZIL Log, you can use it to get most current version.  This is important during disaster recovery where you are willing to take whatever you have and start the business again from that point.


Jignesh Shah is Principal Software Engineer in Application Integration Engineering, Oracle Corporation. AIE enables integration of ISV products including Oracle with Unified Storage Systems. You can also follow me on my blog


« October 2016