Sunday Apr 05, 2009

Effects of Flash/SSD on PostgreSQL - PostgreSQL East 2009

I presented a talk at PostgreSQL East 2009 at Drexel University today morning.  The topic was "Effects of Flash/SSDs on PostgreSQL".  Here are the slides from the presentation

If you have questions please leave comments.

Wednesday Nov 12, 2008

Snapshots with PostgreSQL and Sun Storage 7000 Unified Storage Systems

Continuing on my earlier blog entry on setup of PostgreSQL using Sun Storage 7000  Unified Storage (Amber Road), let's now look at how to take snapshots for such systems. 

Considering a typical PostgreSQL system there are two ways to take backup: 

Cold or Offline backup: Backing of PostgreSQL while PostgreSQL server is  not running. The advantage of such a backup is that it is simple to achieve the purpose of backing up the database. However the disadvantage is that the database system is unavailable for work.  The general steps for doing a cold backup is, shutdown gracefully the PostgreSQL server and backup all the files including $PGDATA, pg_xlog if it is in different location and all the tablespaces location used for that server instance. After backup the files, start the PostgreSQL server again and you are done. In case if you ever want to go back to that earlier backup version, you shutdown the database, restore all the old files and restart the database.  This is essentially the same logic which we will follow with snapshots except if you now have a multi gigabyte database, the snapshot operation  will look like million times faster than having to do a "tar" or "cpio" (specially if you consider that "tar" is a single threaded application doing its best but still slow if you use it on multi-core systems).

Lets consider an example on how to take a snapshot with OpenStorage systems with Cold or Offline backup strategy.

Assume that Sun Storage 7000  Unified Storage System  has the  hostname "amberroad" (Sun Storage 7000 Unified Storage Systems is mouthful ) have a project defined "postgres" which has three devices exported "pgdata" which holds $PGDATA, "pglog" which holds $PGDATA/pg_xlog and "pgtbs1" which is one of the tablespaces defined.

Then a backup/snapshot script  (as postgres user ) will look something similar to the following (Disclaimer:  Provided AS-IS. You will have to modify it for your own setups):

#!/bin/sh
date
echo "---------------------------------------------------------------"
echo "Shutting down the PostgreSQL database..."
echo "---------------------------------------------------------------"
pg_ctl -D $PGDATA stop
echo "---------------------------------------------------------------"
echo "Deleting earlier snapshot called testsnap.."
echo "---------------------------------------------------------------"
ssh -l root amberroad confirm shares select postgres snapshots select testsnap destroy
echo "---------------------------------------------------------------"
echo "Taking a new snapshot..."
echo "---------------------------------------------------------------"
ssh -l root amberroad shares select postgres snapshots snapshot testsnap
echo "---------------------------------------------------------------"
echo "Verifying the snapshot.."
echo "---------------------------------------------------------------"
ssh -l root amberroad shares select postgres snapshots show
ssh -l root amberroad shares select postgres snapshots select testsnap show
echo "---------------------------------------------------------------"
echo "Restarting the database.."
echo "---------------------------------------------------------------"
pg_ctl -D $PGDATA start -l $PGDATA/server.log
echo "---------------------------------------------------------------"
date

Note: The testsnap should not exist otherwise snapshot operation will fail and hence we delete any earlier snaps called testsnap in the script. You may modify the script to suit your priorities.

Now in case you want to restore to the snapshot "testsnap" version
you can use a sample script as follows:

#!/bin/sh
date
echo "Restoring the database"
echo "---------------------------------------------------------------"
echo "Shutting down the database..."
echo "---------------------------------------------------------------"
pg_ctl -D $PGDATA stop
echo "---------------------------------------------------------------"
echo "Restoring the snapshot.."
echo "---------------------------------------------------------------"
ssh -l root amberroad confirm shares select postgres select pgdata snapshots \\
select testsnap rollback
ssh -l root amberroad confirm shares select postgres select pglog snapshots \\
select testsnap rollback
ssh -l root amberroad confirm shares select postgres select pgtbs1 snapshots \\
select testsnap rollback
echo "---------------------------------------------------------------"
echo "Restarting the database.."
echo "---------------------------------------------------------------"
pg_ctl -D $PGDATA start -l $PGDATA/server.log
echo "---------------------------------------------------------------"
echo "Database restored to earlier snapshot: testsnap."
echo "---------------------------------------------------------------"
date

NOTE: When you use the above script, all snapshots taken after “testsnap” are lost as they are rolled back to the point in time when “testsnap” was taken. There is another feature to clone the snapshot incase if you don't want to use rollback.

Hot or  Online backup:  A Hot or Online backup is when one takes a backup while PostgreSQL server is running. The advantage is that the database server is available. The disadvantage is that the setup is more complex than the Cold or Offline backup strategy. The recommended way to use Hot or Online backup with PostgreSQL is to use it in conjuction with PITR - Point in Time Recovery feature of PostgreSQL.

This can be achieved by turning on continuous WAL archiving in PostgreSQL and using   SELECT pg_start_backup('label');before taking the snapshot and then issuing SELECT pg_stop_backup(); after the snapshot is completed.

For this you will also need another dedicated device under the "postgres" project to just hold the WAL archive files. (In my case I call it "pgwal" and mounted as /pgwal/archive 

To turn on continuous WAL archiving you need to set the following variables in postgresql.conf (and then restart the PostgreSQL server):

archive_mode = true
archive_command = 'test ! -f /var/lib/postgres/8.2/data/backup_in_progress || cp -i %p /pgwal/archive/%f < /dev/null'
archive_timeout=3600

Then a sample backup/snapshot with hot backup  script can look as follows:

#!/bin/sh
#PGDATA= /var/postgres/8.3/data; export PGDATA # if not set 
date
echo "---------------------------------------------------------------"
echo "Indicating PostgreSQL for a hot Snapshot.."
echo "---------------------------------------------------------------"
touch $PGDATA/backup_in_progress
psql  -c “select pg_start_backup('MyHotSnapShot');” postgres
echo "---------------------------------------------------------------"
echo "Deleting earlier snapshot.."
echo "---------------------------------------------------------------"
ssh -l root amberroad confirm shares select postgres snapshots select testsnap destroy
echo "---------------------------------------------------------------"
echo "Taking a new snapshot..."
echo "---------------------------------------------------------------"
ssh -l root amberroad shares select postgres snapshots snapshot testsnap
echo "---------------------------------------------------------------"
echo "Verifying the snapshot.."
echo "---------------------------------------------------------------"
ssh -l root amberroad shares select postgres snapshots show
ssh -l root amberroad shares select postgres snapshots select testsnap show
echo "---------------------------------------------------------------"
echo "Indicating to PostgreSQL Server that Hot Snapshot is done."
echo "---------------------------------------------------------------"
psql  -c “ select pg_stop_backup();” postgres
rm $PGDATA/backup_in_progress
echo "---------------------------------------------------------------"
date

Restoring from a hot backup snapshot is also bit tricky since we need to decide whether we need to roll forward and upto what point. It is probably beyond the scope of my post and hence I will just do a simple one with all files in the wal arhcives.

First of all restoring with PITR requires a recovery.conf file which should contain the path to the WAL Archives:

restore_command = 'cp /pgwal/archive/%f "%p"'

Also since we will be doing rollforward with the archive files,  we need to remove all the old pg_xlog files, create an archive status directory, remove the postmaster.pid file and an artifact of our earlier script, remove the backup_in_progress marker file too.

(We will do the above steps also as part  of our restore script.)

A restore snapshot script file can look something like as follows:

#!/bin/sh
#PGDATA= /var/postgres/8.3/data; export PGDATA # if not set 
date
echo "Restoring the database"
echo "---------------------------------------------------------------"
echo "Shutting down the database..."
echo "---------------------------------------------------------------"
pg_ctl -D $PGDATA stop
echo "---------------------------------------------------------------"
echo "Restoring the snapshot.."
echo "---------------------------------------------------------------"
ssh -l root amberroad confirm shares select postgres select pgdata snapshots \\
select testsnap rollback
ssh -l root amberroad confirm shares select postgres select pglog snapshots \\
select testsnap rollback
ssh -l root amberroad confirm shares select postgres select pgtbs1 snapshots \\
select testsnap rollback
#if there is reason to rollback walarchives then uncomment the next two lines
#ssh -l root amberroad confirm shares select postgres select pgwal snapshots \\
#select testsnap rollback
echo "---------------------------------------------------------------"
echo "Make sure pg_xlog/\* is empty ..."
echo "Make sure pg_xlog/archive_status exists..."
echo "Make sure postmaster.pid is removed"
echo "---------------------------------------------------------------"
rm $PGDATA/backup_in_progress
rm $PGDATA/postmaster.pid
rm $PGDATA/pg_xlog/\*
mkdir -p $PGDATA/pg_xlog/archive_status
echo "---------------------------------------------------------------"
echo "Set recovery.conf for recover.."
echo "---------------------------------------------------------------"
echo restore_command = \\'cp /pgwal/archive/%f "%p"\\' > $PGDATA/recovery.conf
echo "---------------------------------------------------------------"
echo "Restarting the database.."
echo "---------------------------------------------------------------"
pg_ctl -D $PGDATA start -l $PGDATA/server.log
echo "---------------------------------------------------------------"
echo "Database restored."
echo "---------------------------------------------------------------"
date

Of course the above scripts may not be best but gives an idea of how one can use snapshot features provided by Sun Storage 7000  Unified Storage Systems with PostgreSQL.



Monday Nov 10, 2008

PostgreSQL and Project Amber Road (Sun Storage 7000 Series)

The New Sun Microsystems  announced a new line of OpenStorage devices. The new Sun Storage 7000 Unified Storage Systems (code name Amber Road) are Storage appliances which uses ZFS as the core technology and DTrace as the means of "understanding" what's happening with the devices. This now allows systems which DO NOT have ZFS capabilities  (ala Linux, Windows, etc) to use and gain  benefit from the features of ZFS and DTrace (albeit limited to the "Storage" part only).

PostgreSQL (on Linux,Solaris, Windows) can use such systems through one of the standard protocols that these OpenStorage devices support

  • NFS
  • CIFS (Only on Windows)
  • iSCSI

PostgreSQL on NFS is already discussed at length (both merits and also the demerits of using NFS with PostgreSQL ) in the PostgreSQL community. Also I haven't met anybody yet who said that they are using PostgreSQL with their database on CIFS mounted devices. So I think one might actually select iSCSI as the favorable protocol for such OpenStorage devices with PostgreSQL. (Correct me via comments if you think otherwise.)

Here are quick steps that I had used to configure the PostgreSQL server (running Solaris 10) to use iSCSI devices exported from Sun Storage 7000 Unified Storage Platform

 In my case I had created two iSCSI LUN devices (one for PGDATA and one for pg_xlog) with file permissions for the postgres user. (This setup is generally done using the BUI - Browser User Interface of the OpenStorage "Amber Road" device.)

Now on the PostgreSQL Server I did the following to setup iSCSI initiators and LUNS:

  1. Add and display iSCSI target discovery address by giving the IP of the "Amber Road" system.

    # iscsiadm add discovery-address 10.9.168.93

    # iscsiadm list discovery-address

    Discovery Address: 10.9.168.93:3260

  2. Display iSCSI targets discovered from the Amber Road system

    # iscsiadm list discovery-address -v 10.9.168.93

    Discovery Address: 10.6.140.151:3260

    Target name: iqn.1986-03.com.sun:02:a4602145-85f8-64fa-c0ef-a059394d9a12

    Target address: 10.9.168.93:3260, 1

    Target name: iqn.1986-03.com.sun:02:0449398a-486f-4296-9716-bcba3c1be41c

    Target address: 10.9.168.93:3260, 1

  3. Enable and display static discovery.

    # iscsiadm modify discovery --static enable

    # iscsiadm list discovery

    Discovery:

    Static: enabled

    Send Targets: disabled

    iSNS: disabled

  4. Adds a target to the list of statically configured targets. A connection to the target will not be attempted unless the static configuration method of discovery has been enabled.

  5. # iscsiadm add static-config iqn.1986-03.com.sun:02:9e0b0e03-8823-eb7e-d449-f9c21930ba15,10.9.168.93

    # iscsiadm add static-config iqn.1986-03.com.sun:02:2cc4fe10-c7ba-697f-d95f-fa75efe50239,10.9.168.93

  6. Use Solaris devfsadm(1M) to create iSCSI device nodes.

    # devfsadm -i iscsi

  7. Use format(1M) command to access iSCSI disks. The disk(s) to be selected contain /scsi_vhci in their path name. Local disks are listed before iSCSI disks in the format command list. The following shows disk no. 4 and 5 are iSCSI disks.

    # format

    Searching for disks...done

    AVAILABLE DISK SELECTIONS:

    0. c0t0d0 <DEFAULT cyl 17830 alt 2 hd 255 sec 63>

    /pci@0,0/pci8086,25f8@4/pci108e,286@0/sd@0,0

    1. c0t1d0 <DEFAULT cyl 17830 alt 2 hd 255 sec 63>

    /pci@0,0/pci8086,25f8@4/pci108e,286@0/sd@1,0

    2. c0t2d0 <DEFAULT cyl 17830 alt 2 hd 255 sec 63>

    /pci@0,0/pci8086,25f8@4/pci108e,286@0/sd@2,0

    3. c0t3d0 <DEFAULT cyl 17830 alt 2 hd 255 sec 63>

    /pci@0,0/pci8086,25f8@4/pci108e,286@0/sd@3,0

    4. c2t600144F04890703F0000144FA6CCAC00d0 <DEFAULT cyl 13051 alt 2 hd 255 sec 63>

    /scsi_vhci/disk@g600144f04890703f0000144fa6ccac00

    5. c2t600144F0489070250000144FA6CCAC00d0 <DEFAULT cyl 13051 alt 2 hd 255 sec 63>

    /scsi_vhci/disk@g600144f0489070250000144fa6ccac00

  8. For UFS Follow the following procedure

    1. Create file systems on iSCSI disks.

    # newfs /dev/rdsk/c2t600144F04890703F0000144FA6CCAC00d0s0

    # newfs /dev/rdsk/c2t600144F0489070250000144FA6CCAC00d0s0

    1. Make the mount points.

    # mkdir /pgdata

    # mkdir /pglog

    1. Mount the iSCSI LUNs

    # mount -F ufs -o forcedirectio /dev/dsk/c2t600144F04890703F0000144FA6CCAC00d0s /pgdata

    # mount -F ufs -o forcedirectio /dev/dsk/c2t600144F0489070250000144FA6CCAC00d0s0 /pglog


  1. For ZFS Create zpool as follows:

    # zpool create pgdata c2t600144F04890703F0000144FA6CCAC00d0s

    # zpool create pglog c2t600144F0489070250000144FA6CCAC00d0s0



  1. For ZFS with Read-Cache and Separate Intent Log (OpenSolaris 2008.05 and later) using local disks

    # zpool create pgdata c2t600144F04890703F0000144FA6CCAC00d0s cache c0t1d0s1 log c0t1d0s0

    # zpool create pglog c2t600144F0489070250000144FA6CCAC00d0s0 cache c0t2d0s1 log c0t2d0s0

PostgreSQL Setup:

Many times, UFS filesystems have “lost+found” directory which causes PostgreSQL's initdb command to fail as directory is not empty. Also major versions of database are incompatible and hence it makes sense to take a naming convention as follows and create a sub directory within /pgdata and /pglog depending on the version you are planning to use:

mkdir /pgdata/8.3 /pglog/8.3

mkdir /pgdata/8.2 /pglog/8.2

mkdir /pgdata/8.1 /pglog/8.1


Then for PostgreSQL 8.3 (Solaris 10 10/08) the step is as follows

/usr/postgres/8.3/bin/initdb -D /pgdata/8.3 -X /pglog/8.3/pg_xlog


As for 8.2 the steps are as follows:

/usr/postgres/8.2/bin/initdb -D /pgdata/8.2

mv /pgdata/8.2/pg_xlog /pglog/8.2/

ln -s /pglog/8.2/pg_xlog /pgdata/8.2/pg_xlog


And for the default 8.1 the steps are similar:

/usr/bin/initdb -D /pgdata/8.1

mv /pgdata/8.1/pg_xlog /pglog/8.1/

ln -s /pglog/8.1/pg_xlog /pgdata/8.1/pg_xlog

And then use the pg_ctl from the version of PostgreSQL being used to start and stop the server:

pg_ctl start -D /pgdata/8.3 -l /pgdata/8.3/server.log

pg_ctl stop -D /pgdata/8.3

In another blog post I will explain on how to take snapshots with PostgreSQL and the OpenStorage devices.


Wednesday Nov 05, 2008

Openbravo ERP with PostgreSQL 8.3 on Solaris 10 10/08

Solaris 10 10/08 (ala Solaris 10 Update 6) is now available. One of the new features of Solaris 10 10/08 is that PostgreSQL 8.3 is now bundled and available at /usr/postgres/8.3. I thought I will take this opportunity to help folks on how to use the new version on PostgreSQL using Openbravo ERP 2.40 as an example. (Note: This is specifically for Solaris 10 Update 6. If you are looking for instructions on how to install Openbravo and PostgreSQL on OpenSolaris check the other blog entry.)

First of all I recommend people to use the ZFS root feature now available in this update.  On my Sunblade 2000 I had to use the text installer via

boot cdrom - text

to install the system with ZFS boot.

Also another piece of recommendation is to put /var in a separate dataset. Why? Well all postgresql databases typically goes by default in /var/postgres  and hence to use the snapshot feature of ZFS, this provides a fine granularity.

Once Solaris 10 Update 6 is installed, then to see all options of Postgres available on Solaris 10 Update 6 do the following:

# svcs -a |grep postgres
disabled       Oct_31   svc:/application/database/postgresql:version_82_64bit
disabled       Oct_31   svc:/application/database/postgresql:version_82
disabled       Oct_31   svc:/application/database/postgresql:version_81
disabled       Oct_31   svc:/application/database/postgresql_83:default_64bit
disabled       Oct_31   svc:/application/database/postgresql_83:default_32bit

In my case I want to use a 32-bit instance of  PostgreSQL 8.3 so I just do the following.

# svcadm enable svc:/application/database/postgresql_83:default_32bit
# svcs -a |grep postgres
disabled       Oct_31   svc:/application/database/postgresql:version_82_64bit
disabled       Oct_31   svc:/application/database/postgresql:version_82
disabled       Oct_31   svc:/application/database/postgresql:version_81
disabled       Oct_31   svc:/application/database/postgresql_83:default_64bit
online         16:56:35 svc:/application/database/postgresql_83:default_32bit

Note when you do svcadm enable for any of the PostgreSQL instances, it first does an initdb and then does the equivalent of pg_ctl start. So only after the first enable of the service, you will see that the data directory initialized in /var/postgres/8.3/data (for 64-bit it will be data_64).

At this point of time I typically edit postgresql.conf located in /var/postgres/8.3/data/postgresql.conf and modify it for my basic tweaks:

checkpoint_segments=16 
shared_buffers = 128MB
wal_buffers=512kB

and then restart PostgreSQL server so the changes are incorporated before I start using the PostgreSQL server instance.

# svcadm restart svc:/application/database/postgresql_83:default_32bit

Great now PostgreSQL 8.3 server is up and running. Don't forget to use psql from  /usr/postgres/8.3/bin instead of the default /usr/bin which uses PostgreSQL 8.1 psql.

Lets now use an application on top of this new PostgreSQL server. Openbravo ERP 2.40 which was recently released also requires tomcat .

Now there are two versions of tomcat in Solaris 10 Update 6 (unlike OpenSolaris which has none by default and when you install it via pkg it is the latest version).  The two versions are the original Tomcat 4 and Tomcat 5.5.

/usr/apache/tomcat    - Tomcat 4

/usr/apache/tomcat55 - Tomcat 5.5 

For ease of use and Openbravo we want to use the newer Tomcat version.

so let's set the already bundled tomcat55 quickly as follows:

# cp /var/apache/tomcat55/conf/server.xml-example 
/var/apache/tomcat/conf55/server.xml

Openbravo ERP 2.40 requires ant to be 1.6.5 or higher.  The ant in /usr/sfw/bin in Solaris 10 Update 6 is still the original version  1.5.4.  I downloaded apache-ant-1.7.1-bin.zip from ant.apache.org  and unzipped it in /opt

Download the Openbravo ERP 2.40 Solaris (SPARC) installer available on its download site.

First of all if you are not logged in as root and use "su -" to root,  then using another terminal using the normal userid,  give X server display rights to root. As a short cut I use "xhost localhost"  to give permission to all users on localhost.

Then back as root

# chmod a+x OpenbravoERP_2.40-solaris-sparc-installer.bin
# DISPLAY=localhost:0.0; export DISPLAY
# ./OpenbravoERP_2.40-solaris-sparc-installer.bin


Here is my quick cheat sheet of answers for the questions that OpenBravo installation GUI asks (note the ones in bold are changed from default):

  • /opt/OpenbravoERP
  • /var/OpenbravoERP/AppsOpenbravo/attachments
  • Complete
  • Standard
  • /usr/jdk/instances/jdk1.5.0
  • /opt/apache-ant-1.7.1/bin/ant 
  • /var/apache/tomcat55
  • PostgreSQL
  • /usr/postgres/8.3/bin
  • localhost     5432
  • (Enter password for postgres user as "postgres" twice)
  • openbravo    tad     (Enter password for tad user  twice)
  • Context name: openbravo
  • Date format: DD MM YYYY, Date Separator -, Time format 24h, Time Separator :
  • Demo data: Y or N depending on your preferences

After the information the installation GUI takes quite a bit of time to complete specially if you select to load the demo data. (Hope you made changes to PostgreSQL before to tune this loading.)

Notice that all user related data (PostgreSQL, Tomcat, Openbravo Attachments) are now going in /var. This allows us now to take a snapshot of /var to revert back to this installed image in case we accidentally corrupt user data or application settings.

Before taking the snapshot it might be a good procedure to make sure the database and tomcat is not running. Since we haven't started tomcat55 yet we only have to bring down PostgreSQL server.

# svcadm disable svc:/application/database/postgresql_83:default_32bit

# zfs snapshot rpool/ROOT/s10s_u6wos_07b/var@openbravoinstall

Once the installation completes  start tomcat as follows

# svcadm enable svc:/application/database/postgresql_83:default_32bit

# /usr/apache/tomcat55/bin/startup.sh

Once tomcat has completely started (depending on system it could be more than couple of seconds),  open a browser and go to

http://localhost:8080/openbravo

If you see the login screen if everything goes right !. Use Openbravo as username and openbravo (all lower case) as password to login and set it up for your business.

If you want to rollback to this intial version just do:

# /usr/apache/tomcat55/bin/shutdown.sh

# svcadm disable svc:/application/database/postgresql_83:default_32bit

# zfs rollback rpool/ROOT/s10s_u6wos_07b/var@obenbravoinstall

# svcadm enable svc:/application/database/postgresql_83:default_32bit

# /usr/apache/tomcat55/bin/startup.sh


Once tomcat has completely started,  open a browser and go back to the browser and start again:

http://localhost:8080/openbravo

More help on installation is available on their wiki site.

Wednesday Oct 01, 2008

Replacing Boot Disk which is running OpenSolaris

Recently my boot disk of my workstation which is running OpenSolaris 2008.05 started giving me media errors. Just for the record the errors looked like this:

 WARNING: /pci@5,0/pci1022,7450@4/pci108e,534d@4,1/sd@0,0 (sd2):
     Error for Command: read(10)                Error Level: Retryable
       Requested Block: 4213122                   Error Block: 4213126
       Vendor: FUJITSU                            Serial Number: AAG0P5300991
       Sense Key: Media_Error
       ASC: 0x11 (read retries exhausted), ASCQ: 0x1, FRU: 0x0
 WARNING: /pci@5,0/pci1022,7450@4/pci108e,534d@4,1/sd@0,0 (sd2):
    Error for Command: read(10)                Error Level: Retryable
       Requested Block: 4213122                   Error Block: 4213126
       Vendor: FUJITSU                            Serial Number: AAG0P5300991
       Sense Key: Media_Error
       ASC: 0x11 (read retries exhausted), ASCQ: 0x1, FRU: 0x0

I did not want to wait till the whole disk died so I started thinking about replacing the disk. I talked to our local lab manager to get a replacement disk and within couple of day he came back with a replacement disk. So with a replacement disk in my hand and the boot disk which is my primary drive in the system I went on a strategy to figure out how to change the disk without really re-installing the system.

First of all I am using an EOL product called Sun Java Workstation W2100z. Fortunately it has extra disk slots in it. Halting OpenSolaris 2008.05 I opened the system without a screw driver (it has twist screws) and I inserted my new disk into one of the available disk slot  and it already had ribbon cables and power supply for it. Booted up OpenSolaris and I could see the disks. Great now what.

My primary plan was to clone the disk exactly to be similar to my boot disk so here are my steps to make it look like it

 # format
Searching for disks...done


AVAILABLE DISK SELECTIONS:
       0. c5t0d0 <DEFAULT cyl 17879 alt 2 hd 255 sec 63>
          /pci@5,0/pci1022,7450@4/pci108e,534d@4/sd@0,0
       1. c6t0d0 <DEFAULT cyl 17846 alt 2 hd 255 sec 63>
          /pci@5,0/pci1022,7450@4/pci108e,534d@4,1/sd@0,0
Specify disk (enter its number): 0
selecting c5t0d0
[disk formatted]
/dev/dsk/c5t0d0s0 is part of active ZFS pool rpool. Please see zpool(1M).
/dev/dsk/c5t0d0s1 is currently used by swap. Please see swap(1M).
format> par
partition> name
Enter table name (remember quotes): "myroot"
partition> pr
Current partition table (myroot):
Total disk cylinders available: 17879 + 2 (reserved cylinders)

Part      Tag    Flag     Cylinders         Size            Blocks
  0       root    wm     262 - 17878      134.95GB    (17617/0/0) 283017105
  1       swap    wu       1 -   261        2.00GB    (261/0/0)     4192965
  2     backup    wu       0 - 17878      136.96GB    (17879/0/0) 287226135
  3 unassigned    wm       0                0         (0/0/0)             0
  4 unassigned    wm       0                0         (0/0/0)             0
  5 unassigned    wm       0                0         (0/0/0)             0
  6 unassigned    wm       0                0         (0/0/0)             0
  7 unassigned    wm       0                0         (0/0/0)             0
  8       boot    wu       0 -     0        7.84MB    (1/0/0)         16065
  9 unassigned    wm       0                0         (0/0/0)             0

partition> quit
format> disk
AVAILABLE DISK SELECTIONS:
       0. c5t0d0 <DEFAULT cyl 17879 alt 2 hd 255 sec 63>
          /pci@5,0/pci1022,7450@4/pci108e,534d@4/sd@0,0
       1. c6t0d0 <DEFAULT cyl 17846 alt 2 hd 255 sec 63>
          /pci@5,0/pci1022,7450@4/pci108e,534d@4,1/sd@0,0
Specify disk (enter its number)[1]: 1
selecting c6t0d0
[disk formatted]
format> ver
WARNING - This disk may be in use by an application that has
      modified the fdisk table. Ensure that this disk is
      not currently in use before proceeding to use fdisk.
format> fdisk
No fdisk table exists. The default partition for the disk is:

  a 100% "SOLARIS System" partition

Type "y" to accept the default partition,  otherwise type "n" to edit the
 partition table.
y
format> par
partition> pr
Current partition table (original):
Total disk cylinders available: 17846 + 2 (reserved cylinders)

Part      Tag    Flag     Cylinders         Size            Blocks
  0 unassigned    wm       0                0         (0/0/0)             0
  1 unassigned    wm       0                0         (0/0/0)             0
  2     backup    wu       0 - 17845      136.71GB    (17846/0/0) 286695990
  3 unassigned    wm       0                0         (0/0/0)             0
  4 unassigned    wm       0                0         (0/0/0)             0
  5 unassigned    wm       0                0         (0/0/0)             0
  6 unassigned    wm       0                0         (0/0/0)             0
  7 unassigned    wm       0                0         (0/0/0)             0
  8       boot    wu       0 -     0        7.84MB    (1/0/0)         16065
  9 unassigned    wm       0                0         (0/0/0)             0

partition>

Unfortunately since somehow my cylinder counts were different I could not use select to choose the layout of "myroot" and hence had to set the layout manually to be similar to my boot disk.

partition> 0
Part      Tag    Flag     Cylinders         Size            Blocks
  0 unassigned    wm       0                0         (0/0/0)             0

Enter partition id tag[unassigned]:
Enter partition permission flags[wm]:
Enter new starting cyl[0]: 262
Enter partition size[0b, 0c, 262e, 0.00mb, 0.00gb]: 17845e
partition> 1
Part      Tag    Flag     Cylinders         Size            Blocks
  1 unassigned    wm       0                0         (0/0/0)             0

Enter partition id tag[unassigned]:
Enter partition permission flags[wm]:
Enter new starting cyl[0]: 1
Enter partition size[0b, 0c, 1e, 0.00mb, 0.00gb]: 261e
partition> pr
Current partition table (unnamed):
Total disk cylinders available: 17846 + 2 (reserved cylinders)

Part      Tag    Flag     Cylinders         Size            Blocks
  0 unassigned    wm     262 - 17845      134.70GB    (17584/0/0) 282486960
  1 unassigned    wm       1 -   261        2.00GB    (261/0/0)     4192965
  2     backup    wu       0 - 17845      136.71GB    (17846/0/0) 286695990
  3 unassigned    wm       0                0         (0/0/0)             0
  4 unassigned    wm       0                0         (0/0/0)             0
  5 unassigned    wm       0                0         (0/0/0)             0
  6 unassigned    wm       0                0         (0/0/0)             0
  7 unassigned    wm       0                0         (0/0/0)             0
  8       boot    wu       0 -     0        7.84MB    (1/0/0)         16065
  9 unassigned    wm       0                0         (0/0/0)             0

partition> lab
Ready to label disk, continue? y

partition>

Now my disk layout looks like my boot disk image. Now I need to tell ZFS to move to the new disk. I wasn't comfortable doing all this in GUI so I reverted to the character mode by issuing the following command

# svcadm disable gdm

 In the character login mode,  I used the following step to move the ZFS rpool to the new disk

# zpool replace rpool c5t0d0s0 c6t0d0s0

(Sometimes you may have to use force flag if you have some some data already on c6t0d0s0

# zpool replace -f rpool c5t0d0s0 c6t0d0s0

The command did look like it returned quite quickly but it then I found it spawns a background task to start moving the pool to the new system.  I observed it by using

# zpool iostat -v

to see the IOs happening from the old disk to the new

and also

# zpool status -v

Once it successfully completed replacing the disk which could take a while, then I  made sure the swap location is also moved to the new disk in /etc/vfstab

# vi /etc/vfstab

(replace c5t0d0s1 to c6t0d0s1)

Now to make sure Grub is also available on the new disk and it knows where the new ZFS Root drive will be located:

# cd /boot/grub

# installgrub stage1 stage2 /dev/rdsk/c6t0d0s0

Final step before rebooting the operating system, I made sure the mini-root is updated with the new information

# bootadm update-archive

 And now I rebooted the system. Here is where there is a bit of timing issue since I need to press F2 to get into the boot menu and select the other disk to boot from in the CMOS menu. Booting into the other hard disk and seeing a working grub menu gave me a positive feeling that things are going right.  Selecting the OpenSolaris installation started booting Solaris again. Since I had disabled the GUI login, it took me to the login prompt. Login using my primary id and  then "su" into root shell

I checked using zpool status -v and "swap -l" to verify that rpool and swap are pointing to the new disk. Once I was satisfied, I  enabled GUI login as follows:

# svcadm enable gdm; exit

( Note as soon as you enable GUI via gdm it starts X windows and leave your shell hanging logged in. Hence I had  the extra ";exit" to  immediately exit after starting the GUI. )

Then I took the system down again to remove my dying disk from it and sent it back to my lab manager.




Wednesday Sep 24, 2008

ZFS with Cloud Storage or Faraway Storage

Recently I am been testing few pieces of Storage projects of OpenSolaris with PostgreSQL. One of tests involves using an iSCSI disk  with PostgreSQL.  Unfortunately the storage that's available is  in Colorado  while my PostgreSQL server is  located in Massachusetts. Latency will definitely be one of my top problems since storage  is halfway across the country (in Colorado). Plus the fact that I will be running a database server on my end  doesn't really sound like a good idea. Come to think about it, this could be a more common problem nowadays since Cloud Storage (for example Amazon S3 Webservice ) could be  optimistically  half way across the country and pessimistically be on the other side of the world.

 So what are my options to solve such problems?  ZFS in OpenSolaris 2008.05 has many new features, couple of which can potentially help with my problem. 

  • ZFS Separate Intent Log: Ability to separate out the ZFS Intent Log (or log writes in simple terms)
  • ZFS L2 ARC: Ability to use a Level 2 Adaptive Replacement Cache which can be block device (or cache reads on device in simple terms).

Thats an interesting set of new features that I thought will be useful in my case. One to log writes separately which can be on a fast disk and another to use a fast disk for caching reads. Of course I am not the first to say on this topic since these new features have been discussed in length a lot specially with SSDs. But I plan to  solve the problem of latency of my Cloud Storage with these new ZFS features and some local disks partitions that I have in my system.

Many people do the analogy that compared to a regular 7,200 rpm SATA or 10,000 rpm SATA/SAS or 15,000 SAS drives, the SSDS act like 40,000 rpm drives. Well extending this to Cloud Storage, I think Cloud Storage is like more like a  500 rpm to 1000 rpm drives depending on the phase of the moon and/or the stock market.

Anyway to continue with my setup, I used an iSCSI disk exported  in Colorado.   I created a regular zpool on top of it  on the server in Massachusetts  and called it "colorado" as shown below:

# zpool create colorado c9t600144F048DAAA5E0000144FA6E7AC00d0

Then I created a PostgreSQL database in /colorado/pgdata and started loading up data in it using pgbench. It was painful to do this late in the day and then  waiting for it to finish. At this point of time I also wished that pgbench had a scale factor of smaller than 1 (maybe it does I don't know). Anyway I did not have the patience to let it finish. I terminated the process after about 8 minutes as  that scenario was unacceptable.

$ time /usr/postgres/8.2/bin/pgbench -i -s 1 pgbench
creating tables...
10000 tuples done.
20000 tuples done.
30000 tuples done.
40000 tuples done.
50000 tuples done.
60000 tuples done.
\^C

real    8m1.509s
user    0m0.052s
sys     0m0.011s


I destroyed that "colorado" pool .  I referred  the Solaris ZFS Administration Guide to get help with the updated syntax of these new features and recreated the pool using a local disk partition for cache and another for log as follows:

 # zpool create -f colorado2 c9t600144F048DAAA5E0000144FA6E7AC00d0 log c5t0d0s0 cache c5t0d0s1

And then repeated the steps and then recreated the database on it. Then I started loading the data again with pgbench.

Boom!!!  it finished in record time:


$ time /usr/postgres/8.2/bin/pgbench -i -s 1 pgbench
creating tables...
10000 tuples done.
20000 tuples done.
30000 tuples done.
40000 tuples done.
50000 tuples done.
60000 tuples done.
70000 tuples done.
80000 tuples done.
90000 tuples done.
100000 tuples done.
set primary key...
NOTICE:  ALTER TABLE / ADD PRIMARY KEY will create implicit index "branches_pkey" for table "branches"
NOTICE:  ALTER TABLE / ADD PRIMARY KEY will create implicit index "tellers_pkey" for table "tellers"
NOTICE:  ALTER TABLE / ADD PRIMARY KEY will create implicit index "accounts_pkey" for table "accounts"
vacuum...done.

real    0m4.560s
user    0m0.076s
sys     0m0.011s


Not bad. Cutting the latency of writes of something that would have taken in excess of 8-10 minutes is atleast recorded within 4 seconds on nonvolatile cache/log combination and allowing ZFS to sync it up to the actual storage. 

Now trying a quick pgbench run to make sure it executes as expected.

$ time /usr/postgres/8.2/bin/pgbench -c 1 -s 1 pgbench
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 1
number of clients: 1
number of transactions per client: 10
number of transactions actually processed: 10/10
tps = 113.030111 (including connections establishing)
tps = 119.500012 (excluding connections establishing)

real    0m0.144s
user    0m0.005s
sys     0m0.008s

So using these new features of ZFS in OpenSolaris 2008.05  can helps hide the latency of these low-cost Cloud Storage and actually make them usable even as a database server.

Plus I heard ZFS is also coming out with recovery  options which will allow to recover not only with the log but also without the separate log and cache device available. If your server dies and takes your ZIL disk with it, you can go and build another server and attach to your cloud device  to regain your data even if you don't have the device to replay the ZIL.If you have your ZIL Log, you can use it to get most current version.  This is important during disaster recovery where you are willing to take whatever you have and start the business again from that point.


Wednesday Jun 20, 2007

How about a ZFS-Extension module for Postgres?

Now that ZFS on Linux is live  one can argue that ZFS is available on multiple OSes including Solaris, FreeBSD (thanks Marc), Linux and Mac OS X (almost). I started thinking about how ZFS can simplify Open Source Databases effort to provide features that ZFS offers and thats still available in OSDB.

 What features of ZFS will be appealing to OSDB would be an obvious question. I think the following ZFS features provides quite a bit of excitement to any DBA. In this example I will take a Postgres view (based on what I know about Postgres and that may not be much and I might be missing some features in my evaluation here)

  • ZFS offers simple commands to setup bare disks
  • ZFS offers simple commands to quickly monitor the status of the disks
  • ZFS offers snapshot features allowing to roll back to a previous snapshot ("Time Machine")
  • ZFS offers to add devices quickly without causing a down time
  • ZFS offers checksum, self healing, compression capabilities

So for this example how can PostgreSQL take advantage of these features?

My first reaction will be the extension to the SQL standard "TABLESPACE" commands. If I were to create a ZFS feature extension to PostgreSQL, the first command that I would add more options will be the CREATE TABLESPACE command. For example PostgreSQL will have a default ZFS pool option and every "CREATE TABLESPACE" can be created as an additional dataset within ZFS with its own checksum, compression, snapshot options. That itself is a huge gain in the number of features. Of course snapshots here will be tricky since you still have to worry about LOGS to be in sync and could require lot of handling here. But the simplest option will be to put snapshots on the default pool itself and roll back the whole database to the point when snapshot was last taken, this is great when you are about to do migrations to a new version, or going to do lots of changes  to the database. Ahh I am deviating from the point of CREATE TABLESPACE to the whole database. Anyway getting back to the CREATE TABLESPACE, I think it will be easier to have an extension in postgresql which will eventually create ZFS datasets with the mount point as determined by the syntax. The advantage of doing this then the DBA can quickly use zfs list to determine all the tablespaces, amount of space used, fine tune it to the workload using the "recordsize" tunable of ZFS.

Similarly I think "initdb" and "CREATE DATABASE" can be extended to have and set default ZFS pools to have the snapshot features work at the database level itself. Now that's cool... Imagine if you have a testbed where you destruct the database every now and then and then have to rebuild the database again just to destruct it again to do "breakdown" tests, such snapshot feature can cut down your "setup" time by a magnitude.

There could be other things here that can be beneficial here. I think I will have to discuss this more with other folks.

 




About

Jignesh Shah is Principal Software Engineer in Application Integration Engineering, Oracle Corporation. AIE enables integration of ISV products including Oracle with Unified Storage Systems. You can also follow me on my blog http://jkshah.blogspot.com

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today