Monday Sep 14, 2009

Infobright Tuning on OpenSolaris/Solaris 10

Recently I was working on a project which used Infobright as the database. The version tested was 3.1.1 both on OpenSolaris as well as Solaris 10. Infobright is like a column-oriented database engine for MySQL primarily targeted towards data warehouse, data mining type of project deployments.

While everything was working as expected, one thing we did notice that as number of concurrent connections tried to query against the database we noticed that queries deteriorated fast in the sense that not much parallel benefits were being squeezed from the machine. Now this sucks! (apparently sucks is now a technical term). It sucks because the server has definitely many  cores and typically each Infobright query still can at the max peg a core. So the expectation will be typically to atleast handle concurrent queries which is close to the number of cores  (figuratively speaking though in reality it depends).

 Anyway we started digging into this problem. First we noticed that CPU cycles were heavy so IO was probably not the culprit (in this case). Using plockstat we found

# plockstat -A -p 2039    (where 2039 is the PID of mysqld server running 4 simultaneous queries)

Mutex hold 

Count     nsec Lock                         Caller 
3634393     1122`libc_malloc_lock`_Znwm+0x2b 
3626645     1047`libc_malloc_lock`_ZdlPv+0xe 
    2 536317885 0x177b878                    mysqld`_ZN7IBMutex6UnlockEv+0x12 
   12  6338626 mysqld`LOCK_open             mysqld`_Z10open_tableP3THDP13st_table_listP11st_mem_rootPbj+0x55a 
 9057     1275`libc_malloc_lock`_Znwm+0x2b 
 8493     1051`libc_malloc_lock`_ZdlPv+0xe 
 7928     1119`libc_malloc_lock`_ZdlPv+0xe 
    5   326542 0x177b878                    mysqld`_ZN7IBMutex6UnlockEv+0x12 
  683     1189`libc_malloc_lock`_Znwm+0x2b 
  564     1339`libc_malloc_lock`_Znwm+0x2b 
  564     1274`libc_malloc_lock`_Znwm+0x2b 
  564     1156`libc_malloc_lock`_ZdlPv+0xe 
   17    36292 0x1777780                    mysqld`_ZN7IBMutex6UnlockEv+0x12 
    2   246377 mysqld`rccontrol+0x18        mysqld`_ZN7IBMutex6UnlockEv+0x12 
   57     8074 mysqld`_iob+0xa8   `_ZNSo5flushEv+0x30 
  218     1479`libc_malloc_lock`_Znwm+0x2b 
    4    78172 mysqld`rccontrol+0x18        mysqld`_ZN7IBMutex6UnlockEv+0x12 
    4    75161 mysqld`rccontrol+0x18        mysqld`_ZN7IBMutex6UnlockEv+0x12 

R/W reader hold 

Count     nsec Lock                         Caller 
   44     1171 mysqld`THR_LOCK_plugin       mysqld`_Z24plugin_foreach_with_maskP3THDPFcS0_P13st_plugin_intPvEijS3_+0xa3 
   12     3144 mysqld`LOCK_grant            mysqld`_Z11check_grantP3THDmP13st_table_listjjb+0x38c 
    1    14125 0xf7aa18                     mysqld`_ZN11Query_cache21send_result_to_clientEP3THDPcj+0x536 
    1    12089 0xf762e8                     mysqld`_ZN11Query_cache21send_result_to_clientEP3THDPcj+0x536 
    2     1886 mysqld`LOCK_grant            mysqld`_Z11check_grantP3THDmP13st_table_listjjb+0x38c 
    2     1776 mysqld`LOCK_grant            mysqld`_Z11check_grantP3THDmP13st_table_listjjb+0x38c 
    1     3006 mysqld`LOCK_grant            mysqld`_Z11check_grantP3THDmP13st_table_listjjb+0x38c 
    1     2765 mysqld`LOCK_grant            mysqld`_Z11check_grantP3THDmP13st_table_listjjb+0x38c 
    1     1797 mysqld`LOCK_grant            mysqld`_Z11check_grantP3THDmP13st_table_listjjb+0x38c 
    1     1131 mysqld`THR_LOCK_plugin       mysqld`_Z24plugin_foreach_with_maskP3THDPFcS0_P13st_plugin_intPvEijS3_+0xa3 

Mutex block 

Count     nsec Lock                         Caller 
 2175 11867793`libc_malloc_lock`_ZdlPv+0xe 
 1931 12334706`libc_malloc_lock`_Znwm+0x2b 
    3 93404485`libc_malloc_lock   mysqld`my_malloc+0x32 
    1    11581`libc_malloc_lock   mysqld`_ZN11Item_stringD0Ev+0x49 
    1     1769`libc_malloc_lock`_ZnwmRKSt9nothrow_t+0x20

Now typically if you see libc_malloc_lock in a plockstat for a  multi-threaded program then it is a sign that the default malloc/free routines in libc is the culprit since the default malloc is not scalable enough for a multi-threaded program. There are alternate implementations which are more scalable than the default. Two such options which are already part of OpenSolaris, Solaris 10 are and They can be forced to be used instead of the default without recompiling the binaries by preloading anyone of them before the startup command.

In case of the 64-bit Infobright binaries we did that by modifying the startup script mysqld-ib and added the following line just before invocation of mysqld command.

LD_PRELOAD_64=/usr/lib/64/; export LD_PRELOAD_64

What we found was now the response times for each query was more in-line as it was being executed on its own. well not true entirely but you get the point. For a 4 concurrent queries we found that it had improved from like 1X to 2.5X reduction in total execution time.

Similary when we used we found the reduction more like 3X when 4 queries were executing concurrently.

LD_PRELOAD_64=/usr/lib/64/; export LD_PRELOAD_64

Definitely something to use for all Infobright installations on OpenSolaris or Solaris 10.

In a following blog post we will see other ways to tune Infobright which are not as drastic as this one but still buys some percentage of improvements. Stay tuned!!

Tuesday Jul 21, 2009

Olio on 6-core Opterons (Istanbul) based Sun Systems

Sun is launching systems with multisocket  6-core Opterons (Istanbul) today. Last week I got access to  Sun Fire X4140 with 2 x 6-core Opterons with 36GB RAM. It is always great to see such a 1RU system packaged with so many x64 cores.

# psrinfo -vp
The physical processor has 6 virtual processors (0-5)
  x86 (chipid 0x0 AuthenticAMD family 16 model 8 step 0 clock 2600 MHz)
    Six-Core AMD Opteron(tm) Processor 8435
The physical processor has 6 virtual processors (6-11)
  x86 (chipid 0x1 AuthenticAMD family 16 model 8 step 0 clock 2600 MHz)
    Six-Core AMD Opteron(tm) Processor 8435

I decided to take the system for a test drive with Olio. Olio is a Web 2.0 toolkit consisting on a web 2.0 event calendar application  which can help stress a system. Depending on your favorite scripting language you can use either PHP, Ruby on Rails, Java as the language used to create the application. (I took the easy way out and selected Olio PHP's prebundled binary kit)

Please don't let the small 2MB kit size fool you thinking it will be a easy workload to test it out. While setting it up I figured that to generate the data population for say 5000 users you will need space with atleast 500GB disk space for the content that it generates for it. Yes I quickly had to figure out how to get a storage array for Olio with about 800GB LUN.

Olio requires a webserver, PHP (of course) and  a database for its metadata store (it has scripts for MySQL already in the kit). The system came preconfigured with Solaris 10 5/09. I downloaded MySQL 5.4.1 beta  and also the Sun WebStack kit which has Apache Httpd 2.2, PHP 5.2 (and also MySQL 5.1 which had not used since I had already downloaded MySQL 5.4 Beta). Memcached 1.2.5 is part of the WebStack download and Olio is configured to use it also by default (but can be disabled too).

Eventually everything was installed and configured in the same X4140 and using the Faban Harness on another system started executing some runs with file store and the meta store preconfigured to handle all the way up to 5000 concurrent users. The results are as follows:


Here are my observation/interpretations:

  • Eventually beyond 10 cores run I find that the system memory (36GB) is not enough to sustain more concurrent users to fully utilize the remaining cores. I would probably need RAM  in the range of 48GB or more to handle more users. (PHP is not completely thread-safe and hence the web server used here spawns processes)
  • This 1RU system can handle more than 3200 users  (with everything on the same system) with CPU cycles to spare is pretty impressive. It means you still have enough CPU to log into the system without seeing degraded performance.
  • Actually you can see here that SMP (or should be called  SMC - Scalable Multi Cores) type system helps when the initial cores are added  instead of using multiple single core systems (ala in Cloud).

 In an upcoming blog entries I will talk more about the individual components used.

Wednesday May 27, 2009

Postgres on OpenSolaris using 2x Quad Cores: Use FX Scheduler

During my PGCon 2009 presentation there was a question on the saw tooth nature of the workload results on the high end side of benchmark runs. To which Matthew Wilcox (from Intel) commented it could be scheduler related. I did not give it much thought at that time till today when I was trying to do some iGen runs for the JDBC Binary Transfer patch (more on that in another blog post) and also Simon's read only scalability runs . Then I realized that I was not following one of my one tuning advice for running Postgres on OpenSolaris. The advice is to  use FX Class of scheduler instead of the default TS Class on OpenSolaris . More details on various scheduler classes can be found on

Now how many times I have forgotten to do that with Postgres on OpenSolaris I have no idea. But yes it is highly recommended specially on multi-core systems to use FX scheduler class for Postgres on OpenSolaris. How much gain are we talking about? The following graph will give an indication using the default TS scheduler class Vs the FX Scheduler class using the iGen benchmark.

The gain is about 14% by just switching over to FX Class. How did I get Postgres server instance to use FX class? I cheated and put all processes of the user (with userid 236177)  in FX class using the following command line.

# priocntl -s -c FX -i uid 236177

One thing to figure out is how to make sure Postgres uses FX scheduler class out of the box on OpenSolaris so I don't keep forgetting about that minute performance tip.

Tuesday Nov 18, 2008

Quest for OpenSolaris based Appliance

Recently I burned a copy of OpenSolaris 2008.11 RC1 and used it with Songbird (using pkg install SUNWsongbird)  and Fluendo MP3 Decoder (which is free for OpenSolaris),  Flash and soon I had a setup in my bedroom which I love to call OpenSolaris Home Theatre Edition that I used to listen to my songs collection and watch online shows that I controlled via Remote Desktop  which to me was a serious contender to those Media Centers out there. 

However I realized that while I wanted to "Pump it up" for my personal usage, I really wanted to "Strip it Down" for business usage. What I meant is in order for someone to try it out with say another Open Source Software it is easier now to say try that Open Source Software pre-installed on OpenSolaris in a Virtualbox Image. However I found it very hard to do it in practice.

Say for example I want to get somebody to try out the latest version of MySQL on OpenSolaris. The easiest thing is to give them a VDI image of preinstalled version of MySQL 5.1 RC  running on OpenSolaris that somebody just double clicks and boom a virutalbox instance starts up with MySQL ready to deliver.

However there is a problem with that theory. The VDI image of OpenSolaris fresh install in a virtualbox instance is about 2.5 - 3 GB. Of course adding MySQL on top of it won't drastically increase the size but I still have a problem with the base size itself. Since the only way that this can work is to use some sort of Peer to Peer File Sharing technology as hosting and hoping people will download this DVD Size downloads without any problems is like going at 10:00am for a Thanksgiving Deal at a store that opens at 6:00am with less than 20 SKU available (Tough Luck!!).

Anyway I started dissecting on how to reduce the size of the VDI image. There are few tricks provided by Virtualbox itself to release zero'ed out sectors using

VBoxManage modifyvdi VDINAME compact

However trying this out on a VDI holding OpenSolaris 2008.11 RC1 candidate did not really help.

The next thing I tried was to list out all the packages that are installed using

pkg list

There are about 550 packages installed as reported by that list. I tried  removing many of the non-essential desktop things (Aha who needs Xorg, Gnome, Firefox, Thunderbird) and reduced it to less than 400 packages. However even with that the VDI images is still not much smaller than 2.5GB. I tried the trick of cloning the disk to another VDI via ZFS replace thinking it will get rid of all the freed up space obtained by removing all the packages but the resulting new VDI image was still 2.5GB.

Looking at ZFS list output I found that for my virtualbox instance with 512MB I have three main zfs file systems defined which gives me the approximate 2.5GB usage.

rpool/ROOT/opensolaris- The legacy mount which is root (/) which is about 1.3 GB

rpool/dump which is about 512MB

rpool/swap which is again about 512MB

Now the goal is to reduce the VDI size without creating complications for the end user trying out the VDI and also still have the capability of going back to the fully installed version of OpenSolaris.

Hence the quest still continues..

Monday Nov 10, 2008

8x Quad-Core Opteron PostgreSQL/MySQL Server Try and Buy Offer

Somehow I must have missed this  try and buy offer before.  Evaluate the Sun Fire X4600 with PostgreSQL, MySQL or Microsoft SQL Server 2005. The Sun Fire X4600 M2 Server included in the program is available in two configurations. First configuration is  4 x Quad-Core with 16GB RAM and second configuration is with 8x Quad-Core (yes 32 cores in all) with 64GB RAM. Of course it only comes with two disks in it which means also need to "Try and Buy" the Sun StorageTek 2540 Array or the Sun Storage 7110 Unified Storage System with it.

Would love to see some real life deployment saturate 32-core Opteron system using only PostgreSQL or MySQL as the database system in it. 

Wednesday Nov 05, 2008

Yes You Can - Save $$$ on cost of Proprietary Databases

The New Sun Microsystems  announced a new Glassfish/MySQL/OpenSolaris SPECjAppServer2004 result today. The  real highlight is that all software used in this benchmark is all Open Source Software. It is  a win for Open Source Software including Open Source Databases like PostgreSQL and MySQL. We need more of such benchmarks to highlight the exorbitant prices charged by Proprietary Database & other Software Vendors who charge and force customers to give them all their major dollars of their IT budget.

Tom Daly's blog entry highlight that in terms of Price/Performance, the proprietary database vendors who  conveniently also happen to be the Application Server Vendors charge  a much higher cost including all hardware/software (10X) associated with same performance which can be obtained  by Open Source database alternatives like PostgreSQL and MySQL. Check BM Seer's blog entry for more comparison.

SPEC required disclosure : - SPEC, SPECjAppServer reg tm of Standard Performance Evaluation Corporation. Results from as of 5th Nov 2008 All comparisons are based on the SPEC SPECjAppServer2004JOPS@Standard metric from or on pricing made using the bill of materials included in each SPECjAppServer2004 result

Wednesday Sep 24, 2008

ZFS with Cloud Storage or Faraway Storage

Recently I am been testing few pieces of Storage projects of OpenSolaris with PostgreSQL. One of tests involves using an iSCSI disk  with PostgreSQL.  Unfortunately the storage that's available is  in Colorado  while my PostgreSQL server is  located in Massachusetts. Latency will definitely be one of my top problems since storage  is halfway across the country (in Colorado). Plus the fact that I will be running a database server on my end  doesn't really sound like a good idea. Come to think about it, this could be a more common problem nowadays since Cloud Storage (for example Amazon S3 Webservice ) could be  optimistically  half way across the country and pessimistically be on the other side of the world.

 So what are my options to solve such problems?  ZFS in OpenSolaris 2008.05 has many new features, couple of which can potentially help with my problem. 

  • ZFS Separate Intent Log: Ability to separate out the ZFS Intent Log (or log writes in simple terms)
  • ZFS L2 ARC: Ability to use a Level 2 Adaptive Replacement Cache which can be block device (or cache reads on device in simple terms).

Thats an interesting set of new features that I thought will be useful in my case. One to log writes separately which can be on a fast disk and another to use a fast disk for caching reads. Of course I am not the first to say on this topic since these new features have been discussed in length a lot specially with SSDs. But I plan to  solve the problem of latency of my Cloud Storage with these new ZFS features and some local disks partitions that I have in my system.

Many people do the analogy that compared to a regular 7,200 rpm SATA or 10,000 rpm SATA/SAS or 15,000 SAS drives, the SSDS act like 40,000 rpm drives. Well extending this to Cloud Storage, I think Cloud Storage is like more like a  500 rpm to 1000 rpm drives depending on the phase of the moon and/or the stock market.

Anyway to continue with my setup, I used an iSCSI disk exported  in Colorado.   I created a regular zpool on top of it  on the server in Massachusetts  and called it "colorado" as shown below:

# zpool create colorado c9t600144F048DAAA5E0000144FA6E7AC00d0

Then I created a PostgreSQL database in /colorado/pgdata and started loading up data in it using pgbench. It was painful to do this late in the day and then  waiting for it to finish. At this point of time I also wished that pgbench had a scale factor of smaller than 1 (maybe it does I don't know). Anyway I did not have the patience to let it finish. I terminated the process after about 8 minutes as  that scenario was unacceptable.

$ time /usr/postgres/8.2/bin/pgbench -i -s 1 pgbench
creating tables...
10000 tuples done.
20000 tuples done.
30000 tuples done.
40000 tuples done.
50000 tuples done.
60000 tuples done.

real    8m1.509s
user    0m0.052s
sys     0m0.011s

I destroyed that "colorado" pool .  I referred  the Solaris ZFS Administration Guide to get help with the updated syntax of these new features and recreated the pool using a local disk partition for cache and another for log as follows:

 # zpool create -f colorado2 c9t600144F048DAAA5E0000144FA6E7AC00d0 log c5t0d0s0 cache c5t0d0s1

And then repeated the steps and then recreated the database on it. Then I started loading the data again with pgbench.

Boom!!!  it finished in record time:

$ time /usr/postgres/8.2/bin/pgbench -i -s 1 pgbench
creating tables...
10000 tuples done.
20000 tuples done.
30000 tuples done.
40000 tuples done.
50000 tuples done.
60000 tuples done.
70000 tuples done.
80000 tuples done.
90000 tuples done.
100000 tuples done.
set primary key...
NOTICE:  ALTER TABLE / ADD PRIMARY KEY will create implicit index "branches_pkey" for table "branches"
NOTICE:  ALTER TABLE / ADD PRIMARY KEY will create implicit index "tellers_pkey" for table "tellers"
NOTICE:  ALTER TABLE / ADD PRIMARY KEY will create implicit index "accounts_pkey" for table "accounts"

real    0m4.560s
user    0m0.076s
sys     0m0.011s

Not bad. Cutting the latency of writes of something that would have taken in excess of 8-10 minutes is atleast recorded within 4 seconds on nonvolatile cache/log combination and allowing ZFS to sync it up to the actual storage. 

Now trying a quick pgbench run to make sure it executes as expected.

$ time /usr/postgres/8.2/bin/pgbench -c 1 -s 1 pgbench
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 1
number of clients: 1
number of transactions per client: 10
number of transactions actually processed: 10/10
tps = 113.030111 (including connections establishing)
tps = 119.500012 (excluding connections establishing)

real    0m0.144s
user    0m0.005s
sys     0m0.008s

So using these new features of ZFS in OpenSolaris 2008.05  can helps hide the latency of these low-cost Cloud Storage and actually make them usable even as a database server.

Plus I heard ZFS is also coming out with recovery  options which will allow to recover not only with the log but also without the separate log and cache device available. If your server dies and takes your ZIL disk with it, you can go and build another server and attach to your cloud device  to regain your data even if you don't have the device to replay the ZIL.If you have your ZIL Log, you can use it to get most current version.  This is important during disaster recovery where you are willing to take whatever you have and start the business again from that point.

Tuesday Jun 10, 2008

PostgreSQL Monitor - A Demo

Netbeans 6.1 was released recently which includes PostgreSQL JDBC driver. It had been a while since I had done programming as a developer, so I decided to take it for a test drive to refresh my programming skills with it. Well here is the resulting demo application:

PostgreSQL Monitor

Remember it is just a demonstration application. It requires your client (laptop or desktop ip address) to be trusted in  pg_hba.conf  ip adddress in order to work with your database.

Update on Sept 30,2008: Added a new dialog box which includes option for password and also the new L&F called nimbus (JRE6u10 recommended). Here is a quick screenshot:

Friday May 16, 2008

PGCon 2008 next week in Ottawa, Canada

It is that time of the year when many PostgreSQL fans gather in Ottawa, Canada for PGCon 2008 next week. This will be my first visit to PGCon in Ottawa. Earlier this year I had presented two sessions "PostgreSQL and Benchmarks" and "Best Practices of PostgreSQL on Solaris" at PostgreSQL Conference 2008 East in Maryland. Thanks to that visit, this time I might recognize many people by face this time around.

Sun is a Gold Sponsor at PGCon 2008. There will be quite a bit of presence from Sun in PGCon. Josh, Max, Robert, Magne, Zdenek, Jim, Mayuresh et all will be present out there.

Josh Berkus is doing a tutorial on  "GUCs: A Three Hour Tour" on Tuesday morning 9:00 am.

Susanne Ebrecht (from MySQL team @ Sun) will also be presenting "What PostgreSQL could learn from MySQL" on Thursday 1:30pm.

 I  will be presenting "Problems with PostgreSQL on multi-core Systems with multi-terabyte Data" at Thursday 3:00pm.  I am actually presenting on behalf of the PostgreSQL Performance Team @ Sun and based on various performance work done by the group.

Robert Lor is also doing a  5-min Lightning Talk on New DTrace Probes proposed for PostgreSQL 8.4. Not to steal his thunder but  the demo will be quite interesting. It is part of Lightning Talks on Thursday 5:30pm

Zdenek Kotala will be presenting "PostgreSQL Upgrade Project" on Friday 3:00pm . In-place Upgrade in an eagerly awaited project.

Personally I will be meeting many of the Sun folks also for the first time in person. Generally we talk on conference calls, emails, etc.  Definitely looking forward to that.


Tuesday May 06, 2008

How OpenSource and ISVs can leverage OpenSolaris

I briefly mentioned in my earlier post that OpenSolaris binary is probably just a chapter of a book. To really understand how the benefits of OpenSolaris lets look at the needs of a typical Software Application

A Software Application at a very base level of being a usage product needs the following:

  1. Operating System: First and foremost which generally nobody wants to rewrite unless there is something missing that you really need. Generally it is part of "Systems requirement" which defines the operating systems supported.
  2. Method of distributing the software: Either via CD, DVD or via internet downloads
  3. Installer: Installing the software on the target machine
  4. Patching/Upgrading: Upgrading the software when new versions are released (specially with security updates)

So how does OpenSolaris solve this problems?

Lets start with (1). For the first time that I am aware, this is first Any Solaris release which makes it so friendly to even bundle an Operating System with a software product that the customers of Software application need not be even concerned about if the server (virtual for that matter) is going to do only the task of running this software application. I may not be far from truth when I mention that since I am aware of many people who are using Virtual Machine which is only doing one tasks. Suddenly the number of operating systems increases drastically and multiplied with number of operating systems available, I forsee more and more people bundling the operating system as part of their software application. In such sense OpenSolaris is already ahead of the curve providing tools to make the whole bundling experience easy with open source projects which includes Distribution Constructor and the new Installer project. Infact the New Installer project even makes it easy to not only install OpenSolaris but also your own application with it  and since it is all open source it can all be modified to have your own branding image out there giving a sense of control to the "key task" product to be highlighted during installation. (Maybe call it PostgreSQL OS). Looks like I am also covering (3).

Now to cover (2) and (4) the repository with pkg(5) command is ready to cover those responsibilities. The thing to realize is that there can be more than one repositories. For instance is a repository which has all the packages for OpenSolaris itself, but an ISV or OpenSource communities can create their own repositories for their Software application. Now suddenly having a software application which is easy to distribute, install, update is now so easy to create and maintain with this ecosystem, its hard to imagine why this was not done years ago. 

This new eco-system makes it easy to cater to multiple type of Software applications

1. Controller Software application where all installations need to have identical stack. (Creating custom distribution helps achieve that)

2. Latest and Greatest Software application model (using pkg makes it very easy)

3. Pick and Choose model : As repositories increase in number there will be choices where to get packages and pre-built binaries for your needs.

Ofcourse I am probably skipping many other requirements of an ISV since every ISV is different but if dissected probably the new ecosystem can probably fill most (if not all) of the needs in this new world of Solaris.







Jignesh Shah is Principal Software Engineer in Application Integration Engineering, Oracle Corporation. AIE enables integration of ISV products including Oracle with Unified Storage Systems. You can also follow me on my blog


« April 2014