Tuesday Aug 22, 2006


1.html We just announced coolstack, a collection of most commonly used opensource software, compiled with sun studio 11, optimized for Solaris/Ultrasparc Platform.

It includes an AMP package (Apache, PHP and MySQL) and individual MySQL, Squid, PHP, Perl and Tomcat package.
It would make life easier to install SAMP on Solaris. It also includes README in each package explaining what flag and option have been used to compile each package in case you need to recompile.

Try it out and let us know how you feel about it.

Tuesday Aug 08, 2006

MySQL InnoDB performance tuning on T2000


MySQL is one of the most popular opensource database.  It supports several pluggable storage engines, MyISAM manages non-transactional tables and it is good performer especially on MySQL 5.1.  The InnoDB and BDB storage engines provide transaction-safe tables. They were acquired by Oracle not so long ago.  The MEMORY storage engine which formerly known as heap storage engine provides in-memory tables.

The InnoDB supports both file system and raw disks on Solaris. It does row level locking.
MySQL InnoDB allocates a few types of memory buffers out of the heap.  The main buffer that is sized by  innodb_buffer_pool_size is used to cache data and indexes of the tables. Each buffer block is 16k bytes.  There is a dedicated thread that does pread to bring in the data to the buffer from the disk. There are also other types of memory buffers like sort buffer, log buffer, query_cache buffer.  This is how to monitor MySQL statistics to get idea of whether need to increase those buffer size.

mysql> show innodb status 

mysqladmin extended-status

It is important to understand how MySQL executes select queries by explain. That can help optimize the select performance.

MySQL is single process multi-threaded. For each new user connection, there is 1 threaded created from MySQL. We
can control the number of concurrent threads that can run simultaneous by limiting  innodb_thread_concurrency. Normally on T2000 we set it equal to the number of cpus.
There is no dedicated log write threads.  There is one io thread that wakes up frequently to do group writes but in our evaluation I rarely see it fires.  Each user thread could issue pwrite when the transaction is commited. And it is serialized! The default innodb_flush_log_at_trx_commit is 1, meaning that after each transaction is committed, there is a pwrite first, then followed by fsflush.  If set innodb_flush_log_at_trx_commit to 0, the performance would improve quite dramatically, pwrite and fsflush is only done every one second instead of after each transaction commit. But the risk is that customer might lose one second of data in case of power loss. innodb_flush_log_at_trx_commit  = 2 would issue pwrite after each commit but fsflush every 1 second.  If log disk response time is not fast enough, it could become a huge performance bottleneck.  It is very important to size enough IOPs for log disks.  We have seen performance being doubled or tripled right away by using a few more disks for the log files.  For performance purpose,  if you use filesystems to store datafiles and log files, you should use forcedirectio.

This is an example of how to use Solaris Volume Manager(svm) to create a soft partition and then set up the logfile links to this location

metadb -d -f c1t0d0s7 c1t1d0s7
metadb -a -f c1t0d0s7 c1t1d0s7

Then create a concatenation

metainit d10 1 4 c1t0d0s0 c1t1d0s0 c1t2d0s0 c1t3d0s0

You can stop here and do newfs on /dev/md/rdsk/d10 or continue to create a soft partition(for example 10G) on d10

metainit d101 -p d10 10G  

After that

newfs /dev/md/rdsk/d101
mount -o forcedirectio /dev/md/dsk/d101  /logs

There is innodb_flush_method.  The default is performing very well on Solaris which uses fsync() to flush both the data
and log files.

When MySQL starts, there are 10 threads created. Thread 1 handles network connections and create new threads for new user connections. Then there are 4 io threads. There is 1 log write thread doing group commits once a while. There is one insert thread, InnoDB stores data physically by its primary key order and insertion would not cause random reads on the disk, but for the non unique secondary index insert, it could cause a lot of random reads. So to avoid that, InnoDB checks whether the secondary index page is in the buffer pool. If it is, InnoDB does the insertion directly to  the index page. If the index page is not found in the buffer pool, InnoDB inserts the record to a special insert buffer structure. The insert buffer is kept so small that it fits entirely in the buffer pool, and insertions can be done very fast. Periodically, the insert buffer is merged into the secondary index trees in the database. Often it is possible to merge several insertions to the same page of the index tree, saving disk I/O operations. There is another thread that I am not sure its purpose.  Thread 6 is handling rollback. Thread 7 and 8 are monitoring thread. Thread 9 is the master thread. Thread 10 is the signal handler thread

Here are a few steps of how to tune MySQL InnoDB performance on T2000 or other Solaris platforms
  • Choose at least MySQL version 5.0.22.  In my experience MySQL 4.1  has more user locks inside hence it doesn't scale well.   MySQL 5.1 beta  is even better than 5.0.

  • If your database size is more than 4GB, using MySQL 64bit is better because you can use more than 4GB memory for innoDB database buffers.

  • MySQL 5.0.22 64bit from MySQL website is a pretty good performer.  But in case if you want to compile it on your own on Solaris, You can refer a good  whitepaper from Jenny Chen from Sun Microsystem.  Compile time can be reduced to a few minutes using parallel build "dmake" in sun studio 11 

  • You can use libumem or libmtmalloc to get scalable memory allocation performance on multi-threaded MySQL.  The way to do this is before mysql is started,  setenv LD_PRELOAD_64 /usr/lib/sparcv9/libumem.so.  This example is for 64bit MySQL
  • Always keep an eye on MySQL internal statistics. Normally you can look at reads versus hits and tell if you need to increase certain buffer size
    • mysql> show innodb status;
    • mysql> show full processlist;
    • mysqladmin extended-status
  • Making sure there is no io bottleneck. Collect  "iostat -xtcnz 5" on the system and look at that what is the disk response time (column asvc_t in miliseconds) and what is the average outstanding ios(column actv). Making sure that you have enough IOPs for the logfile disks and datafile disks.
  • On solaris systems, you can use "prstat -Lmc" to monitor  the active processes on the systems.
  • MySQL startup options are very important. Following is an example of /etc/my.cnf I use to evaluate MySQL sysbench OLTP performance on T2000
#MySQL configure
# The MySQL server
#######below is for innodb############

Monday Aug 07, 2006

MySQL MyISAM performance tuning on T2000


The default storage engine of MySQL is MyISAM.  Unlike InnoDB,  it stores each MyISAM table in three files, the schema file(.frm), the data file(.MYD) and the index file(.MYI). It only supports filesystem. It does table level locking. 

Compared to InnoDB, MyISAM doesn't have its own data buffer cache, it only has index buffer cache, the key buffer (variable key_buffer_size).  We need to use file system buffer cache for data cache for MyISAM tables.

Here are a few steps of how to tune MySQL MyISAM performance on T2000 or other Solaris platforms
  • There is a very good performance feature in MySQL 5.1 beta  that allows MyISAM using mmap memory instead of malloced buffers. If you experienced very high mutex contention in earlier MySQL releases,  you can get a huge performance improvments.  How do you know if you have this problem?

    • If you are using MySQL 5.0 or earlier version, during the test you can collect "lockstat sleep 1" and if you see ufs_lockfs_begin and ufs_lockfs_end in the callers column. You are likely to get the performance boost by going to the latest 5.1. You need to have  "myisam_use_mmap=1" in /etc/my.cnf to be able to use this feature. This is an example of lockstat output that has the performance issue.
Adaptive mutex spin: 140294 events in 1.145 seconds (122505 events/sec)

Count indv cuml rcnt spin Lock Caller
31341 22% 22% 0.00 9 0x600052c4d10 ufs_lockfs_end+0x70
30952 22% 44% 0.00 10 0x600052c4d10 ufs_lockfs_begin+0xe4

  • You can use libumem or libmtmalloc to get scalable memory allocation performance on multi-threaded MySQL.  The way to do this is before mysql is started,  setenv LD_PRELOAD_64 /usr/lib/sparcv9/libumem.so.  This example is for 64bit MySQL. For 32bit MySQL you can do setenv LD_PRELOAD /usr/lib/libumem.so
  • Since we have to use filesystem buffer cache to cache data for MyISAM tables, it is important to tune segmap_percent in /etc/system. The default segmap_percent is 12% on solaris, that means you can only get to use 12% of the system memory for filesystem buffer cache.  It depends on your database size, but setting it too high could lead low memory on the system causing excessive paging.  In our case, we set it to 80%. In /etc/system, set segmap_percent=80. You need to reboot the system to make it effective.
  • key cache is important feature for MyISAM to cache index blocks that allows multiple threads to access key buffer simultaneously. You can check the performance of key cache by doing "show status" and look at what is the ratio of key_reads and key_read_requests. It should be less than 1%.  Otherwise, you might need to increase key_buffer_size. The maximum for key_buffer_size is 4G however you can create multiple key cache (The size limit of 4GB applies to each cache individually, not as a group.) That will also help on the situation where access to one key cache structure does not block access to the other key cache
  • Making sure there is no io bottleneck. Collect  "iostat -xtcnz 5" on the system and look at that what is the disk response time (column asvc_t in milliseconds) and what is the average outstanding ios(column actv).
  • On solaris systems, you can use "prstat -Lmc" to monitor  the active processes on the systems.

Tuesday Dec 06, 2005

UltraSPARC T1 performance on SAP


We just launched UltraSPARC T1 based systems today. There are a lot of debates on whether it is a good fit for running SAP. Is response time an issue due to CMT architecture?

I have spent a lot of time working on SAP Sales and Distribution performance on Sun Fire T2000 system and it runs very well with very good response time.

Unlike any other cpu architecture, on UltraSPARC T1, each core has 4 hardware threads sharing one pipeline. There are a number of ways for a hardware thread to be scheduled or unscheduled on the single pipeline. Every clock tick a different thread is scheduled to run in round robin order. If the hardware thread is stalled because of the cache misses, the hardware will skip this thread and schedule next thread to run. If the hardware thread doesn't have any work to do, it also will be skipped and next thread is scheduled on the pipeline. This will make more efficient use of execution unit and avoid waste of cpu cycles.

CPU speed is only a small factor on how well an application can run on a system. More often, the cpu is stalled due to memory access. This simply means even when you get a system clocked at 3.0Ghz, the amount of useful cpu cycles you could get out of it is much less than that. How much less, it is depending on the cpu architecture and the application itself. But on Sun Fire T2000 system, the useful cpu cycles that is doing the real work is close to cpu speed.

The Solaris CMT scheduler is very smart. If you only have a few processes, the os scheduler will try to schedule them across cores evenly, so you don't run into a situation where all 4 threads on one core is 100% busy while the other cores are idle. Running a single threaded batch job can be an issue on UltraSPARC T1 based systems. But if the application has small working set, it should perform close to other cpu clocked at 1.2Ghz.

Let's look at some of the scaling data we collected. The experiment was done without much tunning. We incrementally enabled more hardware threads for each core. This is running SAP SD ECC 5.0 on Sun Fire T2000 box with 8 cores and 32GBytes memory.
    # of threads on each core		Performance/baseline
1 1x
2 1.7x
3 2.2x
4 2.5x
With two threads, the performance is improved by 70% than just using one hardware thread. Adding the third thread, it improves another 30%. Adding the fourth thread, it improves 14% more. More hardware threads are making execution units less idle.

Here are a few tips tunning SAP on UltraSPARC T1 based system. Remember, TLB is a precious resource, getting less tlb misses is one of the keys to high performance. Solaris integrated several changes to get executable's text, heap and anon on large page by default. Here is a few parameter you should use in SAP profile for all SAP instances.

es/use_shared_memory = true

These parameters will get extended memory and all other buffers to use shared memory segment. If "es/use_shared_memory" is not set, the extended memory will be on allocated on anon page. "

Also set this parameter

ipc/sem_fast_watchdog = 0

This will disable a watchdog process that is constantly checking if the shared memory segment is healthy, this process consumes a lot of cpu.

We are also using pool to reduce the number of shared memory segments it tries to allocate.

For example,

ipc/shm_psize_10 = 1290000000
ipc/shm_psize_01 = -10
ipc/shm_psize_02 = -10
ipc/shm_psize_03 = -10
ipc/shm_psize_04 = -10

This will tell operating system that SAP wants to allocated one 1.29GBytes shared memory segment, The first buffer with key 01 will be allocated from this big segment etc.

Note that SAP has different types of buffers. Most of them can be allocated from this pool if you can find the key
and set ipc/shm_psize_<key> = -10. And you can specify the size if you  don't like the default.

Most importantly, always do parameter checking before you start up SAP. This will iron out any potential errors. Generally if there is warning saying that certain parameter is unknown, it is ok to ingore them

sappfpar check pf=<profile name>

Here is an example output.

sappfpar check pf=E50_D10_ngpae2

== Checking profile: E50_D10_ngpae2

Shared memory disposition overview
Shared memory pools
Key: 10 Pool
Size configured.....: 1290000000 (1230.2 MB)
Size min. estimated.: 1270473596 (1211.6 MB)
Advised Size........: 1274000000 (1215.0 MB)

Shared memories inside of pool 10
Key: 1 Size: 2500 ( 0.0 MB) System administration
Key: 2 Size: 213570652 ( 203.7 MB) Disp. administration tables
Key: 3 Size: 178144000 ( 169.9 MB) Disp. communication areas
Key: 4 Size: 509248 ( 0.5 MB) statistic area
Key: 6 Size: 308838400 ( 294.5 MB) ABAP program buffer
Key: 7 Size: 14838 ( 0.0 MB) Update task administration
Key: 8 Size: 67108964 ( 64.0 MB) Paging buffer
Key: 9 Size: 134217828 ( 128.0 MB) Roll buffer
Key: 11 Size: 1000000 ( 1.0 MB) Factory calender buffer
Key: 12 Size: 6000000 ( 5.7 MB) TemSe Char-Code convert Buf.
Key: 13 Size: 40500000 ( 38.6 MB) Alert Area
Key: 14 Size: 4400000 ( 4.2 MB) Presentation buffer
Key: 16 Size: 22400 ( 0.0 MB) Semaphore activity monitoring
Key: 17 Size: 2672386 ( 2.5 MB) Roll administration
Key: 18 Size: 917604 ( 0.9 MB) Paging adminitration
Key: 19 Size: 60000000 ( 57.2 MB) Table-buffer
Key: 30 Size: 13312 ( 0.0 MB) Taskhandler runtime admin.
Key: 31 Size: 4806000 ( 4.6 MB) Dispatcher request queue
Key: 33 Size: 10240000 ( 9.8 MB) Table buffer, part.buffering
Key: 41 Size: 25010000 ( 23.9 MB) DB statistics buffer
Key: 42 Size: 7232992 ( 6.9 MB) DB TTAB buffer
Key: 43 Size: 32534392 ( 31.0 MB) DB FTAB buffer
Key: 44 Size: 7958392 ( 7.6 MB) DB IREC buffer
Key: 45 Size: 4886392 ( 4.7 MB) DB short nametab buffer
Key: 46 Size: 20480 ( 0.0 MB) DB sync table
Key: 47 Size: 3073024 ( 2.9 MB) DB CUA buffer
Key: 48 Size: 300000 ( 0.3 MB) Number range buffer
Key: 49 Size: 2769392 ( 2.6 MB) Spool admin (SpoolWP+DiaWP)
Key: 51 Size: 3200000 ( 3.1 MB) Extended memory admin.
Key: 52 Size: 40000 ( 0.0 MB) Message Server buffer
Key: 54 Size: 4202496 ( 4.0 MB) Export/Import buffer
Key: 55 Size: 8192 ( 0.0 MB) Spool local printer+joblist
Key: 57 Size: 1048576 ( 1.0 MB) Profilparameter in shared mem
Key: 58 Size: 4096 ( 0.0 MB) Enqueue ID for reset
Key: 62 Size: 85983232 ( 82.0 MB) Memory pipes
Key: 63 Size: 409600 ( 0.4 MB) ICMAN shared memory
Key: 64 Size: 4202496 ( 4.0 MB) Online Text Repository Buf.
Key: 65 Size: 4202496 ( 4.0 MB) Export/Import Shared Memory
Key: 1002 Size: 400000 ( 0.4 MB) Performance monitoring V01.0
Key: 58900110 Size: 4096 ( 0.0 MB) SCSA area

Nr of operating system shared memory segments: 1

Shared memory resource requirements estimated
Total Nr of shared segments required.....: 1
System-imposed number of shared memories.: 1000
Shared memory segment size required min..: 1274000000 (1215.0 MB)
System-imposed maximum segment size......: 44828721152 (42752.0 MB)

Swap space requirements estimated
Shared memory....................: 1218.8 MB
..in pool 10 1211.6 MB, 98% used
..not in pool: 0.0 MB
Processes........................: 107.9 MB
Extended Memory .................: 2000.0 MB
Total, minimum requirement.......: 3326.7 MB
Process local heaps, worst case..: 1907.3 MB
Total, worst case requirement....: 5234.1 MB

Errors detected..................: 0

After SAP instance is started up, always run "pmap -xs <pid>" to see the address space layout of sap processes and if they are on large page as much as possible.

Here is an example output
pmap -xs 15272

15272: dw.sapE50_D10 pf=/usr/sap/E50/SYS/profile/E50_D10_ngpae2
Address Kbytes RSS Anon Locked Pgsz Mode Mapped File
0000000080000000 4096 4096 - - 4M r-x-- disp+work
0000000080400000 24576 16384 - - - r-x-- disp+work
0000000081C00000 3840 512 - - - r-x-- disp+work
0000000081FC0000 32 8 - - - r-x-- disp+work
00000000822C6000 40 8 - - - rwx-- disp+work
00000000822D0000 192 192 128 - 64K rwx-- disp+work
0000000082300000 256 - - - - rwx-- disp+work
0000000082340000 192 192 192 - 64K rwx-- disp+work
0000000082370000 64 - - - - rwx-- disp+work
0000000082380000 64 64 64 - 64K rwx-- disp+work
0000000082390000 640 - - - - rwx-- disp+work
0000000082430000 384 384 384 - 64K rwx-- disp+work
0000000082490000 64 64 - - - rwx-- disp+work
00000000824A0000 192 192 192 - 64K rwx-- disp+work
00000000824D0000 192 - - - - rwx-- disp+work
0000000082500000 704 704 704 - 64K rwx-- disp+work
00000000825B0000 1664 128 - - - rwx-- disp+work
0000000082750000 16 - - - - rwx-- disp+work
0000000084000000 36864 36864 36864 - 4M rwx-- [ heap ]
0000010000000000 1048576 1048576 - 1048576 256M rwxsR [ ism shmid=0x30001f9 ]
0000010040000000 208896 208896 - 208896 4M rwxsR [ ism shmid=0x30001f9 ]
000001004CC00000 2240 2240 - 2240 64K rwxsR [ ism shmid=0x30001f9 ]
000001004CE30000 56 56 - 56 8K rwxsR [ ism shmid=0x30001f9 ]
0000010050000000 8 8 - 8 8K rwxsR [ ism shmid=0x3000263 ]
0000010050400000 4096 4096 - 4096 4M rwxsR [ ism shmid=0x100005e ]
0000010060000000 1835008 1835008 - 1835008 256M rwxsR [ ism shmid=null ]
00000100D0000000 212992 212992 - 212992 4M rwxsR [ ism shmid=null ]

---------------- ---------- ---------- ---------- ----------
total Kb 3428296 3391800 40344 3311872

cpustat is good tool to use if you suspect tlb misses are too high on UltraSPARC T1 based system.

/usr/sbin/cpustat -c pic0=ITLB_miss,pic1=Instr_cnt \\
-c pic0=DTLB_miss,pic1=Instr_cnt 1 10

trapstat is also a great tool to show how much cpu time is spent in itlb and dtlb misses.

Here is an example output "trapstat -T 10"

cpu m size| itlb-miss %tim itsb-miss %tim | dtlb-miss %tim dtsb-miss %tim |%tim
0 u 8k| 1261 0.1 44 0.0 | 17422 0.8 50 0.0 | 0.9
0 u 64k| 2560 0.2 3 0.0 | 166073 7.1 9 0.0 | 7.3
0 u 4m| 1378 0.1 1 0.0 | 43193 2.6 51 0.0 | 2.8
0 u 256m| 0 0.0 0 0.0 | 33228 1.5 53 0.0 | 1.5
- - - - - + - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - + - -
0 k 8k| 560 0.0 0 0.0 | 26409 1.1 18 0.0 | 1.1
0 k 64k| 0 0.0 0 0.0 | 0 0.0 0 0.0 | 0.0
0 k 4m| 0 0.0 0 0.0 | 640 0.0 2 0.0 | 0.0
0 k 256m| 0 0.0 0 0.0 | 2719 0.2 0 0.0 | 0.2

Always check the SAP R/3 buffer stats from SAP snapshot to make sure there is no swapping.

Also make sure proper number of update processes have been configured so the update queue length is less than 5 on average. The VB-Que from system enqueue performance output file ENQHIS should be a good indicator on if more update processes should be configured.

We also recommend using 4 SAP instances on 8 core based Sun Fire T2000 system.

Check out  SAP SD 2 tier benchmark website  for our latest benchmark results on Sun Fire T2000.


Thursday Dec 01, 2005

Hello There

I joined Sun about 9 years ago from University of Wisconsin at Madison, most of the time working in Performance and Availability Engineering group. I have spent many years improving Oracle TPCC and SAP performance on various hardware platforms. Recently I started to look at open source software performance issues.
I am a mommy of a 15 month old toddler . She has the most beautiful smile I have ever seen.



« July 2016