Tuesday Nov 10, 2009

PeopleSoft North American Payroll on Sun Solaris with F5100 Flash Array : A blog Reprise

During the "Sun day" keynote at OOW 09, John Fowler stated that we are #1 in PeopleSoft North American Payroll performance. Later Vince Carbone from our Performance Technologies group went on comparing our benchmark numbers with HP's and IBM's in BestPerf's group blog at Oracle PeopleSoft Payroll (NA) Sun SPARC Enterprise M4000 and Sun Storage F5100 World Record Performance. Meanwhile Jeorg Moellenkamp had been clarifying few things in his blog at App benchmarks, incorrect conclusions and the Sun Storage F5100. Interestingly it all happened while we have no concrete evidence in our hands to show to the outside world. We got our benchmark results validated right before the Oracle OpenWorld, which gave us the ability to speak about it publicly [ and we used it to the extent we could use ]. However Oracle folks were busy with their scheduled tasks for OOW 09 and couldn't work on the benchmark results white paper until now. Finally the white paper with the NA Payroll benchmark results is available on Oracle Applications benchmark web site. Here is the URL:

        PeopleSoft Enterprise Payroll 9.0 using Oracle for Solaris on a Sun SPARC Enterprise M4000

Once again the summary of results is shown below but in a slightly different format. These numbers were extracted from the very first page of the benchmark results white papers where PeopleSoft usually highlights the significance of the results and the actual numbers that they are interested in. The results are sorted by the hourly throughput (payments/hour) in the descending order. The goal is to achieve as much hourly throughput as possible. Since there is one 16 stream result as well in the following table, exercise caution when comparing 8 stream results with 16 stream results. In general, 16 parallel job streams are supposed to yield better throughput when compared to 8 parallel job streams. Hence comparing a 16 stream number with an 8 stream number is not an exact apple-to-apple comparison. It is more like comparing an apple to another apple that is half in size. Click on the link that is underneath the hourly throughput values to open corresponding benchmark result.

Oracle PeopleSoft North American Payroll 9.0 - Number of employees: 240,000 & Number of payments: 360,000
Vendor OS Hardware Config #Job Streams Elapsed Time (min) Hourly Throughput
Payments per Hour
Sun Solaris 10 5/09 1x Sun SPARC Enterprise M4000 with 4 x 2.53 GHz SPARC64-VII Quad-Core processors and 32 GB memory
1 x Sun Storage F5100 Flash Array with 40 Flash Modules for data, indexes
1 x Sun Storage J4200 Array for redo logs
8 67.85 318,349
HP HP-UX 1 x HP Integrity rx6600 with 4 x 1.6 GHz Intel Itanium2 9000 Dual-Core processors and 32 GB memory
1 x HP StorageWorks EVA 8100
16 68.07 317,320
HP HP-UX 1 x HP Integrity rx6600 with 4 x 1.6 GHz Intel Itanium2 9000 Dual-Core processors and 32 GB memory
1 x HP StorageWorks EVA 8100
8 89.77 240,615\*
IBM z/OS 1 x IBM zSeries 990 model 2084-B16 with 313 Feature with 6 x IBM z990 Gen1 processors (populated: 13, used: 6) and 32 GB memory
1 x IBM TotalStorage DS8300 with dual 4-way processors
8 91.7 235,551

This is all public information -- so, feel free to draw your own conclusions. \*At this time of writing, HP's 8 stream results were pulled out of Oracle Applications benchmark web site for some reason I do not know why. Hopefully it will show up again on the same web site soon. If it doesn't re-appear even after a month, probably we can simply assume that the result is withdrawn.

As these benchmark results were already discussed by different people in different blogs, I have nothing much to add. The only thing that I want to highlight is that this particular workload is moderately CPU intensive, but very I/O bound. Hence the better the I/O sub-system, the better the performance. Vince provided an insight on Why Sun Storage F5100 is a good option for this workload, while Jignesh Shah from our ISV-Engineering organization focused on the performance of this benchmark workload with F20 PCIe Card.

Also when dealing with NA Payroll, it is very unlikely to achieve a nice out-of-the-box performance. It requires a lot of database tuning too. As the data sets are very large, we partitioned the data in some of the very hot objects and it showed good improvement in query response times. So if you are a PeopleSoft customer running Payroll application with millions of rows of non-partitioned data, consider partitioning the data. [Updated 11/30/09]We are currently working on a best practices blueprint document for PeopleSoft North American Payroll that presents a variety of tuning tips like these in addition to the recommended practices for F5100 flash array and flash accelerator F20 PCIe card. Stay tuned .. Sun published a best practices blueprint document with a variety of tuning tips like these in addition to the recommended practices for F5100 flash array and flash accelerator F20 PCIe card. You can download the blueprint from the following location:

    Best Practices for Oracle PeopleSoft Enterprise Payroll for North America using the Sun Storage F5100 Flash Array or Sun Flash Accelerator F20 PCIe Card

Related Blog Post:

Friday Oct 09, 2009

Sun achieves the Magic Number 50,000 on T5440 with Oracle Business Intelligence EE 10.1.3.4

Less than two months ago, Sun Microsystems published an Oracle Business Intelligence benchmark with the best single system performance of 28,000 concurrent BI EE users at ~75% CPU utilization. Sun and Oracle Corporation announced another Oracle Business Intelligence benchmark result today with two identical T5440 servers in the Oracle BI Cluster serving 50,000 concurrent BI EE users.

An Oracle white paper with Sun's 50,000 user benchmark results can be accessed from Oracle's Business Intelligence web.

The hardware specifications for each of the T5440s are similar to the hardware that was used in the prior benchmark effort on a single T5440 server. However this time the Presentation Catalog (also frequently referred as the Web Catalog) was moved to a T5220 server where the NFS server was running. Besides this the only other change from the earlier 28,000 user benchmark exercise is the addition of another T5440 to the test rig.

The following graph shows the scalability of the application from one node to four nodes to eight nodes running on T5440 servers.

OBIEE on T5440 : Scalability Graph

Without further ado, here is the summary of the benchmark results along with their significance and some interesting facts:

  • One of the major goals of this benchmark effort is to show the horizontal and vertical scalability of the application (OBIEE) by highlighting the superior performance and the resilience of the underlying hardware (T5440) and the operating system (Solaris). Needless to say the goal has been met.

  • Another goal of this benchmark is to show decent number of concurrent BI EE users executing transactions with good response times. Since we already showed the maximum load that can be achieved on a single BI instance (7500 users) and on a single T5440 server running multiple BI instances (28,000 users), this time we did not attempt to get the peak number that can be achieved from the two T5440 servers in the benchmark environment. Now that there is an additional server in the test setup that is taking care of the Presentation Catalog and the database server, 2 \* 28000 = 56,000 BI EE users would have been an achievable target -- but we opted to stop at the "magic" and the "respectable" number 50,000 instead.

  • The entire benchmark run lasted for about 9 hours 45 minutes, and out of which 8 hours were the rampup hours where the 50,000 BI virtual users were logging into the application few users at a time. LoadRunner tool reported only 4 errors for the entire duration of the run; and there are zero errors in the 60 minute steady state period during which the statistics reported in the document were collected.

  • Two Sun SPARC Enterprise T5440 servers each with 4 x 8-Core 1.6 GHz UltraSPARC T2 Plus processors delivered the best performance of 50,000 concurrent BI EE users at around 63% CPU utilization.

  • The BI EE Cluster was deployed on two T5440 servers running Solaris 10 5/09 operating system. All the nodes in the BI Cluster were consolidated onto two T5440 servers using the free and efficient Solaris Containers virtualization technology.

  • The Presentation Catalog was hosted on ZFS powered file system that was created on top of four internal Solid State Drive (SSD) disks. The Catalog was shared among all eight BI nodes in the cluster as an NFS share. One 8-Core 1.2 GHz UltraSPARC T2 processor powered T5220 server was used to run the NFS server. Due to the minimal activity of the database, Oracle 11g database was also hosted on the same server. Solaris 10 5/09 is the operating system.

  • Solid State Drive (SSD) disks with ZFS file system showed significant I/O performance improvement over traditional disks for the Presentation Catalog activity. In addition, ZFS helped get past the UFS limitation of 32767 sub-directories in a Presentation Catalog directory.

  • Caching was turned ON at the application server, which led to minimal database activity on the server. Note hat the caching mechanism was turned ON even in the prior benchmark exercise.

  • The low end CoolThreads CMT Server T5220 and the mid-range T5440 server once again proved to be ideal candidates to deploy and run multi-thread workloads by exhibiting resilient performance when handling large number of simultaneous requests from 50,000 BI EE virtual users. T5220 handled large number of concurrent asynchronous read/write requests from eight different NFS clients.

  • NFS v3 was configured at the NFS Server as well as at the NFS Client nodes. NFS version 4 is the default on Solaris 10, and it might have worked as expected. However a handful of bug reports prompted us to go with the more matured and less buggy version 3.

  • 3283 watts is the average power consumption when all the 50,000 concurrent BI users are in the steady state of the benchmark test. That is, in the case of similarly configured workloads, the T5440 server supports 15.2 users per watt of energy consumed and supports 5,000 users per rack unit.

  • A summary of the results with system-wide averages of CPU and memory utilization is shown below. The latest results are highlighted in blue color.

    #Vusers Clustered #BI Nodes #CPU #Core RAM CPU Memory Avg Trx Response Time #Trx/sec
    7,500 No 1 1 8 32 GB 72.85% 18.11 GB 0.22 sec 155
    28,000 Yes 4 4 32 128 GB 75.04% 76.16 GB 0.25 sec 580
    50,000 Yes 8 8 64 256 GB 63.32% 172.21 GB 0.28 sec 1031

TOPOLOGY DIAGRAM

The topology diagram in the benchmark results white paper is almost illegible. Here is the original topology diagram that was inserted into the white paper.

OBIEE on T5440 : 50K User Benchmark Topology

Quite frankly I'm not very proud of this drawing -- but that's the best that I could come up with in a short span. Rather than showing the flow of communication between each and every component in the benchmark setup, I simplified the drawing by introducing a "black box" sort of thing - "private network" - in the middle, which protected the drawing from getting messy.


CPU USAGE GRAPH

The following two-dimensional graph shows the CPU utilization patterns at all 3 nodes in the benchmark setup for the 60 minute steady state of the benchmark run. This graph was generated using the free GNUplot tool with sar data as the inputs.

OBIEE on T5440 : 50K User Benchmark CPU Usage Graph

COMPETITIVE LANDSCAPE

And finally here is a quick summary of all the results that are published by different vendors so far with similar benchmark kit. Feel free to draw your own conclusions. All this is public information. Check the corresponding benchmark reports by clicking on the URLs under the "#Users" column.

Server Processors #Users OS
Chips Cores Threads GHz Type
  2 x Sun SPARC Enterprise T5440 (APP)
  1 x Sun SPARC Enterprise T5220 (NFS,DB)
8
1
64
8
512
64
1.6
1.2
UltraSPARC T2 Plus
UltraSPARC T2
50,000 Solaris 10 5/09
  1 x Sun SPARC Enterprise T5440 4 32 256 1.6 UltraSPARC T2 Plus 28,000 Solaris 10 5/09
  5 x Sun Fire T2000 1 8 32 1.2 UltraSPARC T1 10,000 Solaris 10 11/06
  3 x HP DL380 G4 2 4 4 2.8 Intel Xeon 5,800 OEL
  1 x IBM x3755 4 8 8 2.8 AMD Opteron 4,000 RHEL4


Before you go, do not forget to check the best practices for configuring / deploying Oracle Business Intelligence on top of Solaris 10 running on Sun CMT hardware.

Related Blog Posts:
T5440 Rocks [again] with Oracle Business Intelligence Enterprise Edition Workload

Monday Aug 17, 2009

T5440 Rocks [again] with Oracle Business Intelligence Enterprise Edition Workload

A while ago, I blogged about how we scaled Siebel 8.0 up to 14,000 concurrent users by consolidating the entire Siebel stack on a single Sun SPARC® Enterprise T5440 server with 4 x 1.4 GHz eight-core UltraSPARC® T2 Plus Processors. OLTP workload was used in that performance benchmark effort.

We repeated a similar effort by collaborating with Oracle Corporation, but with an OLAP workload this time around. Today Sun and Oracle announced the 28,000 user Oracle Business Intelligence Enterprise Edition (OBIEE) 10.1.3.4 benchmark results on a single Sun SPARC Enterprise T5440 server with 4 x 1.6 GHz eight-core UltraSPARC T2 Plus Processors running Solaris 10 5/09 operating system. An Oracle white paper with Sun's 28,000 user benchmark results is available on Oracle's benchmark web site.

Some of the notes and key take away's from this benchmark are as follows:

  • Key specifications for the Sun SPARC Enterprise T5440 system under test are: 4 x UltraSPARC T2 Plus processors, 32 cores, 256 compute threads and 128 GB of memory in a 4RU space.

  • The entire OBIEE solution was deployed on a single Sun SPARC Enterprise T5440 server using Oracle BI Cluster software.

  • The BI Cluster was configured with 4 x BI nodes. Each of those BI nodes were configured to run inside a Solaris Container.

    1. Each Solaris Container was configured with one physical processor (that is, 8 cores or 64 virtual cpus), and 32 GB physical memory.

    2. Each BI node was configured to run BI Server, Presentation Server and OC4J Web Server

    3. Two of the BI nodes have the BI Cluster Controller running (primary & secondary)

    4. One out of four Containers was sharing CPU and memory resources with Oracle 11g RDBMS and the host operating system that are running in the global zone

  • Caching was turned ON at the application server, which led to minimal database activity on the server.

    1. In other words, one can use these results only to size the hardware requirements for a complete BI EE deployment excluding the database server.

    2. All the OBIEE benchmark results published so far are with the caching turned ON. This fact was not explicitly mentioned in some of the benchmark results white papers. Check the competitive Landscape for the pointers to different benchmark results published by different vendors.

  • From our experiments with the OBIEE benchmark workload, it appears that a BI deployment with a single non-cluster BI node could reasonably scale well up to 7,500 active users on a T5440 server. To scale beyond 7,500 concurrent users, you might need another instance of BI. Of course, your mileage may vary.

  • BI EE exhibited excellent horizontal scalability when multiple BI nodes were clustered using BI Cluster software. Four BI nodes in the Cluster were able to handle 28,000 concurrent users with minimal impact on the overall average transaction response times.

      It appeared as though we can simply add more BI nodes to the BI Cluster to cope with the increase in user base. However due to the limited hardware resources, we could not try running beyond 4 nodes in the BI Cluster. As of today, the theoritical limit for the number of BI nodes in a Cluster is 16.

  • The underlying hardware must behave well in order for the application to scale and perform well -- so, credit goes to UltraSPARC T2 Plus powered Sun SPARC Enterprise T5440 server as well. In other words, it is fair to say the combination of (T5440 + OBIEE) performs and scales well on Solaris.

  • A summary of the results with system-wide averages of CPU and memory utilization is shown below.

    #Vusers Clustered #BI Nodes #CPU #Core RAM CPU Memory Avg Trx Response Time #Trx/sec
    7,500 No 1 1 8 32 GB 72.85% 18.11 GB 0.22 sec 155
    28,000 Yes 4 4 32 128 GB 75.04% 76.16 GB 0.25 sec 580
  • Internal Solid State Drive (SSD) with ZFS file system showed significant I/O performance improvement over traditional disk for the BI catalog activity. In addition, ZFS helped get past the UFS limitation of 32,767 sub-directories in a BI catalog directory.

  • The benchmark demonstrated that 64-bit BI EE platform is immune to the 4 GB virtual memory limitation of the 32-bit BI EE platform -- hence can potentially support even more users and have larger caches as long as the hardware resources are available.

      Solaris runs in 64-bit mode by default on SPARC platform. Consider running 64-bit BI EE on Solaris.

  • 2,107 watts is the average power consumption when all the 28,000 concurrent users are in the steady state of the benchmark test. That is, in the case of similarly configured workloads, T5440 supports 13.2 users per watt of the power consumed; and supports 7,000 users per rack unit.

TOPOLOGY DIAGRAM:

A picture is worth a thousand words. The following topology diagram(s) says it all about the configuration.

1. Single Node BI Non-Cluster Configuration : 7,500 Concurrent Users

Even though the Solaris Container was shown in a cloud like graphical form, it has nothing to do with the "Cloud Computing". It is just a side effect of fancy drawing.

2. Four Node BI Cluster Configuration : 28,000 Concurrent Users

COMPETITIVE LANDSCAPE

Here is a quick summary of all the results that are published by different vendors. Feel free to draw your own conclusions. All this is public information. Check the corresponding benchmark reports by clicking on the URLs under the "#Users" column.

Server Processors #Users OS
Chips Cores Threads GHz Type
  1 x Sun SPARC Enterprise T5440 4 32 256 1.6 UltraSPARC T2 Plus 28,000 Solaris 10 5/09
  5 x Sun Fire T2000 1 8 32 1.2 UltraSPARC T1 10,000 Solaris 10 11/06
  3 x HP DL380 G4 2 4 4 2.8 Intel Xeon 5,800 OEL
  1 x IBM x3755 4 8 8 2.8 AMD Opteron 4,000 RHEL4

CAUTION

Although T5440 possesses a ton of great qualities, it might not be suitable for deploying workloads with heavy single-threaded dependencies. The T5440 is an excellent hardware platform for multi-threaded, and moderately single-threaded/multi-process workloads. When in doubt, it is a good idea to leverage Sun Microsystems' Try & Buy program to try the workloads on the T5440 server before making the final call.


Check the second part of this blog post for the best practices for configuring / deploying Oracle Business Intelligence on top of Solaris 10 running on Sun CMT hardware.

Related Blog Posts:

Monday Jun 22, 2009

Sun Studio: Debugging Multi-Threaded Applications with dbx

(Crossposting the three and half year old blog entry "as is" from my other blog hosted on blogger. It needs some serious editing, but I believe the content is still relevant. Source URL: http://technopark02.blogspot.com/2005/12/sun-studio-debugging-multi-threaded.html)

Multi-threading lets different tasks to run concurrently in a single process, hence multi-threaded programs would run faster [or achieve better throughput] on machines with multiple processors and on CPUs with multiple cores. On an SMP (Symmetric Multi-Processing system, where multiple processors share a single memory system) system with no CMT (Chip Multi-Threading), software threads are executed on different processors; and on an SMP system with CMT, the threads are executed on cores, and logical processors in CMP (Chip Multi-Processing) processors. As revolutionary chip designs are evolving, many important commercial applications like Oracle, SAP, Siebel, PeopleSoft are designed to be multi-threaded.

Debugging a multi-threaded (MT in short) application is a bit hard, due to the number of software threads running in parallel, compared to a single threaded program where only one task will be running per process, at any given time. Thread synchronization plays an important role when concurrently running threads have to share global resources. Improperly synchronized threads may starve, and lead to unnecessary dead locks, and race conditions. So, it is good to have an MT aware debugger handy, during development and in support phases of software life cycle, to debug threading issues.

Fortunately on Solaris, Sun Studio's debugger, dbx, has support for MT applications that are designed to use Solaris threads, and/or POSIX threads. With dbx, it is possible to get information like thread state, stack trace, locks from all threads, navigate between threads, suspend/resume threads, put break points in a thread and can do step by step execution in a function in a designated thread. Note that Solaris Modular Debugger (mdb) also has support for MT programs; but this blog post concentrates on Studio's dbx.

Siebel processes were used to show various dbx commands in the following examples. Siebel is a multi-threaded application, written in C/C++.

Core dump analysis

The following example shows some useful commands to get the stack trace in the thread, where the process crashed. For more information about dbx commands, type help or help <command> in dbx environment ie., at dbx prompt.

% ls -lh core
-rw-------   1 giri     other       273M Dec  9 16:56 core

% file core
core:           ELF 32-bit MSB core file SPARC Version 1, from 'siebprocmw'

% /opt/SS11/SUNWspro/prod/bin/dbx siebprocmw core
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message 7.5' in your .dbxrc
Reading siebprocmw
core file header read successfully
Reading ld.so.1
Reading libsslcwsl.so
Reading libssscsci.so
Reading libssscscf.so
...
...
Reading libsbcfui.so
Reading libsbcfuiapps.so
t@1 (l@1) terminated by signal KILL (Killed)
0xfd2bc7e0: ___nanosleep+0x0008:        blu      _cerror        ! 0xfd2206a0
Since we don't know which thread crashed the process, let's list all known threads with threads command. threads -all lists all threads, including zombies.
(dbx) threads
 >    t@1  a  l@1   ?()   LWP suspended in  ___nanosleep()
      t@2  b  l@2   MwTimerThread()   LWP suspended in  __pollsys()
      t@3  b  l@3   MwAsyncSignalThread()   sleep on 0xfd874078  in  __lwp_park()
      t@4  b  l@4   MwThread()   LWP suspended in  __pollsys()
      t@5  b  l@5   MwThread()   LWP suspended in  __pollsys()
o     t@6  b  l@6   MwThread()   signal SIGABRT in  __lwp_kill()
      t@7  b  l@7   MwThread()   LWP suspended in  __pollsys()
      t@9  b  l@9   MwThread()   LWP suspended in  ___nanosleep()

In the above list, t@1 is the current thread, which is indicated by ">", and the start function is not known (indicated with a "?()").

(dbx) thread
current thread ($thread) is t@1

(dbx) where
current thread: t@1
=>[1] ___nanosleep(0x4, 0xffbfd9a8, 0x0, 0xff000000, 0x0, 0x0), at 0xfd2bc7e0
  [2] _sleep(0x64, 0x0, 0xfd2e8bc0, 0xfd0e2000, 0xfd0e2000, 0x0), at 0xfd2afaa0
  [3] thr_t::do_thr_action(0xfd86ba10, 0xc, 0x1608, 0xfd86ba20, 0x1, 0x2), at 0xfd770e14
  [4] thr_t::t_sleep(0xfb80f5c0, 0x0, 0xffbfdb0e, 0xffbfdb08, 0xfd8546cc, 0xffffffff), at 0xfd770c58
  [5] MwWaitForMultipleObjects(0xfb80f5c0, 0x2, 0xfb80f5c8, 0x2, 0xffffffff, 0x9cd48), at 0xfd774dd4
  [6] WaitForMultipleObjectsEx(0x2, 0xffbfde3c, 0x0, 0x100000, 0x0, 0x9cd48), at 0xfd77fe9c
  [7] OSDNTWait::WaitForThread(0xc, 0xffffffff, 0xffbfdecc, 0xd0108, 0x1004f, 0xff8a1d64), at 0xffa7b050
  [8] OSDWaitTid(0xc, 0xffffffff, 0xffbfe7c4, 0x0, 0xc, 0xc), at 0xff05f1c4
  [9] scfEventFacility::scfEventFac::ShutdownCmd(0xe14450, 0x1, 0x7, 0xfe4de0f4, 0xffbfe7c8, 0xff48f8d4), at 0xff819884
  [10] scfEventFacility::scfEventFac::Shutdown(0xffbfe96c, 0xff877530, 0x0, 0x5e000, 0xff874e8c, 0x5e114), at 0xff819390
  [11] ScfSisDetach(0x0, 0x0, 0x0, 0xffffffff, 0xffbfe96c, 0xfc81c), at 0xff781ed4
  [12] _shutdown(0x6479c, 0x0, 0x651a8, 0x651a8, 0x7, 0x0), at 0x49c7c
  [13] wmain(0x12a, 0x6479c, 0x0, 0x0, 0xffbfedac, 0x6479c), at 0x4995c
  [14] main(0xfd85f310, 0xc94, 0xffbfef90, 0x54, 0xfd85f310, 0xc00), at 0x4d3cc

This is not exactly what we are looking for. The above call stack shows where the current thread (t@1) is waiting. Since our interest is to find out the thread that is responsible for the process crash, we need to look for an o before the thread id. t@6 is the ill fated thread in the list of all known threads; and the process was killed because of a SIGABRT in lwp_kill method. Note that OS provides the necessary abstraction for creating, and destroying threads; and also has the freedom of killing malfunctioning threads when things go haywire. In this example, __lwp_kill() was called by the operating system, due to some event which we are going to investigate.

thread -info <tid> command provides more information like what exactly happened in application code that triggered the forcible shutdown.

(dbx) thread -info t@6
        Thread t@6 (0xfcb80c00) at priority 0
        state: bound to    l@6
        base function: 0xfd770ff4: MwThread() stack: 0xfa380000[524288]
        flags: BOUND|DETACHED|SUSPENDED
        masked signals: SEGV
        Currently active in __lwp_kill

Observe that kernel trapped an illegal memory access with a SEGV signal. The default behavior for a SEGV, is to shutdown the process with a possible core file generation (aka core dump). Let's switch to thread t@6 with thread <tid> command, and get to the instruction which raised the segmentation fault.

(dbx) thread t@6
t@6 (l@6) stopped in __lwp_kill at 0xfd2bd5ec
0xfd2bd5ec: __lwp_kill+0x0008:  bcc,a,pt  %icc,__lwp_kill+0x18  ! 0xfd2bd5fc

(dbx) thread
current thread ($thread) is t@6

(dbx) where
current thread: t@6
=>[1] __lwp_kill(0x0, 0x6, 0x0, 0x6, 0xffff0000, 0x0), at 0xfd2bd5ec
  [2] raise(0x6, 0x0, 0xfd2a1af4, 0x42770, 0xfd2e4278, 0x6), at 0xfd25d884
  [3] abort(0xe15220, 0x1, 0x0, 0xa6544, 0xfd2e7298, 0x0), at 0xfd23de38
  [4] SehScanInvokeTryList(0x44bd308, 0x108000, 0xfd8571c4, 0x0, 0x2, 0x0), at 0xfd74c9d4
  [5] Signal_Handler::raise(0xc0000005, 0xfa37cde8, 0x0, 0x2, 0xfa37cc80, 0x1800), at 0xfd74d778
  [6] Raise_Exception::operator()(0x67670, 0xb, 0xfa37d0a0, 0xfa37cde8, 0xfd86a07c, 0x2c), at 0xfd74d8dc
  [7] __sighndlr(0xb, 0xfa37d0a0, 0xfa37cde8, 0xfd74d7c8, 0x0, 0x1), at 0xfd2bc52c
  ---- called from signal handler with signal 11 (SIGSEGV) ------
  [8] CSSSqlObj::GetTrxDbConn(0x458a7d8, 0x0, 0x1394478, 0x64c00, 0x0, 0x4611290), at 0xf91de72c
  [9] CSSSqlObj::Execute(0x4611290, 0x0, 0x0, 0x0, 0x0, 0xfe4dd294), at 0xf91c7b98
  [10] CSSBusComp::SqlExecute(0x4606640, 0x0, 0x0, 0x0, 0x1, 0x4b22e84), at 0xf9a9c160
  [11] CSSBCBase::SqlExecute(0x4606640, 0x0, 0xfa37d6fc, 0x0, 0x1, 0xf57be3e8), at 0xf56c2294
  [12] CSSBusComp::Execute(0x0, 0x0, 0x0, 0x0, 0x4606640, 0xfa37d7cc), at 0xf9a6b118
  [13] CSSMsgBoardMaintSvc::UpdTaskHistory(0x44b5ae0, 0xfa37df90, 0x0, 0x4567d14, 0xf8611198, 0x489cd94), at 0xf85f2d48
  [14] CSSMsgBoardMaintSvc::HandleEventDataList(0x44b5ae0, 0x43a0018, 0xff486b38, 0x0, 0xfa37e0ac, 0xf8611198), at 0xf85f5afc
  [15] CSSMsgBoardMaintSvc::ReadTaskHistory(0x44b5ae0, 0x43a0018, 0xf85f4e60, 0x44b5ae0, 0x43a0018, 0x1), at 0xf85f53c0
  [16] scfEventFacility::scfEventFac::CallRegSub(0x2a59448, 0x4109bd8, 0x0, 0x0, 0x8, 0x2), at 0xff81ad20
  [17] scfEventFacility::scfEventFac::HandleCurrProcEvents(0xe14450, 0x7530, 0xe14450, 0xff432ef0, 0xff874e8c, 0x1), 
            at 0xff81b19c
  [18] scfEventFacility::scfEventFac::scfEventThreadMain(0x0, 0x0, 0x0, 0x7400, 0xfa37fc90, 0xd0001), at 0xff81a7dc
  [19] OSDWslThreadStart(0x101d58, 0xff81a580, 0x101d58, 0x6, 0x0, 0x101d70), at 0xff05bec8
  [20] _AfxThreadEntry(0xffbfde34, 0xe9568, 0x0, 0x1, 0x0, 0x17289c), at 0xfeb95730
  [21] MwThread(0x1, 0x0, 0x1, 0x0, 0xfd86bed0, 0xe15220), at 0xfd771230

From the above stack trace it is clear that the binary doesn't contain necessary debug information to show high level instructions; so, let's try to get the disassembly with dis command.

(dbx) dis GetTrxDbConn / 50
More than one identifier 'GetTrxDbConn'.
Select one of the following:
 0) Cancel
 1) `libsscfdm.so`#__1cPCSSModelPhysDefMGetTrxDbConn6MpkH_pnJCSSDbConn__
  [non -g, demangles to: CSSModelPhysDef::GetTrxDbConn(const unsigned short\*)]
 2) `libsscfdm.so`#__1cJCSSSqlObjMGetTrxDbConn6kM_pnJCSSDbConn__
  [non -g, demangles to: CSSSqlObj::GetTrxDbConn()const]
> 2
0xf91de6c0: GetTrxDbConn       :        save     %sp, -96, %sp
0xf91de6c4: GetTrxDbConn+0x0004:        mov      %i0, %i5
0xf91de6c8: GetTrxDbConn+0x0008:        ld       [%i0 + 388], %i0
0xf91de6cc: GetTrxDbConn+0x000c:        cmp      %i0, 0
0xf91de6d0: GetTrxDbConn+0x0010:        be,pn    %icc,GetTrxDbConn+0x60 ! 0xf91de720
0xf91de6d4: GetTrxDbConn+0x0014:        sethi    %hi(0x5b400), %l6
0xf91de6d8: GetTrxDbConn+0x0018:        call     GetTrxDbConn+0x20      ! 0xf91de6e0
0xf91de6dc: GetTrxDbConn+0x001c:        mov      %o7, %o7
0xf91de6e0: GetTrxDbConn+0x0020:        sethi    %hi(0x2d1400), %o5
0xf91de6e4: GetTrxDbConn+0x0024:        xor      %l6, 88, %l4
0xf91de6e8: GetTrxDbConn+0x0028:        inc      420, %o5
0xf91de6ec: GetTrxDbConn+0x002c:        sethi    %hi(0x1000), %l5
0xf91de6f0: GetTrxDbConn+0x0030:        add      %o5, %o7, %l3
0xf91de6f4: GetTrxDbConn+0x0034:        add      %l5, 868, %l1
0xf91de6f8: GetTrxDbConn+0x0038:        add      %l3, %l4, %l2
0xf91de6fc: GetTrxDbConn+0x003c:        ld       [%l2], %l0
0xf91de700: GetTrxDbConn+0x0040:        ld       [%l0 + %l1], %o4
0xf91de704: GetTrxDbConn+0x0044:        cmp      %o4, 0
0xf91de708: GetTrxDbConn+0x0048:        be,a,pn  %icc,GetTrxDbConn+0x68 ! 0xf91de728
0xf91de70c: GetTrxDbConn+0x004c:        ld       [%i5 + 128], %i2
0xf91de710: GetTrxDbConn+0x0050:        ld       [%o4 + 88], %l7
0xf91de714: GetTrxDbConn+0x0054:        cmp      %i5, %l7
0xf91de718: GetTrxDbConn+0x0058:        bne,a,pn  %icc,GetTrxDbConn+0x68        ! 0xf91de728
0xf91de71c: GetTrxDbConn+0x005c:        ld       [%i5 + 128], %i2
0xf91de720: GetTrxDbConn+0x0060:        ret
0xf91de724: GetTrxDbConn+0x0064:        restore  %g0, 0, %o0
0xf91de728: GetTrxDbConn+0x0068:        ld       [%i2 + 188], %i1
0xf91de72c: GetTrxDbConn+0x006c:        ld       [%i1 - 16], %i3
0xf91de730: GetTrxDbConn+0x0070:        cmp      %i3, 0
0xf91de734: GetTrxDbConn+0x0074:        bge,pn   %icc,GetTrxDbConn+0x90 ! 0xf91de750
0xf91de738: GetTrxDbConn+0x0078:        add      %i2, 188, %i4
0xf91de73c: GetTrxDbConn+0x007c:        clr      %o0
0xf91de740: GetTrxDbConn+0x0080:        call     RequiredConditionIsFalse [PLT] ! 0xf94b0684
0xf91de744: GetTrxDbConn+0x0084:        mov      84, %o1
0xf91de748: GetTrxDbConn+0x0088:        ld       [%i4], %i1
0xf91de74c: GetTrxDbConn+0x008c:        ld       [%i5 + 388], %i0
0xf91de750: GetTrxDbConn+0x0090:        call     GetTrxDbConn   ! 0xf90e0e00
0xf91de754: GetTrxDbConn+0x0094:        restore  %g0, 0, %g0
0xf91de758: GetTrxDbConn+0x0098:        unimp    0x0
...
...

To see the actual C++ instruction which seg faulted, compile the binary with -g (debug) option, and reproduce the crash. If the source code is readable from the location where you run the dbx session, you will see the actual high level instructions.


Some fun with an active process

The objective of this section is to show how to use some of the dbx commands to get some useful information, from a running MT process.

   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
  2754 giri      399M  302M sleep   59    0   0:00:34 2.0% siebmtshmw/21

% /opt/SS11/SUNWspro/prod/bin/dbx - 2754
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message 7.5' in your .dbxrc
Reading -
Reading ld.so.1
Reading libsslcwsl.so
Reading libssscsci.so
Reading libssscscf.so
...
...
Reading libsscasvbc.so
Reading libswcasvfr.so
Attached to process 2754 with 21 LWPs
t@1 (l@1) stopped in __pollsys at 0xfd13d1c4
0xfd13d1c4: __pollsys+0x0004:   ta       8

(dbx) threads
 >    t@1  a  l@1   ?()   running          in  __pollsys() <- t@1 is always the default current thread under dbx
      t@2  b  l@2   MwTimerThread()   sleep on 0xfb80f4c0  in  __lwp_park()
      t@3  b  l@3   MwAsyncSignalThread()   sleep on 0xfd774078  in  __lwp_park()
      t@4  b  l@4   MwThread()   running          in  __pollsys()
      t@5  b  l@5   MwThread()   running          in  __pollsys()
      t@6  b  l@6   MwThread()   sleep on 0xf9b7eb80  in  __lwp_park()
      t@7  b  l@7   MwThread()   running          in  __pollsys()
      t@8  b  l@8   MwThread()   running          in  _so_recv()
      t@9  b  l@9   MwThread()   sleep on 0xf927fb68  in  __lwp_park()
     t@10  b l@10   MwThread()   sleep on 0xf877f500  in  __lwp_park()
     t@11  b l@11   MwThread()   sleep on 0xf867fa40  in  __lwp_park()
     t@12  b l@12   MwThread()   sleep on 0xf857fa50  in  __lwp_park()
     t@13  b l@13   MwThread()   sleep on 0xf847fa38  in  __lwp_park()
     t@14  b l@14   MwThread()   running          in  __pollsys()
     t@15  b l@15   MwThread()   sleep on 0xf827f490  in  __lwp_park()
     t@16  b l@16   MwThread()   running          in  __pollsys()
     t@17  b l@17   MwThread()   sleep on 0xf807f490  in  __lwp_park()
     t@18  b l@18   MwThread()   running          in  __pollsys()
     t@19  b l@19   MwThread()   sleep on 0xf4c7f490  in  __lwp_park()
     t@20  b l@20   MwThread()   running          in  __pollsys()
     t@21  b l@21   MwThread()   sleep on 0xf4a7f490  in  __lwp_park()

Put a break point in thread 21 (t@21) for all calls to memcpy():

(dbx) stop in memcpy -thread t@21
More than one identifier 'memcpy'.
Select one of the following:
 0) Cancel
 1) `libc.so.1`memcpy
 2) `libc_psr.so.1`memcpy
 a) All
> a
dbx: warning: 'memcpy' has no debugger info -- will trigger on first instruction
dbx: warning: 'memcpy' has no debugger info -- will trigger on first instruction
Will create handlers for all 2 hits
(2) stop in _private_memcpy -thread t@21 <- implicit break point set by dbx
(3) stop in _memcpy -thread t@21  <- implicit break point

(dbx) cont
t@21 (l@21) stopped in _memcpy at 0xfe1f04c0
0xfe1f04c0: _memcpy       :     nop

Note that dbx is synchronous -- when any thread or lightweight process (LWP) stops, all other threads and LWPs stop as well.

(dbx) thread
current thread ($thread) is t@21

(dbx) where
current thread: t@21
=>[1] _memcpy(0x5080e14, 0xff406b38, 0x2, 0x36, 0x1, 0x6c), at 0xfe1f04c0
  [2] SSstring::GetWriteBuffer(0xf4a7e6ac, 0xff406b28, 0xff874e8c, 0x32, 0x0, 0xff3b2ef0), at 0xff31ffcc
  [3] sciProcState::sciBlock::FormatLatchName(0xf4a7e6ac, 0x1, 0x7, 0x853c, 0xffa30bd8, 0x8400), at 0xffa02744
  [4] sciProcState::sciProcState(0x5ad31f8, 0xf9fc0000, 0xf4a7e644, 0xff406b3c, 0x0, 0x0), at 0xffa012c4
  [5] sciProcState::GetSciProcState(0xf4a7e7f8, 0x26fcb8, 0x5ad31f8, 0xff88db30, 0x5f5e4, 0x61e6c90), at 0xffa014f0
  [6] SciCheckShutdown(0xf4a7e8cc, 0x34151f8, 0x74, 0x26fcb8, 0x0, 0x2ef798), at 0xff9fe0e4
  [7] SciGetInterrupt(0x0, 0x6a20950, 0x0, 0xf4a7e864, 0x25cd94, 0x1da84), at 0xff9fde40
  [8] _smiMessageQ::ProcessMessage(0x15f85c0, 0x6a20950, 0x0, 0x0, 0x24a360, 0x32e18f0), at 0x2158e4
  [9] _smiMessageQ::ProcessRequest(0x3380c48, 0x6a20950, 0x191, 0x2, 0x5ae22f0, 0x15f85c0), at 0x21461c
  [10] _smiWorkQueue::ProcessWorkItem(0x15f98b8, 0x3380c48, 0x6a20950, 0x5ae2390, 0x0, 0x101f180), at 0x208d08
  [11] _smiWorkQueue::WorkerTask(0x15f98b8, 0x5b7f6b8, 0x3326338, 0x1500e0, 0x0, 0x0), at 0x208764
  [12] SmiThrdEntryFunc(0x32f72d8, 0x70000f, 0x700010, 0x0, 0x0, 0x0), at 0x1f7a0c
  [13] OSDWslThreadStart(0x3380568, 0x1f75a0, 0x3380568, 0x15, 0x0, 0x3380760), at 0xfefdbec8
  [14] _AfxThreadEntry(0xf4b7de5c, 0x3386210, 0x0, 0x1, 0x0, 0x17289c), at 0xfeb95730
  [15] MwThread(0x1, 0x0, 0x1, 0x0, 0xfd76bed0, 0x33cdc40), at 0xfd671230

Let's step into memcpy() with stepi, and observe how the thread state changes.

(dbx) stepi
t@21 (l@21) stopped in _memcpy at 0xfe1f04c4
0xfe1f04c4: _memcpy+0x0004:     nop

(dbx) threads
      t@1  a  l@1   ?()   running          in  __pollsys()
      t@2  b  l@2   MwTimerThread()   sleep on 0xfb80f4c0  in  __lwp_park()
      t@3  b  l@3   MwAsyncSignalThread()   sleep on 0xfd774078  in  __lwp_park()
      t@4  b  l@4   MwThread()   running          in  __pollsys()
      t@5  b  l@5   MwThread()   running          in  __pollsys()
      t@6  b  l@6   MwThread()   sleep on 0xf9b7eb80  in  __lwp_park()
      t@7  b  l@7   MwThread()   running          in  __pollsys()
      t@8  b  l@8   MwThread()   running          in  _so_recv()
      t@9  b  l@9   MwThread()   sleep on 0xf927fb68  in  __lwp_park()
     t@10  b l@10   MwThread()   sleep on 0xf877f500  in  __lwp_park()
     t@11  b l@11   MwThread()   sleep on 0xf867fa40  in  __lwp_park()
     t@12  b l@12   MwThread()   sleep on 0xf857fa50  in  __lwp_park()
     t@13  b l@13   MwThread()   sleep on 0xf847fa38  in  __lwp_park()
     t@14  b l@14   MwThread()   running          in  __pollsys()
     t@15  b l@15   MwThread()   sleep on 0xf827f490  in  __lwp_park()
     t@16  b l@16   MwThread()   running          in  __pollsys()
o    t@17  b l@17   MwThread()   breakpoint       in  _memcpy()
o    t@18  b l@18   MwThread()   breakpoint       in  _memcpy()
o    t@19  b l@19   MwThread()   breakpoint       in  _memcpy()
     t@20  b l@20   MwThread()   running          in  __pollsys()
\*>   t@21  b l@21   MwThread()   single stepped   in  _memcpy()

In the above example, t@17, t@18 and t@19 are stopped at calls to memcpy(); and t@21 stepped into memcpy(). Get out of memcpy() with step up command.

(dbx) step up
_memcpy returns 84413972
t@21 (l@21) stopped in SSstring::GetWriteBuffer at 0xff31ffd4
0xff31ffd4: GetWriteBuffer+0x0114:      ld       [%i1 + 4], %i2

Clear the break point (in current thread) with clear command

(dbx) cont
t@21 (l@21) stopped in _memcpy at 0xfe1f04c0
0xfe1f04c0: _memcpy       :     nop

(dbx) clear
cleared (3) stop in _memcpy -thread t@21
Locks

thread -blocks [<tid>] lists all locks held by the given thread, blocking other threads. If tid is not specified, dbx lists the locks held by the current thread. In the following example, t@21 (current thread) is not holding any locks.

(dbx) thread -blocks
Locks held by t@21:

thread -blockedby [<tid>] shows the synchronization object (monitor) on which the given thread is blocked. If tid is not specified, dbx shows this information for the current thread. Note that only sleeping threads must be in blocked state.

(dbx) thread -blockedby t@10
Thread t@10 is blocked by:
0xf877f500 (0xf877f500): thread  condition variable

(dbx) thread -blockedby t@12
Thread t@12 is blocked by:
0xf857fa50 (0xf857fa50): thread  condition variable

(dbx) thread -blockedby t@17
Thread t@17 is not asleep

syncs command lists all synchronization objects ie., locks/monitors.

(dbx) syncs
All locks currently known to libthread:
0x01020320 (0x01020320): thread  mutex(unlocked)
0x010203f8 (0x010203f8): thread  mutex(unlocked)
0xf827f490 (0xf827f490): thread  condition variable
0xf827f4a0 (0xf827f4a0): thread  mutex(unlocked)
0xf877f500 (0xf877f500): thread  condition variable
0xf877f510 (0xf877f510): thread  mutex(unlocked)
0xf927fb68 (0xf927fb68): thread  condition variable
0xf927fb78 (0xf927fb78): thread  mutex(unlocked)
0xf867fa40 (0xf867fa40): thread  condition variable
0xf867fa50 (0xf867fa50): thread  mutex(unlocked)
0xf9b7eb80 (0xf9b7eb80): thread  condition variable
0xf9b7eb90 (0xf9b7eb90): thread  mutex(unlocked)
0x015c2ed8 (0x015c2ed8): thread  mutex(unlocked)
0x015c2f38 (0x015c2f38): thread  mutex(unlocked)
0x015c2f18 (0x015c2f18): thread  mutex(unlocked)
0x015c2dd8 (0x015c2dd8): thread  mutex(unlocked)
0x015c34d8 (0x015c34d8): thread  mutex(unlocked)
0x03325fb8 (0x03325fb8): thread  mutex(unlocked)
0x033264b8 (0x033264b8): thread  mutex(unlocked)
0x033261b8 (0x033261b8): thread  mutex(unlocked)
0x017a6ce8 (0x017a6ce8): thread  mutex(locked)
0xfa4f4314 (0xfa4f4314): process mutex(locked)
0x0332c438 (0x0332c438): thread  mutex(unlocked)
0x0332c348 (0x0332c348): thread  mutex(unlocked)
0x02fcd7e8 (0x02fcd7e8): thread  mutex(unlocked)
0x0028f860 (0x0028f860): thread  mutex(unlocked)
__1cUCSSSISLocalTransSrvrKs_instLock_+0x8 (0xff1ee220): thread  mutex(unlocked)
0x034150e8 (0x034150e8): thread  mutex(unlocked)
0x034151d8 (0x034151d8): thread  mutex(unlocked)
__uberdata+0x80 (0xfd168c40): thread  mutex(unlocked)
0x01878b98 (0x01878b98): thread  mutex(unlocked)
0x01878aa8 (0x01878aa8): thread  mutex(unlocked)
0xfa4c7e9c (0xfa4c7e9c): process mutex(unlocked)
libc_malloc_lock (0xfd1676f8): thread  mutex(unlocked)
0x0179cb30 (0x0179cb30): thread  mutex(unlocked)
0x0179c830 (0x0179c830): thread  mutex(unlocked)
0xfa5c2664 (0xfa5c2664): process mutex(unlocked)
0xfa5c2c94 (0xfa5c2c94): process mutex(unlocked)
0x0161dd90 (0x0161dd90): thread  mutex(unlocked)
0x0101f6e0 (0x0101f6e0): thread  mutex(unlocked)
0x0101f718 (0x0101f718): thread  mutex(unlocked)
0x0101f770 (0x0101f770): thread  mutex(unlocked)
0x0101f508 (0x0101f508): thread  mutex(locked)
0x0101f5a8 (0x0101f5a8): thread  mutex(unlocked)
0x015bfe90 (0x015bfe90): thread  mutex(unlocked)
0x015bfe20 (0x015bfe20): thread  mutex(unlocked)
0x015bfe58 (0x015bfe58): thread  mutex(unlocked)

To get information about a synchronization object at a given address, use sync -info <address>

(dbx) sync -info 0x0028f860
0x0028f860 (0x28f860): thread  mutex(unlocked)
Lock is unowned
No threads are blocked by this lock

(dbx) sync -info 0xf877f500
0xf877f500 (0xf877f500): thread  condition variable

(dbx) sync -info 0xfd1676f8
libc_malloc_lock (0xfd1676f8): thread  mutex(unlocked)
Lock is unowned
No threads are blocked by this lock
Tracing

trace command can be used to trace the executed source lines, function calls, or variable changes. The following example traces the thread creation, and prints a message whenever a thread gets created.

(dbx) trace thr_create
(4) trace thr_create

(dbx) cont
trace: thread created t@22 on l@22
trace: thread created t@23 on l@23
Reading libsrlcver.so
Reading libsscafsbc.so
...

(dbx) threads
\*>    t@1  a  l@1   ?()   signal SIGINT in  __pollsys()
      t@2  b  l@2   MwTimerThread()   sleep on 0xfb80f4c0  in  __lwp_park()
      t@3  b  l@3   MwAsyncSignalThread()   sleep on 0xfd774078  in  __lwp_park()
 ...
 ...
     t@20  b l@20   MwThread()   running          in  __pollsys()
     t@21  b l@21   MwThread()   sleep on 0xf4a7f490  in  __lwp_park()
     t@22  b l@22   MwThread()   running          in  __pollsys() <- new thread
     t@23  b l@23   MwThread()   sleep on 0xea6ff490  in  __lwp_park() <- new thread

In the above example, there is no information about who created the threads t@22 & t@23. Even to get that information, use when command as shown below:

(dbx) when thr_create { echo "New thread $newthread was created by thread $thread"; }
(6) when thr_create { kprint "New thread ${newthread} was created by thread ${thread}"; }
(dbx) cont
New thread t@24 was created by thread t@10
New thread t@25 was created by thread t@24

$newthread and $thread are pre-defined variables of dbx, which holds the thread ID of a newly created thread, and the thread ID of the current thread, respectively.

Similarly thread exits can be traced as follows:

(dbx) trace thr_exit
(5) trace thr_exit

(dbx) cont
New thread t@26 was created by thread t@10
New thread t@27 was created by thread t@26
trace: thr_exit t@27
Suspending/Resuming threads

To suspend the execution of a thread, run the command thread -suspend <tid>; to resume the suspended thread, thread -resume <tid>

(dbx) thread -suspend t@26
Thread t@26 suspended

(dbx) thread -resume t@26
Thread t@26 unsuspended
Break point with stop command

The following example shows how to set a break point to stop the execution, when a new thread with id t@34 gets created.

(dbx) stop thr_create t@34
(9) stop thr_create t@34

(dbx) cont
t@10 (l@10) stopped in tdb_event_create at 0xfd1377e8
0xfd1377e8: tdb_event_create       :    retl
trace: thread created t@34 on l@34

(dbx) where   <- who initiated the new thread creation? entire call stack
current thread: t@10
=>[1] tdb_event_create(0x2, 0x1084, 0x3ff, 0x0, 0xfc8e1c00, 0x1000), at 0xfd1377e8
  [2] _thrp_create(0x180, 0x10f8, 0xfd1377e8, 0x1e, 0xc1, 0xfde32000), at 0xfd138c04
  [3] _pthread_create(0xf877f310, 0x0, 0xfd670ff4, 0xf877f318, 0x0, 0xfd168bc0), at 0xfd12d104
  [4] MwCreateThread(0x0, 0xfeb95630, 0xf877f414, 0x4, 0x0, 0x9383cb0), at 0xfd671460
  [5] CreateThread(0x0, 0x0, 0xfeb95630, 0xf877f414, 0x4, 0x9383cb0), at 0xfd67d124
  [6] CWinThread::CreateThread(0x9383c80, 0x4, 0x0, 0x0, 0xfd164278, 0x88cabc9), at 0xfeb95f1c
  [7] AfxBeginThread(0xffa7a420, 0x88cabc0, 0x0, 0x0, 0x4, 0x0), at 0xfeb958a4
  [8] WslCreateThread(0xfefdbe00, 0x5c135c0, 0x0, 0x88cabc0, 0xf877f584, 0x16b8c), at 0xffa7a4cc
  [9] OSDCreateThread(0x211200, 0x5b40660, 0x0, 0x0, 0x5ab1590, 0x5c135c0), at 0xfefdc16c
  [10] SmiDispatchThrdMain(0x101f180, 0x5ab1588, 0x5ab1590, 0xf877fd64, 0xf877fcec, 0xff40f8d4), at 0x1f53f4
  [11] OSDWslThreadStart(0x10b8ad0, 0x1f5240, 0x10b8ad0, 0xa, 0x0, 0x15d07e8), at 0xfefdbec8
  [12] _AfxThreadEntry(0xffbfeaac, 0x2f4948, 0x0, 0x1, 0x0, 0x17289c), at 0xfeb95730
  [13] MwThread(0x1, 0x0, 0x1, 0x0, 0xfd76bed0, 0x15cd558), at 0xfd671230
Light Weight Processes (LWPs)

Application (user) threads are not visible to the kernel. Kernel treats light weight processes (LWPs) as the only schedulable entities within a process. LWPs bridge the user level and kernel level threads. Each process contains one or more LWPs; and each LWP is associated with a kernel thread. Prior to Solaris 9, each of LWPs would run one or more user level threads (ie., 1xN). From Solaris 9 onwards, there is one LWP for every user level thread (ie., 1x1).

Use lwps command to list all LWPs in the process.

(dbx) lwps
  l@1 running          in _private_mprotect()
  l@2 running          in __lwp_park()
  l@3 running          in __lwp_park()
  l@4 running          in __pollsys()
  l@5 running          in __pollsys()
  l@6 running          in __lwp_park()
  l@7 running          in __pollsys()
  l@8 running          in _so_recv()
  l@9 running          in __lwp_park()
  l@10 running          in __lwp_park()
  l@11 running          in __lwp_park()
  l@12 running          in __lwp_park()
  l@13 running          in __time()
  l@14 running          in __pollsys()
  l@15 running          in __lwp_park()
  l@16 running          in __pollsys()
o l@17 breakpoint       in SSstring::GetWriteBuffer()
  l@18 running          in __lwp_unpark()
o l@19 breakpoint       in SSstring::GetWriteBuffer()
  l@20 running          in __pollsys()
\*>l@21 breakpoint       in SSstring::GetWriteBuffer()

lwp command displays the current LWP. To switch to a different LWP, use lwp <lwpid>. lwp -info [<lwpid>] shows some useful information for a given LWP.

(dbx) lwp
current LWP ($lwp) is l@21

(dbx) lwp -info
l@21 breakpoint       in SSstring::GetWriteBuffer()
masked signals are:

(dbx) lwp -info l@12
l@12 running          in __lwp_park()
masked signals are:

(dbx) lwp l@18
t@18 (l@18) stopped in __pollsys at 0xfd13d1c4
0xfd13d1c4: __pollsys+0x0004:   ta       8

Scalability issues

In general, MT applications that make heavy use of the standard {Solaris operating system's} memory allocator, may exhibit poor scalability. This problem occurs when multiple threads are in malloc() or free() waiting to obtain the mlock.

If the application suffers from this scalability issue, the top of the thread stacks (which can be obtained using either dbx or pstack command) will appear as below:

lwp_park
mutex_lock_queue
slow_lock
free
or
lwp_park
mutex_lock_queue
slow_lock
malloc

One such problem was described in this Solaris forum's thread slow_lock making application hang.

MT aware memory allocators

mtmalloc, umem libraries of Solaris distribution will resolve this kind of scalability problem. libmtmalloc was introduced in Solaris 7; and libumem was introduced in Solaris 9 Update 3. These userland memory allocators are packaged as a drop-in replacement to the standard malloc() and free() library calls; so, to take advantage of these allocators, link the MT application with any of these allocators.

mtmalloc, umem allocators are a redesign of the standard library; and hence results in finer grained locking. These libraries will significantly outperform the standard library in cases where multiple concurrent requests are made to the memory allocator. In the case of a single threaded application, the standard memory allocator will however provide better performance. The standard memory allocator also provides a smaller memory footprint. Note that the trade-off with mtmalloc, umem allocators is much bigger memory footprint, due to the way the memory gets allocated. For these reasons the standard memory allocator may be preferred in cases where the advantages of mtmalloc and umem, do not apply. Make sure to experiment with these memory allocators to see which one fits best for your application.

Linking with mtmalloc or umem

At compile time, the application can be linked against mtmalloc or umem library. Adding -lmtmalloc or -lumem, option to the link line results in the application being linked appropriately.

eg., % cc -mt -o my_program my_program.c -lmtmalloc or % cc -mt -o my_program my_program.c -lumem

You can check the library dependency with ldd my_program.

Quick workaround -- library interposition

If re-building the application by linking with mtmalloc or umem, is not feasible, either of these libraries can be preloaded with LD_PRELOAD environment variable, when the program is executed.

eg., % setenv LD_PRELOAD libmtmalloc.so % ./my_program or % setenv LD_PRELOAD libumem.so % ./my_program

You can verify whether the library is preloaded, with pldd `pgrep my_program`.

Resources:
  1. Debugging a Program With dbx
  2. Multithreaded Programming Guide
  3. malloc vs mtmalloc
Suggested Reading:
  1. Welcome to the CMT Era!
  2. Improving Application Efficiency Through Chip Multi-Threading

Tuesday Feb 17, 2009

PeopleSoft HRMS 8.9 Self-Service Benchmark on M3000 & T5120 Servers

Sun published the PeopleSoft HRMS 8.9 Self-Service benchmark results today. The benchmark was conducted on 3 x Sun SPARC Enterprise M3000 and 1 x Sun SPARC Enterprise T5120 servers. Click on the following link for the full report with the benchmark results.

PeopleSoft HRMS 8.9 SELF-SERVICE Using ORACLE on Sun SPARC Enterprise M3000 and Enterprise T5120 Servers

Admittedly it is Sun's first PeopleSoft benchmark after a hiatus of over five years. However I am glad that we came up with a very nice cost effective solution in our comeback effort to the PeopleSoft applications' benchmarking.

Some of the notes and highlights from this competitive benchmark are as follows.

  • The benchmark measured the average search and save transaction response times at a peak load of 4,000 concurrent users.

  • 4,000 users is the limitation of the benchmark kit. All vendors using this benchmark kit are bound to this limitation . Hence it is easy to compare the performance as the throughput achieved by each vendor will be the same. In comparing the benchmark results from workloads like these, lower average [transaction response times, CPU, memory utilizations] and the hardware in use (lesser the better), usually indicate better performance.

  • IBM and Sun are the only vendors who published benchmark results with PeopleSoft HRMS 8.9 Self-Service benchmark kit.

  • Sun's benchmark results are superior relative to IBM's best published result on a combination of z990 2084-C24 and eServer pSeries p690 servers. While I leave the price comparisons to the reader1, I'd like to show the performance numbers extracted from the benchmark reports published by Sun and IBM. All the following data/information is available in the benchmark reports. Feel free to draw your own conclusions.

    Average Transaction Response Times

    Vendor Single User
    Search (sec)
    4,000 Users
    Search (sec)
    Single User
    Save (sec)
    4,000 Users
    Save (sec)
    Sun 0.78 0.77 0.71 0.74
    IBM 0.78 1.35 0.65 1.01

    Average CPU Utilizations

    Vendor Web Server
    CPU%
    App Server1
    CPU%
    App Server2
    CPU%
    DB Server
    CPU%
    Sun 23.10 66.92 67.85 27.45
    IBM 45.81 59.70 N/A 40.66

    Average Memory Utilizations

    Vendor Web Server
    GB
    App Server1
    GB
    App Server2
    GB
    DB Server
    GB
    Sun 4.15 3.67 3.72 5.54
    IBM 5.00 15.70 N/A 0.3 (Huh!?)

    Hardware Configuration

    Vendor: Sun Microsystems

    Topology Diagram

    topology



    Tier Server
    Model
    Server
    Count
    Processor Processor
    Speed
    Processor
    Count
    #Cores per
    Processor
    Memory
    Web T5120 1 UltraSPARC-T2 1.2 GHz 1 4 8 GB
    App M3000 2 SPARC64-VII 2.52 GHz 1 4 8 GB
    DB M3000 1 SPARC64-VII 2.52 GHz 1 4 8 GB

    2 x Sun Storage J4200 arrays were used to host the database. Total disk space: ~1.34 Terabytes. Consumed only 120 GB disk space -- 115 GB for data on one array; and 5 GB for redo logs on the other array.


    Vendor: IBM

    Tier Server
    Model
    Server
    Count
    Processor Processor
    Speed
    Processor
    Count
    #Cores per
    Processor
    Memory
    Web p690 (7040-681) 1 POWER4 1.9 GHz 4 NA (?) 12 GB
    App p690 (7040-681) 1 POWER4 1.9 GHz 12 NA (?) 32 GB
    DB zSeries 990, model 2084-C24 1 z990 Gen1 ??? 6 NA (?) 32 GB

    1 x IBM TotalStorage DS8300 Enterprise Storage Server, 2107-922 ws used to host the database. Total disk space: ~9 Terabytes.

  • The combination of Sun SPARC Enterprise M3000 and T5120 servers consumed 1030 Watts on the average in a 7RU space in achieving 4,000 concurrent users. That is, in the case of similarly configured workloads, M3000/T5120 support 3.88 users per watt of the power consumed; and 571 users per rack unit.

Just like our prior Siebel and Oracle E-Business Suite Payroll 11i benchmarks, Sun collaborated with Oracle Corporation in executing this benchmark. And we sincerely thank our peers at Oracle Corporation for all their help and support over the past few months in executing this benchmark.

___________

I'm planning to post some of the tuning tips to run PeopleSoft optimally on Solaris 10. Stay tuned ..

1: It is relatively hard to obtain IBM's server list prices. On the other hand, it is very easy to find the list prices of Sun servers' from http://store.sun.com
About

Benchmark announcements, HOW-TOs, Tips and Troubleshooting

Search

Archives
« April 2015
SunMonTueWedThuFriSat
   
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
  
       
Today