Modern multi-socket servers exhibit NUMA characteristics that may hurt application performance if ignored. On a NUMA system (Non-uniform Memory Access), all memory is shared between/among processors. Each processor has access to its own memory - local memory - as well as memory that is local to another processor -- remote memory. However the memory access time (latency) depends on the memory location relative to the processor. A processor can access its local memory faster than the remote memory, and these varying memory latencies play a big role in application performance.
Solaris organizes the hardware resources -- CPU, memory and I/O devices -- into one or more logical groups based on their proximity to each other in such a way that all the hardware resources in a group are considered local to that group. These groups are referred as locality groups or NUMA nodes. In other words, a locality group (lgroup) is an abstraction that tells what hardware resources are near each other on a NUMA system. Each locality group has at least one processor and possibly some associated memory and/or IO devices. To minimize the impact of NUMA characteristics, Solaris considers the lgroup based physical topology when mapping threads and data to CPUs and memory.
Note that even though Solaris attempts to provide good performance out of the box, some applications may still suffer the impact of NUMA either due to misconfiguration of the hardware/software or some other reason. Engineered systems such as Oracle SuperCluster go to great lengths in setting up customer environments to minimize the impact of NUMA so applications perform as expected in a predictable manner. Still application developers and system/application administrators need to take NUMA factor into account while developing for and managing applications on large systems. Solaris provided tools and APIs can be used to observe, diagnose, control and even correct or fix the issues related to locality and latency. Rest of this post is about the tools that can be used to examine the locality of cores, memory and I/O devices.
Sample outputs are collected from a SPARC T4-4 server.
Locality Group Hierarchy
lgrpinfo prints information about the lgroup hierarchy and its contents. It is useful in understanding the context in which the OS is trying to optimize applications for locality, and also in figuring out which CPUs are closer, how much memory is near them, and the relative latencies between the CPUs and different memory blocks.
lgrpinfo utility shown above shows the total memory that belongs to each of the locality groups. However, it doesn't show exactly what memory blocks belong to what locality groups. One of mdb's debugger command (dcmd) helps retrieve this information.
The values under MN column (memory node) can be treated as lgroup numbers after adding 1 to existing values. For example, a value of zero under MN translates to lgroup 1, 1 under MN translate to lgroup 2 and so on. Better yet, ::mnode debugger command lists out the mapping of mnodes to lgroups as shown below.
Main memory on T4-4 is interleaved across all memory banks with 8 GB interleave size -- meaning first 8 GB chunk excluding _sys_ blocks will be populated in lgroup 1 closer to processor #1, second 8 GB chunk in lgroup 2 closer to processor #2, third 8 GB chunk in lgroup 3 closer to processor #3, fourth 8 GB chunk in lgroup 4 closer to processor #4 and then the fifth 8 GB chunk again in lgroup 1 closer to processor #1 and so on. Memory is not interleaved on T5 and M6 systems (confirm by running the ::syslayout dcmd). Conceptually memory interleaving is similar to disk striping.
Keep in mind that debugger commands (dcmd) are not committed - thus, there is no guarantee that they continue to work on future versions of Solaris. Some of these dcmds may not work on some of the existing versions of Solaris.
I/O Device Locality
-d option to lgrpinfo utility accepts a specified path to an I/O device and return the lgroup IDs closest to that device. Each I/O device on the system can be connected to one or more NUMA nodes - so, it is not uncommon to see more than one lgroup ID returned by lgrpinfo.
# lgrpinfo -d /dev/dsk/c1t0d0
lgroup ID : 1
# dladm show-phys | grep 10000
net4 Ethernet up 10000 full ixgbe0
# lgrpinfo -d /dev/ixgbe0
lgroup ID : 1
# dladm show-phys | grep ibp0
net12 Infiniband up 32000 unknown ibp0
# lgrpinfo -d /dev/ibp0
lgroup IDs : 1-4
NUMA IO Groups
Debugger command ::numaio_group shows information about all NUMA I/O Groups.
# dladm show-phys | grep up
net0 Ethernet up 1000 full igb0
net12 Ethernet up 10 full usbecm2
net4 Ethernet up 10000 full ixgbe0
# echo ::numaio_group | mdb -k
ADDR GROUP_NAME CONSTRAINT
10050e1eba48 net4 lgrp : 1
10050e1ebbb0 net0 lgrp : 1
10050e1ebd18 usbecm2 lgrp : 1
10050e1ebe80 scsi_hba_ngrp_mpt_sas1 lgrp : 4
10050e1ebef8 scsi_hba_ngrp_mpt_sas0 lgrp : 1
Relying on prtconf is another way to find the NUMA IO locality for an IO device.
# dladm show-phys | grep up | grep ixgbe
net4 Ethernet up 10000 full ixgbe0
== Find the device path for the network interface ==
# grep ixgbe /etc/path_to_inst | grep " 0 "
"/pci@400/pci@1/pci@0/pci@4/network@0" 0 "ixgbe"
== Find NUMA IO Lgroups ==
# prtconf -v /devices/pci@400/pci@1/pci@0/pci@4/network@0
name='numaio-lgrps' type=int items=1
list-rsrc-group subcommand of the Logical Domains Manager command line interface (ldm) shows a consolidated list of processor cores, memory blocks and IO devices that belong to each resource group. This subcommand is available in ldm 3.2 and later versions.
In a Resource Group, resources are grouped based on the underlying physical relationship between cores, memory, and I/O buses. On different hardware platforms, some of the server configurations such as SPARC M7-8 may have a Resource Group that maps directly to a Locality Group.
-H of prstat command shows the home lgroup of active user processes and threads.
-h of ps command can be used to examine the home lgroup of all user processes and threads. -H option can be used to list all processes that are in a certain locality group. [Related] Solaris assigns a thread to an lgroup when the thread is created. That lgroup is called the thread's home lgroup. Solaris runs the thread on the CPUs in the thread's home lgroup and allocates memory from that lgroup whenever possible.
plgrp tool shows the placement of threads among locality groups. Same tool can be used to set the home locality group and lgroup affinities for one or more processes, threads, or LWPs.
Memory placement among lgroups can possibly be achieved using pmadvise when the application is running or by using madvise() system call during development, which provides advice to the kernel's virtual memory manager. The OS will use this hint to determine how to allocate memory for the specified range. This mechanism is beneficial when the administrators and developers understand the target application's data access patterns.
It is not possible to specify memory placement locality for OSM & ISM segments using pmadvise command or madvise() system call (DISM is an exception).
When Oracle acquired Sun, IBM tried to capitalize the situation just like every other competitor Sun had – doubts raised about Oracle's ability to turn Sun's hardware business around, and Solaris customers were advised to flee SPARC. Fast forward four years .. Oracle appears to have successfully dispelled the doubts with proven long-term commitment to the Solaris/SPARC business with consistent investment and delivery on established roadmaps. Besides, Oracle has been innovating in the server space with engineered systems that are pre-integrated to reduce the cost and complexity of IT infrastructures while increasing productivity and performance.
On the other hand, judging by the recent turn of events at IBM such as selling off critical server technologies, decline in data center business, employee furloughs, layoffs etc., it appears that Big Blue has its own struggles to deal with. In any case, irrespective of what is happening at IBM, AIX customers who are contemplating to migrate to a modern operating platform that is reliable, secure, cloud-ready and offers a rich set of features to virtualize, consolidate, diagnose, debug and most importantly scale and perform, have an attractive alternative — Oracle Solaris. Act before it is too late.
Unfortunately migrating larger deployments from one platform to another is not as easy as migrating desktop users from one operating system to another. So, Oracle put together a bunch of documents to make the AIX to Solaris transition as smooth as possible for the existing AIX customers. Access the AIX-to-Solaris migration pages at:
Oracle Internet Directory (OID) is an LDAP v3 Directory Server that has multi-threaded, multi-process, multi-instance process architecture with Oracle database as the directory store.
BENCHMARK WORKLOAD DESCRIPTION
Five test scenarios were executed in this benchmark - each test scenario performing a different type of LDAP operation. The key metrics are throughput -- the number of operations completed per second, and latency -- the time it took in milliseconds to complete an operation.
TEST SCENARIOS & RESULTS
1. LDAP Search operation : search for and retrieve specific entries from the directory
In this test scenario, each LDAP search operation matches a single unique entry. Each Search operation results in the lookup of an entry in such a way that no client looks up the same entry twice and no two clients lookup the same entry, and all entries are looked-up randomly.
2. LDAP Add operation : add entries, their object classes, attributes and values to the directory
In this test scenario, 16 concurrent LDAP clients added 500,000 entries of object class InetOrgPerson with 21 attributes to the directory.
3. LDAP Compare operation : compare a given attribute value to the attribute value in a directory entry
In this test scenario, userpassword attribute was compared. That is, each LDAP Compare operation matches user password of a user.
4. LDAP Modify operation : add, delete or replace attributes for entries
In this test scenario, 50 concurrent LDAP clients updated a unique entry each time and a total of 50 million entries were updated. Attribute that is being modified was not indexed
5. LDAP Authentication operation : authenticates the credentials of a user
In this test scenario, 1000 concurrent LDAP clients authenticated 50 million users.
BONUS: LDAP Mixed operations Test
In this test scenario, 1000 LDAP clients were used to perform LDAP Search, Bind and Modify operations concurrently. Operation breakdown (load distribution): Search: 65%. Bind: 30%. Modify: 5%
And finally, the:
1 x Oracle SPARC T5-2 Server
» 2 x 3.6 GHz SPARC T5 sockets each with 16 Cores (Total Cores: 32) and 8 MB L3 cache
» 512 GB physical memory
» 2 x 10 GbE cards
» 1 x Sun Storage F5100 Flash Array with 80 flash modules
» Oracle Solaris 11.1 operating system
Major credit goes to our colleague, Ramaprakash Sathyanarayan
Hardly six months after announcing Siebel 184.108.40.206 benchmark results on Oracle SPARC T4 servers, we have a brand new set of Siebel 220.127.116.11 benchmark results on Oracle SPARC T5 servers. There are no updates to the Siebel benchmark kit in the last couple years - so, we continued to use the Siebel 18.104.22.168 benchmark workload to measure the performance of Siebel Financial Services Call Center and Order Management business transactions on the recently announced SPARC T5 servers.
The latest Siebel 22.214.171.124 benchmark was executed on a mix of SPARC T5-2, SPARC T4-2 and SPARC T4-1 servers. The benchmark test simulated the actions of a large corporation with 40,000 concurrent active users. To date, this is the highest user count we achieved in a Siebel benchmark.
User Load Breakdown & Achieved Throughput
Siebel Application Module
Business Trx per Hour
Financial Services Call Center
Average Transaction Response Times for both Financial Services Call Center and Order Management transactions were under one second.
Software & Hardware Specification
Per Server Specification
Solaris 10 1/13 (S10U11)
Oracle 11g R2
Solaris 10 8/11 (S10U10)
iPlanet Web Server
7.0.9 (7 U9)
Solaris 10 8/11 (S10U10)
Oracle Application Test Suite
AMD Opteron 285 SE
Windows 2003 R2 SP2
Load Drivers (Agents)
Oracle Application Test Suite
Intel Xeon X5670
Windows 2003 R2 SP2
Siebel Gateway Server was configured to run on one of the application server nodes
Four Siebel application servers were configured in the Siebel Enterprise to handle 40,000 concurrent users
- Each SPARC T5-2 was configured to run two Siebel application server instances
- Each of the Siebel application server instances on SPARC T5-2 servers were separated using Solaris virtualization technology, Zones
- 40,000 concurrent user sessions were load balanced across all four Siebel application server instances
Siebel database was hosted on a Sun Storage F5100 Flash Array consisting 80 x 24 GB flash modules (FMODs)
- Siebel 126.96.36.199 benchmark workload is not I/O intensive and does not require flash storage for better I/O performance
Fourteen iPlanet Web Server virtual servers were configured with Siebel Web Server Extension (SWSE) plug-in to handle 40,000 concurrent user load
- All fourteen iPlanet Web Server instances forwarded HTTP requests from Siebel clients to all four Siebel application server instances in a round robin fashion
Oracle Application Test Suite (OATS) was stable and held up amazingly well over the entire duration of the test run.
- The test ran for more than five hours including a three hour ramp up state
The benchmark test results were validated and thoroughly audited by the Siebel benchmark and PSR teams
- Nothing new here. All Sun published Siebel benchmarks including the SPARC T4 one were properly audited before releasing those to the outside world
Finally, how does this benchmark stack up against other published benchmarks? Short answer is "very well". Head over to the Oracle Siebel Benchmark White Papers webpage to do the comparison yourself.
[Credit to our hard working colleagues in SAE, Siebel PSR, benchmark and Oracle Platform Integration (OPI) teams. Special thanks to Sumti Jairath and Venkat Krishnaswamy for the last minute fire drill]
Just like the Siebel 8.1.x/SPARC T4 benchmark post, this one too was overdue for at least four months. In any case, I hope the Oracle BI customers already knew about the OBIEE 11g/SPARC T4 benchmark effort. In here I will try to provide few additional / interesting details that aren't covered in the following Oracle PR that was posted on oracle.com on 09/30/2012.
The entire BI middleware stack including the WebLogic 11g Server, OBI Server, OBI Presentation Server and Java Host was installed and configured on a single SPARC T4-4 server consisting four 8-Core 3.0 GHz SPARC T4 processors (total #cores: 32) and 128 GB physical memory. Oracle Solaris 10 8/11 is the operating system.
BI users were authenticated against Oracle Internet Directory (OID) in this benchmark - hence OID software which was part of Oracle Identity Management 188.8.131.52.0 was also installed and configured on the system under test (SUT). Oracle BI Server's Query Cache was turned on, and as a result, most of the query results were cached in OBIS layer, that resulted in minimal database activity making it ideal to have the Oracle 11g R2 database server with the OBIEE database running on the same box as well.
Oracle BI database was hosted on a Sun ZFS Storage 7120 Appliance. The BI Web Catalog was under a ZFS/zpool on a couple of SSDs.
In this benchmark, 25000 concurrent users assumed five different business user roles -- Marketing Executive, Sales Representative, Sales Manager, Sales Vice-president, and Service Manager. The load was distributed equally among those five business user roles. Each of those different BI users accessed five different pre-built dashboards with each dashboard having an average of five reports - a mix of charts, tables and pivot tables - and returning 50-500 rows of aggregated data. The benchmark test scenario included drilling down into multiple levels from a table or chart within a dashboard. There is a 60 second think time between requests, per user.
BI Setup & Test Results
OBIEE 11g 184.108.40.206.0 was deployed on SUT in a vertical scale-out fashion. Two Oracle BI Presentation Server processes, one Oracle BI Server process, one Java Host process and two instances of WebLogic Managed Servers handled 25,000 concurrent user sessions smoothly. This configuration resulted in a sub-second overall average transaction response time (average of averages over a duration of 120 minutes or 2 hours). On average, 450 business transactions were executed per second, which triggered 750 SQL executions per second.
It took only 52% of CPU on average (~5% system CPU and rest in user land) to do all this work to achieve the throughput outlined above. Since 25,000 unique test/BI users hammered different dashboards consistently, not so surprisingly bulk of the CPU was spent in Oracle BI Presentation Server layer, which took a whopping 29%. BI Server consumed about 10-11% and the rest was shared by Java Host, OID, WebLogic Managed Server instances and the Oracle database.
So, what is the key take away from this whole exercise?
SPARC T4 rocks Oracle BI world. OBIEE 11g/SPARC T4 is an ideal combination that may work well for majority of OBIEE deployments on Solaris platform. Or in marketing jargon - The excellent vertical and horizontal scalability of the SPARC T4 server gives customer the option to scale up as well as scale out growth, to support large BI EE installations, with minimal hardware investment.
Evaluate and decide for yourself.
[Credit to our colleagues in Oracle FMW PSR, ISVe teams and SCA lab support engineers]
Siebel is a multi-threaded native application that performs well on Oracle's T-series SPARC hardware. We have several versions of Siebel benchmarks published on previous generation T-series servers ranging from Sun Fire T2000 to Oracle SPARC T3-4. So, it is natural to see that tradition extended to the current genration SPARC T4 as well.
29,000 user Siebel 220.127.116.11 benchmark on a mix of SPARC T4-1 and T4-2 servers was announced during the Oracle OpenWorld 2012 event. In this benchmark, Siebel application server instances ran on three SPARC T4-2/Solaris 10 8/11 systems where as the Oracle database server 11gR2 was configured on a single SPARC T4-1/Solaris 11 11/11 system. Several iPlanet web server 7 U9 instances with the Siebel Web Plug-in (SWE) installed ran on one SPARC T4-1/Solaris 10 8/11 system. Siebel database was hosted on a single Sun Storage F5100 flash array consisting 80 flash modules (FMODs) each with capacity 24 GB.
Siebel Call Center and Order Management System are the modules that were tested in the benchmark. The benchmark workload had 70% of virtual users running Siebel Call Center transactions and the remaining 30% vusers running Siebel Order Management System transactions. This benchmark on T4 exhibited sub-second response times on average for both Siebel Call Center and Order Management System modules.
Load balancing at various layers including web and test client systems ensured near uniform load across all web and application server instances. All three Siebel application server systems consumed ~78% CPU on average. The database and web server systems consumed ~53% and ~18% CPU respectively.
All these details are supposed to be available in a standard Oracle|Siebel benchmark template - but for some reason, I couldn't find it on Oracle's Siebel Benchmark White Papers web page yet. Meanwhile check out the following PR that was posted on oracle.com on 09/28/2012.
This solution was centered around the engineered system, SPARC SuperCluster T4-4. Check the business and technical white papers along with a bunch of relevant useful resources online at the above optimized solution page for EBS.
What is an Optimized Solution?
Oracle's Optimized Solutions are designed, tested and fully documented architectures that are tuned for optimal performance and availability. Optimized solutions are NOT pre-packaged, fully tuned, ready-to-install software bundles that can be downloaded and installed. An optimized solution is usually a well documented architecture that was thoroughly tested on a target platform. The technical white paper details the deployed application architecture along with various observations from installing the application on target platform to its behavior and performance in highly available and scalable configurations.
Oracle E-Business Suite R12 Use Case
Multiple E-Business Suite R12 12.1.3 application modules were tested in this optimized solution -- Financials (online - oracle forms & web requests), Order Management (online - oracle forms & web req
uests) and HRMS (online - web requests & payroll batch). The solution will be updated with additional application modules, when they are available.
For the sake of completeness, test results were also documented in the optimized solution white paper. Those test results are mainly for educational purposes only. They give good sense of application
behavior under the circumstances the application was tested. Since the major focus of the optimized solution is around highly available and scalable configurations, the application was configured to me
et those criteria. Hence the documented test results are not directly comparable to any other E-Business Suite performance test results published by any vendor including Oracle. Such an attempt may lead to skewed, incorrect conclusions.
Questions & Requests
Feel free to direct your questions to the author of the white papers. If you are a potential customer who would like to test a specific E-Business Suite application module on any non-engineered syste
m such as SPARC T4-X or engineered system such as SPARC SuperCluster, contact Oracle Solution Center.
It should be easy to find this information just by running an OS command. However for some reason it ain't the case as of today. The user must know few details about the underlying hardware and run multiple commands to figure out the exact number of physical processors, cores etc.,
For the benefit of our customers, here is a simple shell script that displays the number of physical processors, cores, virtual processors, cores per physical processor, number of hardware threads (vCPUs) per core and the virtual CPU mapping for all physical processors and cores on a Solaris system (SPARC or x86/x64). This script showed valid output on recent T-series, M-series hardware as well as on some older hardware - Sun Fire 4800, x4600. Due to the changes in the output of cpu_info over the years, it is possible that the script may return incorrect information in some cases. Since it is just a shell script, tweak the code as you like. The script can be executed by any OS user.