Thursday Dec 08, 2005

How to demonstrate the value of the CoolThread UltraSPARC T1 servers (T1000 - T2000) to your boss ?

Well, after a very long entry presenting my pfp tool , here is a very short one...

To demonstrate the value of the CoolThread UltraSPARC T1 servers (T1000 - T2000) to your boss
there is only one thing to do : make her/him benchmark it using Sun Sim Datacenter
(yes ! your boss is gonna run a benchmark and she/he will like it ! )

How to do it and simulate all your Datacenter with UltraSPARC T1000 or T2000 ?
Very simple, download Sim Datacenter here , and run it on Solaris 9 or 10 !

What, you don't have Solaris 10 on your laptop ?
Get it right now on this Solaris page ...
Easy,no ?

Wednesday Dec 07, 2005

Is my workload recommended for a CoolThread UltraSPARC T1 server ( T1000 - T2000 ) ?

Since the pre-release and announcement of UltraSPARC T1 systems (T1000 - T2000),
our customers coming in the Sun Solution Benchmark Center have been very interested to know if their
application will work well on UltraSPARC T1. While assessing the multi-threaded nature of a
workload is easy using standard system tools, it is less straightforward to obtain at will
the amount and proportion of floating points instructions executed by a system. Some complex
tools exist but we would like to have a simple go/no-go binary that would answer
only this question. (If you are interested in a more detailed analysis of a cpu behavior, please
ask me about a great tool called ripc )

The key information coming from our UltraSPARC T1 engineers is the choice they had to make (because
of space limitations) to have a single floating point unit shared by the 8 cores (and 32 strands).
Please note that this challenge has been solved on the next release of this processor.

They tell us that in there best estimation any workload doing more than 2% of the total amount of instructions
using floating-points  will not be recommended for UltraSPARC T1. Between 1% and 2% is the gray area where
they recommend us to try because a number of the simpler FPU commands were moved to the
core and dont incur a 40 cycles penalty.

The idea of this article is to explain how to get this information and provide a simple tool
(for all UltraSPARC based systems).

The UltraSPARC III (or UltraSPARC IV core) has a maximum of four instructions that can
be fetched from cache in a clock cycle and a total of sixteen fetched instructions that
can wait for an execution unit to become available. Six parallel execution units exist on
the chip : one load/store unit, one branch unit, two identical integer Arithmetic Logical
Units, one add (and therefore substract) floating point unit named FA_PIPE (see FP 1
on the schema below and one multiply(and therefore divide) floating point unit named FM_PIPE.
(see FP 2 below).


For the UltraSparc III (and IV or IV+), multiple performance instrumentation counters are provided to analyze the CPU performance
behavior under load but for our purpose we need to consider only three of them :

1-The total number of instructions completed  not counting annulled, mispredicted or
trapped instructions. This is the Instr_cnt counter

2-The total number of instructions completed on the FA_PIPE. This is the FA_pipe_completion

3-The total number of instructions completed on the FM_PIPE. This is the FM_pipe_completion

Note that the counters 2 and 3 are also incremented for some type of VIS instructions. Therefore,
they have to be considered only as estimations.

For the UltraSPARC T1 based systems, it is simpler as the single counter FP_instr_cnt is directly provided.

As you already deducted, we will be able to determine the percentage of floationg point
operations with the formula :

%FP_ops = 100 \* (FA_pipe_completion + FM_pipe_completion) / Instr_cnt

We are also able to provide this simple heuristic :

if ( %FP_ops < 1%) -> Recommended for UltraSPARC T1
else if (%FP_ops  between 1% to 2%) -> Possible fit for UltraSPARC T1
else -> Not recommended for UltraSPARC T1

To do this, here is a program named pfp that you can use as pfp <duration in seconds>
If you are on a T1000 or T2000 system, please use the flag -n as this program does not detect the cpu
type in its first release.Please remember to run your workload first and while it is running,
use this program as shown below.

paris # ./pfp 30
We observed 22756679 instructions separated in 0.20% floating point and 99.80% others
This workload is recommended for UltraSPARC T1 systems.

ontario # ./pfp -n 30
We observed 342593950 instructions separated in 0.77% floating point and 99.33% others
This workload is recommended for UltraSPARC T1 systems.

If you just want the percentage of floating point instructions, you can also do
paris # ./pfp -s 30

Finally, you can also use the tool on Solaris 8 or Solaris 9 with :
Dtrace # ./pfp -ps 30

The binary of this tool can be found here.

Thursday Nov 17, 2005

Solaris Performance Analysis Methodology (The APM) - part 1b

The APM for Solaris - Part 1b Good morning;

As  promised in my previous post, I am continuing the exposé of the way I analyze
customer  workloads. Please note that the way you will approach this mission can be :
  • Personal - To be successful and have an happy customer (or consumer using a DTrace analogy), you may have to follow your own logic.
  • Traditional - The traditional approach is to start with a general tool (vmstat, prstat or sar) and drill down into interesting areas.
  • Based on previous analysis - You don't need to understand the whole performance picture and can focus on a specific issue.
This is fine. My overall intent is to give you some ideas. If you can walk away with a better process based on mine, my mission will be accomplished.

The APM for Solaris

Part 1 (continued) - Control what's running

Based on my previous post, you now have a better idea of what's going on in the different
layers of your architecture. But, what about the quality of this applications ? The most
common reaction that I get on this question is : "We can't do anything about it". This may
be true for commercially available application like a database or a web server. In fact, I found out that for a majority of our customer, we can improve performance by doing one
very simple thing : Recompile with a modern compiler and adequate flags.
(for Java, read this as : Use the latest supported JVM with adequate flags).

Please remember that Sun Studio 11 is now FREE. You have no excuse not to use it. A
better question is how you can find out what compiler was used for a specific binary.
An old Unix System V tool named elfdump is hidden under /usr/ccs/bin with other tools like prof or lex. (usually it is not in your path - just add it ). Note that elfdump -C can demangle C++ names. For example :

elfdump -c /iGen/iGen_all |grep "SUNWspro"

Convincing your customer to re-compile is sometimes difficult but will most of the time
yield performance gains. Fundamentaly, the quality of the instructions executed
by the processors is key. Using the -fast flag (it will be automatically expanded to a set of
 platform dependant flags) is a good place to start.

Well, that's it for today.
See you next time in the wonderful world of benchmarking.

Wednesday Nov 16, 2005

Welcome to BM Seer and Solaris Performance Methodology

Good morning all;

As it is lightly snowing this morning in St Charles (outside of Chicago),
let me give a warm welcome to a new blogger : BM Seer .
You will find there all the latest news on Sun benchmark results.

Now, I will start today a new serie on my performance analysis methodology.
I called this methodology the ASTROLABE. ( If you do not know what an
astrolabe is, check out this web site)
The intent of this approach is to be SIMPLE and POWERFUL.
This methodology has seven sections and I will expose today the first one .

The ASTROLABE performance analysis methodology.

Section one : Control what's running

You may be surprised but most of the customers I am seeing in the benchmark center
do not really control what's running in the environments.

Top three questions to answer :
1- What are the applications running in this environment
and their main characteristics ?
2- What are the main data streams and what transport mechanism
are used (tcp, udp ...) ?
3- Most importantly, what is running that we do not know about ?
A way to detect this is to run this simple DTrace audit script :
(Note that this script is Solaris 10 zone aware... )

dtrace -n 'proc:::exec{printf("%s execing %s, , uid/zone = %d/%s\\n",execname,args[0],uid,zonename)}'

The main issue that this script will uncover is Runaway shell scripts.
They may use a very valuable chunk of your system resources.
Also, short-lived applications can be uncovered this way
A hint : if the total of the cpu reported by prstat is inferior
to the total cpu usage reported by vmstat, you should worry about this two issues.

In more than 80% of the customer workload we analyzed, performance benefits are
achieved by tuning the software stack and the customer applications
in particular, not by tuning Solaris.

See you soon in the wonderful world of benchmarking....

Thursday Nov 03, 2005

To the princes and princess

To the princes and princess As expected, find some of our jewels below.
Feel free to send me your questions by email to

Part 3
Part 4
Part 5
Part 6
Part 7

See you soon in the wonderful world of pop ...

Tuesday Oct 18, 2005

DTrace deep dive in Southern California

Solaris 10 deep dive Thanks to all the attendees of today's Sun Solaris 10 DTrace deep dive in rainy LA.
Feel free to send me your questions by email to
Please find below the two DTrace presentations.
Dtrace concepts Dtrace scenarios
Let me know if our discussion was useful.
See you soon in the wonderful world of Solaris 10 ...

Thursday Sep 08, 2005

Solaris 10 deep dive in Santa Clara,CA

Solaris 10 deep dive I had the privilege to co-present today at the Solaris 10 deepdive bootcamp in the Santa Clara,CA auditorium.
For all the attendees, thank you so much for coming and stay with us all day. Presenting an operating system is
not an easy endeavor and your patience and outstanding remarks were very much appreciated !

My presentation is available here in pdf format. Enjoy !
Also, send me an email at if you have any questions that could not be answered today.
Please also check Bob Netherton and Linda Kateley weblogs for their latest update and presentations.

Wednesday Aug 17, 2005

DCSS in Vegas feeedback 2

More feedback from the DCSS in Vegas where I presented this morning on Solaris 10 performance
covering various topics to help tuning your environment. Such as the impact of FireEngine,
ptools, libumem, mpss and so on. Great crowd of 160 partners very much motivated to get the
very best from the best OS...
Now, the highlight of the day was the one-man-show of Brian Wilson, one of our distinguished
engineer that make Sun an exciting workplace.
If you need to close a deal and the customer is questioning SUN's strategy, bring him in and the
PO will follow ...
You know the food pyramid? Here is the BW pyramid composed of all the components of the IT infra-
-structure in the proper order. From Network to business processes and ROI.
Great stuff. Need more on this ? Let us know....

Tuesday Aug 16, 2005

DCSS in Vegas - Feedback 1

Hi all; Attending and presenting @ Sun Microsystems Data Center Summit in Vegas this week. Great to see all this familiar faces... This morning was marked by a bright keynote presentation by Rich Napolitano. Finally an executive that got the point : products and technology are second on the list of success. A big first is about sales force discipline, sales tactics and a reminder of the ONE SUN attitude. Thanks Rich for your sane back to basics reminder ! benoit

Tuesday Jun 14, 2005

OpenSolaris - This is the day !

This is the day !

And to say it again :


This is the day !

Details at :


Thursday May 19, 2005

Solaris 10 bootcamp

Well, I delivered today our Solaris 10 bootcamp in the Marriott on 4th street in San Francisco. About 200 attendees (a great group) with a lot of smart questions. Feedback was really positive on the content...great as I was trying a more interactive, terminal based Solaris 10 format. Who said you can not have fun presenting an Operating system...

StarOffice8 beta was good with us as it did not freeze like when I was reviewing the slides the day before.Just a couple of complaints on the french accent. Sorry about that friends. I never learned english at school (but german and italian).

See a great feedback on the bootcamp at

I will probably publish the DTrace performance scenarios on this line starting next week after my 3 days trip to Oregon to meet key customers there...

Tuesday May 10, 2005

Oracle 10g on Solaris 10 Hidden parameters to optimize Oracle 10g on Solaris 10 Hidden parameters to optimize Oracle 10g on Solaris 10

Hidden parameters to optimize Oracle 10g on Solaris 10 Hidden parameters to optimize Oracle 10g on Solaris 10

You may have recently installed Oracle 10g on Solaris 10 and wander
into the wonderful world of Oracle hidden parameters. Every time Oracle
is producing a new vintage of the unbreakable database we get a bulkload
of new mysterious parameters. For the Oracle DBA eye, some of them have
a very explicit name (_lgwr_async_io). Some of them have names directly
extracted from a martian dictionary (see _kghdsidx_count).

Now, of course, your noble intent is to do tuning, not debugging. What about
if you obtain a very sexy
"ORA-03113: end-of-file on communication channel"
on your first 1000 users attempt ?

Well, looking into the Oracle Net Dispatcher log, you will see an helpful :

"NS Primary Error: TNS-12535: TNS:operation timed out
NS Secondary Error: TNS-12606: TNS: Application timeout occurred"

And you call Oracle and they will tell us : This is a bug, Sir. Please go
in sqlnet.ora and do not specify the SQLNET.INBOUND_CONNECT_TIMEOUT parameter.

One problem fixed....the only fixed by a documented feature that you should
not use....great start.

Starting the workload again and now you observe some FULL TABLE SCAN. Oops...
I know how to fix this one and here are the "create index" statements.
Unfortunately, the unbreakable database send you a very rude
ORA-00600 [kcbgtcr_5], or ORA-00600 [kcbgcur_3] error message.

Good thing this young lady from oracle had the coolest voice in the world
so it not a problem to call again. And a certain  John answers the phone...
Excuse me, may I speak with Virgina ?...ok, I'll wait.

Yes, this is a bug again (3392439) and to fix it , just type :
"ALTER SYSTEM FLUSH BUFFER_CACHE" . Interesting... or you can put this in
your pfile "_db_cache_pre_warm=false" . Oracle is easy.

(By the way, some more 600 errors can occur on Oracle 10g for Solaris x86
and the previous parameters do not fix them. You will need the very
entertaining "_enable_NUMA_optimization = FALSE" to keep going...)

Here we are... my 1000 users are running.
Looking at Statspack and system statistics, I notice a lot of pressure on
the shared pool and latch contention.
First, I made sure I was using ISM with " _use_ism_for_pga = true" Yep...
Then, I discovered that we can now segment the shared pool into multiple separate
zones, each protected by bound latches. How to do this ?
Just say " _kghdsidx_count = 4" and you will get four of those. The maximum
is apparently seven. No idea why....And I can not find this martian dictionary.

And running again.... but oracle is still singing the latch contention hymn.
Could I have a high level of contention on certain blocks ?
To find the culprit, I queried V$LATCH_CHILDREN for the address and joined it
to V$BH to identify the blocks protected by this latch (doing so will show all
 blocks that are affected by the warm block).
Two way to fix this :

 - If this is on an index (use DBA_EXTENTS to find out this common case) ,
 use a reverse-key index.
 - If not, set _db_block_hash_buckets to the prime number just larger than twice
 the number of buffers.
Do not forget you must have one LRU latch minimum for each database writer.
You can increase them with a very elegant "_db_block_lru_latches= xx"
Just tell me why this is undocumented as it appears absolute best practice ?

And here I am, running again. Now that I fixed the latch issue, the contention
has moved to the log writer. No surprise.
 A new feature of Oracle 10g is log parallelism that you can obtain with :
and the tuning of _log_parallelism_max

Looking further into this, it does not provide full parallelism.
And because this is not a 24x7 production system, looks like you can also
do a really,really exciting :

(Common sense could have been _log_parallelism_private=true but this
Oracle engineers like poetry too...)

Oracle did not crash (unbreakable,right ) and I am running as fast as ever.
I realized later that I really did not need to update v$pga_advice all the time
(_smm_advice_enabled=false) or enable auto tuning of undo_retention
(_undo_autotune=false) as I really need this CPU cycles for my transactions
and not for the Oracle kernel.

Finally, here I am using the 21st century software jewel, DTrace
And realize that I am not using malloc() anymore but mmap(). Great !
But can I tune the mmap byte preallocation....oh,yes. Here is our final
undocumented pearl : _realfree_heap_pagesize_hint . Only 28 letters, what
do you think ?

Unbreakable, yes ! Simple, not yet ....

Wednesday Apr 20, 2005

RAID-1 vs RAID-5 Part 7

out-cache results - Random IO – 8 kbytes – 2x32 Gbytes

SE3510 - Scalability analysis

This analysis is showing us the impact of concurrency on IOPS performance as well as scalability differences between RAID-1 and RAID-5. Please find below the SE3510 results and charts on test R1 to R5 :

se3510_outcache 1

se3510outcache 2

Observations : RAID-5 is in average 9% slower in read-only up to 33% in 50% read and 61% in a write-only situation. Scalability is good in all cases. The RAID-5 vs RAID-1 difference is stable in percentage for every IO pattern tested however the IOPS difference is proportional to the concurrency. For example, at 50% read , a difference of 2123 IOPS is observed between RAID-1 and RAID-5 at 64 threads, however this difference is only 1024 IOPS at 16 threads. If IO is one of the critical component of the architecture performance, you may already change the end-user experience if you are choosing RAID-5 vs RAID-1.A performance note : RAID-1 andom write raw performance at almost 10000 IOPS is outstanding.

Monday Apr 18, 2005

RAID-1 vs RAID-5 - Part 6

Tonight, we are continuing our in-cache investigation by showing SE9980 results.
Please remember that the intent here is not to compare different IO subsystems, but really to understand how the different RAID algorithm compare.

We already noticed that this is STORAGE DEPENDANT. In fact, if the RAID level does not affect the SE6120 performance in-cache, it does cause RAID-5 to be 20% slower on the SE3510.

What about the SE9980 ?


Observations : When you can fit in the cache, RAID-1 and RAID-5 are delivering similar level of performance . However, the central cache architecture cause more variability on the SE9980 compare to the other IO subsystems.

Next on this blog, out-cache results.
When we do not fit in-cache anymore, does the behavior change ?

Friday Apr 15, 2005

RAID-1 vs RAID-5 - Part 5

A very interesting and exciting week with customers...and the blog suffer. I can only take time today to continue this blog paper. Send me your comments....


    All in-cache tests have been performed at 64 threads total. Please find below what we observed for tests I1 to I5 :


Observations : As on the SE3510, RAID-1 and RAID-5 are delivering similar performance when you are reading in-cache. In addition, this is still true on the SE6120 when you are reading and writing from the cache. This is a great result conform to the expectations. Please look at the out-cache results to complete this analysis.

SE9980 in-cacche results are next....

Friday Apr 08, 2005

RAID-1 vs RAID-5 Part 4 - SE3510 in-cache results

RAID-1 vs RAID-5 Part 4 Here we are.... first  benchmark results ..let me know  your thoughts

in-cache results Random IO 8 kbytes 2x256Mbytes

We are checking first in-cache performance. On all IO subsystems we are guaranteed to perform all the random IO in-cache after warm-up. (we have 1 Gbyte of cache per controller and are testing one 256 Mbytes raw device per controller.) Warm-up period may vary and is not measured. Refer to the Methodology section if you have more questions on when the reported measures have been taken.


    All in-cache tests have been performed at 64 threads total. Please find below what we observed for tests I1 to I5. Results are provided in IO operations per second :


Observations : There is no penalty for RAID-5 in-cache when you are performing reads only.
Otherwise, RAID-5 is performing 22% less IO than RAID-1 in average for the same percentage
 of read and write at 64 threads.

Tomorrow, I will show you SE6120 results....

Thursday Apr 07, 2005

RAID-1 vs RAID-5 - Part 3/7

Raid-1 vs Raid-5 part 3 RAID-1 vs RAID-5 on Solaris 10 - Part 3/7

First, let me thanks all the wonderful attendees at today's Solaris 10 bootcamp.
I had a great time and I hope you did ... Now, let me continue the RAID-1 vs RAID-5
investigations. I now I promised you some results but I forgot some important results
will be hopefully for tomorrow.

6. Physical Disks

The following table describes the physical disks used in each IO subsystem:

Hard disk drives characteristics







Seek time



Cheetah X15







Cheetah 4












Notes : We are not using the same capacity or disk speed on every storage subsystem. This is perfectly compatible with the main goal of this study which is to compare RAID-1 and RAID-5 performance for each IO subsystem and not to provide and compare the best obtainable performance.

7. Software

The following table describes the software testing environment characteristics:

Software Stack




Operating System

Sun Microsystems

Solaris 10 pre-GA

IO testing

Sun Microsystems

Ortera AtlasSP

Notes : To simplify the analysis, no volume manager was used for this study. All tests are done on raw devices configured with the Solaris format command

8. Methodology

The main purpose of this test was to provide an answer to the following questions that are commonly asked by the Sun Microsystems sales force :

  1. Are RAID-1 and RAID-5 providing the same performance level ?

  2. Are concurrency (number of IO threads) or the read/write ratio important factors ?

  3. Are question 1 & 2 answers the same for random IO and sequential IO ?

  4. Does an IO subsystem behave differently in-cache or out-cache ?

  5. Do we have generic answers to this four questions or does it depends on the IO subsystem ?

The following table describes the different test performed in order to help us answering the previous five questions :

Tests performed

Test Id

IO type

IO size

Read %

Write %

Chunk size



8 kbytes



2 x 32Gbytes



8 kbytes



2 x 32Gbytes



8 kbytes



2 x 32Gbytes



8 kbytes



2 x 32Gbytes



8 kbytes



2 x 32Gbytes



8 kbytes



2 x 256Mbytes



8 kbytes



2 x 256Mbytes



8 kbytes



2 x 256Mbytes



8 kbytes



2 x 256Mbytes



8 kbytes



2 x 256Mbytes



256 kbytes



2 x 32Gbytes



64 kbytes



2 x 32Gbytes

Notes : This twelve tests are representing common IO patterns encountered on an Oracle database performing a commercial application.

To obtain the results detailed in the next section of this document and to guarantee the exactitude of the information reported,we followed the following rules :

  • Workloads are always executed in the same order for a duration of 5 minutes each.

  • The iostat tool is continuously running at an interval of 10 seconds.

  • The IOPS (or Mbytes/s) value reported is the median value obtained during the fifth minute of the test (not the average or the peak value)

  • To provide different concurrency values and a scalability curve, various thread numbers have been tested (from 16 to 128 depending on the situation).

In-cache benchmark results are next ...Promised

Tuesday Apr 05, 2005

RAID-1 vs RAID-5 on Solaris 10 - Part 2/7

Raid 1 vs Raid 5 - Part 2 on 7
I presented yesterday an introduction and some basic concepts of this paper.
A new style is born : the blog whitepaper. If it is totally unsuccessful, we will
just forget about it. Right ?

3. Objectives

Here are the objectives of this study :
  • Configure the SE6120, SE3510 and SE9980 using twelve physical disks and two 2 Gbit/s front-end interfaces

  • Compare performance and scalability of RAID-1 and RAID-5 for different scenarios. Please note that the choice has been made to test 8k and 256k IO sizes based on the most common Oracle database installations on Sun Solaris. OLTP databases are very often configured with a 8k block size (thus issuing 8k random IO) and DSS databases often settle to a 256k sequential IO size by tuning the Oracle multi-block-read-count parameter.
Note : We are doing our scalability study on two 100%- accessed 32-Gbytes raw devices. Maximum in-cache throughput will also be provided and is obtained on two 256-Mbytes raw devices that typically fits in a 1-Gbyte storage data cache. In fact, you should consider all the random IO results included in this document as worst cases because our locality of reference is minimal.

4. Hardware

The server is a Sun Fire  E6900 with 24xUltraSparc-IV 1.2Ghz and 96GB RAM.

The following IO subsystems have been used for this study : SE3510, SE6120 ans SE9980.

The Sun StorEdgeTM SE3510 is the flagship product of the Sun StorEdge 3000 family.This product supports 2-Gb Fibre Channel throughout the architecture (midplane, drives, host port and drive ports).

Key technical features of the SE3510 we used include :

  • Dual-redundant RAID controllers in a 2U high chassis
  • One 2-Gb host port per RAID controller
  • Server is directly attached to the array
  • Out-of-bound management via Ethernet connection
  • Configuration control via CLI
  • Two physical luns total using all the available capacity
  • Dual redundant power supplies

The Sun StorEdgeTM 6120 array is a highly available fibre channel RAID array with intuitive management software that is designed to simplify storage administration and provisioning.

Key technical features of the SE6120 we used include :

  • High availability configuration provides redundant hardware RAID controllers with mirrored cache and hot-swappable hardware components
  • 2-Gb Fibre channel configuration front-to-back
  • Two high performance RAID controllers with 1-GB data cache each, ECC protection and one 2-Gb fibre channel each
  • Five possible block sizes from 4KB to 64KB
  • Back-end Fault Isolation Task (BEFIT)
  • Explicit LUN failover for multipathing
  • ESM 2.x LE, unlimited RTU
  • Fast Volume Initialization

The Sun StorEdgeTM SE9980 is a high-performance, non-stop operation storage system designed for multi-host applications and very large databases. The advanced components and features of this system represent an integrated approach to data retrieval and storage management. This system is the second-generation design of the unique switched-fabric architecture pioneered in the Sun StorEdge 9960. By doubling the number of data paths between cache switches and increasing the clock speed over 60%, the total aggregate bandwith has been increased way beyond the current competition.

Key technical features of the SE9980 we used include :
  • Fully addressable 2GB cache with a separate control cache
  • Fully redundant, hot-swappable components
  • Dual port fibre channel disk drives
  • Global dynamic hot sparing
  • RAID-1 and RAID-5 array groups within the same system
  • Duplexed write cache with battery backup

5.  Configuration details

Here are the details of the IO subsystems configuration as used in this study.

IO subsystems Stack
















1 GB








2 GB








2 GB



Notes :

  • We are using 12 spindles in the SE3510 and SE6120 and only 8 on the SE9980. This is due to the lack of the flexibility of the SE9980 that provides only 2 choices : (2+2) or (4+4) in RAID-1 and (3+1) or (7+1) in RAID-5. In order to run exactly the same workload in every case (and the same benchmarking scripts), we decided to use a configuration that provide a single physical lun per controller.
  • While you can have two 2Gb/s on a single fully-populated SE3510, you need two SE6120 half-populated to obtain the same front-end capacity and the same number of physical disks.However, we have verified that the list price of one fully populated SE3510 is about the same than one partner pair of SE6120 half-populated.

Next to come :  in-cache performance results...

Monday Apr 04, 2005

RAID-1 vs RAID-5 on Solaris 10 - Part 1/7

RAID-1 vs RAID-5 on Solaris 10 - Part 1/7 Well, today went so fast I had no chance to write an entry in MrBenchmark's blog.

Sorry to disappoint you but I will not talk about the best winery of the Napa valley or the movies I saw last week. Instead, I'd like to start today a serie of blog entries on the topic of RAID-1 vs RAID-5 on Solaris 10.

I will start today by an Introduction and some Concepts

1. Introduction

We are regularly engaged by Fortune 500 companies to assist in determination of the right IO subsystem for their future information systems. The task of choosing the appropriate IO subsystem must take into considerations many factors like availability, capacity, performance, heterogeneity or price. Capacity and performance are both the consequences of one major decision : the RAID level.

The main goal of this study is to compare RAID-1 and RAID-5 performance on three important IO subsystems out of the Sun Microsystems available products : the SE6120, SE3510 and SE9980. RAID-0 configurations are not part of this study as they are rarely requested by our customers and not available on the SE9980.

2. Concepts

Please find below the definition of what we freely refer in this document as RAID-1 and RAID-5 :

  • RAID-1 is implemented on modern IO subsystems as hardware RAID-1+0.It allows mirroring and disk striping in one step. The total usable capacity is the capacity of half the drives used to create the Logical Unit (or Lun).
  • RAID-5 implements multi-block striping with distributed parity. This RAID level offers redundancy with the parity information distributed across all disks in the array. Data and its parity are never stored on the same disk. In the event that a disk fails, original data can be reconstructed using the parity information and the information on the remaining disks.

By stating this, we realize easily that performing this comparison apples-to-apples is a difficult task. If we use the storage architect point of view, we would compare this two layout technics by capacity . It basically means to compare the performance of a RAID-1 (3+3) lun to a RAID-5 (3+1) lun. From the system engineer point of view, the problematic is more to help the customer configure a storage subsystem that has been purchased to the best of the business requirements interests. As an example, if the customer purchased a SE3510 FC array with two RAID controllers, how should this IO subsystem be configured to ensure good performance, low cost per gigabyte and good reliability (in the customer order) ?

We have chosen the second approach, by spindles, as this is the most common question issued by the sales force. It means that a RAID-1 (3+3) lun will be compared to a RAID-5 (5+1) lun. We will not answer directly to the configuration question, but we hope we will provide you the data to answer it case-by-case in confidence.

As you have plenty of other things to do, I will stop there. Tomorrow,  I will detail the benchmark environment and clarify my objectives. Then, the rest of week will be used to show you benchmark results.


Friday Apr 01, 2005

Memory page coalescing update and Solaris 10

Well, some of you may remember my december technical brief
talking about the Solaris memory page coalescing on high-end servers.
If you don't know what I'm talking about feel free to to send me an email at and I will give you the link.

Since this technical brief, I have received a lot of request on this topic
that I would like to answer today :

-> Request 1 : What is the list of issues linked to this one ?

Here is the list with bugIds so you have a complete picture :

4802594 - Idle loop degrades IO performance on large psets
5059920 - Idle loop is not scalable on large systems
5054052 - disp_getwork() is greedy and negatively impacts dispatch latency
5050686 - Solaris mutexes should be made more efficient under contention
5095432 - Oracle startup takes too long due to memory fragmentation
5046939 - kcage_freemem grows too large when large ISM segments assigned on SF15k
4904187 - page_freelist_coalesce() holds the page freelist locks for too long

--> Request 2 : You mentioned that the fixes for this issues are available in
IDRs for Solaris 8 and Solaris 9. What is the status of the patches ?

Good news here. Solaris 8 patches have just been released. The fixes for this issues
are available in the following kernel update patches (KUP) :

        Solaris 8        Solaris 9
Sparc        117350-23        117171-17
x86        117351            117172
Now, you are ready to ask me : what about Solaris 10 ?
The answer for Solaris 10 (and nevada) is : in progress....
I tested last month some of this issues on Solaris 10 and while the
problems are still there (the page_freelist_coalesce() routine is in the common Solaris
code), the impact is much,much lower. As an example, the 10G Oracle startup testcase we
built took 50s on a normal system. With no 4M pages available , it took up to 2 hours on
Solaris 8, up to 15 minutes on Solaris 9 and up to 3 minutes on Solaris 10.

--> Request 3 : It is very complicated to get a memory picture on our system.
    vmstat or sar data are not detailed enough. Can you help ?

No need here for complex packaged tools. The best kept secret of Solaris is
the numerous options of mdb. So if you write a little script like :

# Displaying the memory map...
echo Browsing memory...
mdb -k 2>/dev/null <<!

You will get this output :

    Browsing memory...

    Page Summary                Pages                MB  %Tot
    ------------     ----------------  ----------------  ----
    Kernel                      36480               285    1%
    Anon                        12891               100    0%
    Exec and libs                5106                39    0%
    Page cache                 208799              1631    5%
    Free (cachelist)           139913              1093    3%
    Free (freelist)           3780231             29533   90%

    Total                     4183420             32682
    Physical                  4116397             32159

    Fri Apr  1 10:18:16 PST 2005

Cool !

--> Request 4 : Solaris 10 provide updated memory structures and the page freelist is now available. Can we use it to get the amount of free 4M pages ?
This request came last week from the VOS escalation team. And the answer is : yes but it requires a close look at how the page_freelists is implemented to get the right number.
We worked on this question with my good friend Mike C. in December and here is
the updated script for Solaris 10 (yes, mdb again) :

# Walking the page_freelist in Solaris 10 to get the amount of 4M pages...
mdb -k 1>/tmp/1 2>&1 <<!
page_freelists+30::array uintptr_t 1 | \\
::print uintptr_t | ::array uintptr_t 0t18 | \\
::print uintptr_t | ::array uintptr_t 0t2 | \\
::print uintptr_t | ::grep ".!=0" | ::list page_t p_vpnext
cat /tmp/1 |grep -v failed|wc -l

And on my v490, I have :

v490 # ./

That's it for now...



« July 2016

No bookmarks in folder