Dtrace and determining zones limiting factor

I have started to investigate what the limiting factor on the Ultra 10 is when at 167 zones. Using svcadm, I have disabled the name service cache on every zone. That has let me boot all 171 created zones (I'll have to create more).

Here is the vmstat output of the "idle" Ultra 10:

 kthr      memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr dd dd f0 s1   in   sy   cs us sy id
 0 0 0 4920928 729616 1   4  8  6  8  0 14  0  1  0  0  402  113  161  0  1 98
 0 0 0 1773368 44480  0  15 15  0  0  0  0  0  0  0  0  409  325  619  1  6 92
 0 0 0 1773320 44432  0   0  0  0  0  0  0  0  0  0  0  401  356  676  2  5 92
 0 0 0 1773264 44376  0   1  0  0  0  0  0  0  0  0  0  399  319  518  2  6 92
 0 0 0 1773384 44496  0   1  0  0  0  0  0  0  0  0  0  405  233  500  0  4 95

Run queue looks good. Virtual memory looks good. Occasional page faulting. On this rather over-loaded host, I expected more. System and user time are taking 5-8% of CPU time. Hmmm. A quick run of prstat shows:
 16253 root     7696K 7320K cpu0    19    0   0:00:01 4.4% prstat/1
 15060 root     8568K 1720K sleep   59    0   0:00:09 0.1% sshd/1
 15102 root     2936K 1984K sleep   59    0   0:00:01 0.0% bash/1
   223 root     4032K 2040K sleep   59    0   0:02:11 0.0% nscd/41
  8826 root     6384K 1608K sleep   59    0   0:00:29 0.0% svc.startd/13
  7782 root     6424K 1552K sleep   59    0   0:00:31 0.0% svc.startd/12
  6988 root     6304K 1464K sleep   59    0   0:00:33 0.0% svc.startd/12
  5950 root     6304K 1440K sleep   59    0   0:00:33 0.0% svc.startd/12
  4380 root     6256K 1384K sleep   59    0   0:00:32 0.0% svc.startd/12
  2037 root     6448K 1240K sleep   59    0   0:00:33 0.0% svc.startd/12
 29486 root     6360K 1208K sleep   59    0   0:00:33 0.0% svc.startd/12
 29244 root     6224K 1216K sleep   59    0   0:00:34 0.0% svc.startd/12
 29002 root     6568K 1208K sleep   59    0   0:00:32 0.0% svc.startd/12
 27792 root     6288K 1256K sleep   59    0   0:00:33 0.0% svc.startd/13
 26582 root     6288K 1264K sleep   59    0   0:00:33 0.0% svc.startd/12
 14054 root     6224K 2840K sleep   59    0   0:00:11 0.0% svc.startd/12
 23098 root     6096K 1304K sleep   59    0   0:00:34 0.0% svc.startd/12
 13262 root     6432K 3088K sleep   59    0   0:00:11 0.0% svc.startd/12
 21415 root     6240K 1368K sleep   59    0   0:00:33 0.0% svc.startd/14
 21175 root     6240K 1304K sleep   59    0   0:00:34 0.0% svc.startd/12
 20450 root     6224K 1304K sleep   59    0   0:00:34 0.0% svc.startd/12
 20180 root     6304K 1296K sleep   59    0   0:00:34 0.0% svc.startd/13
 17449 root     6144K 1384K sleep   59    0   0:00:34 0.0% svc.startd/12
 16938 root     6160K 1320K sleep   59    0   0:00:33 0.0% svc.startd/12
 16674 root     6224K 1336K sleep   59    0   0:00:34 0.0% svc.startd/12
  9435 root     6096K 1464K sleep   59    0   0:00:41 0.0% svc.startd/13
 12657 root     6304K 1880K sleep   59    0   0:00:12 0.0% svc.startd/12
  6378 root     6296K 1336K sleep   59    0   0:00:39 0.0% svc.startd/12
  3843 root     6304K 1336K sleep   59    0   0:00:41 0.0% svc.startd/12
  3248 root     6160K 1336K sleep   59    0   0:00:39 0.0% svc.startd/13
  3090 root     6360K 1336K sleep   59    0   0:00:39 0.0% svc.startd/13
  1785 root     6176K 1464K sleep   59    0   0:00:39 0.0% svc.startd/12
   583 root     6584K  976K sleep   59    0   0:02:15 0.0% svc.startd/12
     7 root     6480K  392K sleep   59    0   0:03:09 0.0% svc.startd/11
...
I wonder what svc.startd is doing? Not much I'm sure, but multiplied by 171 ... hmmm ... Let's throw DTrace along with my inferior intellect at the problem. I'm at the early RTFM stage of DTrace. Let's see what I can find out. First, let's see who is making the most system calls:
syscall:::entry
{
   @func[execname] = count();
}

That results in:
  sshd                                                             16
  nscd                                                             23
  sac                                                              42
  utmpd                                                            54
  ttymon                                                           63
  automountd                                                      124
  init                                                            150
  inetd                                                           170
  svc.configd                                                     171
  dtrace                                                          417
  svc.startd                                                     1735

No suprise here based on what we know from the vmstat and prstat. But what is svc.startd doing? Let's throw together another Dtrace script to show what function calls svc.startd is making:
syscall:::entry
/execname == "svc.startd"/
{
   @func[probefunc] = count();
}

Afer a couple of seconds, this results in:
  lwp_park                                                        112
  portfs                                                          855

What has all of this accomplished? Don't know exactly :) I'm out of time and I realize that I should't necessarily be focusing on the number of times called but on the time spent in each function call. My 'D' isn't quite that good yet.

Gotta go!

Comments:

John, what are the symptoms that you see when you "hit the limit"? You've got an awful lot of idle there, so looking at what is getting on-proc isn't where I would be looking first.

So you can do compares between runs, you may also find the tick probe useful.

i.e.

tick-10s { exit(0);}

Which will exit the script after 10 seconds, also printing out your aggregations.

If you can tell me what exactly "hitting the limit" looks like, I'd be happy to try to give you a hand.

Posted by Alan Hargreaves on December 30, 2004 at 05:34 AM PST #

I could use the hand (especially the process of determination)

.

Yeah, I got side-tracked trying to figure out what was going on with that 5-8% system/user time on the "idle" box. That was a baseline for playing around with Dtrace. Right now I am enabling nscd in each zone trying to get back to where things became unstable. I'm up to 150 zones with nscd running and things are looking good (outside of massive paging wide-ranging scan rate). I've got a feeling I am going to get all 171 running so I have no idea what happened last time at 168. If I get to 171, I'll have to create more zones and move forward.

Posted by John Clingan on December 30, 2004 at 06:18 AM PST #

Post a Comment:
  • HTML Syntax: NOT allowed
About

John Clingan

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today