jeudi janv. 24, 2013

simple tracing of bad free's using Dtrace

#pragma D option quiet

inline int stack_size = 3; inline int error_stack_size = 10;

pid$target:libc:malloc:entry { self->alloc_size = arg0; ustack(stack_size); } pid$target:libc:malloc:return /self->alloc_size && arg1 != 0/ { printf("alloc of size [%d] on addr 0x%p\n",self->alloc_size,arg1); allocations[arg1] = 1; } pid$target:libc:malloc:return /self->alloc_size && arg1 == 0/ { printf("alloc of size [%d] failed\n",self->alloc_size); } pid$target:libc:free:entry /arg0 != 0 && allocations[arg0]/ { printf("free of addr 0x%p\n",arg0); } pid$target:libc:free:entry /arg0 != 0 && !allocations[arg0]/ { printf("free of unallocated addr 0x%p\n",arg0); ustack(error_stack_size); } pid$target:libc:free:entry /arg0 == 0/ { printf("free of NULL addr !!\n"); ustack(error_stack_size); }

Tracing processes

What this process is doing ? 

Tracing a running process and find out what it is doing is 

also  a good way to prevent a crash or to understand what caused the crash.

Now with Dtrace, processes can be traced a lot more efficiently I agree, but for that you will need a Dtrace script or be a Dtrace expert who can write Dtrace instruction on the fly. This is not the case of everybody and when the question

is simple as "which files are opened by that process" or "what is the port number he is trying to connect to"  a simple call to a tools like truss (strace for linux) is much more quicker.


several commands are available. Man pages are usually good and

drawing will be more efficient than my poor English. I will then only use

pictures which are (I hope) self explanatory.

Using truss(1)

oups

Using Whocalls (1)


oups

Using apptrace (1)/sotruss (1) 

oups

oups

oups


        
    

memory problem with libumem

Debuggers like dbx are quite powerful to diagnose memory problems (try help rtc inside dbx).

The "problems" with debuggers is that they are not always available on the host you are working on.

Plus when you have to diagnose a problem in a shared library , using a debugger may become a little bit heavy.


A nice alternative is the libumem on Solaris. You can get debugging information

about memory usage just by using a LD_PRELOAD statement.

here are pictures describing basic usage. For more information go to the

man page  libumem(3LIB). The first example demonstrate a leak and the second

one a wrong access problem

oups
oups



jeudi août 14, 2008

tiny zone to run apache

Even if communities around Solaris Zones and available documentation are great

I had difficulties to answer this simple question :
How to create a zone a small as possible to run only Apache/MySQL in it ?


I've just needed a web server (and MySQL) to run a site with very limited activity.

All this contained in a zone. So I needed a zone with a minimum footprint which won't disturb me

on my host which is not a beast.


After trying to aggregate all I've read, I am not pretending that this is the perfect answer but

here what I've done : (I used a ZFS pool present on my host)

#zfs create tank/tinyzone
#zfs set mountpoint=/tinyzone
#zfs set quota=500M tank/tinyzone
...
#dispadmin -d FSS   (need a reboot/init 6)
...
#zonecfg -z tinyzone
zonecfg:tinyzone> create
zonecfg:tinyzone> set zonepath=/tinyzone
zonecfg:tinyzone> set autoboot=true
zonecfg:tinyzone> set scheduling-class=FSS
zonecfg:tinyzone> set ip-type=shared
(on my host /opt contains lots of packages)
zonecfg:tinyzone> add inherit-pkg-dir
zonecfg:tinyzone:inherit-pkg-dir> set dir=/opt
zonecfg:tinyzone:inherit-pkg-dir> end
zonecfg:tinyzone> add net
zonecfg:tinyzone:net> set address=x.x.x.x
zonecfg:tinyzone:net> set physical=bge0
zonecfg:tinyzone:net> set defrouter=x.x.x.x
zonecfg:tinyzone:net> end
(global zone will receive a lot more shares)
zonecfg:tinyzone> set cpu-shares=1
zonecfg:tinyzone> add capped-memory
zonecfg:tinyzone:capped-memory> set physical=512M
zonecfg:tinyzone:capped-memory> set swap=512M
zonecfg:tinyzone:capped-memory> end
zonecfg:tinyzone> verify
zonecfg:tinyzone> commit
zonecfg:tinyzone> exit
...
#chmod 700 /tinyzone
#zonecfg -z tinyzone info
zonename: tinyzone
zonepath: /tinyzone
brand: native
autoboot: true
bootargs:
pool:
limitpriv:
scheduling-class: FSS
ip-type: shared
[cpu-shares: 1]
inherit-pkg-dir:
        dir: /lib
inherit-pkg-dir:
        dir: /platform
inherit-pkg-dir:
       dir: /sbin
inherit-pkg-dir:
        dir: /usr
inherit-pkg-dir:
        dir: /opt
net:
        address: 1.2.3.4
        physical: bge0
        defrouter: 1.2.3.1
capped-memory:
        physical: 512M
        [swap: 512M]
rctl:
        name: zone.cpu-shares
        value: (priv=privileged,limit=1,action=none)
rctl:
        name: zone.max-swap
        value: (priv=privileged,limit=536870912,action=deny

Once the zone is installed we can see that disk space is quite small

zfs  get available,used tank/tinyzone
NAME            PROPERTY   VALUE           SOURCE
tank/tinyzone  available  381M            -
tank/tinyzone  used       119M            -

Now use a SMF profile to disable all unused service (i.e in my case all but apache and MySQL)

#svccfg extract > /tmp/tinyprofile
... go through the list inside /tmp/tinyprofile to disable everything not needed
#cp /tmp/tinyprofile /tinyzone/root/var/svc/profile/site.xm

Once zone is booted and configured (using sysidcfg) check Apache and MySQL web site

for performance tuning et voila :-)

prstat -Z
-----------------------------------------------------------
ZONEID    NPROC  SWAP   RSS MEMORY      TIME  CPU ZONE                       
     0      131  704M  847M    33%   1:42:30  15% global                     
     4       32  126M   85M   3.3%   0:00:18 0.1%
tinyzone

 

lundi juil. 28, 2008

taking snaphost and creating template of Xen domain

For testing purpose I've used vmware images. One of the advantages was to be able to take snapshots of an image.
And to make these snapshots a template for furture image creation.

When developing on windows platforms all this save you a lot of time.

Now I am using Xen on latest release of Solaris. This great but I was missing snapshots and templates.
Thank to ZFS I am happy again.

The trick is simple : use ZFS volume to create your guests and just use cloning and snapshot feature of ZFS.

In this example I use a file to create the zfs pool, this is not reliable but this is enough as we are testing

1 - Create a file.

#mkfile -n 10g /my-image-file


2 - Create a pool on that file.

#zpool create mypool /my-image-file
#zpool status mypool
pool: mypool
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 0
/my-image-file ONLINE 0 0 0


3 - Create a volume on that new created pool.

#zfs create -V 10g mypool/myhostdisk


4 - Create the guest on this new volume.

virt-install --name=foo --hvm --file-size=10 --ram=1024 --os-type=windows --vnc --cdrom=win.iso --file=/dev/zvol/dsk/mypool/myhostdisk

That's it ! each time you want a snapshot, shutdown the guest and take a snapshot of the volume.
Cloning the snapshot will give templates.

lundi avr. 21, 2008

Problem about corrupted cacao SMF service

On solaris 1x the Common Agent Container is registered under SMF. Sometimes (mainly due

to user violent/wrong usage of packaging) the service configuration is corrupted and

trying to start CAC messages like the following ones may appear.

#cacaoadm start
svcs: Pattern
'svc:/application/management/common-agent-container-1:default' doesn't
match any instances
Error when reseting SMF service maintenance state:
[svc:/application/management/common-agent-container-1:default].
Error when trying to start SMF service:
[svc:/application/management/common-agent-container-1:default].

 

One way to recover from that is to uninstall the package and to cleanup the repository if needed. 

Note that CAC configuration remains on the host across installations. Also note that operation will not

solve all possible corruptions but I hope this will help most of people facing that issue

  • uninstall the package
#pkgrm SUNWcacaort
  • remove any stale service registrations
Be careful to only remove FMRI relative to your installation

#svcs svc:/application/management/common-agent-container\\\*
disabled Dec_03 svc:/application/management/common-agent-container-4:default
disabled Dec_03 svc:/application/management/common-agent-container-6:default
disabled Dec_03 svc:/application/management/common-agent-container-7:default
disabled Dec_03 svc:/application/management/common-agent-container-8:default
online Apr_02 svc:/application/management/common-agent-container-9:default


#svcprop -p common-agent-container/basedir svc:/application/management/common-agent-container\\\*
svc:/application/management/common-agent-container-9:default/:properties/common-agent-container/basedir astring /
svc:/application/management/common-agent-container-8:default/:properties/common-agent-container/basedir astring /
svc:/application/management/common-agent-container-7:default/:properties/common-agent-container/basedir astring /var/tmp/root/cacao_2.2/
svc:/application/management/common-agent-container-6:default/:properties/common-agent-container/basedir astring /var/tmp/kkoteste/cacao_2.1/
svc:/application/management/common-agent-container-4:default/:properties/common-agent-container/basedir astring /var/tmp/cacao_2.0/


here, 4,6 and 7 are obviously not ours

svccfg delete svc:/application/management/common-agent-container-8:default
svccfg delete svc:/application/management/common-agent-container-9:default


  • install the package back on the host.
#pkgadd ....
  • check everything is back to green.
#cacaoadm status
default instance is DISABLED at system startup.
default instance is not running. 


 

 

How do I wait for a module to be ready

A common mistake is to not make the difference between

the status of the Common Agent Container and the status of modules deployed inside.

As any container, CAC may be in good shape while modules deployed aren't.

The command "cacaoadm status" gives you the status of the container but

to get the actual state and status of a module the command is cacaoadm status <module name>.

The container may be in three states:

  • stopped
# cacaoadm status
default instance is DISABLED at system startup. 
default instance is not running.
  • started
# cacaoadm status
default instance is DISABLED at system startup.
Smf monitoring process:
13400
Uptime: 0 day(s), 0:16
  • starting
# cacaoadm status
default instance is DISABLED at system startup.
Daemon is running but not available. Try again as it might be starting."

The container may be still starting executing its own initilization phase but mainly busy to

start modules deployed inside. see here for details. The container is ready to serve as soon as

an uptime is printed.

All this as nothing to do with the satus of a modules deployed. The container will start a module but

the module can be actually not ready to serve :

  • The module's creation (deployment) may have fail because of missing jars or imcompatible platform.
  • The module may have not been started because of dependency problem.
  • The module may have not been started because its is locked by default
  • The module may have failed to start because of an error raised during its initialization.
  • etc...

If an application depends on a service(s) deployed in CAC (i.e a module(s)) . This application

must monitor the status of the module it is interested in and not the container itself.

This is done by "cacaoadm status <module name>".

Few examples of module'state/status :

  • Module up and running.
# cacaoadm status com.sun.cacao.rbac
Operational State:ENABLED
Administrative State:UNLOCKED
Availability Status:[]
Module is in good health.
  • Module only enabled on demand (off line for now)
#cacaoadm status com.sun.cacao.efd
Module com.sun.cacao.efd has not been loaded.
Cause of the problem:[OFF_LINE]
  • Module registered and enabled but which had a problem starting
#cacaoadm status com.sun.cacao.instrum
Operational State:DISABLED
Administrative State:UNLOCKED
Availability Status:[OFF_LINE]
  • Module not loaded by the container due to errors
#cacaoadm status com.sun.scn.base.SCNBase
Module com.sun.scn.base.SCNBase has not been loaded.
Cause of the problem:[FAILED]
  • Module not started by the container due to dependency problem
#cacaoadm status com.sun.scn.SolarisAssetModule
Module com.sun.scn.SolarisAssetModule has not been loaded.
Cause of the problem:[DEPENDENCY]

Another tips for a module developer is set correctly its dependencies. One of basics examples is connectors.

If a module offer a service using the RMI connector (if the client part of the applications access it

only using the RMI connector). If the developer knows that its entire logic will be down because if this,

the module descriptor file of the module should define a dependency on the RMI module.

jeudi févr. 21, 2008

why the Common Agent Container (CACAO) take a long time to start or stop ?

One question we have time to time is about the time taken by the container to start.

The answer in 99.9 % of the case is : because of one module deployed inside.

0.1 % percent left are due to sub-processes as explain here.

The Common Agent Container only loaded with its core modules never take more then

few seconds to start (include the start of the underlying jvm ).

For instance on a host (2 x UltraSPARC-IIIi  (1280 MHz) - 2.50GB Mem)

the container initialisation takes 4~5 secondes. A big part of this is taken by the creation

and the start of modules.

As an example here are statistics displayed using a Dtrace script.

                   modules - time (secs/milliseconds)
    startAdaptorsAndConnectors -             312
 registerAdaptorsAndConnectors -            1375
          register all modules -           82003
               container start -           86893


In this case DTrace is really not the recommended way to take stats

as the jvm is really slowed down but it gives you an idea of how the time is dispatched.

Everything in the container start sequence (this is also true for the shutdown) is done sequentially:

container initialisation

-> connectors/adaptors creation

-> start modules in order (dependency order)

-> start connectors (connections are then allowed)

-> stop modules in order

-> stop connectors

container finalization

 

The benefit of this is simplicity and to ease the life of module developers. A module cannot (may not)

be disturbed while it is performing tricky actions like initislisation and cleanup. As connectors

are not opened, no request can be received.

The disadvantage of this is if somebody take a long time to complete, everything is blocked waiting.

 
 Initialization and finalisation of a module (start and stop method) are supposed to be as short

as possible and bounded in time. If a module has to do perform some actions which may block

(db connection, remote host connection, execution of interactive process ...). The developper should

implement timeouts and/or use additional thread dedicated to "dangerous" phases.

By doing this there will be no risk to block or slow down the container. And all deployed modules

will enjoy beeing started on time.

 

Knowing that, here is what you can do to identify the root cause of a slow start or stop of the container. 

When this happen on Solaris 10 and above, the cli command "/usr/sbincacaoadm start" will take

a long time to return. On other systems (Solaris9, Linux, hp-ux, windows) the CLI will return but

the container will take a long time to be ready to serve. issuing the "cacaoadm status" command

you will see messages like

default instance is DISABLED at system startup.
14834
14835
Daemon is running but not available. Try again as it might be starting.

 

Just having a look to the log file should help you to find out what happened.

First set a high level on the following filters.

cacaoadm set-filter --persistent com.sun.cacao.element=ALL

cacaoadm set-filter --persistent com.sun.cacao.container.impl=ALL

 

The container initialization would be logged but also (the most interesting part)  the beginning

and the end of each modules initialisation. You have to look for entries from unsynchronizedDeployModule method

and entries from the ElementSupport support class. Here is an example with the RMI module

 ....
 Feb 21, 2008 6:08:49 PM com.sun.cacao.container.impl.ContainerPrivate unsynchronizedDeployModule
FINER: Add the descriptor Module name: com.sun.cacao.rmi
Descriptor full path:/usr/lib/cacao/lib/tools/template/modules/com.sun.cacao.rmi.xml
 Description: This module manages the secure and unsecure RMI connectors.
 Version: 1.0
 Module class: com.sun.cacao.rmi.impl.RMIModule
 Initial admin State: UNLOCKED
 Heap: 100
 Instances Descriptor:
 Dependencies: []
 Parameters: {}
 Cacao version supported: 2.1
 Private paths: []
 Public paths: [file:../../../../lib/cacao_rmi.jar]
 Library paths: []
 Ignored at startup: false
 Enable on demand: false
Feb 21, 2008 6:08:50 PM com.sun.cacao.element.ElementSupport setAdministrativeState
FINE: Administrative state change to UNLOCKED : com.sun.cacao:type=ModuleManager,instance="com.sun.cacao.rmi"
Feb 21, 2008 6:08:50 PM com.sun.cacao.element.ElementSupport setAdministrativeState
FINE: Administrative state change to UNLOCKED : com.sun.cacao:type=module,instance="com.sun.cacao.rmi"
....

As you can see the start of this module begun at 6:08:49 to end more or less at 6:08:50.

You know now that the RMI module took around 1 second to start.

An administrative state set to UNLOCK for a module means 'started' 

Looking at the log file like this should give you the name of the culprit.

High level filter can be time and disk consuming. When your investigation is completed

you should set them back to their default value (in our case null as only com.sun.cacao is set by default).
 

cacaoadm set-filter --persistent com.sun.cacao.element=null

cacaoadm set-filter --persistent com.sun.cacao.container.impl=null

 




mercredi févr. 06, 2008

relocatable binaries and scripts

Once, somebody asked me about how to relocate scripts and binaries in a safe manner . Depending on what

you do it may not be as easy as looking in argv[0]. Hope this helps somebody else.


  • From shell script.
  
 #!/bin/sh

where_am_i () {
cmd=`dirname $1`
case ${cmd} in
    /\*)
    cannon_cmd=${cmd}
    ;;
    \*/\*)
    _pwd=`pwd 2>/dev/null`
    cannon_cmd="${_pwd}/${cmd}"
    ;;
    \*)
    echo "ousp..."
    exit 1
    ;;
esac
echo "$cannon_cmd" | awk -F/ '{
    realpath="";
    skip_it=0;
    for (i=NF; i>0; i--) {
        if ($i         == ".")                         continue;
        if (length($i) == 0&& i != 1)       continue;
        if ($i         == "..") {skip_it++; continue;}
        if (skip_it    > 0 )    {
            skip_it--;
            continue;
        }
        if (length(realpath) > 0) {
                   realpath=$i"/"realpath;
                } else {
                   realpath=$i;
                }
    }
    } END {print realpath}' 2>/dev/null
}

res=`where_am_i $0`

echo "resolved command line <$res>"

 

 this will produce output like

#/tmp/a.sh
resolved command line </tmp/>
#/var/tmp/..//../tmp/a.sh
resolved command line </tmp/>
#/var/tmp/..//../tmp/./a.sh
resolved command line </tmp/>


  • from BAT file

    @ECHO OFF
    setlocal

    set canonBaseDir=%~dp0 

     ....

    endlocal

    .
  •  From perl module

 

use FindBin;

# add our module repository to perl @INC
use lib "$FindBin::RealBin";

#.... later in a sub module

my ($mod_package,$mod_name) = split (/::/,__PACKAGE__);
 
  my $modnameToPath=catfile($mod_package,$mod_name);
  $modnameToPath.=".pm";
 
  foreach my $v (@INC) {   
    my $loaded_file = catfile($v,$modnameToPath);
    if ( -e $loaded_file ) {
      # we were loaded from $loaded_file
      $relocation_base = canonpath(dirname($loaded_file));
      last;
    }
  }




 

  • From a binary on Solaris

#include <dlfcn.h>

Dl_info dli;
dladdr((void\*)&main,&dli);
printf("resolved cmd : %s\\n",(char\*)mydlinfo.dli_fname);


  • From a binary on Linux
max = pathconf("/",_PC_PATH_MAX);
buf = (char\*)malloc(max + 1);
end = readlink("/proc/self/exe",buf,max);
buf[end] = '\\0';
printf("resolved cmd : %s\\n",buf);

  • From a binary on AIX
struct ld_info \*myld_info;
loadquery(L_GETINFO, buf, size);
myld_info = (struct ld_info \*)buf;
printf("resolved cmd : %s\\n",myld_info->ldinfo_filename);

  • From a binary on HP-UX
struct shl_descriptor \*desc;
shl_gethandle(PROG_HANDLE,&desc);
//Note : may need to prepend pwd to desc->filename
printf("resolved cmd : %s\\n",desc->filename);



  • From a win32 binary
GetModuleFileName(NULL, buffer,<buffer length>);
printf("resolved cmd : %s\\n",desc->filename);



mardi févr. 05, 2008

How to get Common Agent Container/detached jvm dump

One of the frequently asked question is : My Jvm just crash but I cannot find any core dump of log.

All this may depend on host and/or Jvm configuration. The Common Agent Container is a service and

then run in background. It is not as easy as with console application to get any traces.

There are several methods I've used, here they are, just pick the one you prefer (some are not working on all jdks).


Note that if you are used to use java debugger like the one in NetBeans it may be useless
for you to read this page :-)


  • getting Jvm stack output using truss/strace.
Just go here.

  • enabling core file dump on the host.
By default on systems like Solaris, the dump of core file is disabled.
Use this with caution, enabling core file can cause security issue and/or fill up your disk.
It is recommended to set everything back to default after investigation is done.
  • On Solaris :
See coreadm(1M) man page for details.
    #coreadm -e log
    #coreadm -e process
    #coreadm -e global
    #coreadm -G all
    #coreadm -i /var/tmp/core.%f.%p.%t
    #coreadm -g /var/tmp/core.%f.%p.%t
    #coreadm -u

It is even better to only change the configuration for the running jvm.
    #coreadm -p /var/tmp/core.%f.%p.%t -P all  <jvm pid>
  • On Linux :
  • set core file limit in /etc/profile.
Replace any line like
ulimit -S -c 0 > /dev/null 2>&1 
by something like
[ `/usr/bin/id -u` -eq 0 ] && ulimit -c unlimited
  • set core file pattern (cf proc(5) fro details)
echo 1 > /proc/sys/kernel/core_uses_pid
echo /var/tmp/core.%f.%p.%t > /proc/sys/kernel/core_pattern
you can also use the systctl command :
sysctl -w kernel.core_pattern=/var/tmp/core.%f.%p.%t

  • Setting Jvm error file path in the command line.
Starting from jdk 6 , you are allowed to set the path of the dump file. This is done with the
XX:ErrorFile option. You can then add this to the Common Agent Container java's flags.

#cacaoadm set-param java-flags=-XX:ErrorFile=/var/tmp/myDump `cacaoadm get-param java-flags --value`

  • Using DTrace to catch Jvm fault.

Using the fault probe of the Proc provider you can print the stack trace each time the jvm
triggers one. I've wrote a small DTrace script that you can use. For all running Jvms it will
print output like the following on case of failure. In that example I've hacked a jni call to trigger
a segmentation fault.  (get the script)


[java:  4213/  2] experiencing fault <address not mapped to object>, signal 11

              libc.so.1`0xff1c5a70
              libc.so.1`0xff2103ec
              Interpreter
              com/sun/cacao/agent/auth/UserPrincipal.internalGetUid(Ljava/lang/String;)I
              com/sun/cacao/agent/auth/UserPrincipal.internalGetUid(Ljava/lang/String;)I
              com/sun/cacao/agent/auth/UserPrincipal.getUid()I
              com/sun/cacao/agent/auth/AssertMechanism.createSubject(Ljava/lang/String;Z)Ljavax/security/auth/Subject;
              com/sun/cacao/container/impl/ContainerPrivate.internalStart(Ljava/lang/String;)V
              com/sun/cacao/container/impl/ContainerPrivate.start(Ljava/lang/String;Ljava/lang/String;)V
              com/sun/cacao/container/impl/ContainerPrivate.main([Ljava/lang/String;)V
              StubRoutines (1)
              libjvm.so`__1cJJavaCallsLcall_helper6FpnJJavaValue_pnMmethodHandle_pnRJavaCallArguments_pnGThread__v_+0x1e4
              libjvm.so`jni_CallStaticVoidMethod+0x4b8
              java`0x13a4c
              libc.so.1`0xff245e28

              unix`trap_cleanup+0x24
              unix`trap+0x1b84
              unix`utl0+0x4c







dimanche janv. 13, 2008

start/stop of Common Agent Container hangs

One question which is raised more and more often is :

"
why can I stop/start the Common Agent Container ?"
or
"
why /usr/sbin/cacaoadm stop/start command never returns ?".

Users facing that are on Solaris 10/11. This is due to users' modules executing sub processes and
not doing proper cleanup when they are asked to stop. Another cause is user module start/stop method wrong implementation.

There are at least two rules that a module must follow :

  • start and stop methods of a module must be relatively quick and always return.
The Common Agent Container start the registered modules sequentially. Its overall
start sequence is only completed when all registered modules have started.
The opposite is true for the container shutdown.
If a module never return from its start/stop method, it will then block the all sequence
and the container will never terminate its start/stop sequence. A module which have a complex initialisation/termination phase (like connecting to a database, ...) should
delegate all possible hanging actions to a thread beside.

  • stop method of a module must insure that all resources created by it are cleaned-up.
A module can be locked or unlocked (started/stopped) dynamically at runtime. Its life-cycle may not be aligned with the container life-cycle. This implies that the
module is clean regarding what it creates.
I.e If a module leave things behind,
depending on its logic , future (re)start of it may fail.


This particularly true for modules executing sub processes on Solaris 10 and later.
On S1x the
Common Agent Container is registered as an SMF service. Processes
created by a service lives in a process contract
( see contract(4) ).
A process contract has a lifetime , by default this lifetime is as long as there are
processes
running in it (contract is not empty).
So as any other service, the Common Agent Container has an associated contract.

# svcs -p -o  CTID,SVC svc:/application/management/common-agent-container-1:default
CTID   SVC
  4924 application/management/common-agent-container-1
               18:40:23     2536 launch
               18:40:23     2537 java

# /usr/bin/ptree -c 2537
[process contract 1]
  1     /sbin/init
    [process contract 4]
      7     /lib/svc/bin/svc.startd
        [process contract 4924]
          2536  /usr/lib/cacao/lib/tools/launch -w /usr/lib/cacao -f -U root -G sys -- /usr/jdk
            2537  /usr/jdk/jdk1.6.0_03/bin/java -Xms4M -Xmx128M -classpath /usr/share/lib/jdmk/jd

SMF will consider the service as "terminate" only when associated process contract becomes empty. User executing sub processes from within module (using Runtime.exec(...)) will populate the contract with new processes.

As an example I've wrote a simple module executing a ping command.You see that the ping process (2787) is now part of the contract


ctstat -v -i `/usr/bin/svcs -H -o  CTID svc:/application/management/common-agent-container-1:default`
CTID    ZONEID  TYPE    STATE   HOLDER  EVENTS  QTIME   NTIME  
4924    0       process owned   7       0       -       -      
        cookie:                0x20
        informative event set: none
        critical event set:    core signal hwerr empty
        fatal event set:       none
        parameter set:         inherit regent
        member processes:      2536 2537 2785 2787
        inherited contracts:   none


# pargs 2787
2787:   /usr/sbin/ping -s 127.0.0.1
argv[0]: /usr/sbin/ping
argv[1]: -s
argv[2]: 127.0.0.1


When I will try to stop the Common Agent Container , the associated svcadm(1M) command will hang waiting for the service to terminate : i.e waiting for the contract to become empty.
The Common Agent Container is properly stopped (Jvm is no longer running) but the command will never return.
Until I kill the orphan process (pid = 2787). You can also notice that the service state is now wrong: real logic has gone but the state is still online


# /usr/sbin/cacaoadm stop
(we are hanging here...)

# ptree -c `pgrep -x cacaoadm`

[process contract 1]

  1     /sbin/init

   [process contract 4]

     7     /lib/svc/bin/svc.startd

       [process contract 87]

         1346  gnome-terminal

           27663 /bin/bash

             27699 -bash

               2842  /bin/sh /usr/sbin/cacaoadm stop

                 2951  /usr/sbin/svcadm disable -st svc:/application/management/common-agent-container

svcs -p -o  CTID,STATE,SVC svc:/application/management/common-agent-container-1:default
CTID   STATE          SVC
  4924 online\*        application/management/common-agent-container-1
               18:52:26     2787 ping


(cacaoadm command is now completed)
#kill -INT 2787


What users should do about that ?
There are two solutions :
  • A list of all executed commands is kept by the module and he terminates still running
ones during its stop method.
  • All sub commands are executed using the container InvokeCommand or the UserProcess class.These two helpers will execute the command taking care of
sub contract. The sub command is launchedin a sub-contract using the ctrun(1) command.

In run again the previous example but this time a second ping command is launched using the InvokeCommand class.
You can see two ping commands but the one which is directly a child of the container is running in its own contract.

# svcs -p -o  CTID,STATE,SVC svc:/application/management/common-agent-container-1:default
CTID   STATE          SVC
  4947 online         application/management/common-agent-container-1
               19:32:33     3590 launch
               19:32:33     3591 java
               19:33:42     3744 ctrun
               19:33:42     3745 ping

# ptree 3591
3590  /usr/lib/cacao/lib/tools/launch -w /usr/lib/cacao -f -U root -G sys -- /usr/jdk
  3591  /usr/jdk/jdk1.6.0_03/bin/java -Xms4M -Xmx128M -classpath /usr/share/lib/jdmk/jd
    3744  /usr/bin/ctrun -l child -o pgrponly /usr/lib/cacao/lib/tools/suexec /usr/sbin/p
      3746  /usr/sbin/ping -s localhost
bighal# ptree -c 3591
[process contract 1]
  1     /sbin/init
    [process contract 4]
      7     /lib/svc/bin/svc.startd
        [process contract 4947]
          3590  /usr/lib/cacao/lib/tools/launch -w /usr/lib/cacao -f -U root -G sys -- /usr/jdk
            3591  /usr/jdk/jdk1.6.0_03/bin/java -Xms4M -Xmx128M -classpath /usr/share/lib/jdmk/jd
              3744  /usr/bin/ctrun -l child -o pgrponly /usr/lib/cacao/lib/tools/suexec /usr/sbin/p
                [process contract 4950]
                  3746  /usr/sbin/ping -s localhost
# ptree -c 3745
[process contract 1]
  1     /sbin/init
    [process contract 4]
      7     /lib/svc/bin/svc.startd
        [process contract 4947]
          3745  /usr/sbin/ping -s 127.0.0.1



When the application is a container which may run external code executing sub-processes,
developers should keep in mind contracts. For C application , libcontract(3LIB) is here to help, for Java application this becomes a little bit more problematic.

Another point which can be a problem is that one of the critical event of process contract can be core dump of a process or signal received by it . For instance let's take a
Common Agent Container case. If a process executed by one of the registered module crash or is killed by receiving a signal, the entire contract will be restarted and so do the container...oups...






vendredi août 10, 2007

my new colleague

here is the picture of my new colleague at sun

I think he lied about his resume but he looks motivated....

 

jeudi août 09, 2007

jumping sockets

One day I've faced an X-files situation on windows.

(now you should hear the sound tudu..dudu..duduuuu from the trailer,... no ?

...never mind :-)).


One server was listening to connections on a specific address and port. so far so good...

Another instance of that server was launched (bound on the address/port by mistake) and I suddenly

saw a client communication already started with the first server ending with the second one.


As you can imagine : At First I was afraid,..I was petrified...

I scratched my head for days .... (unless you are already bold... don't try :-)).

reading Microsoft documentation for the 23343123 times I found the solution. the reuse address
of windows socket (winsock) has a tiny difference which make a big one in the end. binding a socket
on an address already used is allowed. Not only when the socket just exists like on Unix but at any time.
The second server was stealing the socket on the first.

I am still confused about some details and there are surely some parts which I didn't get
but what I am sure of is that removing the flag SO_REUSEADDR solved my issue.


Usual shell command lines translated for windows

I had to work on windows recently. As a standard Unix guy, I though that

you cannot use command line on windows, you have to install cygwin, mks etc...

I was curious and I've  started to dig into cmd.exe usage. I've discovered that the windows command

line is far from being poor.

For some aspects I would even say that cmd.exe may be not less powerful than sh.

Here is a list of usual command I use translated to windows command tool.

Hope this help someone.

people crying for awk while doing scripting on windows ,  should have a closer look to "FOR /?" :-)

/bin/ls -l

dir /N /Q

grep -i \^bla myfile.txt

findstr /B /I "bla" myfile.txt

grep -v -i \^bla myfile.txt

findstr /V /B /I "bla" myfile.txt

find /tmp -name \*.c | xargs | grep blo

findstr /S blo C:\\\\tmp \*.c

find /tmp -name \*.c

dir /S /B C:\\\\tmp\\\\\*.c

echo $?

echo %ERRORLEVEL%

ls \*.c 2>/dev/null 

dir \*.c 2>NUL

pwd

echo %CD%

echo $0

echo %CMDCMDLINE%

exit 12

exit /B 12

dirname $1

%~p1

basename $1

%~nx1

which $1

%~dp$PATH:1

env | grep TERM

set TERM 2>NUL

su - foo -c command

runas /profile /user:foo command

rsh myHost -l foo command

rexec myHost -l foo command

for file in `find /tmp -name \*.txt`

do

echo `basename $file`

done

for /R C:\\\\TEMP %i in (\*.txt) DO echo %~ni

mount

net use

mount foo:/bar /mnt/bar

net use Z: \\\\foo\\bar [/USER:username /PERSISTENT:NO]

diff file1 file2

fc file1 file2

echo -n "Enter a letter : " ; read foo

set /P foo="Enter a letter : "

shutdown -g 120 -y -i 6

shutdown /r /l /f /dp:0:0


Saut a l'elastique a Ponsonnas

Il y a quelques annees (2000) j'ai essaye le saut a l'elastique a ponsonnas

Je m'en souviens encore !

ps: a ne pas essayer avec des basquettes a scratchs...

Si vous dites "oui" de la tete assez vite ca fera un film :-)

 

oups
oups
oups
oups
oups
oups
oups

 

About

Emmanuel Jannetti blog

Search

Archives
« avril 2014
lun.mar.mer.jeu.ven.sam.dim.
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    
       
Today