start/stop of Common Agent Container hangs

One question which is raised more and more often is :

"
why can I stop/start the Common Agent Container ?"
or
"
why /usr/sbin/cacaoadm stop/start command never returns ?".

Users facing that are on Solaris 10/11. This is due to users' modules executing sub processes and
not doing proper cleanup when they are asked to stop. Another cause is user module start/stop method wrong implementation.

There are at least two rules that a module must follow :

  • start and stop methods of a module must be relatively quick and always return.
The Common Agent Container start the registered modules sequentially. Its overall
start sequence is only completed when all registered modules have started.
The opposite is true for the container shutdown.
If a module never return from its start/stop method, it will then block the all sequence
and the container will never terminate its start/stop sequence. A module which have a complex initialisation/termination phase (like connecting to a database, ...) should
delegate all possible hanging actions to a thread beside.

  • stop method of a module must insure that all resources created by it are cleaned-up.
A module can be locked or unlocked (started/stopped) dynamically at runtime. Its life-cycle may not be aligned with the container life-cycle. This implies that the
module is clean regarding what it creates.
I.e If a module leave things behind,
depending on its logic , future (re)start of it may fail.


This particularly true for modules executing sub processes on Solaris 10 and later.
On S1x the
Common Agent Container is registered as an SMF service. Processes
created by a service lives in a process contract
( see contract(4) ).
A process contract has a lifetime , by default this lifetime is as long as there are
processes
running in it (contract is not empty).
So as any other service, the Common Agent Container has an associated contract.

# svcs -p -o  CTID,SVC svc:/application/management/common-agent-container-1:default
CTID   SVC
  4924 application/management/common-agent-container-1
               18:40:23     2536 launch
               18:40:23     2537 java

# /usr/bin/ptree -c 2537
[process contract 1]
  1     /sbin/init
    [process contract 4]
      7     /lib/svc/bin/svc.startd
        [process contract 4924]
          2536  /usr/lib/cacao/lib/tools/launch -w /usr/lib/cacao -f -U root -G sys -- /usr/jdk
            2537  /usr/jdk/jdk1.6.0_03/bin/java -Xms4M -Xmx128M -classpath /usr/share/lib/jdmk/jd

SMF will consider the service as "terminate" only when associated process contract becomes empty. User executing sub processes from within module (using Runtime.exec(...)) will populate the contract with new processes.

As an example I've wrote a simple module executing a ping command.You see that the ping process (2787) is now part of the contract


ctstat -v -i `/usr/bin/svcs -H -o  CTID svc:/application/management/common-agent-container-1:default`
CTID    ZONEID  TYPE    STATE   HOLDER  EVENTS  QTIME   NTIME  
4924    0       process owned   7       0       -       -      
        cookie:                0x20
        informative event set: none
        critical event set:    core signal hwerr empty
        fatal event set:       none
        parameter set:         inherit regent
        member processes:      2536 2537 2785 2787
        inherited contracts:   none


# pargs 2787
2787:   /usr/sbin/ping -s 127.0.0.1
argv[0]: /usr/sbin/ping
argv[1]: -s
argv[2]: 127.0.0.1


When I will try to stop the Common Agent Container , the associated svcadm(1M) command will hang waiting for the service to terminate : i.e waiting for the contract to become empty.
The Common Agent Container is properly stopped (Jvm is no longer running) but the command will never return.
Until I kill the orphan process (pid = 2787). You can also notice that the service state is now wrong: real logic has gone but the state is still online


# /usr/sbin/cacaoadm stop
(we are hanging here...)

# ptree -c `pgrep -x cacaoadm`

[process contract 1]

  1     /sbin/init

   [process contract 4]

     7     /lib/svc/bin/svc.startd

       [process contract 87]

         1346  gnome-terminal

           27663 /bin/bash

             27699 -bash

               2842  /bin/sh /usr/sbin/cacaoadm stop

                 2951  /usr/sbin/svcadm disable -st svc:/application/management/common-agent-container

svcs -p -o  CTID,STATE,SVC svc:/application/management/common-agent-container-1:default
CTID   STATE          SVC
  4924 online\*        application/management/common-agent-container-1
               18:52:26     2787 ping


(cacaoadm command is now completed)
#kill -INT 2787


What users should do about that ?
There are two solutions :
  • A list of all executed commands is kept by the module and he terminates still running
ones during its stop method.
  • All sub commands are executed using the container InvokeCommand or the UserProcess class.These two helpers will execute the command taking care of
sub contract. The sub command is launchedin a sub-contract using the ctrun(1) command.

In run again the previous example but this time a second ping command is launched using the InvokeCommand class.
You can see two ping commands but the one which is directly a child of the container is running in its own contract.

# svcs -p -o  CTID,STATE,SVC svc:/application/management/common-agent-container-1:default
CTID   STATE          SVC
  4947 online         application/management/common-agent-container-1
               19:32:33     3590 launch
               19:32:33     3591 java
               19:33:42     3744 ctrun
               19:33:42     3745 ping

# ptree 3591
3590  /usr/lib/cacao/lib/tools/launch -w /usr/lib/cacao -f -U root -G sys -- /usr/jdk
  3591  /usr/jdk/jdk1.6.0_03/bin/java -Xms4M -Xmx128M -classpath /usr/share/lib/jdmk/jd
    3744  /usr/bin/ctrun -l child -o pgrponly /usr/lib/cacao/lib/tools/suexec /usr/sbin/p
      3746  /usr/sbin/ping -s localhost
bighal# ptree -c 3591
[process contract 1]
  1     /sbin/init
    [process contract 4]
      7     /lib/svc/bin/svc.startd
        [process contract 4947]
          3590  /usr/lib/cacao/lib/tools/launch -w /usr/lib/cacao -f -U root -G sys -- /usr/jdk
            3591  /usr/jdk/jdk1.6.0_03/bin/java -Xms4M -Xmx128M -classpath /usr/share/lib/jdmk/jd
              3744  /usr/bin/ctrun -l child -o pgrponly /usr/lib/cacao/lib/tools/suexec /usr/sbin/p
                [process contract 4950]
                  3746  /usr/sbin/ping -s localhost
# ptree -c 3745
[process contract 1]
  1     /sbin/init
    [process contract 4]
      7     /lib/svc/bin/svc.startd
        [process contract 4947]
          3745  /usr/sbin/ping -s 127.0.0.1



When the application is a container which may run external code executing sub-processes,
developers should keep in mind contracts. For C application , libcontract(3LIB) is here to help, for Java application this becomes a little bit more problematic.

Another point which can be a problem is that one of the critical event of process contract can be core dump of a process or signal received by it . For instance let's take a
Common Agent Container case. If a process executed by one of the registered module crash or is killed by receiving a signal, the entire contract will be restarted and so do the container...oups...






Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

Emmanuel Jannetti blog

Search

Archives
« avril 2014
lun.mar.mer.jeu.ven.sam.dim.
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    
       
Today