Wednesday Apr 29, 2009

UnsatisfiedLinkError: no jss4 in java.library.path

As support engineer, considerable amount of my time goes in trying to replicate customer problems in the labs within Sun. When setting up things in the lab, sometimes I run into other issues that I have to overcome before I can possibly reproduce the actual problem reported by the customer.

Here is one such error I ran into when  I installed  Access Manager 7.1 from JES 5 on Appserver 8.2.

Installation of the compoents went through fine but when the web container was restarted,  it was throwing the error:  java.lang.UnsatisfiedLinkError: no jss4 in java.library.path

To overcome this, I had to add the following 3 entries to AMConfig.properties and then restart the container.

com.iplanet.security.SecureRandomFactoryImpl=com.iplanet.am.util.SecureRandomFactoryImpl
com.iplanet.security.SSLSocketFactoryImpl=netscape.ldap.factory.JSSESocketFactory
com.iplanet.security.encryptor=com.iplanet.services.util.JCEEncryption

Snippet of the exception in the application server log

[#|2009-04-13T16:50:59.482+1000|SEVERE|sun-appserver-pe8.2|javax.enterprise.system.container.web|_ThreadID=10;|WebModule[/amserver]Servlet /amserver threw load() exception javax.servlet.ServletException: no jss4 in java.library.path
at org.apache.catalina.security.SecurityUtil.execute(SecurityUtil.java:300)
at org.apache.catalina.security.SecurityUtil.doAsPrivilege(SecurityUtil.java:165)
at org.apache.catalina.security.SecurityUtil.doAsPrivilege(SecurityUtil.java:118)
at org.apache.catalina.core.StandardWrapper.loadServlet(StandardWrapper.java:1093)
at org.apache.catalina.core.StandardWrapper.load(StandardWrapper.java:931)
at org.apache.catalina.core.StandardContext.loadOnStartup(StandardContext.java:4183)
at org.apache.catalina.core.StandardContext.start(StandardContext.java:4535)
at com.sun.enterprise.web.WebModule.start(WebModule.java:241)
at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1086)
at org.apache.catalina.core.StandardHost.start(StandardHost.java:847)
at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1086)
at org.apache.catalina.core.StandardEngine.start(StandardEngine.java:483)
at org.apache.catalina.startup.Embedded.start(Embedded.java:894)
at com.sun.enterprise.web.WebContainer.start(WebContainer.java:741)at com.sun.enterprise.web.PEWebContainer.startInstance(PEWebContainer.java:515)
at com.sun.enterprise.web.PEWebContainerLifecycle.onStartup(PEWebContainerLifecycle.java:54)
at com.sun.enterprise.server.ApplicationServer.onStartup(ApplicationServer.java:300)
at com.sun.enterprise.server.PEMain.run(PEMain.java:292)
at com.sun.enterprise.server.PEMain.main(PEMain.java:218)
----- Root Cause -----
java.lang.UnsatisfiedLinkError: no jss4 in java.library.path
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1682)
at java.lang.Runtime.loadLibrary0(Runtime.java:822)
at java.lang.System.loadLibrary(System.java:993)
at org.mozilla.jss.CryptoManager.loadNativeLibraries(CryptoManager.java:1339)
at org.mozilla.jss.CryptoManager.initialize(CryptoManager.java:827)
at com.iplanet.services.util.JSSEncryption.<clinit>(JSSEncryption.java:250)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:164)
at com.iplanet.services.util.Crypt.createInstance(Crypt.java:133)
at com.iplanet.services.util.Crypt.<clinit>(Crypt.java:103)
at com.iplanet.services.ldap.LDAPUser.getPasswd(LDAPUser.java:117)
at com.iplanet.services.ldap.ServerInstance.getPasswd(ServerInstance.java:128)
at com.sun.identity.security.ServerInstanceAction.run(ServerInstanceAction.java:92)
at java.security.AccessController.doPrivileged(Native Method)
at com.iplanet.am.util.AdminUtils.<clinit>(AdminUtils.java:82)
at com.sun.identity.sm.SMSEntry.<clinit>(SMSEntry.java:169)
at com.sun.identity.sm.ServiceManager.<clinit>(ServiceManager.java:74)
at com.sun.identity.authentication.service.AuthD.<init>(AuthD.java:247)
at com.sun.identity.authentication.service.AuthD.getAuth(AuthD.java:507)
at com.sun.identity.authentication.UI.LoginLogoutMapping.initializeAuth(LoginLogoutMapping.java:89)
at com.sun.identity.authentication.UI.LoginLogoutMapping.init(LoginLogoutMapping.java:74)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at org.apache.catalina.security.SecurityUtil$1.run(SecurityUtil.java:249)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAsPrivileged(Subject.java:517)
at org.apache.catalina.security.SecurityUtil.execute(SecurityUtil.java:282)
at org.apache.catalina.security.SecurityUtil.doAsPrivilege(SecurityUtil.java:165)
at org.apache.catalina.security.SecurityUtil.doAsPrivilege(SecurityUtil.java:118)
at org.apache.catalina.core.StandardWrapper.loadServlet(StandardWrapper.java:1093)
at org.apache.catalina.core.StandardWrapper.load(StandardWrapper.java:931)
at org.apache.catalina.core.StandardContext.loadOnStartup(StandardContext.java:4183)
at org.apache.catalina.core.StandardContext.start(StandardContext.java:4535)
at com.sun.enterprise.web.WebModule.start(WebModule.java:241)
at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1086)
at org.apache.catalina.core.StandardHost.start(StandardHost.java:847)
at org.apache.catalina.core.StandardHost.start(StandardHost.java:847)
at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1086)
at org.apache.catalina.core.StandardEngine.start(StandardEngine.java:483)
at org.apache.catalina.startup.Embedded.start(Embedded.java:894)
at com.sun.enterprise.web.WebContainer.start(WebContainer.java:741)
at com.sun.enterprise.web.PEWebContainer.startInstance(PEWebContainer.java:515)
at com.sun.enterprise.web.PEWebContainerLifecycle.onStartup(PEWebContainerLifecycle.java:54)
at com.sun.enterprise.server.ApplicationServer.onStartup(ApplicationServer.java:300)
at com.sun.enterprise.server.PEMain.run(PEMain.java:292)
at com.sun.enterprise.server.PEMain.main(PEMain.java:218)

 

Monday Feb 25, 2008

Access Manager Session State Diagram

Here is the session state diagram  I found in the access manager code, session.java.

  \* The following is the state diagram for a session:
\*
\* |
\* |
\* |
\* V
\* ---------------invalid
\* | |
\* | |creation (authentication OK)
\* | |
\* |max login time | max idle time
\* |destroy V ------------------->
\* | valid inactive --
\* | | <------------------ |
\* | | reactivate |
\* | | |
\* | | logout | destroy
\* | | destroy | max session time
\* | | max session time |
\* | V |
\* --------------> destroy <---------------------





 

Monday Dec 17, 2007

why is the stacktrace in gdb different to the one in dbx ?

After I posted my last blog, another idea struck me, how about I try live debugging with dbx ?

I downloaded Sun Studio 12 and installed on my lab system running Red Hat Enterprise Linux ES release 4 (Nahant Update 4). When I started apache in dbx, it crashed the first time around but subsequent starts did not crash. Even when it crashed, I couldn't get the symbols working correctly in dbx. What's the point in debugging live if I couldn't read the symbols correctly ?

I decided not to spend time on live debugging and focus instead on crash analysis.  But I was in for another surprise that is -- the stacktrace output in  dbx and gdb looked very different. I just couldn't figure out why the stacktrace output on gdb and dbx looked so different ?

The backtrace output in gdb looked as follows:

(gdb) bt
#0  0x003a9194 in spin () from /root/web_agents/apache22_agent/lib/libamapc22.so
#1  0x024dedc1 in PR_Select () from /usr/lib/libnspr4.so
#2  0x00b96371 in start_thread () from /lib/tls/libpthread.so.0
#3  0x009fdffe in clone () from /lib/tls/libc.so.6 

 
Now, compare that with the stacktrace in dbx ?

(dbx)

t@null (l@1) terminated by signal SEGV (Segmentation fault)
0x00929b7d: __mmap+0x000d: movl %edx,%ebx
Current function is apr_dso_load
139 os_handle = dlopen(path, flags);


(dbx) where
[1] __mmap(0xcbf000, 0x71d0, 0x5, 0x802, 0xa, 0x0), at 0x929b7d
[2] _dl_map_object_from_fd(0xbfe57e80, 0x90b9528, 0x90b9200, 0x2, 0x80000000, 0xbfe57e78), at 0x91d809
[3] _dl_map_object(0x2, 0x0, 0x80000000, 0x0, 0xbfe583b4, 0x44fd66), at 0x91f0bc
[4] openaux(0xbfe583b4, 0x80000000, 0x0, 0xbfe583b4, 0xb7f3d6bc, 0x922d50), at 0x922d98
[5] _dl_catch_error(0xbfe583b4, 0x24, 0xbfe584b0, 0xbfe58188, 0x92dfd4, 0xbfe583e8), at 0x9240fe
[6] _dl_map_object_deps(0x0, 0x80000000, 0x90000102, 0x0, 0xa5cff4, 0x0), at 0x92303a
[7] dl_open_worker(0xbfe585b0, 0x0, 0x0, 0x0, 0xb7f3d6bc, 0xa31fe0), at 0xa32159
[8] _dl_catch_error(0xbfe585b0, 0x90, 0xffffffff, 0x0, 0x970f98, 0xbfe58b40), at 0x9240fe
[9] __GI__dl_open(0xfffffffe, 0x0, 0x0, 0x92dfd4, 0xbfe58750, 0x92dca0), at 0xa32cb8
[10] dlopen_doit(0xbfe58750, 0x0, 0x0, 0xbfe58c44, 0xb7f3d6bc, 0xa88c70), at 0xa88cb8
[11] _dl_catch_error(0xbfe58750, 0xa88f50, 0x48, 0xe3736c, 0x905b0a8, 0x90c973a), at 0x9240fe
[12] _dlerror_run(0x0, 0x0, 0x31, 0x0, 0x90c9740, 0x102), at 0xa892bb
[13] __dlopen_check(0x90c9740, 0x102, 0x90c9739, 0xe23650, 0x905b0a8, 0x1), at 0xa88d11
=>[14] apr_dso_load(res_handle = <value not available>, path = <value not available>, pool = <value not available>), line 139 in "dso.c"
[15] load_module(cmd = <value not available>, dummy = <value not available>, modname = <value not available>, filename = <value not available>), line 238 in "mod_so.c"
[16] invoke_cmd(cmd = (nil), parms = (nil), mconfig = <value not available>, args = <value not available>), line 890 in "config.c"
[17] ap_build_config_sub(p = <value not available>, temp_pool = <value not available>, l = <value not available>, parms = <value not available>, current = <value not available>, curr_parent = <value not available>, conftree = <value not available>), line 1436 in "config.c"
[18] ap_build_config(parms = (nil), p = <value not available>, temp_pool = <value not available>, conftree = (nil)), line 1219 in "config.c"
[19] process_resource_config_nofnmatch(s = <value not available>, fname = (nil), conftree = <value not available>, p = <value not available>, ptemp = <value not available>, depth = <value not available>), line 1629 in "config.c"
[20] ap_process_resource_config(s = <value not available>, fname = <value not available>, conftree = <value not available>, p = <value not available>, ptemp = <value not available>), line 1661 in "config.c"
[21] include_config(cmd = (nil), dummy = <value not available>, name = (nil)), line 2584 in "core.c"
[22] invoke_cmd(cmd = (nil), parms = (nil), mconfig = <value not available>, args = <value not available>), line 890 in "config.c"
[23] ap_build_config_sub(p = <value not available>, temp_pool = <value not available>, l = <value not available>, parms = <value not available>, current = <value not available>, curr_parent = <value not available>, conftree = <value not available>), line 1436 in "config.c"
[24] ap_build_config(parms = (nil), p = <value not available>, temp_pool = <value not available>, conftree = (nil)), line 1219 in "config.c"
[25] process_resource_config_nofnmatch(s = <value not available>, fname = (nil), conftree = <value not available>, p = <value not available>, ptemp = <value not available>, depth = <value not available>), line 1629 in "config.c"
[26] ap_process_resource_config(s = <value not available>, fname = <value not available>, conftree = <value not available>, p = <value not available>, ptemp = <value not available>), line 1661 in "config.c"
[27] ap_read_config(process = <value not available>, ptemp = <value not available>, filename = <value not available>, conftree = <value not available>), line 2021 in "config.c"
[28] main(argc = -1075478912, argv = 0x90b9528), line 694 in "main.c"

There is hardly any resemblance in the stacktrace output produced by the two debug tools.  I posted this question to one of our internal aliases and I received a response from a colleague of mine in Singapore saying,  I was running backtrace against the wrong thread in gdb !!.   

When I ran the command "bt"  in gdb, I did not choose any particular thread,  all I did was open the core file in gdb and run the command "bt".  I always thought that by default when you open a core file in a debugger and run the command to print the stacktrace ( or backtrace),  it prints the trace of the thread that resulted this crash.   I now know that might always be the case. Finally, I managed to print the stacktrace of the thread that resulted in this crash in gdb.

(gdb) thread apply all bt
Thread 2 (process 7160):
#0 0x00929b7d in mmap () from /lib/ld-linux.so.2
#1 0x0091d809 in _dl_map_object_from_fd () from /lib/ld-linux.so.2
#2 0x0091f0bc in _dl_map_object () from /lib/ld-linux.so.2
#3 0x00922d98 in openaux () from /lib/ld-linux.so.2
#4 0x009240fe in _dl_catch_error () from /lib/ld-linux.so.2
#5 0x0092303a in _dl_map_object_deps () from /lib/ld-linux.so.2
#6 0x00a32159 in dl_open_worker () from /lib/tls/libc.so.6
#7 0x009240fe in _dl_catch_error () from /lib/ld-linux.so.2
#8 0x00a32cb8 in _dl_open () from /lib/tls/libc.so.6
#9 0x00a88cb8 in dlopen_doit () from /lib/libdl.so.2
#10 0x009240fe in _dl_catch_error () from /lib/ld-linux.so.2
#11 0x00a892bb in _dlerror_run () from /lib/libdl.so.2
#12 0x00a88d11 in dlopen@@GLIBC_2.1 () from /lib/libdl.so.2
#13 0x00e25222 in apr_dso_load (res_handle=0xbfe587ec,path=0x90c9740 "/root/web_agents/apache22_agent/lib/libamapc22.so",pool=0x905b0a8) at dso/unix/dso.c:139
#14 0x0809631e in load_module (cmd=0xbfe58bd0, dummy=0xbfe58a70,modname=0x90c96f8 "dsame_module", filename=0x90c9708 "/root/web_agents/apache22_agent/lib/libamapc22.so") at mod_so.c:238
#15 0x0807280f in invoke_cmd (cmd=0x80adde0, parms=0xbfe58bd0,mconfig=0xbfe58a70,args=0x90b6248 "") at config.c:890
#16 0x08072d3b in ap_build_config_sub (p=0x905b0a8,temp_pool=<value optimized out>, l=0x90b61f8 "LoadModule dsame_module /root/web_agents/apache22_agent/lib/libamapc22.so", parms=0xbfe58bd0, current=0xbfe58ab4, curr_parent=0xbfe58ab8, conftree=0xbfe58d74) at config.c:1436
#17 0x08073163 in ap_build_config (parms=0xbfe58bd0, p=0x905b0a8, temp_pool=0x90610c0, conftree=0xbfe58d74) at config.c:1219
#18 0x08073678 in process_resource_config_nofnmatch (s=0x905fc80,fname=0x90c8530 "/root/web_agents/apache22_agent/Agent_001/config/dsame.conf", conftree=0xbfe58d74, p=0x905b0a8, ptemp=0x90610c0, depth=0) at config.c:1629
#19 0x08073844 in ap_process_resource_config (s=0x905fc80,fname=0x90c8530 "/root/web_agents/apache22_agent/Agent_001/config/dsame.conf", conftree=0xbfe58d74, p=0x905b0a8, ptemp=0x90610c0) at config.c:1661
#20 0x0806bba1 in include_config (cmd=0xbfe59050, dummy=0xbfe58ef0,name=0x90c8468 "/root/web_agents/apache22_agent/Agent_001/config/dsame.conf") at core.c:2584
#21 0x0807280f in invoke_cmd (cmd=0x80a3ab8, parms=0xbfe59050,mconfig=0xbfe58ef0, args=0x90a2203 "") at config.c:890
#22 0x08072d3b in ap_build_config_sub (p=0x905b0a8,temp_pool=<value optimized out>,l=0x90a21c0 "include /root/web_agents/apache22_agent/Agent_001/config/dsame.conf", parms=0xbfe59050, current=0xbfe58f34, curr_parent=0xbfe58f38,conftree=0x80b43a8) at config.c:1436
#23 0x08073163 in ap_build_config (parms=0xbfe59050, p=0x905b0a8, temp_pool=0x90610c0, conftree=0x80b43a8) at config.c:1
#24 0x08073678 in process_resource_config_nofnmatch (s=0x905fc80,fname=0x90ab290 "/opt/apache226/conf/httpd.conf", conftree=0x80b43a8,p=0x905b0a8, ptemp=0x90610c0, depth=0) at config.c:1629
#25 0x08073844 in ap_process_resource_config (s=0x905fc80,fname=0x90ab290 "/opt/apache226/conf/httpd.conf", conftree=0x80b43a8,p=0x905b0a8, ptemp=0x90610c0) at config.c:1661
#26 0x0807418f in ap_read_config (process=0x9059120, ptemp=0x90610c0 filename=0x80a01e5 "conf/httpd.conf", conftree=0x80b43a8) at config.c:2021
#27 0x08061ee9 in main (argc=3, argv=0xbfe59354) at main.c:694

The backtrace against the correct thread in gdb made much more sense and gdb could resolve the symbols correctly too.  From gdb's backtrace it was clear that the crash was immediately after loading the dsame module, libamapc22.so which is our Policy Agent.

I filed a bug  6637032  against Access Manager Policy Agent and engineering has provided a fix to resolve this.

Monday Nov 19, 2007

libumem to detect modify-after-free corruptions.

Here is another example of how libumem can help us detect memory corruption..

The problem reported by the customer was that the Sun Web Server 6.1 protected by Sun's Access Manager Policy Agent  running in production was crashing.  This problem was not showing up in their staging environment, it was only the servers in production that were crashing.

Customer sent in several core files for us to analyze.  The stack trace in the first  core file that I looked at showed that we were aborting in libmtmalloc.so.1. 

=>[1] libc.so.1:__lwp_kill(0x0, 0x6, 0x0, 0xfe0bc000, 0x5, 0xaf295c8), at 0xfe0a0218
[2] libc.so.1:raise(0x6, 0x0, 0xe1efba50, 0xffffffff, 0x74ab0, 0xaf02900), at 0xfe050c80
[3] libc.so.1:abort(0x0, 0x0, 0x88, 0xffffffff, 0x74ab0, 0xaf02870), at 0xfe06e98
[4] libmtmalloc.so.1:free(0xaf28060, 0xfffb, 0xaf02800, 0xaf02828, 0x74ab0, 0xaf02828),

 I vaguely remembered that there was a bug in mtmalloc that returned an already freed pointer. I searched our bug database and found the bug . As I read through the synopsis of the bug  it became clear that we were NOT running into this bug. The bug was about : libmtmalloc's realloc() returning an already freed pointer. I don't think, we were doing a realloc here.  I ruled out that bug and just to be sure, I checked the mtmalloc patch level on the system and found that they were running the latest mtmalloc patch.

I looked at the other core files sent by the customer, and this  time, the crash was somewhere else. As I analyzed the core files and pstack output I noticed that the crashes were random in nature. The randomness of the crash indicated  memory corruption.  I initially suspected that this could be double-free type of  an error.

 fe0a0218 _lwp_kill (6, 0, e1ffba50, ffffffff, 74620, 2969700) + 18
fe036e98 addsev   (0, 0, 88, ffffffff, 74620, 2969670)
ff390bcc free     (3e0d188, fffb, 2969600, 2969628, 74620, 2969628) + 1e0
fd726f08 void\*__Crun::vector_del(void\*,unsigned,void(\*)(void\*)) (3e0d188, 0, 0, ffffffff, 0,
fd70e934 std::vector<bool>::iterator&std::vector<bool>::iterator::operator++()

I requested the customer to use libumem , the customer obliged (BIG Thank you) and sent us three corefiles generated with libumem enabled.

I opened the corefile in mdb, ran ::umem_verify and to my surprise, the integrity of all the caches came up as clean. Where do I go from here ?

Okay, I ran ::umem_status command and that printed the exact nature of corruption, including the stacktrace of the thread that last freed this buffer and the offset at which someone wrote to this buffer after it was being freed.

This information was sufficient for Sun Access Manager engineering team to come up with a fix and release  hotpatch 2.2-01 against Policy agent.  

How good is that ! I told you, libumem is so powerful yet so simple and easy to use.

mdb>::umem_status
Status: ready and active
Concurrency: 8
Logs: transaction=256k content=256k fail=256k (inactive)
Message buffer:
umem allocator: buffer modified after being freed modification occurred at offset 0x20 (0xdeadbeefdeadbeef replaced by 0xdeadbeeedeadbeef)
buffer=a7a0708 bufctl=a813f68 cache: umem_alloc_96
previous transaction on buffer a7a0708:
thread=28 time=T-284.970780700 slab=a5dede0 cache: umem_alloc_96
libumem.so.1'umem_cache_free+0x4c
libumem.so.1'process_free+0x006c
libames6.so'PR_DestroyLock
libames6.so'~Node
libames6.so'~RefCntPtr
libames6.so'~Tree
libames6.so'~PolicyEntry
libames6.so'~RefCntPtr
libames6.so'cleanup+0x0030
libames6.so'operator+0x01fc
libames6.so'spin+0x023c
libnspr4.so'_pt_root+0x00d4
libthread.so.1'_lwp_start
umem: heap corruption detected


About

Hema

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today