Monday Aug 17, 2009

Setting up ssoAdminTools fails with: /opensso/bootstrap (No such file or directory)

I was trying to install ssoadm commandline tool on a lab system that had opensso installed previously.

 I unzipped the  ssoAdminTools.zip to a directory and as per the instructions in the file README.setup I set the java path and then ran the setup command.  

The setup did not complete successfully, it complained about  not finding the bootstap file.

# export JAVA_HOME=/usr/jdk/jdk1.5.0_17
# ./setup
Path to config files of OpenSSO server (example: /opensso):/opensso
Debug Directory:/opensso/opensso/debug
Log Directory:/opensso/opensso/log
/opensso/bootstrap (No such file or directory)


I looked at the debug logs and found the following exception in the file:
/opensso/opensso/debug/Configuration

amSetupServlet:08/16/2009 11:42:25:705 PM EST: Thread[smIdmThreadPool,5,main]
ERROR: ServerConfigXMLObserver.notifyChanges
com.sun.identity.common.configuration.ConfigurationException: Not enough space
        at com.sun.identity.setup.BootstrapCreator.update(BootstrapCreator.java:112)

        at com.sun.identity.setup.BootstrapCreator.updateBootstrap(BootstrapCreator.java:79)        at com.sun.identity.common.configuration.ServerConfigXMLObserver.update(ServerConfigXMLObserver.java:100)
        at com.sun.identity.common.configuration.ServerConfigXMLObserver.notifyChanges(ServerConfigXMLObserver.java:80)
        at com.sun.identity.common.configuration.ConfigurationObserver.notifies(ConfigurationObserver.java:190)
        at com.sun.identity.common.configuration.ConfigurationObserver.globalConfigChanged(ConfigurationObserver.java:158)
        at com.sun.identity.sm.ServiceConfigManagerImpl.notifyGlobalConfigChange(ServiceConfigManagerImpl.java:464)
        at com.sun.identity.sm.ServiceConfigManagerImpl.objectChanged(ServiceConfigManagerImpl.java:436)
        at com.sun.identity.sm.SMSNotificationManager.sendNotifications(SMSNotificationManager.java:282)
        at com.sun.identity.sm.SMSNotificationManager$LocalChangeNotifcationTask.run(SMSNotificationManager.java:348)
        at com.iplanet.am.util.ThreadPool$WorkerThread.run(ThreadPool.java:306)

This meant that the opensso configurator was not successful in creating the bootstrap file and I came to know of this only when I tried to setup ssoadm command line tool.

When I last ran the opensso configurator, I remeber reading the message "configuration complete proceed to login" and I was able to login to opensso console as amadmin without any issues.

Is this a bug in the configurator ? Can I construct a bootstap file manually to include datastore information to workaround this ? I am not too sure on this, for now I just decided to use a different system to get going.

I'll make another entry in my blog if I find answers to the above questions.

Wednesday Apr 29, 2009

UnsatisfiedLinkError: no jss4 in java.library.path

As support engineer, considerable amount of my time goes in trying to replicate customer problems in the labs within Sun. When setting up things in the lab, sometimes I run into other issues that I have to overcome before I can possibly reproduce the actual problem reported by the customer.

Here is one such error I ran into when  I installed  Access Manager 7.1 from JES 5 on Appserver 8.2.

Installation of the compoents went through fine but when the web container was restarted,  it was throwing the error:  java.lang.UnsatisfiedLinkError: no jss4 in java.library.path

To overcome this, I had to add the following 3 entries to AMConfig.properties and then restart the container.

com.iplanet.security.SecureRandomFactoryImpl=com.iplanet.am.util.SecureRandomFactoryImpl
com.iplanet.security.SSLSocketFactoryImpl=netscape.ldap.factory.JSSESocketFactory
com.iplanet.security.encryptor=com.iplanet.services.util.JCEEncryption

Snippet of the exception in the application server log

[#|2009-04-13T16:50:59.482+1000|SEVERE|sun-appserver-pe8.2|javax.enterprise.system.container.web|_ThreadID=10;|WebModule[/amserver]Servlet /amserver threw load() exception javax.servlet.ServletException: no jss4 in java.library.path
at org.apache.catalina.security.SecurityUtil.execute(SecurityUtil.java:300)
at org.apache.catalina.security.SecurityUtil.doAsPrivilege(SecurityUtil.java:165)
at org.apache.catalina.security.SecurityUtil.doAsPrivilege(SecurityUtil.java:118)
at org.apache.catalina.core.StandardWrapper.loadServlet(StandardWrapper.java:1093)
at org.apache.catalina.core.StandardWrapper.load(StandardWrapper.java:931)
at org.apache.catalina.core.StandardContext.loadOnStartup(StandardContext.java:4183)
at org.apache.catalina.core.StandardContext.start(StandardContext.java:4535)
at com.sun.enterprise.web.WebModule.start(WebModule.java:241)
at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1086)
at org.apache.catalina.core.StandardHost.start(StandardHost.java:847)
at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1086)
at org.apache.catalina.core.StandardEngine.start(StandardEngine.java:483)
at org.apache.catalina.startup.Embedded.start(Embedded.java:894)
at com.sun.enterprise.web.WebContainer.start(WebContainer.java:741)at com.sun.enterprise.web.PEWebContainer.startInstance(PEWebContainer.java:515)
at com.sun.enterprise.web.PEWebContainerLifecycle.onStartup(PEWebContainerLifecycle.java:54)
at com.sun.enterprise.server.ApplicationServer.onStartup(ApplicationServer.java:300)
at com.sun.enterprise.server.PEMain.run(PEMain.java:292)
at com.sun.enterprise.server.PEMain.main(PEMain.java:218)
----- Root Cause -----
java.lang.UnsatisfiedLinkError: no jss4 in java.library.path
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1682)
at java.lang.Runtime.loadLibrary0(Runtime.java:822)
at java.lang.System.loadLibrary(System.java:993)
at org.mozilla.jss.CryptoManager.loadNativeLibraries(CryptoManager.java:1339)
at org.mozilla.jss.CryptoManager.initialize(CryptoManager.java:827)
at com.iplanet.services.util.JSSEncryption.<clinit>(JSSEncryption.java:250)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:164)
at com.iplanet.services.util.Crypt.createInstance(Crypt.java:133)
at com.iplanet.services.util.Crypt.<clinit>(Crypt.java:103)
at com.iplanet.services.ldap.LDAPUser.getPasswd(LDAPUser.java:117)
at com.iplanet.services.ldap.ServerInstance.getPasswd(ServerInstance.java:128)
at com.sun.identity.security.ServerInstanceAction.run(ServerInstanceAction.java:92)
at java.security.AccessController.doPrivileged(Native Method)
at com.iplanet.am.util.AdminUtils.<clinit>(AdminUtils.java:82)
at com.sun.identity.sm.SMSEntry.<clinit>(SMSEntry.java:169)
at com.sun.identity.sm.ServiceManager.<clinit>(ServiceManager.java:74)
at com.sun.identity.authentication.service.AuthD.<init>(AuthD.java:247)
at com.sun.identity.authentication.service.AuthD.getAuth(AuthD.java:507)
at com.sun.identity.authentication.UI.LoginLogoutMapping.initializeAuth(LoginLogoutMapping.java:89)
at com.sun.identity.authentication.UI.LoginLogoutMapping.init(LoginLogoutMapping.java:74)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at org.apache.catalina.security.SecurityUtil$1.run(SecurityUtil.java:249)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAsPrivileged(Subject.java:517)
at org.apache.catalina.security.SecurityUtil.execute(SecurityUtil.java:282)
at org.apache.catalina.security.SecurityUtil.doAsPrivilege(SecurityUtil.java:165)
at org.apache.catalina.security.SecurityUtil.doAsPrivilege(SecurityUtil.java:118)
at org.apache.catalina.core.StandardWrapper.loadServlet(StandardWrapper.java:1093)
at org.apache.catalina.core.StandardWrapper.load(StandardWrapper.java:931)
at org.apache.catalina.core.StandardContext.loadOnStartup(StandardContext.java:4183)
at org.apache.catalina.core.StandardContext.start(StandardContext.java:4535)
at com.sun.enterprise.web.WebModule.start(WebModule.java:241)
at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1086)
at org.apache.catalina.core.StandardHost.start(StandardHost.java:847)
at org.apache.catalina.core.StandardHost.start(StandardHost.java:847)
at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1086)
at org.apache.catalina.core.StandardEngine.start(StandardEngine.java:483)
at org.apache.catalina.startup.Embedded.start(Embedded.java:894)
at com.sun.enterprise.web.WebContainer.start(WebContainer.java:741)
at com.sun.enterprise.web.PEWebContainer.startInstance(PEWebContainer.java:515)
at com.sun.enterprise.web.PEWebContainerLifecycle.onStartup(PEWebContainerLifecycle.java:54)
at com.sun.enterprise.server.ApplicationServer.onStartup(ApplicationServer.java:300)
at com.sun.enterprise.server.PEMain.run(PEMain.java:292)
at com.sun.enterprise.server.PEMain.main(PEMain.java:218)

 

Wednesday Mar 19, 2008

Gosling Creed

Sydney Morning Herald ran an article on James Gosling and I liked the last part on Gosling Creed.

The Gosling creed

- Reward risk and don't penalise failure.

- Innovation is about risk and thinking that risk and failure are good things.

- You have to learn to trust crazy people. It's the people on the edge that have the most interesting, insightful comments.

- Have confidence in yourself. People you think are oracles of wisdom - they're just making it up.

- Listen carefully, and try to figure out what the problem is to solve.

 

Link to the actual  article is here: http://www.smh.com.au/news/technology/wisdom-from-the-wayout/2008/03/18/1205602306483.html

Monday Feb 25, 2008

Access Manager Session State Diagram

Here is the session state diagram  I found in the access manager code, session.java.

  \* The following is the state diagram for a session:
\*
\* |
\* |
\* |
\* V
\* ---------------invalid
\* | |
\* | |creation (authentication OK)
\* | |
\* |max login time | max idle time
\* |destroy V ------------------->
\* | valid inactive --
\* | | <------------------ |
\* | | reactivate |
\* | | |
\* | | logout | destroy
\* | | destroy | max session time
\* | | max session time |
\* | V |
\* --------------> destroy <---------------------





 

LARES - Liberty-like AuthNResponse

LARES stands for  Liberty-like AuthNResponse.   This is the base64 encoded value of SAML assertion sent by the access manager to the agent when CDSSO is enabled.

The Access manager sends this as a hidden value in it's response message as shown below:

    <HTML>
    <BODY Onload="document.Response.submit()">
    <FORM NAME="Response" METHOD="POST" ACTION="http://hostname.singapore.sun.com:80/test.asp?sunwMethod=GET">
    <INPUT TYPE="HIDDEN" NAME="LARES" VALUE="PGxpYjpBdXRoblJlc3BvbnNlIHhtbG5zOmxpYj0iaHR0cDovL3Byb2plY3RsaWJlcnR5Lm9yZy9zY2hlbWFzL2NvcmUvMjAwMi8xMiIgeG1sbnM6c2FtbD0idXJuOm9hc2lzOm5hbWVzOnRjOlNBTUw6MS4wOmFzc2VydGlvbiIgeG1sbnM6c2FtbHA9InVybjpvYXNpczpuYW1lczp0YzpTQU1MOjEuMDpwcm90b2NvbCIgeG1sbnM6ZHM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvMDkveG1sZHNpZyMiIHhtbG5zOnhzaT0iaHR0cDovL3d3dy53My5vcmcvMjAwMS9YTUxTY2hlbWEtaW5zdGFuY2UiIFJlc3BvbnNlSUQ9InM1MjU5NjZlNmUzODhlNDQ4Y2I4OGFiYjVkNzNjZDBmNmM5ZDg4YmFkIiAgSW5SZXNwb25zZVRvPSIyNzE0NSIgIE1ham9yVmVyc2lvbj0iMSIgIE1pbm9yVmVyc2lvbj0iMCIgIElzc3VlSW5zdGFudD0iMjAwOC0wMi0yNVQwMTozODoyNloiPjxzYW1scDpTdGF0dXM%2BCjxzYW1scDpTdGF0dXNDb2RlIFZhbHVlPSJzYW1scDpTdWNjZXNzIj4KPC9zYW1scDpTdGF0dXNDb2RlPgo8L3NhbWxwOlN0YXR1cz4KPHNhbWw6QXNzZXJ0aW9uICB4bWxuczpzYW1sPSJ1cm46b2FzaXM6bmFtZXM6dGM6U0FNTDoxLjA6YXNzZXJ0aW9uIiB4bWxuczp4c2k9Imh0dHA6Ly93d3cudzMub3JnLzIwMDEvWE1MU2NoZW1hLWluc3RhbmNlIiAgeG1sbnM6bGliPSJodHRwOi8vcHJvamVjdGxpYmVydHkub3JnL3NjaGVtYXMvY29yZS8yMDAyLzEyIiAgaWQ9InMyOTBkODYyN2ZiYmQ4OGU4OGIyMGZjMjE2NmNjODhjODI3MDE2OWZkMDEiIE1ham9yVmVyc2lvbj0iMSIgTWlub3JWZXJzaW9uPSIwIiBBc3NlcnRpb25JRD0iczI5MGQ4NjI3ZmJiZDg4ZTg4YjIwZmMyMTY2Y2M4OGM4MjcwMTY5ZmQwMSIgSXNzdWVyPSJodHRwOi8vdjEwMC1jLmF1cy5zdW4uY29tOjgwODAvYW1zZXJ2ZXIvY2Rjc2VydmxldCIgSXNzdWVJbnN0YW50PSIyMDA4LTAyLTI1VDAxOjM4OjI2WiIgSW5SZXNwb25zZVRvPSIyNzE0NSIgeHNpOnR5cGU9ImxpYjpBc3NlcnRpb25UeXBlIj4KPHNhbWw6Q29uZGl0aW9ucyAgTm90QmVmb3JlPSIyMDA4LTAyLTI1VDAxOjM4OjI2WiIgTm90T25PckFmdGVyPSIyMDA4LTAyLTI1VDAxOjM5OjI2WiIgPgo8c2FtbDpBdWRpZW5jZVJlc3RyaWN0aW9uQ29uZGl0aW9uPgo8c2FtbDpBdWRpZW5jZT5odHRwOi8vdjR1LTIyMHJhLXNpbjA2LnNpbmdhcG9yZS5zdW4uY29tOjgwL2FtYWdlbnQ8L3NhbWw6QXVkaWVuY2U%2BCjwvc2FtbDpBdWRpZW5jZVJlc3RyaWN0aW9uQ29uZGl0aW9uPgo8L3NhbWw6Q29uZGl0aW9ucz4KPHNhbWw6QXV0aGVudGljYXRpb25TdGF0ZW1lbnQgIEF1dGhlbnRpY2F0aW9uTWV0aG9kPSJMREFQIiBBdXRoZW50aWNhdGlvbkluc3RhbnQ9IjIwMDgtMDItMjVUMDE6Mzg6MjRaIiBSZWF1dGhlbnRpY2F0ZU9uT3JBZnRlcj0iMjAwOC0wMi0yNVQwMTozOToyNloiIHhzaTp0eXBlPSJsaWI6QXV0aGVudGljYXRpb25TdGF0ZW1lbnRUeXBlIj48c2FtbDpTdWJqZWN0ICAgeHNpOnR5cGU9ImxpYjpTdWJqZWN0VHlwZSI%2BPHNhbWw6TmFtZUlkZW50aWZpZXIgTmFtZVF1YWxpZmllcj0iaHR0cDovL3YxMDAtYy5hdXMuc3VuLmNvbTo4MDgwL2Ftc2VydmVyL2NkY3NlcnZsZXQiPkFRSUM1d00yTFk0U2Zjd2tDTmFQJTJCemVMbWpwTVBmUUZ0czkxNzlBcGZrOHM5VlklM0QlNDBBQUpUU1FBQ01ERSUzRCUyMzwvc2FtbDpOYW1lSWRlbnRpZmllcj4KPHNhbWw6U3ViamVjdENvbmZpcm1hdGlvbj4KPHNhbWw6Q29uZmlybWF0aW9uTWV0aG9kPnVybjpvYXNpczpuYW1lczp0YzpTQU1MOjEuMDpjbTpiZWFyZXI8L3NhbWw6Q29uZmlybWF0aW9uTWV0aG9kPgo8L3NhbWw6U3ViamVjdENvbmZpcm1hdGlvbj4KPGxpYjpJRFBQcm92aWRlZE5hbWVJZGVudGlmaWVyICBOYW1lUXVhbGlmaWVyPSJodHRwOi8vdjEwMC1jLmF1cy5zdW4uY29tOjgwODAvYW1zZXJ2ZXIvY2Rjc2VydmxldCIgPkFRSUM1d00yTFk0U2Zjd2tDTmFQJTJCemVMbWpwTVBmUUZ0czkxNzlBcGZrOHM5VlklM0QlNDBBQUpUU1FBQ01ERSUzRCUyMzwvbGliOklEUFByb3ZpZGVkTmFtZUlkZW50aWZpZXI%2BCjwvc2FtbDpTdWJqZWN0PjxzYW1sOlN1YmplY3RMb2NhbGl0eSAgSVBBZGRyZXNzPSIxMC4xNS4zLjE0NiIgRE5TQWRkcmVzcz0idjEwMC1jLmF1cy5zdW4uY29tIiAvPjxsaWI6QXV0aG5Db250ZXh0PjxsaWI6QXV0aG5Db250ZXh0Q2xhc3NSZWY%2BaHR0cDovL3d3dy5wcm9qZWN0bGliZXJ0eS5vcmcvc2NoZW1hcy9hdXRoY3R4L2NsYXNzZXMvUGFzc3dvcmQ8L2xpYjpBdXRobkNvbnRleHRDbGFzc1JlZj48bGliOkF1dGhuQ29udGV4dFN0YXRlbWVudFJlZj5odHRwOi8vd3d3LnByb2plY3RsaWJlcnR5Lm9yZy9zY2hlbWFzL2F1dGhjdHgvY2xhc3Nlcy9QYXNzd29yZDwvbGliOkF1dGhuQ29udGV4dFN0YXRlbWVudFJlZj48L2xpYjpBdXRobkNvbnRleHQ%2BPC9zYW1sOkF1dGhlbnRpY2F0aW9uU3RhdGVtZW50Pjwvc2FtbDpBc3NlcnRpb24%2BCjxsaWI6UHJvdmlkZXJJRD5odHRwOi8vdjEwMC1jLmF1cy5zdW4uY29tOjgwODAvYW1zZXJ2ZXIvY2Rjc2VydmxldDwvbGliOlByb3ZpZGVySUQ%2BPC9saWI6QXV0aG5SZXNwb25zZT4K
    </FORM>
    </BODY>
</HTML>

 

 

 

If you stick the value in the LARES param to a base64 decoder, it should decode to the actual SAML assertion sent by the Access Manager.
 

<lib:AuthnResponse xmlns:lib="http://projectliberty.org/schemas/core/2002/12" xmlns:saml="urn:oasis:names:tc:SAML:1.0:assertion" xmlns:samlp="urn:oasis:names:tc:SAML:1.0:protocol" xmlns:ds="http://www.w3.org/2000/09/xmldsig#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" ResponseID="s525966e6e388e448cb88abb5d73cd0f6c9d88bad"  InResponseTo="27145"  MajorVersion="1"  MinorVersion="0"  IssueInstant="2008-02-25T01:38:26Z"><samlp:Status?
</saml:Conditions>
<saml:AuthenticationStatement AuthenticationMethod="LDAP" AuthenticationInstant="2008-02-25T01:38:24Z" ReauthenticateOnOrAfter="2008-02-25T01:39:26Z" xsi:type="lib:AuthenticationStatementType"><saml:Subject xsi:type="lib:SubjectType"?
</saml:Subject><saml:SubjectLocality IPAddress="10.15.3.146" DNSAddress="v100-c.aus.sun.com" /><lib:AuthnContext><lib:AuthnContextClassRef?�</saml:AuthenticationStatement></saml:Assertion?
�</lib:AuthnResponse>


 

 Here is the snapshot of the actual packet as seen in ethereal. Choose View image and then zoom in to view the image correctly.


 


 



Monday Dec 17, 2007

why is the stacktrace in gdb different to the one in dbx ?

After I posted my last blog, another idea struck me, how about I try live debugging with dbx ?

I downloaded Sun Studio 12 and installed on my lab system running Red Hat Enterprise Linux ES release 4 (Nahant Update 4). When I started apache in dbx, it crashed the first time around but subsequent starts did not crash. Even when it crashed, I couldn't get the symbols working correctly in dbx. What's the point in debugging live if I couldn't read the symbols correctly ?

I decided not to spend time on live debugging and focus instead on crash analysis.  But I was in for another surprise that is -- the stacktrace output in  dbx and gdb looked very different. I just couldn't figure out why the stacktrace output on gdb and dbx looked so different ?

The backtrace output in gdb looked as follows:

(gdb) bt
#0  0x003a9194 in spin () from /root/web_agents/apache22_agent/lib/libamapc22.so
#1  0x024dedc1 in PR_Select () from /usr/lib/libnspr4.so
#2  0x00b96371 in start_thread () from /lib/tls/libpthread.so.0
#3  0x009fdffe in clone () from /lib/tls/libc.so.6 

 
Now, compare that with the stacktrace in dbx ?

(dbx)

t@null (l@1) terminated by signal SEGV (Segmentation fault)
0x00929b7d: __mmap+0x000d: movl %edx,%ebx
Current function is apr_dso_load
139 os_handle = dlopen(path, flags);


(dbx) where
[1] __mmap(0xcbf000, 0x71d0, 0x5, 0x802, 0xa, 0x0), at 0x929b7d
[2] _dl_map_object_from_fd(0xbfe57e80, 0x90b9528, 0x90b9200, 0x2, 0x80000000, 0xbfe57e78), at 0x91d809
[3] _dl_map_object(0x2, 0x0, 0x80000000, 0x0, 0xbfe583b4, 0x44fd66), at 0x91f0bc
[4] openaux(0xbfe583b4, 0x80000000, 0x0, 0xbfe583b4, 0xb7f3d6bc, 0x922d50), at 0x922d98
[5] _dl_catch_error(0xbfe583b4, 0x24, 0xbfe584b0, 0xbfe58188, 0x92dfd4, 0xbfe583e8), at 0x9240fe
[6] _dl_map_object_deps(0x0, 0x80000000, 0x90000102, 0x0, 0xa5cff4, 0x0), at 0x92303a
[7] dl_open_worker(0xbfe585b0, 0x0, 0x0, 0x0, 0xb7f3d6bc, 0xa31fe0), at 0xa32159
[8] _dl_catch_error(0xbfe585b0, 0x90, 0xffffffff, 0x0, 0x970f98, 0xbfe58b40), at 0x9240fe
[9] __GI__dl_open(0xfffffffe, 0x0, 0x0, 0x92dfd4, 0xbfe58750, 0x92dca0), at 0xa32cb8
[10] dlopen_doit(0xbfe58750, 0x0, 0x0, 0xbfe58c44, 0xb7f3d6bc, 0xa88c70), at 0xa88cb8
[11] _dl_catch_error(0xbfe58750, 0xa88f50, 0x48, 0xe3736c, 0x905b0a8, 0x90c973a), at 0x9240fe
[12] _dlerror_run(0x0, 0x0, 0x31, 0x0, 0x90c9740, 0x102), at 0xa892bb
[13] __dlopen_check(0x90c9740, 0x102, 0x90c9739, 0xe23650, 0x905b0a8, 0x1), at 0xa88d11
=>[14] apr_dso_load(res_handle = <value not available>, path = <value not available>, pool = <value not available>), line 139 in "dso.c"
[15] load_module(cmd = <value not available>, dummy = <value not available>, modname = <value not available>, filename = <value not available>), line 238 in "mod_so.c"
[16] invoke_cmd(cmd = (nil), parms = (nil), mconfig = <value not available>, args = <value not available>), line 890 in "config.c"
[17] ap_build_config_sub(p = <value not available>, temp_pool = <value not available>, l = <value not available>, parms = <value not available>, current = <value not available>, curr_parent = <value not available>, conftree = <value not available>), line 1436 in "config.c"
[18] ap_build_config(parms = (nil), p = <value not available>, temp_pool = <value not available>, conftree = (nil)), line 1219 in "config.c"
[19] process_resource_config_nofnmatch(s = <value not available>, fname = (nil), conftree = <value not available>, p = <value not available>, ptemp = <value not available>, depth = <value not available>), line 1629 in "config.c"
[20] ap_process_resource_config(s = <value not available>, fname = <value not available>, conftree = <value not available>, p = <value not available>, ptemp = <value not available>), line 1661 in "config.c"
[21] include_config(cmd = (nil), dummy = <value not available>, name = (nil)), line 2584 in "core.c"
[22] invoke_cmd(cmd = (nil), parms = (nil), mconfig = <value not available>, args = <value not available>), line 890 in "config.c"
[23] ap_build_config_sub(p = <value not available>, temp_pool = <value not available>, l = <value not available>, parms = <value not available>, current = <value not available>, curr_parent = <value not available>, conftree = <value not available>), line 1436 in "config.c"
[24] ap_build_config(parms = (nil), p = <value not available>, temp_pool = <value not available>, conftree = (nil)), line 1219 in "config.c"
[25] process_resource_config_nofnmatch(s = <value not available>, fname = (nil), conftree = <value not available>, p = <value not available>, ptemp = <value not available>, depth = <value not available>), line 1629 in "config.c"
[26] ap_process_resource_config(s = <value not available>, fname = <value not available>, conftree = <value not available>, p = <value not available>, ptemp = <value not available>), line 1661 in "config.c"
[27] ap_read_config(process = <value not available>, ptemp = <value not available>, filename = <value not available>, conftree = <value not available>), line 2021 in "config.c"
[28] main(argc = -1075478912, argv = 0x90b9528), line 694 in "main.c"

There is hardly any resemblance in the stacktrace output produced by the two debug tools.  I posted this question to one of our internal aliases and I received a response from a colleague of mine in Singapore saying,  I was running backtrace against the wrong thread in gdb !!.   

When I ran the command "bt"  in gdb, I did not choose any particular thread,  all I did was open the core file in gdb and run the command "bt".  I always thought that by default when you open a core file in a debugger and run the command to print the stacktrace ( or backtrace),  it prints the trace of the thread that resulted this crash.   I now know that might always be the case. Finally, I managed to print the stacktrace of the thread that resulted in this crash in gdb.

(gdb) thread apply all bt
Thread 2 (process 7160):
#0 0x00929b7d in mmap () from /lib/ld-linux.so.2
#1 0x0091d809 in _dl_map_object_from_fd () from /lib/ld-linux.so.2
#2 0x0091f0bc in _dl_map_object () from /lib/ld-linux.so.2
#3 0x00922d98 in openaux () from /lib/ld-linux.so.2
#4 0x009240fe in _dl_catch_error () from /lib/ld-linux.so.2
#5 0x0092303a in _dl_map_object_deps () from /lib/ld-linux.so.2
#6 0x00a32159 in dl_open_worker () from /lib/tls/libc.so.6
#7 0x009240fe in _dl_catch_error () from /lib/ld-linux.so.2
#8 0x00a32cb8 in _dl_open () from /lib/tls/libc.so.6
#9 0x00a88cb8 in dlopen_doit () from /lib/libdl.so.2
#10 0x009240fe in _dl_catch_error () from /lib/ld-linux.so.2
#11 0x00a892bb in _dlerror_run () from /lib/libdl.so.2
#12 0x00a88d11 in dlopen@@GLIBC_2.1 () from /lib/libdl.so.2
#13 0x00e25222 in apr_dso_load (res_handle=0xbfe587ec,path=0x90c9740 "/root/web_agents/apache22_agent/lib/libamapc22.so",pool=0x905b0a8) at dso/unix/dso.c:139
#14 0x0809631e in load_module (cmd=0xbfe58bd0, dummy=0xbfe58a70,modname=0x90c96f8 "dsame_module", filename=0x90c9708 "/root/web_agents/apache22_agent/lib/libamapc22.so") at mod_so.c:238
#15 0x0807280f in invoke_cmd (cmd=0x80adde0, parms=0xbfe58bd0,mconfig=0xbfe58a70,args=0x90b6248 "") at config.c:890
#16 0x08072d3b in ap_build_config_sub (p=0x905b0a8,temp_pool=<value optimized out>, l=0x90b61f8 "LoadModule dsame_module /root/web_agents/apache22_agent/lib/libamapc22.so", parms=0xbfe58bd0, current=0xbfe58ab4, curr_parent=0xbfe58ab8, conftree=0xbfe58d74) at config.c:1436
#17 0x08073163 in ap_build_config (parms=0xbfe58bd0, p=0x905b0a8, temp_pool=0x90610c0, conftree=0xbfe58d74) at config.c:1219
#18 0x08073678 in process_resource_config_nofnmatch (s=0x905fc80,fname=0x90c8530 "/root/web_agents/apache22_agent/Agent_001/config/dsame.conf", conftree=0xbfe58d74, p=0x905b0a8, ptemp=0x90610c0, depth=0) at config.c:1629
#19 0x08073844 in ap_process_resource_config (s=0x905fc80,fname=0x90c8530 "/root/web_agents/apache22_agent/Agent_001/config/dsame.conf", conftree=0xbfe58d74, p=0x905b0a8, ptemp=0x90610c0) at config.c:1661
#20 0x0806bba1 in include_config (cmd=0xbfe59050, dummy=0xbfe58ef0,name=0x90c8468 "/root/web_agents/apache22_agent/Agent_001/config/dsame.conf") at core.c:2584
#21 0x0807280f in invoke_cmd (cmd=0x80a3ab8, parms=0xbfe59050,mconfig=0xbfe58ef0, args=0x90a2203 "") at config.c:890
#22 0x08072d3b in ap_build_config_sub (p=0x905b0a8,temp_pool=<value optimized out>,l=0x90a21c0 "include /root/web_agents/apache22_agent/Agent_001/config/dsame.conf", parms=0xbfe59050, current=0xbfe58f34, curr_parent=0xbfe58f38,conftree=0x80b43a8) at config.c:1436
#23 0x08073163 in ap_build_config (parms=0xbfe59050, p=0x905b0a8, temp_pool=0x90610c0, conftree=0x80b43a8) at config.c:1
#24 0x08073678 in process_resource_config_nofnmatch (s=0x905fc80,fname=0x90ab290 "/opt/apache226/conf/httpd.conf", conftree=0x80b43a8,p=0x905b0a8, ptemp=0x90610c0, depth=0) at config.c:1629
#25 0x08073844 in ap_process_resource_config (s=0x905fc80,fname=0x90ab290 "/opt/apache226/conf/httpd.conf", conftree=0x80b43a8,p=0x905b0a8, ptemp=0x90610c0) at config.c:1661
#26 0x0807418f in ap_read_config (process=0x9059120, ptemp=0x90610c0 filename=0x80a01e5 "conf/httpd.conf", conftree=0x80b43a8) at config.c:2021
#27 0x08061ee9 in main (argc=3, argv=0xbfe59354) at main.c:694

The backtrace against the correct thread in gdb made much more sense and gdb could resolve the symbols correctly too.  From gdb's backtrace it was clear that the crash was immediately after loading the dsame module, libamapc22.so which is our Policy Agent.

I filed a bug  6637032  against Access Manager Policy Agent and engineering has provided a fix to resolve this.

Monday Nov 26, 2007

program's behaviour changing when attached to debugger or when using monitoring tools

Last week I managed to replicate a problem reported by one of our customers in Japan.  It was about Policy agent 2.2 for Apache 2.2 crashing on 32 bit  Red Hat Enterprise Linux ES release 4 (Nahant Update 4).

I downloaded the apache code from http://httpd.apache.org/download.cgi and built it on the system that I was trying to replicate this problem on.

I downloaded the policy agent from http://www.sun.com/download/products.xml?id=471909fc
and before I installed the agent, I verified that I could start and stop the apache server correctly.

I followed the install steps outlined the install document and installation was successful. http://docs.sun.com/app/docs/doc/820-3288.

I now  tried to start the apache server after installing the policy agent and  it crashed with segmentation fault.  Hiya,  I could reproduce the problem !

#/opt/apache226/bin/apachectl -k start
/opt/apache226/bin/apachectl: line 78: 18504 Segmentation fault     
(core dumped) $HTTPD $ARGV

I opened the corefile in gdb and the backtrace looked as follows:

(gdb) bt
#0  0x003a9194 in spin ()
from /web_agents/apache22_agent/lib/libamapc22.so
#1  0x024dedc1 in PR_Select () from /usr/lib/libnspr4.so
#2  0x00b96371 in start_thread () from /lib/tls/libpthread.so.0
#3  0x009fdffe in clone () from /lib/tls/libc.so.6

Now that I could replicate the problem consistently on my lab system, I thought, I'll do some live debugging with gdb for that extra bit of fun.

I ran the program in gdb setting breakpoint in spin() and to my surprise, the process did not crash.  The program exited normally. I thought, this might be due to the breakpoint that I set.  I re-ran the program without the breakpoint and the program ran normally again in gdb .

(gdb) file /opt/apache22/bin/httpd
(gdb) set args -k start
(gdb) start
Breakpoint 1 at 0x80618bd: file main.c, line 438.
Starting program: /opt/apache226/bin/httpd -k start
[Thread debugging using libthread_db enabled]
[New Thread 0xb7ee06c0 (LWP 28225)]
[Switching to Thread 0xb7ee06c0 (LWP 28225)]
main (argc=3, argv=0xbfeb32a4) at main.c:438
438     {
(gdb) break spin
Function "spin" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 2 (spin) pending.
(gdb) continue
Continuing.
Breakpoint 3 at 0x3a9326
Pending breakpoint "spin" resolved
[New Thread 0x19a5bb0 (LWP 28229)]
[Switching to Thread 0x19a5bb0 (LWP 28229)]
Breakpoint 3, 0x003a9326 in spin ()
   from /web_agents/apache22_agent/lib/libamapc22.so
(gdb) continue
Continuing.
warning: Temporarily disabling breakpoints for unloaded shared library "/hema/opensso/web_agents/apache22_agent/lib/libamapc22.so"
Breakpoint 3 at 0x4b6326
Program exited normally.

 I then ran strace and ltrace to get more information and to my surprise, apache didn't crash if I ran with those monitoring tools either.

I suspect this is due to some sort of timing issue.  How do I debug a program if  it's behavior changes when attached to a debugger or when run with the monitoring tools !!

I gave up on live debugging and finally had to go back to core dump analysis of the core file in gdb.

Monday Nov 19, 2007

libumem to detect modify-after-free corruptions.

Here is another example of how libumem can help us detect memory corruption..

The problem reported by the customer was that the Sun Web Server 6.1 protected by Sun's Access Manager Policy Agent  running in production was crashing.  This problem was not showing up in their staging environment, it was only the servers in production that were crashing.

Customer sent in several core files for us to analyze.  The stack trace in the first  core file that I looked at showed that we were aborting in libmtmalloc.so.1. 

=>[1] libc.so.1:__lwp_kill(0x0, 0x6, 0x0, 0xfe0bc000, 0x5, 0xaf295c8), at 0xfe0a0218
[2] libc.so.1:raise(0x6, 0x0, 0xe1efba50, 0xffffffff, 0x74ab0, 0xaf02900), at 0xfe050c80
[3] libc.so.1:abort(0x0, 0x0, 0x88, 0xffffffff, 0x74ab0, 0xaf02870), at 0xfe06e98
[4] libmtmalloc.so.1:free(0xaf28060, 0xfffb, 0xaf02800, 0xaf02828, 0x74ab0, 0xaf02828),

 I vaguely remembered that there was a bug in mtmalloc that returned an already freed pointer. I searched our bug database and found the bug . As I read through the synopsis of the bug  it became clear that we were NOT running into this bug. The bug was about : libmtmalloc's realloc() returning an already freed pointer. I don't think, we were doing a realloc here.  I ruled out that bug and just to be sure, I checked the mtmalloc patch level on the system and found that they were running the latest mtmalloc patch.

I looked at the other core files sent by the customer, and this  time, the crash was somewhere else. As I analyzed the core files and pstack output I noticed that the crashes were random in nature. The randomness of the crash indicated  memory corruption.  I initially suspected that this could be double-free type of  an error.

 fe0a0218 _lwp_kill (6, 0, e1ffba50, ffffffff, 74620, 2969700) + 18
fe036e98 addsev   (0, 0, 88, ffffffff, 74620, 2969670)
ff390bcc free     (3e0d188, fffb, 2969600, 2969628, 74620, 2969628) + 1e0
fd726f08 void\*__Crun::vector_del(void\*,unsigned,void(\*)(void\*)) (3e0d188, 0, 0, ffffffff, 0,
fd70e934 std::vector<bool>::iterator&std::vector<bool>::iterator::operator++()

I requested the customer to use libumem , the customer obliged (BIG Thank you) and sent us three corefiles generated with libumem enabled.

I opened the corefile in mdb, ran ::umem_verify and to my surprise, the integrity of all the caches came up as clean. Where do I go from here ?

Okay, I ran ::umem_status command and that printed the exact nature of corruption, including the stacktrace of the thread that last freed this buffer and the offset at which someone wrote to this buffer after it was being freed.

This information was sufficient for Sun Access Manager engineering team to come up with a fix and release  hotpatch 2.2-01 against Policy agent.  

How good is that ! I told you, libumem is so powerful yet so simple and easy to use.

mdb>::umem_status
Status: ready and active
Concurrency: 8
Logs: transaction=256k content=256k fail=256k (inactive)
Message buffer:
umem allocator: buffer modified after being freed modification occurred at offset 0x20 (0xdeadbeefdeadbeef replaced by 0xdeadbeeedeadbeef)
buffer=a7a0708 bufctl=a813f68 cache: umem_alloc_96
previous transaction on buffer a7a0708:
thread=28 time=T-284.970780700 slab=a5dede0 cache: umem_alloc_96
libumem.so.1'umem_cache_free+0x4c
libumem.so.1'process_free+0x006c
libames6.so'PR_DestroyLock
libames6.so'~Node
libames6.so'~RefCntPtr
libames6.so'~Tree
libames6.so'~PolicyEntry
libames6.so'~RefCntPtr
libames6.so'cleanup+0x0030
libames6.so'operator+0x01fc
libames6.so'spin+0x023c
libnspr4.so'_pt_root+0x00d4
libthread.so.1'_lwp_start
umem: heap corruption detected


Sunday Nov 18, 2007

libumem to detect write-beyond-what-you-allocate errors


I am a big fan of libumem . I've been using it for years to debug application crashes reported by our customers. I've found it very useful in isolating the source of corruption . I thought I'll share some of my experiences here.

I will use some examples from real cases although I might obscure the names of some of the libraries.

One of our elite customers reported that their application was crashing and he suspected that java is the cause of  the crash. This was a pretty complex java application that involved few native libraries as well.

One of the challenging part of a support job is isolating the problem and even more challenging is convincing the customer that the problem is elsewhere and not where he thinks it is. 

But, you know what, with libumem you'll see how easy this could be. Okay, I asked the customer to run their application with libumem enabled and send us another crash and so they did.

The first thing to do of course is open the corefile in mdb and run umem_verify command. This prints the name, address and the integrity of the cache. Take the address of the cache containing corrupt buffer and run umem_verify against it.  This gives you the address of the corrupt buffer.

 # mdb java corefile
Loading modules: [ libumem.so.1 libthread.so.1 libc.so.1 ]
> ::umem_verify
Cache Name                        Addr          Cache Integrity
umem_bufctl_audit_cache      254008        clean
umem_alloc_8                      256008        clean
umem_alloc_16                     25c008        clean
umem_alloc_24                    25e008        1 corrupt buffer
umem_alloc_32                     260008        clean
umem_alloc_40                     262008        clean

The corrupt buffer comes from cache: umem_alloc_24:                

> 25e008::umem_verify
Summary for cache 'umem_alloc_24'
  buffer 1bc9628 (allocated) has a corrupt redzone size encoding

Let's dump the buffer:

The first 8 bytes has meta data information.The actual buffer starts at: 0x1bc9628 + 0x8 = 0x1bc9630


> 0x1bc9630/7X
0x1bc9630: 3137322e 32332e31 37302e37 3700cafe
feedface 1498 1bcdc98

The contents of the buffer (hightlighted in green) indicate that the application has written 14 bytes. The content is actually hexadecimal ascii equivalent of IP address: 172.23.170.77 followed by a NULL character.

From the redzone data, let's find out the actual number of bytes that this application allocated.

Redzone data is 8 bytes that follows the buffer. When a buffer is allocated with libumem, the first 4 bytes of redzone contains the pattern 0xfeedface and the last 4 bytes contains the encoded value of the actual size of memory allocated by the application. Do the  following math to find the actual size allocated by the application.

0x1498 == 5272t
5272/251 = 21
21 - 8 = 13 bytes

Aha, someone allocated 13 bytes and wrote 14 bytes into it -- that explains it all.

Now, let's see who allocated this buffer itself:, to do that, take the value following the redzone data and run the command bufctl_audit against it:

> 1bcdc98::bufctl_audit
ADDR BUFADDR TIMESTAMP THR LASTLOG CONTENTS CACHE SLAB NEXT DEPTH
01bcdc98 01bc9628 a35debd7cb0b0 3b 000d92c0 00000000 0025e008 00934450 00000000 f
libumem.so.1`umem_cache_alloc+0x208
libumem.so.1`umem_alloc+0x44
libumem.so.1`malloc+0x2c
libobscure.so.1`decode+0x0020
libobscure.so.1`getItemN+0x0070
libobscure.so.1`_InitSession+0x934
libobscure.so.1`Java_com_xxx_yyy_od_zzzInterface_main+0x01f8
0xf900bc20

As you can see above, this buffer is getting allocated in the native library libobscure.so via java native interface. When they allocated the memory to store the IP address, they did not the NULL character into account and therefore were writing beyond what they actually allocated. This information was enough to convince the customer that the corruption was not in java but instead in a native library that they were using.

About

Hema

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today