By Hema on Nov 19, 2007
Here is another example of how libumem can help us detect memory corruption..
The problem reported by the customer was that the Sun Web Server 6.1 protected by Sun's Access Manager Policy Agent running in production was crashing. This problem was not showing up in their staging environment, it was only the servers in production that were crashing.
Customer sent in several core files for us to analyze. The stack trace in the first core file that I looked at showed that we were aborting in libmtmalloc.so.1.
=> libc.so.1:__lwp_kill(0x0, 0x6, 0x0, 0xfe0bc000, 0x5, 0xaf295c8), at 0xfe0a0218
 libc.so.1:raise(0x6, 0x0, 0xe1efba50, 0xffffffff, 0x74ab0, 0xaf02900), at 0xfe050c80
 libc.so.1:abort(0x0, 0x0, 0x88, 0xffffffff, 0x74ab0, 0xaf02870), at 0xfe06e98
 libmtmalloc.so.1:free(0xaf28060, 0xfffb, 0xaf02800, 0xaf02828, 0x74ab0, 0xaf02828),
I vaguely remembered that there was a bug in mtmalloc that returned an already freed pointer. I searched our bug database and found the bug . As I read through the synopsis of the bug it became clear that we were NOT running into this bug. The bug was about : libmtmalloc's realloc() returning an already freed pointer. I don't think, we were doing a realloc here. I ruled out that bug and just to be sure, I checked the mtmalloc patch level on the system and found that they were running the latest mtmalloc patch.
I looked at the other core files sent by the customer, and this time, the crash was somewhere else. As I analyzed the core files and pstack output I noticed that the crashes were random in nature. The randomness of the crash indicated memory corruption. I initially suspected that this could be double-free type of an error.
fe0a0218 _lwp_kill (6, 0, e1ffba50, ffffffff, 74620, 2969700) + 18
fe036e98 addsev (0, 0, 88, ffffffff, 74620, 2969670)
ff390bcc free (3e0d188, fffb, 2969600, 2969628, 74620, 2969628) + 1e0
fd726f08 void\*__Crun::vector_del(void\*,unsigned,void(\*)(void\*)) (3e0d188, 0, 0, ffffffff, 0,
I opened the corefile in mdb, ran ::umem_verify and to my surprise, the integrity of all the caches came up as clean. Where do I go from here ?
Okay, I ran ::umem_status command and that printed the exact nature of corruption, including the stacktrace of the thread that last freed this buffer and the offset at which someone wrote to this buffer after it was being freed.
This information was sufficient for Sun Access Manager engineering team to come up with a fix and release hotpatch 2.2-01 against Policy agent.
How good is that ! I told you, libumem is so powerful yet so simple and easy to use.
Status: ready and active
Logs: transaction=256k content=256k fail=256k (inactive)
umem allocator: buffer modified after being freed modification occurred at offset 0x20 (0xdeadbeefdeadbeef replaced by 0xdeadbeeedeadbeef)
buffer=a7a0708 bufctl=a813f68 cache: umem_alloc_96
previous transaction on buffer a7a0708:
thread=28 time=T-284.970780700 slab=a5dede0 cache: umem_alloc_96
umem: heap corruption detected