libumem to detect modify-after-free corruptions.

Here is another example of how libumem can help us detect memory corruption..

The problem reported by the customer was that the Sun Web Server 6.1 protected by Sun's Access Manager Policy Agent  running in production was crashing.  This problem was not showing up in their staging environment, it was only the servers in production that were crashing.

Customer sent in several core files for us to analyze.  The stack trace in the first  core file that I looked at showed that we were aborting in libmtmalloc.so.1. 

=>[1] libc.so.1:__lwp_kill(0x0, 0x6, 0x0, 0xfe0bc000, 0x5, 0xaf295c8), at 0xfe0a0218
[2] libc.so.1:raise(0x6, 0x0, 0xe1efba50, 0xffffffff, 0x74ab0, 0xaf02900), at 0xfe050c80
[3] libc.so.1:abort(0x0, 0x0, 0x88, 0xffffffff, 0x74ab0, 0xaf02870), at 0xfe06e98
[4] libmtmalloc.so.1:free(0xaf28060, 0xfffb, 0xaf02800, 0xaf02828, 0x74ab0, 0xaf02828),

 I vaguely remembered that there was a bug in mtmalloc that returned an already freed pointer. I searched our bug database and found the bug . As I read through the synopsis of the bug  it became clear that we were NOT running into this bug. The bug was about : libmtmalloc's realloc() returning an already freed pointer. I don't think, we were doing a realloc here.  I ruled out that bug and just to be sure, I checked the mtmalloc patch level on the system and found that they were running the latest mtmalloc patch.

I looked at the other core files sent by the customer, and this  time, the crash was somewhere else. As I analyzed the core files and pstack output I noticed that the crashes were random in nature. The randomness of the crash indicated  memory corruption.  I initially suspected that this could be double-free type of  an error.

 fe0a0218 _lwp_kill (6, 0, e1ffba50, ffffffff, 74620, 2969700) + 18
fe036e98 addsev   (0, 0, 88, ffffffff, 74620, 2969670)
ff390bcc free     (3e0d188, fffb, 2969600, 2969628, 74620, 2969628) + 1e0
fd726f08 void\*__Crun::vector_del(void\*,unsigned,void(\*)(void\*)) (3e0d188, 0, 0, ffffffff, 0,
fd70e934 std::vector<bool>::iterator&std::vector<bool>::iterator::operator++()

I requested the customer to use libumem , the customer obliged (BIG Thank you) and sent us three corefiles generated with libumem enabled.

I opened the corefile in mdb, ran ::umem_verify and to my surprise, the integrity of all the caches came up as clean. Where do I go from here ?

Okay, I ran ::umem_status command and that printed the exact nature of corruption, including the stacktrace of the thread that last freed this buffer and the offset at which someone wrote to this buffer after it was being freed.

This information was sufficient for Sun Access Manager engineering team to come up with a fix and release  hotpatch 2.2-01 against Policy agent.  

How good is that ! I told you, libumem is so powerful yet so simple and easy to use.

mdb>::umem_status
Status: ready and active
Concurrency: 8
Logs: transaction=256k content=256k fail=256k (inactive)
Message buffer:
umem allocator: buffer modified after being freed modification occurred at offset 0x20 (0xdeadbeefdeadbeef replaced by 0xdeadbeeedeadbeef)
buffer=a7a0708 bufctl=a813f68 cache: umem_alloc_96
previous transaction on buffer a7a0708:
thread=28 time=T-284.970780700 slab=a5dede0 cache: umem_alloc_96
libumem.so.1'umem_cache_free+0x4c
libumem.so.1'process_free+0x006c
libames6.so'PR_DestroyLock
libames6.so'~Node
libames6.so'~RefCntPtr
libames6.so'~Tree
libames6.so'~PolicyEntry
libames6.so'~RefCntPtr
libames6.so'cleanup+0x0030
libames6.so'operator+0x01fc
libames6.so'spin+0x023c
libnspr4.so'_pt_root+0x00d4
libthread.so.1'_lwp_start
umem: heap corruption detected


Comments:

Post a Comment:
Comments are closed for this entry.
About

Hema

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today