Friday Feb 29, 2008

pstack in Solaris 10 prints java frames

Prior to Solaris 10, if you ran pstack command against a java process, you would see some garbage characters in java frames.seen below.

-----------------  lwp# 3 / thread# 3  --------------------
fef9f9c8 lwp_cond_wait (f2918, f2900, 0)
fe8f2948 void ObjectMonitor::wait(long long,int,Thread\*) (f2918, f2900, fed5c048, 0, fed675a0, f28d8) + 558
fe8f1d8c JVM_MonitorWait (0, f2304, 0, 0, f1d28, fed4dd24) + 2f0
f8d1b9b0 ???????? (bb6e1f08, fb67fa24, f26dc, bd13f668, bea2afa0, bd292558)
f8e7f49c ???????? (f8aa71c0, f8e7f1c8, f8e7f5d8, 0, f1d28, 8)
f8aa71b8 ???????? (0, fed34000, fed627f8, f8812e10, 5400, fb67fae0)
f8800218 ???????? (fb67fbc8, fb67fd90, a, d54435d8, f880a7e0, fb67fcc0)
fe8d9f00 void JavaCalls::call_helper(JavaValue\*,methodHandle\*,JavaCallArguments\*,Thread\*) (1, f1d28, fb67fcb8, fb67fbd8, fb67fcbc, 0) + 5b8
fe8f1734 void JavaCalls::call_virtual(JavaValue\*,Handle,KlassHandle,symbolHandle,symbolHandle,Thread\*) (555c, f1d28, f22f0, f22fc, f26dc, b0c00750) + 18cfe8f1590 void thread_entry(JavaThread\*,Thread\*) (b0c00750, f1d28, fed5c928, d542470, f22f0, fed5c418) + 12c
fe8f1414 void JavaThread::run() (f1d28, fed4b784, 6, 5400, 5790, fed34000) + 1f4
fec1ffac void\*_start(void\*) (f1d28, 32f4, 5800, 5b1c, 3000, fed34000) + 144
feff5854 _lwp_start (0, 0, 0, 0, 0, 0) 

 
With Solaris 10, pstack has the ability to decode java frames which makes our lives as support engineers bit easy.  Here is the sample output of pstack taken on a solaris 10 system against a java process.

 -----------------  lwp# 10 / thread# 10  --------------------

 fef46acc lwp_cond_wait (94d7e0, 94d7c8, 0, 0)
fe8f490c void ObjectMonitor::wait(long long,int,Thread\*) (94d7e0, 94d7c8, fed7ccc8, 0, fed88224, 94d7a0) + 558
fe8f4304 void ObjectSynchronizer::wait(Handle,long long,Thread\*) (94d578, 0, 0, fed67688, cf77f5ec, 4818) + d8
fe8f3f18 JVM_MonitorWait (0, 94d0a4, 0, 0, 94d578, fed6e554) + 2f0
f900bc20 \* java/lang/Object.wait(J)V+26828
f900bbc4 \* java/lang/Object.wait(J)V+0
f9005764 \* java/lang/Object.wait()V+2 (line 474)
f9005764 \* java/util/TimerThread.mainLoop()V+28 (line 483)
f9005764 \* java/util/TimerThread.run()V+1 (line 462)
f9000218 \* StubRoutines (1)
fe8dc090 void JavaCalls::call_helper(JavaValue\*,methodHandle\*,JavaCallArguments\*,Thread\*) (1, 94d578, cf77fa98, cf77f9b8, cf77fa9c, 0) + 5b8
fe8f38c0 void JavaCalls::call_virtual(JavaValue\*,Handle,KlassHandle,symbolHandle,symbolHandle,Thread\*) (57cc, 94d578, 94d090, 94d09c, 94d47c, d4cb7350) + 18c
fe8f371c void thread_entry(JavaThread\*,Thread\*) (d4cb7350, 94d578, fed7d5a8, f4c2d2e0, 94d090, fed7d098) + 12c
fe8f35a0 void JavaThread::run() (94d578, fed6bef4, 6, 5800, 5a0c, fed54000) + 1f4
fec2af08 void\*_start(void\*) (94d578, fed6e58c, fed67650, 0, 3314, fed54000) + 200
fef457f8 _lwp_start (0, 0, 0, 0, 0, 0)

Sunday Nov 18, 2007

libumem to detect write-beyond-what-you-allocate errors


I am a big fan of libumem . I've been using it for years to debug application crashes reported by our customers. I've found it very useful in isolating the source of corruption . I thought I'll share some of my experiences here.

I will use some examples from real cases although I might obscure the names of some of the libraries.

One of our elite customers reported that their application was crashing and he suspected that java is the cause of  the crash. This was a pretty complex java application that involved few native libraries as well.

One of the challenging part of a support job is isolating the problem and even more challenging is convincing the customer that the problem is elsewhere and not where he thinks it is. 

But, you know what, with libumem you'll see how easy this could be. Okay, I asked the customer to run their application with libumem enabled and send us another crash and so they did.

The first thing to do of course is open the corefile in mdb and run umem_verify command. This prints the name, address and the integrity of the cache. Take the address of the cache containing corrupt buffer and run umem_verify against it.  This gives you the address of the corrupt buffer.

 # mdb java corefile
Loading modules: [ libumem.so.1 libthread.so.1 libc.so.1 ]
> ::umem_verify
Cache Name                        Addr          Cache Integrity
umem_bufctl_audit_cache      254008        clean
umem_alloc_8                      256008        clean
umem_alloc_16                     25c008        clean
umem_alloc_24                    25e008        1 corrupt buffer
umem_alloc_32                     260008        clean
umem_alloc_40                     262008        clean

The corrupt buffer comes from cache: umem_alloc_24:                

> 25e008::umem_verify
Summary for cache 'umem_alloc_24'
  buffer 1bc9628 (allocated) has a corrupt redzone size encoding

Let's dump the buffer:

The first 8 bytes has meta data information.The actual buffer starts at: 0x1bc9628 + 0x8 = 0x1bc9630


> 0x1bc9630/7X
0x1bc9630: 3137322e 32332e31 37302e37 3700cafe
feedface 1498 1bcdc98

The contents of the buffer (hightlighted in green) indicate that the application has written 14 bytes. The content is actually hexadecimal ascii equivalent of IP address: 172.23.170.77 followed by a NULL character.

From the redzone data, let's find out the actual number of bytes that this application allocated.

Redzone data is 8 bytes that follows the buffer. When a buffer is allocated with libumem, the first 4 bytes of redzone contains the pattern 0xfeedface and the last 4 bytes contains the encoded value of the actual size of memory allocated by the application. Do the  following math to find the actual size allocated by the application.

0x1498 == 5272t
5272/251 = 21
21 - 8 = 13 bytes

Aha, someone allocated 13 bytes and wrote 14 bytes into it -- that explains it all.

Now, let's see who allocated this buffer itself:, to do that, take the value following the redzone data and run the command bufctl_audit against it:

> 1bcdc98::bufctl_audit
ADDR BUFADDR TIMESTAMP THR LASTLOG CONTENTS CACHE SLAB NEXT DEPTH
01bcdc98 01bc9628 a35debd7cb0b0 3b 000d92c0 00000000 0025e008 00934450 00000000 f
libumem.so.1`umem_cache_alloc+0x208
libumem.so.1`umem_alloc+0x44
libumem.so.1`malloc+0x2c
libobscure.so.1`decode+0x0020
libobscure.so.1`getItemN+0x0070
libobscure.so.1`_InitSession+0x934
libobscure.so.1`Java_com_xxx_yyy_od_zzzInterface_main+0x01f8
0xf900bc20

As you can see above, this buffer is getting allocated in the native library libobscure.so via java native interface. When they allocated the memory to store the IP address, they did not the NULL character into account and therefore were writing beyond what they actually allocated. This information was enough to convince the customer that the corruption was not in java but instead in a native library that they were using.

About

Hema

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today