Thursday Jul 01, 2010

Partition Alignment Guidelines for Unified Storage

If you create and access logical disks (aka LUNs) from your Sun Unified Storage appliance, whether over iSCSI or Fiber Channel, you should be aware that client side partition alignment can have a big impact on performance. This is a generic issue that applies to any virtual disk interface, not just Unified Storage, and relates to how client generated virtual disk I/O maps to actual I/O in the appliance. The good news is that it can be quite easy to properly align partitions.

Background

The reason we care about alignment is that most of the storage industry is based on the historical abstraction of 512 byte sectors, however the consumers of those sectors (filesystems and applications managing raw storage) and the sophisticated storage arrays that provide block storage generally organize their data internally in larger units, typically 4KB or multiples thereof. This includes LUNs in a Sun Unified Storage appliance, which use a default volume block size of 8KB. Without proper care, filesystem blocks can end up non-aligned with the natural block size of the storage, impacting performance.

With proper alignment, a single client block that is the same size or less as the volume block size of a LUN will be contained entirely within a single volume block in the LUN. Without proper alignment, that same client block may span multiple volume blocks in the LUN. That could result in 2 appliance reads for a single client read, and 2 appliance reads plus 2 appliance writes for a single client write. This will obviously have a big impact on performance if ignored.

The graphic below illustrates partition misalignment:

What we see in the graphic above is that the LUN is divided into fixed sized volume blocks, which are 8KB by default in the Unified Storage appliance. A given volume block in the LUN will always be read or written as a whole. When a LUN is imported by the client, it is presented as if it were a physical disk drive, with its own virtual sector size of 512 bytes. The client generally allocates sectors to one or more partitions or slices, which are then made available for file systems or raw application data. The block addresses of application or file system I/O are relative to the first sector in the partition, and the partition may be allowed to start on an arbitrary 512 byte sector.

In the example above, partition block P0 starts on an arbitrary sector, and spans LUN volume blocks L0 and L1. If we issue an 8KB read to P0, which matches the default 8KB volume block size of the LUN, we will have to read both L0 and L1 to get the data. If we issue an 8KB write to P0, we will have to read both L0 and L1 to get the data that is not being changed, then write back both L0 and L1 to store the combined new and old data.

The graphic below illustrates a properly aligned partition:

We now see that partition block P0 starts on a 512 byte sector that coincides with the start of LUN volume block L1. If we issue an 8KB read to P0, which matches the default 8KB volume block size of the LUN, we only have to read L1 to get the data. If we issue an 8KB write to P0, we simply replace LUN volume block L1 and do not need to do any reads at all. We have skipped a small amount of space in the LUN, but the result is a potentially large reduction in I/O.

There are three primary issues that lead to partition misalignment:

  • Most platforms consume some number of sectors at the beginning of a disk for a disk label, and actual data storage must skip these sectors to avoid overwriting the label.
  • Most partition management software, such as format, parted, and diskpart, was written to manage physical disks with 512 byte sectors. If there are constraints on how sectors are grouped into partitions, they typically relate to disk characteristics like heads, tracks, and cylinders, not virtual characteristics like user selected volume block sizes.
  • Most platforms allow you to create multiple partitions or slices within a disk, so even if the first partition is aligned, subsequent partitions may not be aligned.
Taken together, these factors mean that block zero of a given partition may map to an arbitrary 512 byte sector on a virtual disk, and for most platforms your partitions will not be aligned on LUN volume block size boundaries by default.

General Recommendations

If possible, use a disk label that allows sector addressing rather than cylinder addressing for partition/slice locations. This allows for simple math when calculating alignment.

If possible, create a single data partition/slice on the LUN, especially if you must use cylinder addressing. This avoids having to calculate alignment at multiple points within the LUN.

If you are offered an explicit alignment option by your disk partitioning software, use it. This currently only applies to Windows 2003 or later, where the diskpart.exe utility allows an "align=X" option on the create partition command, where X is the desired alignment in kilobytes. You should specify an alignment that either matches the volume block size in the LUN, or is a power of two and is larger than the volume block size.

Aligning by Sector

To manually calculate partition alignment by sector, make sure that the starting sector number of each partition is a multiple of the number of sectors in a LUN volume block. For example, with 512 byte sectors, there are 16 sectors in a default 8KB LUN volume block. In that case, the starting sector of each partition/slice should be a multiple of 16. The maximum volume block size for LUNs in the Sun Unified Storage appliance is currently 128KB, and there are 256 sectors in a 128KB volume block. For a 128KB volume block size, the starting sector of each partition/slice should be a multiple of 256.

If you are aligned for a power of two volume block size, you are also aligned for any smaller power of two volume block size. All supported volume block sizes in the Sun Unified Storage appliance are powers of two, so aligning for the maximum 128KB volume block size (ie starting partitions on multiples of 256 sectors) ensures alignment for all currently supported LUN volume block sizes.

Aligning by Cylinder

If you use a disk label that requires partitions/slices to begin on a cylinder boundary (for example, Solaris SMI labels), make sure that the starting cylinder number multiplied by the number of sectors per cylinder is a multiple of the number of sectors per LUN volume block.

The following Least Common Multiple (LCM) method can simplify the process:

  • Determine sectors per cylinder. In Solaris format, this is nhead \* nsect. In Linux fdisk, this is heads \* sectors/track.
  • Determine sectors per LUN volume block. There are two 512 byte sectors per kilobyte, so an 8KB volume block is 16 sectors, and a 128KB volume block is 256 sectors.
  • Find the LCM of the number of sectors per cylinder and per LUN volume block. For example, by using a tool like http://www.mathsisfun.com/least-common-multiple-tool.html
  • Divide the LCM by the number of sectors per cylinder
  • The result is the first non-zero cylinder that is aligned for your volume block size. Any cylinder that is a multiple of this number is also aligned.
For example, with 255 heads and 63 sectors per track, we have 16065 sectors per cylinder. With an 8KB LUN volume block size, we have 16 sectors per volume block. The LCM of 16065 and 16 is 257040. Dividing the LCM by 16065 (sectors per cylinder) gives us 16. Cylinder 16 is the first non-zero cylinder that is aligned for an 8KB LUN volume block, and any cylinder that is a multiple of 16 is also aligned.

Caveats

  • Do not use sector 0 of an MBR/msdos labled LUN, or sectors 0 through 33 of an EFI/gpt labeled LUN, to avoid overwriting the label.
  • Do not trust cylinder numbers reported by Linux fdisk or parted, because both may be rounded to the nearest cylinder. As described in the Linux specific section below, set units to sectors in both tools to verify alignment.
  • Do not trust KB offsets reported by Windows diskpart.exe, because they may be rounded to the nearest KB. As described in the Windows specific section below, you can use the wmic.exe utility to display actual byte offsets.
  • Do not trust cylinder numbers reported by Solaris fdisk on x86/amd64/x86_64 in interactive mode, because they may be rounded to the nearest cylinder. As decribed in the Solaris on x86/amd64/x86_64 section below, you can run "fdisk -W - {raw_device}" and use the reported Rsect (relative starting sector) to verify alignment. Note the Solaris fdisk will only create cylinder aligned partitions, so this issue relates primarily to reporting the location of partitions created by another mechanism.
  • If you use an SMI label with Solaris on x86/amd64/x86_64 keep in mind that the SMI label subdivides a partition within an MBR/msdos labeled LUN, so there are two levels of alignment to consider. See the Solaris on x86/amd64/x86_64 section for details.

Platform Specific Recommendations

Solaris on SPARC

If possible, use an EFI label (requires "format -e") which allows sector addressing. Configure data slices with a starting sector that is a multiple of the number of 512 byte sectors per LUN volume block. With a default volume block size of 8KB, the starting sector of each slice should be a multiple of 16. With any currently supported volume block size up to 128KB, a slice can begin on a sector that is a multiple of 256.

If you use an EFI label, ensure that sector 0 through 33 are not assigned to any slice, to avoid overwriting the label.

If you use an SMI label, you will be constrained to begin all slices on a cylinder boundary. To determine whether a cylinder is aligned on a LUN volume block boundary, multiply the cylinder number by the number of 512 byte sectors per cylinder. The result should be a multiple of the number of sectors per LUN volume block.

Refer to the Aligning by Cylinder section above for a Least Common Multiple method you can use to determine cylinder alignment.

Solaris on x86/amd64/x86_64

If possible, use an EFI label (requires "format -e") which allows sector addressing. However, be aware that unlike an SMI label, which subdivides a partition within an MBR/msdos labeled LUN when used on x86/amd64/x86_64, an EFI label replaces any existing MBR/msdos label, destroying any existing non-Solaris partitions.

If using an EFI label, use the same EFI guidelines as those described above in the Solaris on SPARC section.

If you use an SMI label with Solaris on x86/amd64/x86_64 keep in mind that the SMI label subdivides a Solaris2 partition within an MBR/msdos labeled LUN, so there are two levels of alignment to consider. The Solaris fdisk utility will report partitions relative to the beginning of the disk/LUN, and the Solaris format utility will report slices relative to the beginning of the Solaris2 partition.

One caveat with fdisk is that in interactive mode it will only create cylinder aligned partitions, but will also report partition starting points rounded to the nearest cylinder if they were created by another mechanism and are not actually cylinder aligned.

To confirm that a Solaris2 fdisk partition starts on a cylinder boundary, run "fdisk -W - {raw device}" and verify that the reported Rsect (relative starting sector) is a multiple of the number of sectors per cylinder.

The simplest alignment method for SMI on x86/amd64/x86_64 is to ensure that the Solaris2 partition created/reported by fdisk is on a non-zero cylinder boundary that is aligned for your LUN volume block size. You can then use the same guidelines as those described above in the Solaris on SPARC section to align slices within the Solaris2 partition using the format utility.

Refer to the Aligning by Cylinder section above for a Least Common Multiple method you can use to determine cylinder alignment.

Linux

Make sure that units is set to sectors when creating or displaying partitions in fdisk and/or parted. If using fdisk in interactive mode, the "u" command toggles units back and forth between sectors and cylinders. If using parted in interactive mode, the "units s" command sets units to sectors.

If you use either tool with units set to cylinders, the reported cylinder numbers may be rounded. Even if you do the math to determine a cylinder that should be aligned, you can not be sure that you are actually aligned unless you set units to sectors.

To ensure alignment, configure data partitions with a starting sector that is a multiple of the number of 512 byte sectors per LUN volume block. With a default volume block size of 8KB, the starting sector of each partition should be a multiple of 16. With any currently supported volume block size up to 128KB, a partition can begin on a sector that is a multiple of 256.

If you would like to choose a sector that is aligned for your LUN volume block and is also on a cylinder boundary, refer to the Aligning by Cylinder section above for a Least Common Multiple method you can use to determine cylinder alignment. After determining an aligned cylinder, multiply the cylinder number times sectors per cylinder, and use that as your starting sector number.

If you use a gpt label (equivalent to an EFI label in Solaris), ensure that sector 0 through 33 are not assigned to any partition, to avoid overwriting the label.

If you use an MBR (aka msdos) label, ensure that sector 0 is not assigned to any partition, to avoid overwriting the label.

Windows

For Windows 2003 SP1 and later, the diskpart.exe utility can be used to create aligned partitions by including the align=X option on the create partition command, where X is the desired alignment in kilobytes. To create an aligned partition, simply specify a power of two alignment that is greater than or equal to the LUN volume block size. For example, use align=128 to align for any LUN volume block size up to 128 KB. The default in Windows Vista and Windows 2008 is align=1024, which is correctly aligned for any power of two LUN volume block size up to 1MB, and does not need to be changed.

A caveat with the diskpart.exe utility is that it displays the offset in KB, but this is a rounded value. For example, a default Windows 2003 partition offset of 63 sectors is actually 31.5 KB, but will be displayed by diskpart.exe as 32 KB.

To determine the actual byte offset of partitions in Windows, you can use the wmic.exe utility, with a command like:

wmic partition get StartingOffset, Name, Index

This will show partition information for all of the basic disks/luns in the system, with StartingOffset specified in bytes. For proper alignment, StartingOffset should be a multiple of the number of bytes (not sectors) in the LUN volume block. For example, with a default 8KB LUN volume block size, StartingOffset should be a multiple of 8192.

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Thursday Mar 11, 2010

OLTP Improvements in Sun Storage 7000 2010.Q1

The 2010.Q1 Software Update for the Sun Storage 7000 Unified Storage Systems product line is now available. Among the many enhancements and new features included in this update is an important performance improvement for configurations using shares or LUNs with small record sizes. The two most likely configurations to benefit from this change are OLTP databases, which typically configure the record size of a share to match the block size of the database, and iSCSI LUNs, which have a default block size of 8KB.

For OLTP databases, we have seen as much as:

  • 50% increase in average throughput
  • 70% reduction in variability

This is based on transaction rates measured over a series of identically configured benchmark runs. Roch Bourbonnais provides a detailed discussion on his blog of the engineering work that went into this improvement, and I will highlight the aspects specific to Oracle and other OLTP database configurations.

In general, if you have configured your Unified Storage appliance to have shares or LUNs with recordsize/blocksize less than 128K, you are strongly encouraged to upgrade to the latest software release for enhanced overall performance.

For the details of how these gains were achieved, read on....

As Roch describes in his blog, this improvement relates to metaslab and block allocation in ZFS, and was worked as CR 6869229. As he describes, to store data in a ZFS pool, zfs first selects a vdev (a physical block device like a disk, or a logical grouping of physical block devices comprising a RAID group), then selects a metaslab (a region of space) within that vdev, and finally a block within that metaslab. I refer you to Roch's blog for more details on this and on the changes being introduced, and to Jeff Bonwick's older blogs on ZFS Block Allocation and Space Maps for further background.

As you may know, ZFS supports multiple record sizes, from 512 bytes to 128 kilobytes. In most cases, we recommend that you use the default record size of 128K for ZFS file systems, unless you have an application that manages large files using small random reads and writes. The most well known example is for database files, where it can be beneficial to match the ZFS record size to the database block size. This also applies to iSCSI LUNs, which have a default block size of 8K. In both cases, you may have a large amount of data that is randomly updated in small units of space.

The OLTP testing that contributed to CR 6869229 was for an Oracle database consisting of roughly 350GB of data and log files, stored on a Unified Storage appliance and accessed using NFSv4 with direct I/O. The workload was an OLTP environment simulating an order entry system for a wholesale supplier. The database block size was configured at 4KB, to minimize block contention, and the recordsize of the shares containing data files was configured with a matching 4KB record size. The database log files, which are accessed in a primarily sequential manner and with a relatively large I/O size, were configured with the default 128KB record size. In addition, the log file shares were configured with log bias set to latency, and the data file shares were configured with log bias set to throughput.

Initial testing consisted of repeated benchmarks runs with the number of active users scaled from 1 to 256. Three runs were completed at each user count before increasing to the next level. This testing revealed an anomaly, in that there was a high degree of variability among runs with the same user count, and that a group of runs with relatively low throughput could be followed by a sudden jump to relatively high throughput. To better understand the variability, testing was altered to focus on multiple, repeated runs with 64 active users, with all other factors held constant. This testing continued to exhibit a high degree of variability, and also revealed a cyclic pattern, with periods of high throughput followed by slow degradation over several runs, followed by a sudden return to the previous high. To identify the cause of the variation in throughput, we collected a broad range of statistics from Oracle, from Solaris, and from Analytics in the Unified Storage appliance. Some examples include Oracle buffer pool miss rates, top waiters and their contribution to run time, user and system CPU consumption, OS level reads and writes per second, kilobytes read and written per second, average service time, Appliance level NFSv4 reads and writes per second, disk reads and writes per second, and disk kilobytes read and written per second. These data were loaded into an OpenOffice spreadsheet, processed to generate additional derived statistics, and finally analyzed for correlation with the observed transaction rate in the database. This analysis highlighted I/O size in the appliance as the statistic having the strongest correlation (R\^2 = 0.83) to database transaction rates. What this showed is that database transaction rate seemed to increase with increased I/O size in the appliance, which also related to lower read and write service times as seen by the database server. Conversely, as average I/O size in the appliance dropped, database transaction rates would tend to drop as well. The question was, what was triggering changes in I/O size in the appliance, given a consistent I/O size in the application?

As Roch describes in his blog, metaslab and block allocation in ZFS were ultimately found to contribute heavily to the observed variability in OLTP throughput. When a given metaslab (a region of space within a vdev) became 70% full, ZFS would switch from a first fit to a best fit block allocation strategy within that metaslab, to help with the compactness of the on disk layout. Note that this refers to a single metaslab within a vdev, not the entire vdev or storage pool. With a random rewrite workload to a share with a small record size, like the 4KB OLTP database workload in our tests, the random updates tended to free up individual records within a given metaslab. When we switched to best fit allocation, new 4KB write requests would prefer to use these "best fit" locations rather than other, possibly larger areas of free space. This inhibited the ability of ZFS to do write aggregation, resulting in more IOPS required to move the same amount of data.

Two related space allocation issues were identified and ultimately improved. The first was to raise the threshold for transition to best fit allocation from 70% full to 96% full, and the second was to change the weighting factors applied to metaslab selection so that a higher level of free space would be maintained per metaslab. The latter avoids using metaslabs that might transition soon to best fit allocation, and more quickly switches away from a metaslab once it does make that transition. This will tend to spread a random write workload among more metaslabs, and each will have more free space and will permit a higher degree of write aggregation.

As mentioned already, the end result of these changes and other enhancements in the new software update were a 50% improvement in average OLTP throughput for this workload, and a 70% reduction in variability from run to run. Roch also reports a 200% improvement on MS Exchange performance, and others have reported substantial improvements in performance consistency on iSCSI luns.

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Monday Nov 30, 2009

Maximizing NFS client performance on 10Gb Ethernet

I generally agree with the opening statement of the ZFS Evil Tuning Guide, which says "Tuning is often evil and should rarely be done." That said, tuning is sometimes necessary, especially when you need to push the envelope. At the moment, achieving peak performance for NFS traffic between a single client and a single server, running over 10 Gigabit Ethernet (10GbE) is one of those cases. I will outline below the tunings I used to achieve a ~3X throughput improvement in NFS IOPS over 10GbE, on a Chip Multithreading (CMT) system running Solaris 10 Update 7 (S10u7).

The default values for the tunables outlined below are all either being reviewed, or have already changed since the release of S10u7. Some of these tunings are unnecessary if you are running S10u8, and they should all be unnecessary in the future. Consider these settings a workaround to achieve maximum performance, and plan to revisit them in the future. A good place to monitor for future developments is the Networks page on the Solaris Internals site. You can also review the NFS section of the Solaris Tunable Parameters Reference Manual.

If you want to fine tune these settings beyond what is outlined here, a reasonable technique is to start from your current default settings and double the value until no observable improvement is seen.

For the time being, consider the following settings if you plan to run NFS between a single client and a single server over 10GbE:

Step 1 - TCP window sizes

The TCP window size defines how much data a host is willing to send/receive without an acknowledgment from its communication partner. Window size is a central component of the TCP throughput formula, which can be simplified to the following if we assume no packet loss:

max throughput (per second) = window size / round trip time (in seconds)

For example, with 1ms RTT and the current default window size of 48k, we have:

49152 / 0.001 = ~50 MB/sec per communication partner

This is obviously too low for NFS over 10GbE, so the send and receive window sizes should be increased. A setting of 1MB provides a max bandwidth of ~1 GB/sec with a RTT of 1ms.

Solaris 10 Update 8 and earlier

	ndd -set /dev/tcp tcp_xmit_hiwat 1048576
	ndd -set /dev/tcp tcp_recv_hiwat 1048576
TCP window size has been the subject of a number of CRs, has changed several times over the years, and the default is likely to change again in the near future. Use a command like
	ndd -get /dev/tcp tcp_xmit_hiwat
on your system to check the current default value before tuning, to make sure that you do not inadvertently lower the values.

Note: if you want to increase TCP window sizes beyond 1MB, you should also increase tcp_max_buf and tcp_cwnd_max, which currently default to 1MB.

Step 2 - IP software rings

A general heuristic for network bandwidth is that we need approximately 1GHz of CPU bandwidth to handle 1Gb (gigabit) per second of network bandwidth. That means that we need to use multiple CPUs to match the bandwidth of a 10GbE interface. Software Rings are used in Solaris as a mechanism to spread the incoming load from a network interface across multiple CPU strands, so that we have enough aggregate CPU bandwidth to match the network interface bandwidth. The default value for the number of soft rings in Solaris 10 Update 7 and earlier is too low for 10GbE, and must be increased:

Solaris 10 Update 7 and earlier on Sun4v

In /etc/system
	set ip:ip_soft_rings_cnt=16

Solaris 10 Update 7 and earlier on Sun4u, x86-64, etc

In /etc/system
	set ip:ip_soft_rings_cnt=8

Solaris 10 Update 8 and later

Thanks to the implementation of CR 6621217 in S10 u8, the default value for the number of soft rings should be fine for network interface speeds up to and including 10GbE, so no tuning should be necessary.

The changes introduced by CR 6621217 highlight why tuning is often evil. It was found that it is difficult to find an optimal, system wide setting for the number of soft rings if the system contains multiple network interfaces of different types. This resulted in the addition of a new tunable, ip_soft_rings_10gig_cnt, which applies to 10GbE interfaces. The old tunable, ip_soft_rings_cnt, applies to 1GbE interfaces. Both tunables have good defaults at this point, so it is best not to tune either on S10u8 and later.

Step 3 - RPC client connections

Now that we have enough IP software rings to handle the network interface bandwidth, we need to have enough IP consumer threads to handle the IP bandwidth. In our case the IP consumer is NFS, and at the time of this writing, its default behavior is to open a single network connection from an NFS client to a given NFS server. This results in a single thread on the client that handles all of the data coming from that server. To maximize throughput between a single NFS client and server over 10GbE, we need to increase the number of network connections on the client:

Solaris 10 Update 8 and earlier

In /etc/system
	set rpcmod:clnt_max_conns = 8
Note: for this to be effective, you must have the fix for CR 2179399, which is available in snv_117, s10u8, or s10 patch 141914-02

A new default value for rpcmod:clnt_max_conns is being investigated as part of CR 6887770, so it should be unnecessary to tune this value in the future.

Step 4 - Allow for multiple pending I/O requests

The IOPS rate of a single thread issuing synchronous reads or writes over NFS will be bound by the round trip network latency between the client and server. To get the most out of the available bandwidth you should have a workload that generates multiple pending I/O requests. This can be from multiple processes each generating an individual I/O stream, a multi-threaded process generating multiple I/O streams, or a single or multi-threaded process using asynchronous I/O calls.

Conclusion

Once you have verified/tuned TCP window sizes, IP soft rings, and rpc client connections, and you have a workload that can capitalize on the available bandwidth, you should see excellent NFS throughput on your 10GbE network interface. There are a few more tunings that might add a few percentages of performance, but the tunings shown above should suffice for the majority of systems.

As I mentioned at the start, these tunables are all either under investigation or already adjusted in Solaris 10 Update 8. Our goal is always to provide excellent performance out of the box, and these tunings should be unnecessary in the near future.

Friday Jan 09, 2009

Memory Leak Detection with libumem

I recently had the opportunity to do some memory leak detection with libumem, so I decided to share some thoughts and examples on its use.  The issue I was working on was related to a call for help from a colleague who was working primarily on Linux and OS X.  His application had a memory footprint that was growing over time, and he had used Valgrind and dtrace (on OS X) to try to find a leak, but had reached a dead end.  I offered to run the application on Solaris and use the libumem(3LIB) library and mdb(1) to search for leaks, and was able to quickly find a leak in the open source SIGAR library that he was using with his application.  For more details and current status on the specific leak, check out bug SIGAR-132.  For this discussion, I'll focus primarily on a simple example program to highlight libumem.

What is libumem?

The libumem(3LIB) library is a highly scalable memory allocation library that supports the standard malloc(3C) family of functions as well as its own umem_alloc(3MALLOC) functions and umem_cache_create(3MALLOC) object caching services.  It also provides debugging support including detection of memory leaks and many other common programming  errors.  The debugging capabilities are described in umem_debug(3MALLOC).  This discussion will focus primarily on using the debugging capabilities with standard malloc(3C).  For a performance comparison between libumem and several other memory allocators, have a look at Tim Cook's memory allocator bake-off from a few weeks back.

What is a memory leak?

Before I get started, let me clarify what I mean by a memory leak.  To me, a pure memory leak occurs when you allocate memory but then fail to retain a pointer to that memory.  For example, you might overwrite a pointer with a new value, or allow an automatic variable to be discarded without first freeing the memory that it references.  Without a pointer to the memory, you can't use it any more or free it, and it has leaked out of your control.  Some people also refer to situations where memory is held longer than necessary as a memory leak, but to me that is a memory hog, not a memory leak.  The debugging tools in libumem can help with both issues, but the techniques are different.  I will focus on what I consider a pure memory leak for today.

How do I enable libumem?

If you are compiling a new application and want libumem as your default memory allocator, just add -lumem to your compile or link command.  If you want to use any of the libumem specific functions, you should also #include <umem.h> in your program.  If you want to enable libumem on an existing application, you can use the LD_PRELOAD environment variable (or LD_PRELOAD_64 for 64 bit applications) to interpose the library on the application and cause it to use the malloc() family of functions from libumem instead of libc.

For example with sh/ksh/bash:

LD_PRELOAD=libumem.so your_command

with csh/tcsh:

(setenv LD_PRELOAD libumem.so; your_command)

To confirm that you are using libumem, you can use the pldd(1) command to list the dynamic libraries being used by your application.  For example:

$ pgrep -l my_app
 2239 my_app
$ pldd 2239
2239:    my_app
/lib/libumem.so.1
/usr/lib/libc/libc_hwcap2.so.1
$

How do I enable libumem debugging?

As described in umem_debug(3MALLOC), the activation of run-time debugging features is controlled by the UMEM_DEBUG and UMEM_LOGGING environment variables.  For memory leak detection, all we need to enable is the audit feature of UMEM_DEBUG.

For example, with sh/ksh/bash:

LD_PRELOAD=libumem.so UMEM_DEBUG=audit your_command

with csh/tcsh:

(setenv LD_PRELOAD libumem.so; setenv UMEM_DEBUG audit; your_command)

How do I access the debug data?

The libumem library provides a set of mdb(1) dcmds to inspect the debug data collected while the program runs.  To use the dcmds, you can either run your program under the control of mdb, attach to the program with mdb, or generate a core dump (for example with gcore(1)) and examine the dump with mdb.  The latter is the simplest, and looks like this:

$ pgrep -l my_app
1603 my_app
$ gcore 1603
gcore: core.1603 dumped
$ mdb core.1603
Loading modules: [ libumem.so.1 ld.so.1 ]
>

The commands above assume that your program runs long enough for you to generate the core dump, and that the memory leak has been triggered before the core dump is generated.  For a fast running program or to examine the image just before program exit, you can do the following:

$ LD_PRELOAD=libumem.so UMEM_DEBUG=audit mdb ./your_app
> ::sysbp _exit
> ::run
mdb: stop on entry to _exit
mdb: target stopped at:
0xfee3301a: addb %al,(%eax)
> ::load libumem.so.1
>

Once you are in mdb, you can get a listing of the libumem dcmds by running ::dmods -l libumem.so.1 and can get help on an individual dcmd with ::help dcmd.  For example:

> ::dmods -l libumem.so.1

libumem.so.1
dcmd allocdby - given a thread, print its allocated buffers
dcmd bufctl - print or filter a bufctl
dcmd bufctl_audit - print a bufctl_audit
dcmd findleaks - search for potential memory leaks
...
> ::help findleaks

NAME
findleaks - search for potential memory leaks

SYNOPSIS
[ addr ] ::findleaks [-dfv]

DESCRIPTION

Does a conservative garbage collection of the heap in order to find
potentially leaked buffers. Similar leaks are coalesced by stack
trace, with the oldest leak picked as representative. The leak
table is cached between invocations.
...

You can now use the various dcmds to look for memory leaks and other common problems with memory allocation, or to simply better understand how your application uses memory.

A complete example

The attached mem_leak.c program includes three simple memory leaks.  The first is within main(), where we overwrite a pointer after allocating memory.  The second is within a function, where we allow an automatic variable to be discarded before freeing memory that it references.  The last is a nested function call that includes a logic bug that causes it to return early, also allowing an automatic variable to be discarded before freeing memory that it references.

To get started, compile the program and start it up with libumem and its audit feature enabled:

$ /opt/SunStudioExpress/bin/cc -o mem_leak mem_leak.c
$ LD_PRELOAD=libumem.so UMEM_DEBUG=audit ./mem_leak
Memory allocated, hit enter to continue:
Memory freed, hit enter to exit:

With the program waiting at the second prompt, go to another window to generate a core dump and examine the results with mdb:

$ pgrep -l mem_leak
1714 mem_leak
$ gcore 1714
gcore: core.1714 dumped
$ mdb core.1714
Loading modules: [ libumem.so.1 ld.so.1 ]
> ::findleaks
CACHE LEAKED BUFCTL CALLER
08072c90 1 0807dd08 buf_create+0x12
08072c90 1 0807dca0 func_leak+0x12
08072c90 1 0807dbd0 main+0x12
------------------------------------------------------------------------
Total 3 buffers, 3456 bytes
>

The output from ::findleaks shows that we have leaked three memory buffers, as expected, and we can now obtain a stack trace for each by running ::bufctl_audit against each bufctl address:

> 0807dbd0::bufctl_audit
ADDR BUFADDR TIMESTAMP THREAD
CACHE LASTLOG CONTENTS
807dbd0 807bb00 f5c5bb73837 1
8072c90 0 0
libumem.so.1`umem_cache_alloc_debug+0x144
libumem.so.1`umem_cache_alloc+0x19a
libumem.so.1`umem_alloc+0xcd
libumem.so.1`malloc+0x2a
main+0x12
_start+0x7d

> 0807dca0::bufctl_audit
ADDR BUFADDR TIMESTAMP THREAD
CACHE LASTLOG CONTENTS
807dca0 807b180 f5c5bb74120 1
8072c90 0 0
libumem.so.1`umem_cache_alloc_debug+0x144
libumem.so.1`umem_cache_alloc+0x19a
libumem.so.1`umem_alloc+0xcd
libumem.so.1`malloc+0x2a
func_leak+0x12
main+0x2f
_start+0x7d

> 0807dd08::bufctl_audit
ADDR BUFADDR TIMESTAMP THREAD
CACHE LASTLOG CONTENTS
807dd08 807acc0 f5c5bb7446e 1
8072c90 0 0
libumem.so.1`umem_cache_alloc_debug+0x144
libumem.so.1`umem_cache_alloc+0x19a
libumem.so.1`umem_alloc+0xcd
libumem.so.1`malloc+0x2a
buf_create+0x12
nested_leak_l3+0xb
nested_leak_l2+8
nested_leak_l1+8
nested_leak+8
main+0x34
_start+0x7d

>

Note that if you have leaked any "oversized" allocations (currently anything over 16k) the output will include a list of these leaked buffers including a byte count and vmem_seg address.  You can obtain the stack traces for these buffer allocations by running ::vmem_seg -v against each vmem_seg address.

Looking at the stack traces, the entry just below libumem.so.1`malloc in each stack is the function that allocated the leaked buffer.  If it isn't clear which malloc() got leaked, it may help to use the ::dis dcmd to disassemble the code.  For example:

> main+0x12::dis
main: pushl %ebp
main+1: movl %esp,%ebp
main+3: subl $0x28,%esp
main+6: pushl $0x0
main+8: pushl $0x400
main+0xd: call -0x256 <PLT=libumem.so.1`malloc>
main+0x12: addl $0x8,%esp
main+0x15: movl %eax,-0x8(%ebp)
main+0x18: pushl $0x0
main+0x1a: pushl $0x400
main+0x1f: call -0x268 <PLT=libumem.so.1`malloc>
main+0x24: addl $0x8,%esp
main+0x27: movl %eax,-0x8(%ebp)
main+0x2a: call -0xff <func_leak>
main+0x2f: call -0x44 <nested_leak>
main+0x34: pushl $0x0
main+0x36: pushl $0x8050e70
>

The example above shows that there were two calls to malloc() near the beginning of main(), and we have leaked the memory allocated by the first one.  Note that the second malloc() is not reported as a leak even if the core is generated while the buffer is still active.  That is because we still have a reference to the buffer and it has not actually been leaked.  Whether the buffer is eventually freed doesn't really matter.  As long as we have a reference to the buffer at the time the core is generated or mdb examines the running program, it will not be reported as a leak.

Even with the information obtained from libumem and mdb, you will still have some detective work to do to determine exactly why you have leaked a particular buffer.  However, knowing which buffer has been leaked, and the point in the code where it was allocated, is more than half the battle.

Keep in mind that the allocation of the leaked memory may occur in a system library, not in the code for your program. This could mean you have found a leak in a system library, but more likely it means that you requested an object from the library and were supposed to call another function to discard that object when you were finished with it. For example, in the SIGAR leak that I mentioned at the start of this discussion, the leaks were related to buffers allocated by libnsl, but the real bug was a failure by sigar_rpc_ping() to call clnt_destroy(3NSL) to clean up a CLIENT handle it had created with clntudp_create(3NSL).

About

user12610824

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today