dma_attr_sgllen can make your i/o look slow

I was asked to look at a slow i/o performance problem using solaris 10 on our fabulously fast
AMD64 boxes. The iostat command was reporting a very slow active service time (asvc_t) when the memory supplying the data to the large(ish) i/o was not allocated from large pages.

Dtrace showed the large page based i/o going out in one chunk with one call to sdintr() at the end of the
i/o before the buf was returned, but the 4k page based i/o was going out in a number of chunks. Each chunk of 128k was terminating in a call to sdintr(), only after the last chunk returned was the buf returned. The important part of the stacktrace  that dtrace or lockstat profiling will show is calls to ddi_dma_nextcookie() as each chunk is initialised.

For the i/o kstat  the service time runs from when the buf is sent to the HBA for transport to the disk and ends when the buf is returned. For the 4k based page i/o each chunk extends the service time by a multiple of the real service time.

So what was causing the i/o to be broken up...  the sd target driver relies on the underlying HBA driver to
do the scsi packet and DMA initialisation via the scsi_init_pkt() call. For this particular HBA the
ddi_dma_attr structure ( man ddi_dma_attr) had 32 in the dma_attr_sgllen field.  This field
describes the number of scatter gather segments that the dma engine built into the HBA card can
deal with per i/o request. If an i/o requires more than 32 scatter gather list elements then it will be broken
up into multiple i/o requests.

The large page buffer is allocated out 2Mb pages of contigeous virtual memory addresses but
more importantly each large page is made from  contigeous physical memory which is used by the dma engines so a buffer allocated from these large pages occupies just one DMA scatter gather list element.

The small page buffer is allocated out of 4K pages of contigeous virtual memory addresses but
each 4k page can be mapped into its virual address  from any physical address, in the worst case each 4k page takes one DMA scatter gather list element. A large i/o can therefore take more than 32 elements and so be chunked into N i/os of 128k making the iostat active service time look N times worse than it is really.

So using large pages can have hidden benefits.  (man ppgsz)

Comments:

Post a Comment:
Comments are closed for this entry.
About

timatworkhomeandinbetween

Search

Archives
« April 2014
MonTueWedThuFriSatSun
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    
       
Today
News

No bookmarks in folder

Blogroll

No bookmarks in folder