Out of range lba's stop my web business
By clive on Feb 07, 2008
I spent a large part of a day a few weeks back on a bridge call with a large US customer for who serious I/O performance issues developed overnight. They had made no changes to the application, platform, SAN or storage for months. The SAN checked out fine. They had very serious performance issues which came down to very high latency I/O across most LUN's.
At iostat level all I could see was high service times and the storage vendor could only see low service times. As is part of the course it becomes a finger pointing exercise as both sides dig deeper and deeper into their stacks of hardware and software.
With the customers business in effect down (they are a web based business to a large extent), the political environments starts to heat up (you can even feel the heat from 3,000 miles away). Eventually one of the storage vendors engineers found that scsi packets with "out of range" LBA's ( Logical Block Address ) were being sent to the array which pointed the finger back at our platform. Some Solaris code reading resulted on my part and I concluded that if the platform was generating out of range LBA's, then it would get recorded in the Illegal Request ssderr kstat, but we did not see that counter incrementing.
As a side remark, one of the customers admin's mentioned that they had seen similar I/O issues on a set of Wintel systems since installing a new set of HBA cards. I asked what the port addresses were for the HBA's generating the errant packets and they were not from the 15k which had the performance problem. The call went silent as the picture unfolded, Sun and the storage vendor were thanked for their time and we left the call.
I wanted to really understand how Solaris really behaves when out of range lba's get generated, so I wrote some code to generate out of range LBA's. Please play with the code, but remember it uses USCSI which bypasses the checks of the filesystem, etc, so don't use it on your production server, please.
So there is a risk in a SAN environment that other hosts can impact on your mission critical business and it becomes a real challenge to find root cause. This customer lost a number of hours of sales. The Solaris side has no visibility of the out of range LBA scsi packets generated by other hosts. So one place it might be useful to use this code is to determine the effect on I/O latency of a workload if the back end storage has to handle "out of range" LBA's. In the case of our customer above, I suspect that resets were occurring within the array and this impacted performance. Should the storage be robust in terms of performance degradation to such requests or is it just a good idea to limit the size and scope of each SAN?