Wednesday Aug 05, 2009

Data Validation and Journaling and journal recovery

One thing that I discovered last summer, but have not had the time to resolve yet is the fact that there is a small window after journal recovery where Vdbench may report a Data Validation error while there really is no cause for it. Since Vdbench writes a journal record before it starts the i/o, and then an other after the i/o completes, it can be that Vdbench or the system shuts down before the write is complete. The question then during journal recovery is "did this i/o complete or not". Vdbench does not resolve this. It should accept the block to contain either the BEFORE contents or the AFTER contents.  This windows is pretty small though.

Friday Mar 06, 2009

Vdbench Data Validation and Journaling

Vdbench Data Validation (-v) gives Vdbench the ability to make sure that data once written can be read back and compared to make sure that the data is still correct. To accomplish this, Vdbench has a large table in memory that tells him what was written where. Once Vdbench terminates that in-memory table of course is gone.

By using the '-j' execution parameter Vdbench does not only do Data Validation, it also maintains a Journal file that is used to keep a copy of the in-memory table, allowing upon a Vdbench Journal restart (-jr) for the table to be recovered so that Vdbench knows what was written in the previous run. (There is a 'Data Validation and Journaling' chapter in the doc explaining all this).

I saw an interesting case yesterday: one of my users was running Journaling without really needing it. This can cause problems. Why? Journaling for each write to a target lun causes two synchronized writes of a journal record, one before the write to the lun, and one just after the write to the lun is completed. Synchronized writes, especially when the write is done against a storage device that does not have non-volatile write cache, can be pretty slow. The result of that is that the IOPS running against the lun that is being tested is quite lower than when you do not use journaling.

The objective of the test in question was to recreate a data integrity problem in a lab environment. This data integrity problem in theory can and usually will be faster to recreate when running higher iops. If for instance your problem happens after 100,000 i/o operations, the higher your IOPS, the sooner you hit the problem.

Henk.


Friday Feb 27, 2009

Vdbench and Data Validation against very large luns or files

One warning, something that some people have run into without knowing it:
If your luns are very large and your xfersize relatively small and your iops relatively small, then it may take quite a while before a data block is accessed for the second time, and therefore being read and its data validated.

For instance, with a one TB lun using an xfersize of 8k at 100 iops it will take 37282 hours or almost 5 years before each block is accessed ONCE.
You can make the lun look smaller by using the size= parameter
sd=sd1,lun=/dev/rdsk/cxtxdxsx,size=5g    (Vdbench then only uses the first 5gb of the lun)

or you can use the 'range= parameter:

sd=sd1,lun=/dev/rdsk/cxtxdxsx,range=(50,60) (Vdbench now only uses the space between the 50 and 60 percentile of the lun's size)



Henk.


Wednesday Oct 22, 2008

A common mistake made when using Vdbench data validation.

When using Vdbench data validation, the content of a data block is validated only when this block is accessed for the second or third etc., time. This means that if you have a very large LUN or file, and your run has a relatively small elapsed time, a run can complete without any data block ever having been accessed more than once. This means then that the run appears successful, but no real data validation has ever been done. Now of course you can have undetected data integrity problems!

I therefore decided to make a change in Vdbench 5.00: I will count the amount of data block validations that have been done, and if at the end of a run that count still is zero, Vdbench aborts Vdbench with an explanation about what happened and suggestions on how to fix this, things like:

  • Use larger block sizes
  • Use longer elapsed= times.
  • Use only a portion of the LUN, using:
    • sd=sd1,….,size=1g
    • sd=sd1,….,range=(xx,yy) (Vdbench 5.00)
    • wd=wd1,…..,range=(xx,yy)

Error message:

08:31:10.258 No read validations done during a Data Validation run.

08:31:10.258 This means very likely that your run was not long enough to

08:31:10.258 access the same data block twice.

08:31:10.258 There are several solutions to this:

08:31:10.259 - increase elapsed time.

08:31:10.259 - use larger xfersize.

08:31:10.259 - use only a subset of your lun by using the 'sd=...,size='

08:31:10.259 parameter or the 'wd=...,range=' parameter.

Normal completion message on logfile.html:

08:33:23.847 localhost-0: Total amount of blocks read and validated: 1271

About

Blog for Henk Vandenbergh, author of Vdbench, and Sun StorageTek Workload Analysis Tool (Swat). This blog is used to keep you up to date about anything revolving around Swat and Vdbench.

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today