Monday Nov 16, 2009

Fun With DTrace: The Windows-Key Prank

The current episode of the German HELDENFunk podcast features an interview with Chris Gerhard about one of his favourite subjects: DTrace (in English, beginning at 14:58):

After the interview, we hear a guy called "Konteener Kalle" express his love (in German) for DTrace by playing a prank on his boss: Whenever he presses the Windows key (on an OpenSolaris system, mind you), he's punished by watching the XScreensaver BSOD hack (of course not knowing that it's just a screensaver).

That little joke challenged me to actually implement this prank. Here's how to do it.

The Idea

The idea of this prank is to start the XScreensaver Blue-Screen-of-Death screensaver (which simulates a Windows crash experience) on an OpenSolaris system whenever the user presses a certain key a certain number of times. This could be the Windows-Key (which doesn't have any real use on an OpenSolaris machine) or any other key. We count the number of key presses and only execute the BSOD after a certain number of key presses in order to make the prank less obvious.

Step 1: Identify the Windows (or any other) Key

If you have a Windows-Keyboard, this is easy: Run xev and press the Windows-Key. Take note of the keycode displayed in the xev output. Of course you can use any other key as well to play this prank. In this case, I'm using the left Control-Key, because I don't have a Windows-Key on the system I'm working on. The Control key has the keycode 37.

Step 2: Configure XScreensaver for BSOD

XScreensaver comes with a great collection of "hacks" that do interesting stuff on the screen when the screensaver activates. Check out the /usr/lib/xscreensaver/hacks directory. Each hack can be run individually, but then it will only execute inside a new window. For the BSOD illusion to be realistic, we want to execute the BSOD hack in full-screen.

This can be achieved by telling XScreensaver to demo the BSOD hack for us. It will then create a full-screen window and execute the BSOD hack inside the new window. The following command will tell XScreensaver to run a hack for us:

xscreensaver-command -demo <number>

The <number> part is a little complicated: XScreensaver looks at its config file ~/.xscreensaver where it stores a list of programs and arguments after the keyword "programs:". <number> simply refers to the number of the hack on that list. Therefore, we must create an entry in our admin user's .xscreensaver file that starts bsod(6) with the right parameters and that gives us a known number to call xscreensaver-command with.

Let's put our entry at the top of the list so we can simply use the number "1" to execute the BSOD screensaver. Somewhere in our .xscreensaver, the programs section should look like this:

  textFile:       /etc/motd
  textProgram:    date

  programs:                                                                     \\
  -               "BSOD Windoze"  bsod -root -only nt         \\n\\
  -                "Qix (solid)"  qix -root -solid -segments 100              \\n\\
  -          "Qix (transparent)"  qix -root -count 4 -solid -transparent      \\n\\

You can test this by running xscreensaver-command -demo 1.

Step 3: Write a DTrace Script That Sets Up the Trap

Now it gets more interesting. How do we use DTrace to find out when a user presses a certain key? All we know is that the Xorg server processes the keystrokes for us. So let's start by watching Xorg in action. The following DTrace command will trace all function calls within Xorg:

pfexec dtrace -n pid`pgrep Xorg`:::entry'{ @func[probefunc] = count(); }'

Let's start it, press the desired key 10 times, then stop it with CTRL-C. You'll see a long list of Xorg functions, sorted by the number of times they've been called. Since we pressed the key 10 times, it's a good idea to look for functions that have been called ca. 10 times. And here, we seem to be lucky:

  miUnionO                                                          8
  DeviceFocusInEvents                                               9
  CommonAncestor                                                   10
  ComputeFreezes                                                   10
  CoreLeaveNotifies                                                10
  key_is_down                                                      11
  FreeScratchPixmapHeader                                          12
  GetScratchPixmapHeader                                           12
  LookupIDByType                                                   12
  ProcShmDispatch                                                  12
  ProcShmPutImage                                                  12

The key_is_down function looks like exactly the function we're looking for! In fact, some googling tells us that this function's 2nd argument is the keycode of the key that is down when the function is called.

Why do we see "11" and not "10" function calls to key_is_down? Because it also counted my pressing of the Ctrl-Key when I stopped the DTrace script through Ctrl-C :).

This gives us enough knowledge to create the following DTrace script:

  #!/usr/sbin/dtrace -s

   \* BSODKey.d

   \* This D script will monitor a certain key in the system. When this key is
   \* pressed, a shell script will be executed that simulates a BSOD.
   \* The script needs the process id of the Xorg server to tap into as its
   \* first argument.
   \* One example of using this script is to punish a user pressing the
   \* Windows key on an OpenSolaris system by launching the BSOD screen saver.

  #pragma D option quiet
  #pragma D option destructive

          ctrlcount = 0;

  /arg1 == keycode/
          ctrlcount ++;

  /ctrlcount == 10/
          ctrlcount = 0;
          system("/usr/bin/xscreensaver-command -demo 1");

First, we need to enable DTrace's destructive mode (ever heard of a "constructive prank"?) otherwise we can't call the system-command at the end. The script uses the pid provider to tap into Xorg. Therefore, we need to give it the PID of the Xorg server as an argument:

pfexec ./BSODKey.d `pgrep Xorg`

It then sets up a probe that fires whenever key_is_down is called with our keycode and counts the key presses. At the end of the key_is_down function call, it checks whether we reached 10 keypresses, then executes the BSOD screen saver and resets the counter. You may need to make sure that the DISPLAY variable is set correctly for the BSOD program to show up on the victim's screen when starting this script.

After hitting the Control-Key 10 times, we're rewarded with our beloved BSOD:


That wasn't too difficult, was it? Yes, one could have done the same thing by writing a regular script that taps into /dev/kbd or something similar. But the beauty of DTrace lies in the simplicity of this script (Tap into the right function while it's running) and in the fact that it now can be modified very easily to fire BSODs at any kind of event, including the user hitting a certain area of the screen with his mouse or selecting a particular text or whatever you choose it to be.

So, have fun with this script and let me know in the comments what kind of pranks (or helpful actions) you can imagine with DTrace!

Tuesday Feb 19, 2008

VirtualBox and ZFS: The Perfect Team

I've never installed Windows in my whole life. My computer history includes systems like the Dragon 32, the Commodore 128, then the Amiga, Apple PowerBook (68k and PPC) etc. plus the occasional Sun system at work. Even the laptop my company provided me with only runs Solaris Nevada, nothing else. Today, this has changed. 

A while ago, Sun announced the acquisition of Innotek, the makers of the open-source virtualization software VirtualBox. After having played a bit with it for a while, I'm convinced that this is one of the coolest innovations I've seen in a long time. And I'm proud to see that this is another innovative german company that joins the Sun family, Welcome Innotek!

Here's why this is so cool.

Windows XP running on VirtualBox on Solaris Nevada

After having upgraded my laptop to Nevada build 82, I had VirtualBox up and running in a matter of minutes. OpenSolaris Developer Preview 2 (Project Indiana) runs fine on VirtualBox, so does any recent Linux (I tried Ubuntu). But Windows just makes for a much cooler VirtualBox demo, so I did it:

After 36 years of Windows freedom, I ended up installing it on my laptop, albeit on top of VirtualBox. Safer XP if you will. To the top, you see my VirtualBox running Windows XP in all its Tele-Tubby-ish glory.

As you can see, this is a plain vanilla install, I just took the liberty of installing a virus scanner on top. Well, you never know...

So far, so good. Now let's do something others can't. First of all, this virtual machine uses a .vdi disk image to provide hard disk space to Windows XP. On my system, the disk image sits on top of a ZFS filesystem:

# zfs list -r poolchen/export/vm/winxp
NAME                                                          USED  AVAIL  REFER  MOUNTPOINT
poolchen/export/vm/winxp                                     1.22G  37.0G    20K  /export/vm/winxp
poolchen/export/vm/winxp/winxp0                              1.22G  37.0G  1.05G  /export/vm/winxp/winxp0
poolchen/export/vm/winxp/winxp0@200802190836_WinXPInstalled   173M      -   909M  -
poolchen/export/vm/winxp/winxp0@200802192038_VirusFree           0      -  1.05G  -

Cool thing #1: You can do snapshots. In fact I have two snapshots here. The first is from this morning, right after the Windows XP installer went through, the second has been created just now, after installing the virus scanner. Yes, there has been some time between the two snapshots, with lots of testing, day job and the occasional rollback. But hey, that's why snapshots exist in the first place.

Cool thing #2: This is a compressed filesystem:

# zfs get all poolchen/export/vm/winxp/winxp0
NAME                             PROPERTY         VALUE                    SOURCE
poolchen/export/vm/winxp/winxp0  type             filesystem               -
poolchen/export/vm/winxp/winxp0  creation         Mon Feb 18 21:31 2008    -
poolchen/export/vm/winxp/winxp0  used             1.22G                    -
poolchen/export/vm/winxp/winxp0  available        37.0G                    -
poolchen/export/vm/winxp/winxp0  referenced       1.05G                    -
poolchen/export/vm/winxp/winxp0  compressratio    1.53x                    -
poolchen/export/vm/winxp/winxp0  compression      on                       inherited from poolchen

ZFS has already saved me more than half a gigabyte of precious storage capacity already! 

Next, we'll try out Cool thing #3: Clones. Let's clone the virus free snapshot and try to create a second instance of Win XP from it:

# zfs clone poolchen/export/vm/winxp/winxp0@200802192038_VirusFree poolchen/export/vm/winxp/winxp1
# ls -al /export/vm/winxp
total 12
drwxr-xr-x   5 constant staff          4 Feb 19 20:42 .
drwxr-xr-x   6 constant staff          5 Feb 19 08:44 ..
drwxr-xr-x   3 constant staff          3 Feb 19 18:47 winxp0
drwxr-xr-x   3 constant staff          3 Feb 19 18:47 winxp1
dr-xr-xr-x   3 root     root           3 Feb 19 08:39 .zfs
# mv /export/vm/winxp/winxp1/WindowsXP_0.vdi /export/vm/winxp/winxp1/WindowsXP_1.vdi

The clone has inherited the mountpoint from the upper level ZFS filesystem (the winxp one) and so we have everything set up for VirtualBox to create a second Win XP instance from. I just renamed the new container file for clarity. But hey, what's this?

VirtualBox Error Message 

Damn! VirtualBox didn't fall for my sneaky little clone trick. Hmm, where is this UUID stored in the first place?

# od -A d -x WindowsXP_1.vdi | more
0000000 3c3c 203c 6e69 6f6e 6574 206b 6956 7472
0000016 6175 426c 786f 4420 7369 206b 6d49 6761
0000032 2065 3e3e 0a3e 0000 0000 0000 0000 0000
0000048 0000 0000 0000 0000 0000 0000 0000 0000
0000064 107f beda 0001 0001 0190 0000 0001 0000
0000080 0000 0000 0000 0000 0000 0000 0000 0000
0000336 0000 0000 0200 0000 f200 0000 0000 0000
0000352 0000 0000 0000 0000 0200 0000 0000 0000
0000368 0000 c000 0003 0000 0000 0010 0000 0000
0000384 3c00 0000 0628 0000 06c5 fa07 0248 4eb6
0000400 b2d3 5c84 0e3a 8d1c
8225 aae4 76b5 44f5
0000416 aa8f 6796 283f db93 0000 0000 0000 0000
0000432 0000 0000 0000 0000 0000 0000 0000 0000
0000448 0000 0000 0000 0000 0400 0000 00ff 0000
0000464 003f 0000 0200 0000 0000 0000 0000 0000
0000480 0000 0000 0000 0000 0000 0000 0000 0000
0000512 0000 0000 ffff ffff ffff ffff ffff ffff
0000528 ffff ffff ffff ffff ffff ffff ffff ffff
0012544 0001 0000 0002 0000 0003 0000 0004 0000

Ahh, it seems to be stored at byte 392, with varying degrees of byte and word-swapping. Some further research reveals that you better leave the first part of the UUID alone (I spare you the details...), instead, the last 6 bytes: 845c3a0e1c8d, sitting at byte 402-407 look like a great candidate for an arbitrary serial number. Let's try changing them (This is a hack for demo purposes only. Don't do this in production, please):

# dd if=/dev/random of=WindowsXP_1.vdi bs=1 count=6 seek=402 conv=notrunc
6+0 records in
6+0 records out
# od -A d -x WindowsXP_1.vdi | more
0000000 3c3c 203c 6e69 6f6e 6574 206b 6956 7472
0000016 6175 426c 786f 4420 7369 206b 6d49 6761
0000032 2065 3e3e 0a3e 0000 0000 0000 0000 0000
0000048 0000 0000 0000 0000 0000 0000 0000 0000
0000064 107f beda 0001 0001 0190 0000 0001 0000
0000080 0000 0000 0000 0000 0000 0000 0000 0000
0000336 0000 0000 0200 0000 f200 0000 0000 0000
0000352 0000 0000 0000 0000 0200 0000 0000 0000
0000368 0000 c000 0003 0000 0000 0010 0000 0000
0000384 3c00 0000 0628 0000 06c5 fa07 0248 4eb6
0000400 b2d3 2666 6fbb c1ca 8225 aae4 76b5 44f5
0000416 aa8f 6796 283f db93 0000 0000 0000 0000
0000432 0000 0000 0000 0000 0000 0000 0000 0000
0000448 0000 0000 0000 0000 0400 0000 00ff 0000
0000464 003f 0000 0200 0000 0000 0000 0000 0000
0000480 0000 0000 0000 0000 0000 0000 0000 0000
0000512 0000 0000 ffff ffff ffff ffff ffff ffff
0000528 ffff ffff ffff ffff ffff ffff ffff ffff
0012544 0001 0000 0002 0000 0003 0000 0004 0000

Who needs a hex editor if you have good old friends od and dd on board? The trick is in the "conv=notruc" part. It tells dd to leave the rest of the file as is and not truncate it after doing it's patching job. Let's see if it works:

VirtualBox with two Windows VMs, one ZFS-cloned from the other.

Heureka, it works! Notice that the second instance is running with the freshly patched harddisk image as shown in the window above.

Windows XP booted without any problem from the ZFS-cloned disk image. There was just the occasional popup message from Windows saying that it found a new harddisk (well observed, buddy!).

Thanks to ZFS clones we can now create new virtual machine clones in just seconds without having to wait a long time for disk images to be copied. Great stuff. Now let's do what everybody should be doing to Windows once a virus scanner is installed: Install Firefox:

Clones WinXP instance, running FireFox

I must say that the performance of VirtualBox is stunning. It sure feels like the real thing, you just need to make sure to have enough memory in your real computer to support both OSes at once, otherwise you'll run into swapping hell...

BTW: You can also use ZFS volumes (called ZVOLs) to provide storage space to virtual machines. You can snapshot and clone them just like regular file systems, plus you can export them as iSCSI devices, giving you the flexibility of a SAN for all your virtualized storage needs. The reason I chose files over ZVOLs was just so I can swap pre-installed disk images with colleagues. On second thought, you can dump/restore ZVOL snapshots with zfs send/receive just as easily...

Anyway, let's see how we're doing storage-wise:

# zfs list -rt filesystem poolchen/export/vm/winxp
NAME                              USED  AVAIL  REFER  MOUNTPOINT
poolchen/export/vm/winxp         1.36G  36.9G    21K  /export/vm/winxp
poolchen/export/vm/winxp/winxp0  1.22G  36.9G  1.05G  /export/vm/winxp/winxp0
poolchen/export/vm/winxp/winxp1   138M  36.9G  1.06G  /export/vm/winxp/winxp1

Watch the "USED" column for the winxp1 clone. That's right: Our second instance of Windows XP only cost us a meager 138 MB on top of the first instance's 1.22 GB! Both filesystems (and their .vdi containers with Windows XP installed) represent roughly a Gigabyte of storage each (the REFER column), but the actual physical space our clone consumes is just 138MB.

Cool thing #4: ZFS clones save even more space, big time!

How does this work? Well, when ZFS creates a snapshot, it only creates a new reference to the existing on-disk tree-like block structure, indicating where the entry point for the snapshot is. If the live filesystem changes, only the changed blocks need to be written to disk, the unchanged ones remain the same and are used for both the live filesystem and the snapshot.

A clone is a snapshot that has been marked writable. Again, only the changed (or new) blocks consume additional disk space (in this case Firefox and some WinXP temporary data), everything that is unchanged (in this case nearly all of the WinXP installation) is shared between the clone and the original filesystem. This is de-duplication done right: Don't create redundant data in the first place!

That was only one example of the tremenduous benefits Solaris can bring to the virtualization game. Imagine the power of ZFS, FMA, DTrace, Crossbow and whatnot for providing the best infrastructure possible to your virtualized guest operating systems, be they Windows, Linux, or Solaris. It works in the SPARC world (through LDOMs), and in the x86/x64 world through xVM server (based on the work of the Xen community) and now joined by VirtualBox. Oh, and it's free and open source, too.

So with all that: Happy virtualizing, everyone. Especially to everybody near Stuttgart.

Thursday Dec 06, 2007

X4500 + Solaris ZFS + iSCSI = Perfect Video Editing Storage

Digital video editing is one of those applications that tend to be very data hungry. At SD PAL resolution, we're talking about 720 pixels x 576 lines x 3 bytes of color x 25 full frames per second = about 30 MB/s of data. That's about 224 GB for a 2 hour feature film. Not counting audio (that would only be around 3-4 GB). And we (in Germany) haven't looked at HD or Digital Cinema a lot yet...

During the last couple of weeks I worked with a customer who bought a Sun Fire X4500 server (you know, Thumper). The plan is to run Solaris ZFS on it, then provide big iSCSI volumes to the video editing systems, which tend to be specialized Windows or Mac OS X machines. Wonderful idea: Just use zpool create to combine a number of disks with some RAID level into a pool, then zfs create -V to create a ZVOL. Thanks to zfs shareiscsi=on, sharing the volume over iSCSI is dead easy.

But it didn't work.

First, Windows wouldn't mount the iSCSI volume. After some trying, we discovered that there must be an upper limit of 2TB to the size of iSCSI volumes that Windows can mount (we initially tried something like 5 ot 10TB). So be it: zfs create -V 2047G videopool/videovolume.

Now it mounted ok, we formatted the disk with NTFS (yuck!) and started the editing system's speed test. Then came the real issue: The test reported a write performance of 8-10 MB/s, but the editing system needs something like 30 MB/s sustained to be able to record reliably!

After some trying, we started the systematic approach:

  • A simple dd from one disk to another yielded >39 MB/s.
  • dd'ing from one small ZFS pool to another exceeded 120 MB/s (I later learned that cp is a better benchmark because it works asynchronously with large chunks of data vs. dd's synchronous block approach), so that was again more than we needed.
  • We tried re-attaching our ZVOL through iscsiadm to test the iSCSI stack's performance and ran into a TCP fusion issue. Ok, I've always wanted to play with mdb, so we followed the workaround instructions and we were able to attach our own ZVOL over the loopback interface. Slightly less performance (due to up the stack, down the stack effects, I presume) but still way more than we needed. So, it wasn't the X4500's nor ZFS' fault.

Finally, Danilo pointed me into the right direction: Nagle's algorithm. What usually helps maximize network bandwidth turns out to be a killer for iSCSI performance. For Solaris iSCSI clients, we know this already,  but how do we turn off Nagle on Windows?

The answer is deeply buried inside the Microsoft's iSCSI Initiator user guide: The "Addressing Slow Performance with iSCSI Clusters" chapter mentions a similar issue (although they talk about read not write performance) and they do mention RFC 1122's delayed ACK feature, which is related to Nagle's algorithm. The Microsoft document suggests a workaround which involves setting a variable in the registry, so it was worth a try (and my vengeance for having to use mdb before).

And low and behold, the speed test now yielded 90-100 MB/s (Close to a GBE's raw performance)! Yipee that was it! One little registry entry on the client side gave us a 10x improvement in iSCSI performance!

Now, can someone explain to me, why on Windows 2000 you need to set "TcpAckDelTicks=0" while on Windows 2003 the same thing is accomplished by saying "TcpAckFrequency=1" (which is the same thing, only seen from the other side of the division sign)?

So, to all you storage hungry video editors out there: The Sun Fire X4500 with Solaris ZFS and iSCSI is a great solution for reliable, fast, easy to use and inexpensive video storage. You just need to know how to tell your TCP/IP stack to not delay ACKs...


Tune in and find out useful stuff about Sun Solaris, CPU and System Technology, Web 2.0 - and have a little fun, too!


« April 2014