Instruction Level Time Travel in Solaris

Last week an associate professor from Michigan, prof Peter Chen was here demonstrating a novel tool that can record and reply execution of an Operating System at instruction level. It is a simple concept - you record the initial state of the OS, then log every interrupt and input. (as he says record the "I" in IO and any non-deterministic events) Once you have done that you can basically "replay" the running of the OS. You can pause, rewind and forward as you wish.

His demo was stunning. He booted up a red hat linux with gnome desktop in user mode linux, randomly opened a few windows and terminals, created ssh keys. Then he replayed it. All the windows opened exactly as they were in the previous run, and to the much exclamation of the crowd, the ssh-keys were the same! He had to assure some people that it was not a screen capture and video replay.

Professor says that this tool is coming to Xen, which means you can record a Solaris execution and replay it. It would be a great help in debugging, since you can now go back or forward in time, as you wish. It can watch for a variable, in back time. find when it was modified, who modified etc.,
Comments:

Now that is amazing. What I find so amazing is the simplicity of it. It's so easy to understand that it's actually astonishing we haven't seen this until now. Was this on a single CPU? What happens when we have two or more CPU? How about running this at all times on a troubled system and keeping the last 20 seconds in RAM such that during core dump we grab that section of RAM and it's part of the normal crash output. It would be great to replay the moments before crash for analysis.

Posted by Sun Enthusiast on December 21, 2005 at 03:53 PM PST #

About 10 years ago I worked on a project with some Brits that resulted in a hardware fault tolerant system called the FTSparc, based on SuperSparc chips. Fast forward a bit, Sun buys the company, roles out an Ultra based box which bacame the late but little lamented FT 1800. For the compute complex, the fault tolerance was based on the same principle, by comparing the signatures of IO events and taking action if they mismatched between the dual or triple redundant elements.

Posted by David McDaniel on December 28, 2005 at 05:59 AM PST #

Post a Comment:
Comments are closed for this entry.
About


sayings of an hearer

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today