DTrace with Oracle VM Server SPARC 2.1

While live migration is the most notable of the new features of OVMSS 2.1, there are several other enhancements. One that really appeals to me is the addition of DTrace support.

Tiny overview of DTrace

DTrace, described here, is an advanced feature of Solaris for gathering information - "telemetry" - of a running system's behavior. DTrace provides probe points that can be dynamically traced (hence the name), enabled and disabled at run-time. Each probe point can be associated with DTrace scripts that are invoked when the probe "fires", letting you non-intrusively select, collect, and aggregate data at the application and system level. DTrace is safe and intended for production systems. It provides the best observability of any OS I know, let alone any hypervisor.

DTrace enablement in Oracle VM Server SPARC

OVMSS now includes DTrace-enabled instrumentation. I learned about the DTrace support on an internal web page written by Menno Lageman, one of the OVMSS engineers. The list of probe points exposed by the OVMSS "provider" (a provider is the DTrace term for a software component that generates DTrace telemetry) shows interesting probe points (currently, a total of 53) for events or processes that might be experienced by a domain. Worth further investigation!

# dtrace -l -P 'ldmd*'
   ID   PROVIDER            MODULE                          FUNCTION NAME
   53   ldmd4724              ldmd                         do_hv_cmd hv-cmd-done
   54   ldmd4724              ldmd                         do_hv_cmd hv-cmd-start
   55   ldmd4724              ldmd                        hvctl_poll hv-pollmsg-done
   56   ldmd4724              ldmd                        hvctl_poll hv-pollmsg-start
   57   ldmd4724              ldmd                      mig_compress mig-compress-done
   58   ldmd4724              ldmd                      mig_compress mig-compress-start
   59   ldmd4724              ldmd                    mig_decompress mig-decompress-done
   60   ldmd4724              ldmd      tgt_mblock_decompress_thread mig-decompress-done
... lines snipped for brevity...
# dtrace -l -P 'ldmd*'|wc -l
      53

A script for watching domains migrate

Since live motion is so new and interesting, I decided to write a script migrate.d to measure times for parts of the migration process, and to calculate the efficiency of the memory compression and decompression OVMSS uses to reduce migration time. I'll explain the script, which may give a little insight into how DTrace scripts are coded.

DTrace script are structured into stanzas that have a probe description, an optional predicate enclosed between "/" characters, and an action delimited by "{" and "}". The action is executed when the associated probe fires, but only if the optional predicate evaluates to "true". The BEGIN and END probes are built-in probes that automatically fire when a DTrace script starts and ends. Other probes are specified by by a 4-field expression that names a provider (in most cases here, "ldmd" followed by a process ID), module and function, and name within a function. Omitted fields (look for adjacent colons) are "wild-carded", meaning that anything matches.

In this script I used the BEGIN probe to type out instructions and set variables for totals. The script waits until probes are fired and then executes the associated actions. Finally, the END probe fires when the script terminates, which lets me display results.

ldmd*:::mig-compress-start fires when a block of memory is being compressed. I record the uncompressed size and the (highly precise) time that compression started in thread-local variables, and add to a DTrace "aggregate" that counts the number of times this combination of probe function and name fired. When the corresponding ldmd*:::mig-compress-done probe fires, I compute the size of data saved by compression, and record the total and average time spent in compression. Note the predicate /self->t0/ used to guard against executing a "done" action without a corresponding "start" by checking for an existing (non-zero) timestamp. (When could that happen? Consider a DTrace script that starts up just after the "start" action but before the "done".) This is a very common DTrace idiom.

The corresponding ldmd*::mig_decompress:mig-decompress-start and and ldmd*::mig_decompress:mig-decompress-done clauses do similar calculation. During a single domain's migration, either the compress or decompress actions will appear on a given host, but coding both of them lets me use the same script on both source and target systems, or if a given system acts in both roles while the script is running.

The remaining probes are fired for overall migration and during the different "passes" OVMSS uses to transmit memory. The pass 1 transmits all of memory, pass 2 retransmits the memory that changed while pass 1 was running, and pass 3 retransmits the memory that changed during pass 2. The amount of memory is expected to decrease over subsequent passes. Eventually OVMSS says "enough already - let's pause the domain and transmit the residual bits" - an algorithm typical for virtual machine migration.

migrate.d
#!/usr/sbin/dtrace -qs

/* 
 * Based on Menno Lageman's DTrace scripts and helpful suggestions.
 * Jeff Savit, June 2011
 */


unsigned long long total_uncompressed;
unsigned long long total_compressed;

BEGIN
{
        printf("Type CTRL-C to exit when migrations complete.\n");
	total_uncompressed = 0l;
	total_compressed = 0l;
}


/*
 * compression: Save uncompressed size and start timestamp.
 */

ldmd*:::mig-compress-start
{
        self->uncompressed_size = arg0;
	total_uncompressed = total_uncompressed + self->uncompressed_size;
        self->t0 = timestamp;
        /* printf("%s %s -- %d bytes at %Y\n",  probefunc, probename, self->uncompressed_size, walltimestamp); */
	@count[probefunc,probename] = count();
}


/*
 * compression: show amount saved and calculate the elapsed time.
 */

ldmd*:::mig-compress-done 
/self->t0/
{
        this->t1 = timestamp;
        this->compressed_size=arg0;
	total_compressed = total_compressed + this->compressed_size;
	this->saved = (self->uncompressed_size - this->compressed_size);
        /* printf("%s %s -- %d bytes reduced to %d saving %d took %d ms\n", probefunc, probename,
	    self->uncompressed_size, this->compressed_size, this->saved,
            (this->t1 - self->t0) / 1000000); */
	@count[probefunc,probename] = count();
	@sum_compressed[probefunc,probename] = sum(this->saved);
        @sum_compress_time[probefunc,probename] = sum((this->t1 - self->t0) / 1000000);
	@avg_compress_time[probefunc,probename] = avg((this->t1 - self->t0) / 1000000);
	@compress_saved[probename] = quantize(this->saved);
}


/*
 * decompression: Save uncompressed size and start timestamp.
 */

ldmd*::mig_decompress:mig-decompress-start 
{
        self->compressed_size = arg0;
	total_compressed = total_compressed + self->compressed_size;
        self->t0 = timestamp;
        /* printf("%s %s -- %d bytes at %Y \n", probefunc, probename, self->compressed_size, walltimestamp); */
	@count[probefunc,probename] = count();
}


/*
 * decompression: show amount saved and calculate the elapsed time.
 */

ldmd*::mig_decompress:mig-decompress-done
/self->t0/
{
        this->t1 = timestamp;
        this->decompressed_size=arg0;
	total_uncompressed = total_uncompressed + this->decompressed_size;
	this->saved = (this->decompressed_size - self->compressed_size);
        /* printf("%s %s -- %d bytes decompressed from %d saving %d took %d ms\n",
	    probefunc, probename,
	    this->decompressed_size, self->compressed_size, this->saved,
            (this->t1 - self->t0) / 1000000); */
	@count[probefunc,probename] = count();
	@sum_compressed[probefunc,probename] = sum(this->saved);
        @sum_compress_time[probefunc,probename] = sum((this->t1 - self->t0) / 1000000);
        @avg_compress_time[probefunc,probename] = avg((this->t1 - self->t0) / 1000000);
	@compress_saved[probename] = quantize(this->saved);
}


/*
 * migration memory transfer pass start and end
 */

ldmd*:::mig-pass-start
{
        self->t0 = timestamp;
	self->pass = arg0;
        printf("%s %s -- pass %d at %Y\n", probefunc, probename, self->pass, walltimestamp);
	@count[probefunc,probename] = count();
}

ldmd*:::mig-pass-done
{
        this->t1 = timestamp;
        printf("%s %s -- pass %d at %Y took %d \n", probefunc, probename, self->pass, walltimestamp,
            (this->t1 - self->t0) / 1000000);
	@count[probefunc,probename] = count();
        @sum_pass_time[probefunc,probename,self->pass] = sum((this->t1 - self->t0) / 1000000);
}


/*
 * migration transfer start and end
 */

ldmd*:::mig-transfer-start  
{
        self->t0 = timestamp;
        printf("%s %s -- at %Y\n", probefunc, probename, walltimestamp);
	@count[probefunc,probename] = count();
}

ldmd*:::mig-transfer-done
{
        this->t1 = timestamp;
        printf("%s %s -- at %Y took %d \n", probefunc, probename, walltimestamp,
            (this->t1 - self->t0) / 1000000);
	@count[probefunc,probename] = count();
        @sum_mig_transfer_time[probefunc,probename] = sum((this->t1 - self->t0) / 1000000);
}

END
{
        printf("\n\nProbe Counts\n");
        printa(@count);
        printf("\n\n(De)compressed bytes by probe\n");
        printa(@sum_compressed);
	printf("\n\nTotal bytes uncompressed = %d - total compressed = %d - percent reduction %d", 
		total_uncompressed, total_compressed, (100*(total_uncompressed-total_compressed)/total_uncompressed) );
        printf("\n\nTotal (de)compression time\n");
        printa(@sum_compress_time);
        printf("\n\nAverage (de)compression time\n");
	printa(@avg_compress_time);
	printf("\n\nCompression savings\n");
	printa(@compress_saved);
        printf("\n\nTime by pass (source host only)\n");
        printa(@sum_pass_time);
        printf("\n\nTransfer Time\n");
	printa(@sum_mig_transfer_time);
}

EDIT: changed June 22 to use longs and milliseconds. That cleans up the display misformating below and handles bigger guest domain memories.

The results

I ran the script on the same T5120 and T5220 that I used in my previous testing. Here are the results (edited for brevity) from the source machine:

# ./migrate.d
Type CTRL-C to exit when migrations complete.
migrate_mem mig-transfer-start -- at 229030374454886
migrate_mem mig-pass-start -- pass 1 at 2011 Jun  9 12:11:57
migrate_mem mig-pass-start -- pass 2 at 2011 Jun  9 12:12:07
migrate_mem mig-pass-done -- pass 1 at 229038959512246 took 8583294 
migrate_mem mig-pass-done -- pass 2 at 229040364296377 took 694034 
migrate_mem mig-pass-start -- pass 3 at 2011 Jun  9 12:12:08
migrate_mem mig-pass-done -- pass 3 at 229041444794966 took 612191 
migrate_mem mig-transfer-done -- at 2011 Jun  9 12:12:08 took 612328 
^C
Probe Counts
  migrate_mem                                         mig-transfer-done                                                 1
  migrate_mem                                         mig-transfer-start                                                1
  migrate_mem                                         mig-pass-done                                                     3
  migrate_mem                                         mig-pass-start                                                    3
  mig_compress                                        mig-compress-done                                             10176
  mig_compress                                        mig-compress-start                                            10176
(De)compressed bytes by probe
  mig_compress                                        mig-compress-done                                         526348887
Total bytes uncompressed = 662896640 - total compressed = 140071626 - percent reduction 78
Total (de)compression time
  mig_compress                                        mig-compress-done                                         136264607
Average (de)compression time
  mig_compress                                        mig-compress-done                                             13390
Compression savings
  mig-compress-done                                 
           value  ------------- Distribution ------------- count    
            1024 |                                         0        
            2048 |                                         1        
            4096 |                                         1        
            8192 |                                         9        
           16384 |@                                        369      
           32768 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  9796
           65536 |                                         0
Time by pass (source host only)
  migrate_mem                                         mig-pass-done                                                     3           612191
  migrate_mem                                         mig-pass-done                                                     2           694034
  migrate_mem                                         mig-pass-done                                                     1          8583294
Transfer Time
  migrate_mem                                         mig-transfer-done                                            612328

This tells me that OVMSS compressed the transmitted memory by 78%, which I'm sure helped speed up migration (your mileage may vary). The average time to compress a block of memory (which happened to be 64K in size) was 13390 microseconds, and the total compression time was 136 seconds. How could the total compression time far exceed the time to migrate the domain? Because it counts multiple parallel threads of execution - CPU time can advance much faster than wall clock time for a highly parallel application. OVMSS makes very good use of parallelism! Pass 1 took about 8.6 seconds (the unit has been scaled from nanoseconds to microseconds), with subsequent passes much faster.

Here is corresponding output from the target machine, decompressing the received memory contents. Now we see decompress probes firing instead of compress probes. Fortunately, the amount of decompression matches the amount of compression - something would be very odd otherwise!

Probe Counts
  mig_tgt_warm_recv_state                             mig-transfer-done                                                 1
  mig_tgt_warm_recv_state                             mig-transfer-start                                                1
  mig_decompress                                      mig-decompress-done                                           10176
  mig_decompress                                      mig-decompress-start                                          10176
(De)compressed bytes by probe
  mig_decompress                                      mig-decompress-done                                       526348887
Total bytes uncompressed = 664535040 - total compressed = 140349831 - percent reduction 78
Total (de)compression time
  mig_decompress                                      mig-decompress-done                                        24515462
Average (de)compression time
  mig_decompress                                      mig-decompress-done                                            2409
Compression savings
  mig-decompress-done                               
           value  ------------- Distribution ------------- count    
            1024 |                                         0        
            2048 |                                         1        
            4096 |                                         1        
            8192 |                                         9        
           16384 |@                                        369      
           32768 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  9796     
           65536 |                                         0        
Transfer Time
  mig_tgt_warm_recv_state                             mig-transfer-done                                          11718964

Summary

The addition of DTrace support to OVMSS provides valuable new information on how it works, and can be used to determine its internal flow and timing.

This provides a way to get information about domain activity that is unavailable through other methods. I would not be surprised to see this enhanced in future editions of Oracle VM Server.

Much thanks to Menno for his helpful suggestions.

NOTE: These probes are not designated as "stable interfaces", and may be replaced as the product continues to emerge. The algorithm, nature and number of passes, probe names, and so forth are subject to change.

Comments:

Post a Comment:
Comments are closed for this entry.
About

jsavit

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today