X

News, tips, partners, and perspectives for the Oracle Linux operating system and upstream Linux kernel work

Oracle Data Analytics Accelerator (DAX) for SPARC

This blog post was written by kernel developers Jon Helman and Rob Gardner, whose code for the Oracle Data Analytics driver was accepted into the Linux source earlier this year. This is our ultimate installment in the kernel blog series on Linux enablement for SPARC chip features.

Oracle DAX Support in Linux

The Oracle Data Analytics Accelerator (DAX) is a coprocessor built into the SPARC M7, S7, and M8 chips, which can perform various operations on data streams. These operations are particularly suited to accelerate database queries but have a wide variety of uses in the field of data analytics. For the duration of a coprocessor operation, the main processors are free to execute other instruction streams. Since the coprocessor can operate on large data sets, this can potentially free up processor resources significantly. Each system may have multiple DAX coprocessors, and each DAX has multiple execution units. Each unit is capable of doing independent work in parallel with the others and applications may be able to take advantage of this parallelism for some data sets.

DAX Operations

The explanations and drawings below show in detail the basic operations that the DAX can perform.

Scan

The scan operation finds all instances of a value, values, or range of values in a list. In the following example, the DAX performs the operation of finding each instance of the search value, A, in the input vector. The resulting bit vector has a 1 set in each position where an A is found.

Select

The select operation pulls elements from a vector to produce a subset which corresponds to the bits set in a bitmap. In the following example, the DAX filters the input data so that the resulting output vector consists of only those elements for which a 1 is set in the bit vector.

Extract

The extract operation converts a vector of values from one format to another format. In the following example, the DAX converts from an RLE-encoded input vector to an expanded output vector. (RLE, or run-length encoding, is a compression technique in which repeated elements are represented by a tuple consisting of the element and the number of repetitions.) This is just one of the many possible format conversions.

Translate

The translate operation takes as input a vector and a bitmap. Each element in the vector is used as an index into the bitmap, and that bit is placed into the output bitmap. This operation is more easily described with this short code segment and illustrated in the diagram which follows.

for (i=0; i<N; i++) OUTPUT[i] = BITMAP[INPUT[i]];

Coprocessor Features

Control flow

  1. The hardware defines a Coprocessor Control Block (CCB) which specifies the operation to be done, the addresses of the buffers to process as well as metadata describing those buffers (format of the data, number of elements in the stream, compression format, etc.). 
  2. One or more CCBs are presented to the coprocessor via software. 
  3. Multiple requests may be enqueued in the hardware and these are serviced as resources allow.
  4. Many threads may make requests concurrently, and resources are shared much like the CPU is shared.
  5. After submission, software is free to do other work until it requires the computational results from the coprocessor.
  6. Upon completion of the request, no interrupt is sent as commonly done with other hardware. Rather, completion is signalled via memory which can be polled by software. The processor provides an efficient mechanism for polling this completion status in the form of two new instructions, monitored load and monitored wait.  The monitored load instruction performs a memory load while also marking the address as one of interest. The monitored wait instruction pauses the virtual processor until one of several events occur, one of which is modification of the memory location of interest. This allows other hardware threads to use core resources while the monitoring thread is suspended.

Data access

  • The DAX hardware directly reads from and writes to physical memory avoiding handling large amounts of data in the main processor. 
  • In order to optimize cache utilization, an option is provided that directs the DAX to place output directly in the processor's L3 cache. 
  • The DAX also optimizes data accesses with its capability of operating on compressed data: it can decompress data while performing the operation and hence does not need temporary memory to hold decompressed intermediate output. This helps to reduce the number of physical memory reads and increase the size of possible data sets. 
  • In addition to compressed data, the DAX can work with a variety of data formats and bit widths including fixed-width bit- and byte-packed, and variable width. The multitude of possible data formats and supported bit widths is documented in the Linux kernel file located at Documentation/sparc/oradax/dax-hv-api.txt.

Software Stack

Initiating a Request

An application will typically use the available function library (libdax) to utilize the capabilities of the coprocessor, though it is also feasible to use the raw driver interface. A request to submit an operation to the DAX starts with a user calling one of the libdax functions (e.g. dax_scan_value). These functions perform rigorous validation of the arguments, and convert them into the hardware defined CCB format before being fed to the driver. The driver locks the pages containing the input and output buffers and then submits the CCBs to the hypervisor via the hypercall mechanism. The hypervisor translates each address in the CCB from virtual to physical and then initiates the hardware operation. Control immediately returns to the hypervisor, subsequently to the driver, and then back to libdax.

Request Completion

Since the kernel and hypervisor are not involved in processing a CCB after it has been submitted to the DAX, requests to the DAX driver do not block waiting for completion as is traditional for many other drivers. This means that the userland application has the option of performing other work while waiting for completion. libdax provides two variants of each DAX operation: blocking (e.g. dax_scan_value or dax_extract) and non-blocking (e.g. dax_scan_value_post and dax_extract_post). Completion of a request is signaled via a status byte in shared memory called the completion area. libdax waits on this byte using the monitored load and monitored wait instructions. The function dax_poll is provided for the application to check for completion in the non-blocking scenario. In libdax, the logic of checking the completion area is:

while (1) {
	uint8_t status = loadmon8(&completion_area->status);
	if (status == INPROGRESS)
		mwait(TIMEOUT);
	else
		break;
}

Driver Operation

The oradax driver provides a transport mechanism for conveying one or more CCBs from a user application to the coprocessor, and also performs several housekeeping functions essential to security and integrity. The API consists of the Linux system calls open, close, read, write, and mmap. The /open/ call initializes a context for use by a single thread. The context contains buffers to hold CCBs, completion areas, and records the virtual pages used by requests. Multiple threads may utilize the coprocessor, but each thread must do its own /open/. A correspondin /close/ releases all resources associated with all requests submitted by the thread. The /mmap/ call is used to gain access to said completion area buffer. Driver commands are given via /write/, and responses (when necessary) are retrieved via /read/. Driver commands involve a CCB or group of CCBs and are submit, kill, request info, and dequeue.

The submit command is a /write/ of a buffer containing one or more CCBs to be conveyed to the coprocessor. Since the coprocessor accesses physical memory directly, the virtual to physical mappings of the I/O buffers must be locked in order to prevent the physical pages from being repurposed by the kernel. The driver does this locking of all pages associated with the request and transmits the CCBs to the hypervisor. If any of the CCBs were not submitted successfully, the corresponding pages are unlocked and the /write/ return value will indicate this discrepancy.

If all CCBs could not be submitted successfully, then a /read/ must be done to retrieve further information that describes what went wrong. If all CCBs were submitted successfully, the application may poll for completion or proceed immediately to other tasks and defer polling until the results are required for further progress. The current state of a CCB may be queried at any time using the request info command, and a CCB may be terminated with the kill command. The dequeue command explicitly unlocks the pages associated with all completed requests; it is not usually necessary to call this since pages are unlocked implictly during the submission process.

For much more detail, see Documentation/sparc/oradax/oracle-dax.txt.

Conclusion

Oracle DAX is supported by the oradax device driver and is available beginning with the Linux 4.16 kernel.  A user may make calls directly to the oradax driver to submit requests to the DAX, and the kernel documentation files contain example code to demonstrate this. Do be aware that we fully expect applications wishing to use the DAX to leverage the libdax library which provides higher level services for analytics and frees the application writer from the need to understand the low level DAX command structure. The library is fully open-sourced and available at the Oracle open source project webpage and includes a full set of manpages to describe the DAX operations.

Feedback is always welcome and we would be interested in hearing about your experiences with the DAX.

Reference Links

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha