X

Pat Shuff's Blog

database options - compression

A natural follow on to database partitioning is database compression. With partitioning we wanted to split everything into buckets based on how frequently it is used and minimize the more used stuff so that it would fit into memory. The older stuff that we don't access that frequently can be put on slower and lower cost storage. In this blog we are going to look at different techniques to use the cheaper storage event more. Since we don't access this data very frequently and most of the time when we access it we only need to read it and not write to it, we should be able to take advantages of common data and compress the information to consume less storage. If, for example, we have the census data that we are storing and we want to store city and state information we can take advantage of not having Punxsutawney, Pennsylvania stored 5900 times based on the current population. If we stored a copy of this roughly 6000 time it would take up 6000 times 12 bytes for the city and 6000 times 12 bytes for the state. We would also store 15767 as the zip code roughly 6000 times consuming 6000 times 9 bytes. If we could create a secondary table that contains Punxsutawney, Pennsylvania 15767 and correlate it to the hexadecimal value 2e, we could store 2e for the city, state, and zip code thus consuming one byte each rather than 12, 12, and 9 bytes. We effectively save 180,000 bytes by doing a replacement value rather than storing the long strings multiple times. This is effectively the way that hybrid columnar compression works.

Compression can be done at a variety of levels and locations. Disk vendors for years have touted compression in place on storage to consume less space. Compression has been used in a variety of industries. Audio compression, for example, takes recorded audio and under samples the changes in volume and pitch and only records only 8,000 samples per second since the ear can not really hear changes faster than that. These changes are then compressed and stored in an mp3 or avi format. Programs know how to take the mp3 format and rebuild the 8k sample and drive a speaker to estimate the sound that was originally created. Some people can hear the differences and still want to listen to music recorded on reel to reel tape or vinyl because the fidelity is better than CD-ROM or DVD. Videos do the same thing by compressing a large number of bits on a screen and break it into squares on the screen. Only the squares that are changing are transmitted rather than sending all of the data across the whole screen and the blocks that did not change are redisplayed rather than being retransmitted thirty times a second. This allows for video distribution of movies and video recordings across the internet and storage on a DVD rather than recording all of the data all of the time.

Generically compressing data for a database can be complex and if done properly works well. It can also be done very poorly and cause performance problems and issues when reading back the data. Let's take the census data that we talked about earlier. If we store the data as bytes it will consume 198K of space on the disk. If we use the compression ratio that we talked about we will consume roughly 20K of data. This gives us a 10x compression ratio and saves us a significant amount of space on the disk. If the disk sub-system does this compression for us we write 198K of data to the disk, it consumes 20K of storage on the spindles, but when we read it back it has to be rehydrated and we transfer 198K back to the processor and consume 198K of memory to hold the rehydrated data. If the database knew what the compression algorithm and compressed the data initially in memory it would only transmit 20K to the disk, store 20K on the spindles, read 20K back from the disk, and consume 20K of memory to hold the data. This might not seem significant but if we are reading the data across a 2.5 G/second SCSI connection it takes 80ms to read the data rather than 8ms. This 72ms difference can be significant if we have to repeat this a few thousand times. It can also be significant if we have a 1 GigE network connection rather than a direct attached disk. The transfer time jumps to 200ms by moving the data from an attached disk to an nfs or smb mounted disk. We see performance problems like this with database backups to third party storage solutions like Data Domain. If you take a database backup and copy it to a Data Domain solution you get the 10x compression and the backup takes roughly an hour. You have to estimate that it will take seven to eight times the time to rehydrate the data so a restore will take 7-8 hours to recover your database.

The recommended solution is to use compression inside the database rather than third party compression solutions that are designed to compress backups, home directories, and email attachments. Oracle offers Advanced Compressions options for information stored in the database. If you look at the 12c Advanced Compression Data Sheet you will notice that there are a variety options available for compression. You can compress

  • using historic access patterns (heat map and ADO options)
  • using row compression (by analyzing update and insert operations as they occur)
  • file compression (duplicate file links and file compression of LOBS, BLOGS, and CLOBS)
  • backup data compression
  • Data Guard compression of redo logs before transmission
  • index compressions
  • network transmission compression of results to client systems
  • hybrid columnar compression (Exadata and ZFS only)
  • storage snapshot optimization (ZFS only)

Heat Map Compression

At the segment level, Heat Map tracks the timestamps of the most recent modification and query of each table and partition in the database. At the block level, Heat Map tracks the most recent modification timestamp. These timestamps are used by Automatic Data Optimization to define compression and storage policies which will be automatically maintained throughout the lifecycle of the data. Heat Map skips internal operations done for system tasks -- automatically excluding Stats Gathering, DDLs, Table Redefinitions and similar operations. In addition, Heat Map can be disabled at the session level, allowing DBA’s to exclude manual maintenance, avoiding
pollution of Heat Map data.

With the data collected by Heat Map, Oracle Database can automatically compress each partition of a table independently based on Heat Map data, implementing compression tiering. This compression tiering can use all forms of Oracle table compression, including: Advanced Row Compression and all levels of Hybrid Columnar Compression (HCC) if the underlying storage supports HCC. Oracle Database can also compress individual database blocks with Advanced Row Compression based on Heat Map data.

Row Compression

a segment-level ADO policy is created to automatically compress the entire table after there have been no
modifications for at least 30 days, using Advanced Row Compression:

ALTER TABLE employee ILM ADD POLICY ROW STORE COMPRESS ADVANCED SEGMENT AFTER 30 DAYS OF NO MODIFICATION;

In this next example, a row-level ADO policy is created to automatically compress blocks in the table, after no rows in the block have been modified for at least 3 days, using Advanced Row Compression:
ALTER TABLE employee ILM ADD POLICY ROW STORE COMPRESS ADVANCED ROW AFTER 3 DAYS OF NO MODIFICATION;

In addition to Smart Compression, other ADO policy actions can include data movement to other storage tiers, including lower cost storage tiers or storage tiers with other compression capabilities such as Hybrid Columnar Compression (HCC). HCC requires the use of Oracle Storage – Exadata, Pillar Axiom or Sun ZFS Storage Appliance (ZFSSA).

In this example, a tablespace-level ADO policy automatically moves the table to a different tablespace when the tablespace currently containing the object meets a pre-defined tablespace fullness threshold:

ALTER TABLE employee ILM ADD POLICY tier to ilmtbs;

Another option when moving a segment to another tablespace is to set the target tablespace to READ ONLY after the object is moved. This is useful for historical data during database backups, since subsequent full database backups will skip READ ONLY tablespaces.

Advanced Row Compression uses a unique compression algorithm specifically designed to work with OLTP applications. The algorithm works by eliminating duplicate values within a database block, even across multiple columns. Compressed blocks contain a structure called a symbol table that maintains compression metadata. When a block is compressed, duplicate values are eliminated by first adding a single copy of the duplicate value to the symbol table. Each duplicate value is then replaced by a short reference to the appropriate entry in the symbol table.

File Compression

Consider an email application where 10 users receive an email with the same 1MB attachment. Without
Advanced LOB Deduplication, the system would store one copy of the file for each of the 10 users –
requiring 10MB of storage. If the email application in our example uses Advanced LOB Deduplication, it will
store the 1MB attachment just once. That’s a 90% savings in storage requirements.

In addition to the storage savings, Advanced LOB Deduplication also increases application performance.
Specifically, write and copy operations are much more efficient since only references to the SecureFiles
data are written. Further, read operations may improve if duplicate SecureFiles data already exists in the
buffer cache.

Backup data compression

RMAN makes a block-by-block backup of the database data, also known as a “physical” backup, which can be used to perform database, tablespace or block level recovery. Data Pump is used to perform a “logical” backup by offloading data from one or more tables into a flat file.

Due to RMAN’s tight integration with Oracle Database, backup data is compressed before it is written to
disk or tape and doesn’t need to be uncompressed before recovery – providing an enormous reduction in
storage costs and a potentially large reduction in backup and restore times. There are three levels of RMAN Compression: LOW, MEDIUM, and HIGH. The amount of storage savings increases from LOW to HIGH, while potentially consuming more CPU resources.

Data Pump compression is an inline operation, so the reduced dump file size means a significant savings
in disk space. Unlike operating system or file system compression utilities, Data Pump compression is fully
inline on the import side as well, so there is no need to decompress a dump file before importing it. The
compressed dump file sets are automatically decompressed during import without any additional steps by
the Database Administrator.

Data Guard redo log compression

Data Guard Redo Transport Services are used to transfer this redo data to the standby site(s). With Advanced Compression, redo data may be transmitted in a compressed format to reduce network bandwidth consumption and in some cases reduce transmission time of redo data. Redo data can be transmitted in a compressed format when the Oracle Data Guard configuration uses either synchronous redo transport (SYNC) or asynchronous redo transport (ASYNC).

Index Compression

Advanced Index compression is a new form of index block compression. Creating an index using Advanced Index Compression reduces the size of all supported unique and non-unique indexes -- while still providing efficient access to the indexes. Advanced Index Compression works well on all supported indexes, including those indexes that are not good candidates (indexes with no duplicate values, or few duplicate values, for given number of leading columns of the index) with the existing index Prefix Compression feature.

Network Compression

Advanced Network Compression, also referred to as SQL Network Data Compression, can be used to compress the network data to be transmitted at the sending side and then uncompress it at the receiving side to reduce the network traffic. Advanced Network Compression reduces the size of the session data unit (SDU) transmitted over a data connection. Reducing the size of data reduces the time required to transmit the SDU.

Advanced Network Compression not only makes SQL query responses faster but also saves bandwidth. On narrow bandwidth connections, with faster CPU, it could significantly improve performance. The compression is transparent to client applications.

We won't cover the last two options since they don't apply to database services in the cloud unless you purchase the Exadata as a Service option. There is a Compression Estimation Tool to help you estimate the benefits of compression. A sample of this looking at 100 TB of database data shows a significant cost savings in the millions of dollars.

There is also a Compression Advisor that can be downloaded and installed in your database to look at your tables and estimate how much storage you can save based on your data and your usage patterns. You can watch a Four minute marketing video on the tool and how to use it. I recommend Tyler Mouth's blog entry on customizing the output of the compression advisor to be a little more user friendly. I would also look at Mike Haas's Blog on compression and the DBAORA blog that provides a good overview of 11g compressions. Mike Messin's blog is a good blog on installing and executing the compression advisor.

In summary, compression can be used with a variety of mechanisms based on your usage patterns and objectives. This option is not one size fits all and requires a DBA with knowledge of the usage patterns and familiarity of the data and applications. Letting a non-DBA decide on the compression mechanism can lead to poor performance, missing recovery objective times, increased network throughput, and higher processor utilization than necessary. The Database 12c Compression Documentation details how to create tables that are compressed, how to look and see if tables are compressed, and how to update tables for compression. Compression is a mechanism that can directly reduce your storage costs by consuming significantly less amounts of storage to store the same data. In the cloud this correlates directly to storage cost savings. You get compression as an option for High Performance Edition and Extreme Performance Edition but not the Standard Edition or Enterprise Edition versions of the database.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha