When I'm speaking to customers or at conferences, and I describe the higher levels of Database In-Memory compression, and I mean anything higher than the default, I invariably get looks like, "Yeah right, I know compression is going to impact performance so I'm not even going to consider it." The problem with this line of thinking is that it doesn't take into account that even if there is a performance penalty, it's not going to be close to the difference of not having the object in the IM column store at all.
If your choice is between populating an object in the IM column store at a higher compression level versus not populating it at all, or only partially populating it, then in most cases you will be better off taking the higher compression level. Now, this wouldn't be an issue if you had more than enough memory to populate all of your objects in the IM column store. But in the real world, most customers don't have enough memory to do that and so it's a balancing act between what can be populated with the amount of memory that is available.
Database In-Memory supports 5 levels of compression and you can even disable compression altogether if you want. As part of the ALTER TABLE/CREATE TABLE commands and the INMEMORY sub-clause, the options are:
Each successive level typically decreases the amount of memory required to populate the object at a possible reduction in scan performance. I say possible because some customers actually see a performance increase when going from MEMCOMPRESS FOR QUERY LOW to MEMCOMPRESS FOR QUERY HIGH for example.
I thought it would be interesting to run some simple tests using the LINEORDER table from the SSB schema to illustrate the differences in compression for the various compression levels. Now, this is not a benchmark and not meant to be anything more than an illustrative example.
The LINEORDER table I chose was approximately 630 GB uncompressed on disk. I wanted it to be large enough so that we would hopefully see a significant variation in memory usage based on the Database In-Memory compression level chosen. I also ran the following query at least three times to give a realtive sense of the performance difference between the different levels of compression:
SELECT SUM(lo_quantity), SUM(lo_ordtotalprice) FROM lineorder
The following chart lists the compression level, the amount of space occupied in the IM column store for that compression level and the minimum execution time of the three or more executions for that compression level:
Based on these results, we can save a huge amount of memory by moving from the default, MEMCOMPRESS FOR QUERY LOW to MEMCOMPRESS FOR QUERY HIGH. At QUERY HIGH we save over 100GB of column store memory and if we go to CAPACITY HIGH then it is less than half the QUERY LOW size. Performance is still pretty good as well. We can see that we go from 7.66 seconds to run the query to 10.14 at QUERY HIGH and 12.19 at CAPACITY HIGH. That's a degradation in query performance of around 25-37% for a space savings of around 1.6x to almost 2.5x. In many situations that may be a perfectly valid trade-off if that allows you to populate a lot more objects in the IM column store.
I think it's also worth mentioning that compression numbers for NO MEMCOMPRESS and MEMCOMPRESS FOR DML are basically the same. That's because MEMCOMPRESS FOR DML is optimized for DML operations and performs little or no data compression. In practice, it will only provide compression if all of the column values are the same.
At this point you might be thinking that you could just exclude some columns to save space, and you would be right. But I talked about the downside of doing that in my blog post "Questions You Asked: What happens if you access a column that is not populated in the IM column store?".
You might also be thinking, well what if I just populate most of the table in the IM column store? In my experiment above, what if I only had 200GB that I could allocate to the IM column store? If I used the default compression level then I cannot populate the entire LINEORDER table because we know that it will take approximately 339 GB. If I only have a 200GB column store then only about 150GB will be populated because some of that space is allocated to the 64KB Pool. If I run the test query above on this configuration it takes around 10 minutes to run. Now my system doesn't have a very good I/O subsystem, but clearly the query is going to take a lot longer than if all the data is populated in the column store. In fact, if I populate the LINEORDER table with CAPACITY HIGH and leave the dimension tables at the default FOR QUERY LOW then I can fit all of the tables in the IM column store with an allocation of only 200GB.
I think the takeaway from all of this is that trading a small performance penalty for a huge gain in IM column store capacity may be seriously worth considering. After all, even at the CAPACITY HIGH compression level the scan performance is going to be way faster than not being able to populate the object(s) in memory at all or having them only partially populated.