Lempel Ziv Markov Algorithm and 7-Zip
By clayb on Feb 06, 2008
I've been working with the LZMA compression algorithm and code for a little while playing around with rewriting it in C and just understanding the algorithm and how it works. Certainly it's attracted some attention with its impressive compression ratios, especially for just random bits of one's disk. However, I've also found a lot of confusion about what is LZMA and what is 7-Zip. As in some ways 7-Zip is LZMA's interface and in some ways and it's completely orthogonal to a discussion about LZMA. So hopefully these questions and answers can clarify some things for folks just starting to look into this cool compression algorithm.
What is LZMA and what's it have to do with 7za?
Lempel-Ziv-Markov-Algorithm (LZMA) is a very nifty compression algorithm recently designed by Igor Pavlov. The algorithm is implemented in a few different software packages, but it's most often discussed for being in the 7-Zip archiver 7za(1).
The 7-Zip archiver, by Pavlov as well, was originally written for Windows but later ported to Unix. The 7-Zip archiver is an archiver (like tar(1) or cpio(1)), which outputs in the 7z Format. The format is open and supports multiple compression algorithms. LZMA compression is the default for the format, however, it also supports bzip2 and deflate (gzip), amongst others. The format supports (AES-256) encryption, but not being designed for a Unix system originally UID/GID pairs aren't stored, similarly directories aren't (or at least don't seem) supported on Unix systems - a significant drawback for an archive format.
Why would I use LZMA?
LZMA is a kick-butt compression algorithm for size's sake. It does a super job of compressing files usually providing a 10% addition to bzip2 compression, but not always. For example, a SPARC binary /usr/bin/gimp-2.4 on my Ultra 45 is 4,997KB and LZMA compressed it to 1,411KB for a savings of 72%, versus bzip2 -9 which resulted in a 1,923KB file size for a savings of 62% only. Again, a SPARC core dump lying around from some application test was 3,588KB which LZMA got to 2,051KB (a 43% savings) and bzip2 -9 got 2,307KB (for a 36% savings). Meanwhile, for a 130KB English LaTeX document in ASCII encoding using LZMA compression compressed it to 11KB (91.5% savings), but bzip2 -9 achieved 10KB (92.3% savings). (Similar results occur for my /usr/sfw/man/man1/gcc.1 where I got for the original,LZMA, and bzip2 -9 514K,104K,99K respectively.) Oh well, can't win them all. Still, on random chunks of my disk using dd(1) LZMA seems to do closer to the binary performance than anything (likely due to the composition of my disk being largely binary files).
Well LZMA looks cool, tell me more:
LZMA though new, and very little documented, is an adaptation of LZ77 with the goal of large compression and fast decompression. It uses range encoding (to deflate's (gzip) Huffman coding), and uses a large dictionary as necessary (up to ~1GB) which is searched with various hashing algorithms stored applied to various binary tree algorithms or a hashed array of lists. There is an on-disk format for LZMA. The LZMA on-disk format is different from the 7z Format which is not solely a wrapper for the LZMA on-disk format - one must transcode/convert them. The LZMA format is very straight-forward: first, the properties used by the LZMA encoder, then, the dictionary size, and uncompressed size, followed by the actual compressed data.
The compression algorithm has been primarily written in C++ (but C# and Java ports are available too). A C decompression example is provided by the LZMA SDK, however, no non-OO port of the compression algorithm seems yet available.
So why do I care about 7za(1) anyways?
Though some other tools support LZMA and have been written to take advantage of it, few tools allow one to compress a file with LZMA other than 7za(1), which currently is a feature rich "LZMAzip" akin to gzip and bzip2. The easiest way to use 7za(1) for compression is with the standard in/out -si and -so flags to make it part of a tar(1) or cpio(1) pipeline as previous compression utilities have been used for years, just remember your data will be compressed not just in LZMA, but LZMA and the 7z Format too (not an overhead worry - just something to understand). If you'd like to play with 7za(1) now, you can. If you're running Nevada build 79 or newer, p7zip version 4.55 went back into the SFW Consolidation on November 16th making /usr/bin/7za available.