hpr4647 :: UNIX Curio #7 - Compression
Making your files smaller
Hosted by Vance on Tuesday, 2026-05-26 is flagged as Clean and is released under a CC-BY-SA license.
unix curio, unix, compression.
(Be the first).
Listen in ogg,
opus,
or mp3 format. Play now:
Duration: 00:18:27
Download the transcription and
subtitles.
general.
This series is dedicated to exploring little-known—and occasionally useful—trinkets lurking in the dusty corners of UNIX-like operating systems.
In UNIX Curio #4 ( HPR episode 4617 ), I teased the subject of file compression. Today I'm circling back to that.
The history of data compression goes back at least to the 1970s, and in contexts outside UNIX and computers, probably even earlier. Somehow, it is refreshing to learn that humans have always struggled to have enough storage space to keep all the data they want to hang on to. One way around this limitation is to use some form of compression.
I am only going to dive into lossless compression for this episode—that is, a compression method that can be reversed and will spit out the original data bit for bit. Lossy compression methods also have their places: you might be familiar with their use for audio (such as Ogg Vorbis or MP3); it's also used for images (such as JPEG). Lossy compression allows some of the original data to be thrown away, resulting in a smaller file than is possible with lossless compression, but the intent is for the result to still sound or look "good enough" to a human observer. Also, I am going to limit my discussion to generic methods used for many types of data; while FLAC does lossless compression, it is specifically designed just for audio.
I should make clear that I have never studied computer science or information theory, so this episode will not get into the science behind various types of compression algorithms and how they differ. But in general, these methods take advantage of the fact that many types of data have recurring patterns. English text mostly consists of words that often re-appear many times—source code similarly has keywords and variable names that recur. Compression is accomplished by representing a piece of data that occurs multiple times with a symbol that is shorter in length.
The first compression program in the UNIX world I could find is called
pack
, from 1978
1
. It was shortly followed in 1979 by a similar program called
compact
2
. Both of these used a technique called Huffman coding, but with some differences between them. Files compressed with
pack
were given a
.z
extension and
compact
gave filenames a
.C
extension. Roughly every five or ten years after this, a new program would come along and achieve lasting popularity.
There were, and still are, two opposing forces facing any new form of compression. Working in favor was the advantages it provided—first among these was achieving a better compression ratio, but performance improvements such as speed or reduced memory usage could also be compelling. The force against any new method was the fact that it was not yet widely supported—it doesn't much help to have a smaller file if the people you share it with cannot decompress it.
The next major advance in compression arose out of three scientific papers: two in 1977 and 1978 by Abraham Lempel and Jacob Ziv (called LZ77 and LZ78), and one by Terry Welch in 1984 which built on LZ78. This last method is typically referred to as LZW. Our UNIX Curio for today is a
program called
compress
3
that implements the LZW method. Files compressed this way are named with the extension
.Z
. I had always assumed that this was to honor Jacob Ziv, but now that I've researched the history, it seems more likely to be a follow-on from how files compressed by
pack
were named. Since
pack
did not use any of the Lempel-Ziv methods, I would guess that it used
.z
because that wasn't already taken by anything else, but that's pure speculation.
I do recall encountering
.Z
files in the wild, but feel certain that hasn't happened in the last 25 years, maybe longer. If you need to expand one of these,
uncompress
4
is the program to use (
GNU's
gunzip
can also handle them
5
). However, there was a serious problem that arose with the LZ78 and LZW compression methods. Both of them were patented, and the owner became aggressive in seeking payment from developers and users. The
compress
utility was developed within two months of the publication of Welch's 1984 paper and was included in Bell Laboratories' Eighth Edition UNIX before these shakedowns started. The paper did not disclose that a patent had been filed, and apparently Spencer Thomas and the other developers of
compress
were unaware of it. The utility became popular for a while, and was even standardized by POSIX, but people moved away from LZW once the legal threats started.
Another important advance came in 1991 and was called the DEFLATE compression method. It combined the un-patented LZ77 method with Huffman coding to achieve a similar level of compression as LZW (actually, often better) without the legal trouble. DEFLATE was developed for
PKZIP
and was soon adopted by the GNU project's
gzip
compressor. While Phil Katz (the "PK" in
PKZIP
) patented one way of implementing the DEFLATE method,
it was possible to write a compressor and decompressor without infringing
6
; also, he apparently
never tried to enforce the patent
7
.
As I mentioned in UNIX Curio #4, .zip is both an archive
and
a compression format. Each archive member can be compressed with one of several possible methods (or stored without compression). Unlike a
tar
file where compression can be applied to the entire archive, in .zip each archive member is compressed individually. This often means a .zip file will be slightly bigger than a
tar
file with the same contents compressed with
gzip
, because the .zip format cannot take advantage of duplication that occurs among more than one member of the archive. The vast majority of .zip files use only the DEFLATE and uncompressed storage methods and these are the only options if you want to follow the profile standardized in ISO/IEC 21320-1. Actually, since they both use DEFLATE,
gzip
is able to extract a .zip file in the special case where it only holds one member compressed with that method.
From the 1990s onward, people paid significant attention to avoiding patent landmines, so only methods that didn't have that problem became broadly popular. While the patents on LZ78 and LZW have since expired, I feel like their most successful legacy was in discouraging people from using those methods, leading to DEFLATE taking the popularity crown.
The next step came in 1996 and 1997 with the development of
bzip
and
bzip2
by Julian Seward. The original method was quickly followed by
bzip2
, which was the version that achieved true popularity. They use the Burrows-Wheeler transform, which does not itself compress data but re-arranges it to make it more compressible;
this is combined with other techniques
8
. (At least, that's my understanding. I told you, I'm not up on information theory.) This provides a significant reduction in the compressed size of the data compared to earlier methods—however, it is slower than DEFLATE both during compression and decompression.
Separate projects have developed parallel versions of
gzip
and
bzip2
that can take advantage of multi-processor machines, but the original utilities run single-threaded.
Another five years later, in 2001, Igor Pavlov added the Lempel-Ziv-Markov chain algorithm (LZMA), an enhancement to LZ77, to his 7-Zip compression tool. This was followed a few years later by LZMA2, a container format that allowed for LZMA compression to be split between multiple threads. Broad LZMA2 support came to the UNIX world in 2009 with the
xz
utility
9
. It offers roughly similar compression ratios to
bzip2
, though it can be better or worse depending on the data to be compressed. While compression generally takes even longer than
bzip2
, decompression is significantly faster (though still not as fast as
gzip
). The Linux kernel relatively quickly supported
booting from xz-compressed images
10
because it was a good match for that use case—compression, the time-consuming activity, only has to be done once while the more frequent decompression during boot happens relatively fast.
The last method I will cover is
Zstandard
11
, often written as
zstd
. This came about in 2015, and is another variation on LZ77 that uses finite-state entropy (which means nothing to me, but you might understand it). It performs about as well as DEFLATE in terms of compression ratios, but is much faster both when compressing and decompressing data. I should say that these statements are true with the typical default settings—depending on the compression level selected, it can compress more slowly, but compress the data smaller. However, decompression is always speedier than DEFLATE. This makes it attractive for some uses, and it is heavily promoted by Meta/Facebook, where Yann Collet developed it. For example, shipping large amounts of actively-used data between machines in a data center can go more quickly when the size is reduced; however, if the compression and decompression steps take too long that benefit is lost. A speedy method can be valuable even if it doesn't result in the greatest reduction in size. This use case stands in contrast to, say, a compressed backup file which might only be accessed in a disaster recovery scenario or never accessed at all, making size more important than speed.
Both the
xz
and
zstd
utilities have some built-in support for multi-threading, but the default is to run in a single thread. While
xz
can use multiple threads for decompression (but only if the file was compressed in multi-thread mode), the reference
zstd
utility can only use more than one thread for compression, not decompression.
There are many other methods of lossless compression that have been developed over the decades, but I believe these are the ones you are most likely to encounter in the world of UNIX-like systems. This is a personal opinion, and others might choose a different set. As mentioned, it can be tough for a new method to gain popularity and 35-year-old DEFLATE is still probably the most commonly used despite not being the fastest or offering the greatest reduction in size. Even systems like FreeBSD, NetBSD, and OpenBSD that do not like to include GNU tools supported it by developing their own version of
gzip
based on the permissively-licensed
zlib
library.
Technically, the LZW method used by the
compress
utility is still standardized by POSIX, so one might expect it to have the widest support. However, aggressive patent enforcement discouraged adoption, especially by Free and Open Source Software systems—even though the patent has expired, it is still out of favor compared to DEFLATE. For this reason, I feel justified in calling it a curio.
References:
- Eighth Edition UNIX pack.c https://www.tuhs.org/cgi-bin/utree.pl?file=V8/usr/src/cmd/pack/pack.c
- 2.9BSD compact.c https://www.tuhs.org/cgi-bin/utree.pl?file=2.9BSD/usr/src/ucb/compact/compact.c
- Compress specification https://pubs.opengroup.org/onlinepubs/009695399/utilities/compress.html
- Uncompress specification https://pubs.opengroup.org/onlinepubs/009695399/utilities/uncompress.html
- GNU Gzip manual https://www.gnu.org/software/gzip/manual/gzip.html
- RFC 1951: DEFLATE Compressed Data Format Specification version 1.3 https://tools.ietf.org/html/rfc1951
- History of Lossless Data Compression Algorithms: The Rise of Deflate https://ethw.org/History_of_Lossless_Data_Compression_Algorithms#The_Rise_of_Deflate
- bzip2 https://en.wikipedia.org/wiki/Bzip2
- XZ Utils https://en.wikipedia.org/wiki/XZ_Utils
- 2.6.38 merge window part 2 https://lwn.net/Articles/423541/
- zstd https://en.wikipedia.org/wiki/Zstd
Appendix
The table below demonstrates the results of compressing different types of data using tools described in this episode. While not totally rigorous, I did run each compression and decompression multiple times to ensure I was getting consistent results. The laptop I used has an Intel Core i5-6200U CPU running at 2.30GHz, and the system had at least 5 GB of free memory for each run. While this processor has two cores and can run four simultaneous threads, all utilities were run single-threaded.
The term "best" means the highest level of compression available (the exact level used is shown). For
bzip2
, the default
is
the best. For
zstd
, "best" is -19, which is the highest "normal" level, but "ultra" levels that are even higher also exist. Ratios are the percentage of the original size that the file was reduced to (other sources might instead express the compression ratio as the
reduction
in size achieved). In all results, smaller numbers are better.
┌────────────────────────────┬─────────────┬─────────────┬─────────────┬─────────────┬─────────────┬─────────────┬─────────────┐ │ │ gzip │ gzip │ bzip2 │ xz │ xz │ zstd │ zstd │ │ │(default -6) │ (best -9) │ (-9) │(default -6) │ (best -9) │(default -3) │ (best -19) │ ├──────────────┬─────────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┤ │ │Size (ratio) │ 22,036,508 │ 21,891,623 │ 15,795,698 │ 13,487,768 │ 12,938,464 │ 20,454,657 │ 13,709,078 │ │ │ │ (24%) │ (24%) │ (17%) │ (15%) │ (14%) │ (23%) │ (15%) │ │English Text ├─────────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┤ │(90,532,092 │Compression │ 4.8s │ 7.6s │ 8.5s │ 49.8s │ 58.8s │ 0.6s │ 65.2s │ │bytes │time │ │ │ │ │ │ │ │ │uncompressed) ├─────────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┤ │ │Decompression│ 0.7s │ 0.8s │ 3.7s │ 1.2s │ 1.2s │ 0.4s │ 0.4s │ │ │time │ │ │ │ │ │ │ │ ├──────────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┤ │ │Size (ratio) │ 125,291,122 │ 124,189,544 │ 98,016,512 │ 84,882,492 │ 81,954,344 │ 120,604,855 │ 87,298,645 │ │ │ │ (21%) │ (21%) │ (17%) │ (14%) │ (14%) │ (20%) │ (15%) │ │Source Code ├─────────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┤ │(590,008,320 │Compression │ 22.0s │ 39.3s │ 54.8s │ 241s │ 298s │ 3.7s │ 348s │ │bytes │time │ │ │ │ │ │ │ │ │uncompressed) ├─────────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┤ │ │Decompression│ 5.1s │ 5.1s │ 20.3s │ 8.1s │ 7.8s │ 2.4s │ 2.4s │ │ │time │ │ │ │ │ │ │ │ ├──────────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┤ │ │Size (ratio) │ 32,830,905 │ 32,371,241 │ 26,856,579 │ 20,717,288 │ 20,352,880 │ 28,538,810 │ 23,154,582 │ │ │ │ (19%) │ (19%) │ (16%) │ (12%) │ (12%) │ (17%) │ (13%) │ │Binary Program├─────────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┤ │(171,972,264 │Compression │ 6.4s │ 22.4s │ 18.6s │ 62.2s │ 67.8s │ 0.8s │ 111s │ │bytes │time │ │ │ │ │ │ │ │ │uncompressed) ├─────────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┤ │ │Decompression│ 1.5s │ 1.5s │ 5.6s │ 2.3s │ 2.3s │ 0.7s │ 0.7s │ │ │time │ │ │ │ │ │ │ │ ├──────────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┤ │ │Size (ratio) │ 146,397,772 │ 146,397,757 │ 144,485,451 │ 131,950,232 │ 130,926,780 │ 147,154,979 │ 145,703,840 │ │ │ │ (89%) │ (89%) │ (88%) │ (80%) │ (80%) │ (90%) │ (89%) │ │WAVE Audio ├─────────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┤ │(164,396,302 │Compression │ 9.2s │ 9.2s │ 25.1s │ 70.4s │ 97.7s │ 0.7s │ 58.3s │ │bytes │time │ │ │ │ │ │ │ │ │uncompressed) ├─────────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┤ │ │Decompression│ 2.0s │ 2.0s │ 13.5s │ 12.2s │ 12.1s │ 0.6s │ 0.8s │ │ │time │ │ │ │ │ │ │ │ ├──────────────┴─────────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┤ │ │ gzip │ gzip │ bzip2 │ xz │ xz │ zstd │ zstd │ │ │(default -6) │ (best -9) │ (-9) │(default -6) │ (best -9) │(default -3) │ (best -19) │ └────────────────────────────┴─────────────┴─────────────┴─────────────┴─────────────┴─────────────┴─────────────┴─────────────┘
- English text consists of Titles 1 through 10 of the 2020 U.S. Code of Federal Regulations .
-
Source code consists of a
tarfile containing the Linux kernel source, version 4.0. -
Binary program consists of an ELF-format executable of the
pandocapplication, version 2.17.1.1 found on Debian 12. -
Audio consists of a 24-bit Signed Integer PCM WAVE file with 2 channels at 44.1kHz, about 10:21 in length. For comparison, the audio-specific
flaclossless compression utility reduced this file to 97,962,711 bytes (60%) in 2.6 seconds at the default (-5) level and to 97,714,876 bytes (59%) in 5.4 seconds at the highest (-8) level.