Squashing Your Files
One problem common to all UNIX systems — indeed, to nearly all computer systems of any kind — is that you never have enough hard drive space. UNIX comes with a couple of programs that can alleviate this problem:compress and gzip . They change the data in a file into a more compact form. Although you can’t do anything with the file in this compact form except expand it back to its original format, for files you don’t need to refer to often, compressing can be a big space-saver.
Compress without stress
You use compress and gzip in pretty much the same way. To compress a file named confidential.txt , for example, type this line:
compress -v confidential.txt
The optional -v (for verbose) option merely tells UNIX to report how much space it saved. If you use it, UNIX responds with this information: confidential.txt: Compression: 49.79% — replaced with confidential.txt.Z
The compress program replaces the file with one that has the same name with .Z added to it. The degree of compression depends on what’s in the file, although 50 percent compression for text files is typical. For a few files, the compression scheme doesn’t save any space, in which case compress is polite enough not to make a .Z file.To get the compressed file back to its original state, use uncompress :
uncompress confidential.txt.Z
This command gets rid of confidential.txt.Z and gets back confidential.txt . You can also use zcat , a compressed-file version of the cat program, which sends an uncompressed version of a compressed file to the terminal, without storing the uncompressed version in a file. The command is rarely useful by itself but can be quite handy with programs, such as more or lp . You use it this way:
zcat confidential.txt.Z | more
This command enables you to see one page at a time what’s in the file. Unlike uncompress , zcat does not get rid of the .Z file.The GNU crowd weighed in with its own compress -like program named gzip . It works the same way that compress does, but uses a different, slightly better, compression scheme. The gzip program is analogous to compress .gunzip and gzcat uncompress stuff. Use them this way:
gzip -v confidential.txt
gunzip confidential.txt.gz
zcat confidential.txt.gz | more
Note that the files end with lowercase gz rather than uppercase Z .Tip Fortunately, gzip knows how to uncompress files produced by compress as well as those produced by several other compression programs, so you can use gunzip as your one-stop uncompression utility.Yet another compression program, called bzip2 , comes with companions bunzip2 and bzcat . You use it the same way as gzip , except that the files it makes end with bz2 and are a little smaller than the equivalent gz files. Downloaded files from the Web are sometimes compressed with bzip2 . If your system doesn’t have bzip2 installed, you (or maybe your local helpful nerd) can find it at http://sources.redhat.com/bzip2 . Here’s how you use them:
bzip2 -v confidential.txt
bunzip2 confidential.txt.bz2
bzcat confidential.txt.bz2 | more
Technical Stuff How does file compression work, anyway?
This discussion is pretty technical. Don’t say that we didn’t warn you.The issue of optimal codes (codes that use the least number of bits for a particular file — or message because at that time they were thinking in terms of radioteletypes) was a hot topic in the late 1940s, challenging the deepest thinkers in the field. In 1952, a student named David Huffman published a paper that any high-school student could understand showing how to use simple arithmetic techniques to construct optimal codes. Oops. Ever since then, this kind of code has been known as Huffman coding. For many years Huffman coding was the best available, and a UNIX program named pack used it.Normally, every character in a file is stored by using 8 bits (binary digits, 1s and 0s, the smallest unit of data a computer can handle). Suppose that a file contains 800 As followed by 100 Bs, and 100 Cs. That’s 1,000 characters, at 8 bits apiece, or 8,000 bits. For this particular file, a compression program can use much shorter codes. It can use a 1-bit code for A and 2-bit codes for B and C. That makes the total size 800 bits for the As, and 200 bits apiece for the Bs and the Cs — a total of 1,200 bits rather than 8,000. The packed file is a little larger than that (1,408 bits) because a table at the front of the packed file indicates which codes correspond to which letters.The compress program uses a dictionary-compression scheme, which is kind of backward from Huffman coding. Rather than try to find the shortest code for every letter, compress runs through the file trying to find frequently occurring groups of letters it can encode as a single dictionary entry, or token. To compress the same file we packed in the previous paragraph, compress reads letter by letter and notes that it has seen AA more than once; then it notices that it has seen AAA more than once, and so on. It enters longer and longer runs of A’s into its dictionary until it has runs of more than 300 As, each represented by a single dictionary entry and a single token in the compressed file. When compress runs into the Bs and then the Cs, it does the same thing and also enters long runs of Bs and Cs in the dictionary.Using a clever technique (at least, it’s clever to data-compression wonks), compress doesn’t have to store the dictionary in the compressed file because uncompress can deduce the contents of the dictionary that compress was building from the sequence of tokens in the compressed file. As a result, compress does a fantastic job on this file and squashes it to a mere 640 bits from the original 8,000.Compression techniques are still a hot topic in the computer biz, and many techniques have been patented. The particular technique compress uses is known as LZW, after Lempel, Ziv, and Welch, the three guys who thought of it. Welch, who works for Unisys and made some improvements to an earlier scheme designed by Lempel and Ziv, has a patent on it. It’s such a cool technique, in fact, that two other guys named Miller and Wegman, who work for IBM, invented it at about the same time, and they also have a patent on it. Because the patent office is not supposed to grant two patents on the same invention, some people use this situation to suggest that issuing patents on software isn’t a good idea. Fortunately, neither Unisys nor IBM has ever objected to the compress program, and the patent expired in June 2003, so you can go ahead and use it.Gzip , zip , and bzip2 use techniques that are somewhat similar to LZW but not covered by patents.
Zippedy day-tah
WinZip and PKZIP are widely used compression programs among Windows and DOS users to create ZIP files containing one or more files compressed together. You may run into ZIP files if you get information from the Internet or on a disk from a DOS or Windows system. Fortunately, a number of volunteers (led by a perfectly nice guy who goes by the enigmatic handle of Cave Newt) have written free zipping and unzipping programs named zip and unzip . Because they’re both available for free over the Internet, no UNIX system should be without them.To unzip a ZIP file, you use unzip :
unzip video-list.zip
The unzip command has a bunch of options, the most useful of which is -l , which tells the program to list the contents of the ZIP file without extracting any of the files. To find out what all the options are, run unzip with no arguments.If you need to create a ZIP file, you can use the equally boringly named zip program:
zip video-list *.txt
This command says to create a file named video-list.zip (it adds the .zip part if you don’t) containing all the files in the current directory whose names end in .txt . The zip program has a number of options, the most useful of which are -9 , meaning to compress as well as possible even though it’s slow (-1 means as fast as possible; other digits give results in between), and -k , which means to make the file look just like one created on a DOS system, not using any lowercase filenames or other UNIX-isms. We use zip -9k to create ZIP files to copy to DOS systems.Incidentally, gzip bears only the vaguest connection to zip and unzip .gzip compresses single files, whereas zip compresses multiple files into a single archive.