This book is intended to be self contained.
History of Lossless Data Compression Algorithms Introduction There are two major categories of compression algorithms: In lossy compression, it is impossible to restore the original file due to the removal of essential data.
Lossy compression is most commonly used to store image and audio data, and while it can achieve very high compression ratios through data removal, it is not covered in this article.
Lossless data compression is the size reduction of a file, such that a decompression function can restore the original file exactly with no loss of data. Lossless data compression is used ubiquitously in computing, from saving space on your personal computer to sending data over the web, communicating over a secure shell, or viewing a PNG or GIF image.
The basic principle that lossless compression algorithms work on is that any non-random file will contain duplicated information that can be condensed using statistical modeling techniques that determine the probability of a character or phrase appearing. These statistical models can then be used to generate codes for specific characters or phrases based on their probability of occurring, and assigning the shortest codes to the most common data.
Such techniques include entropy encoding, run-length encoding, and compression using a dictionary. Using these techniques and others, an 8-bit character or a string of such characters could be represented with just a few bits resulting in a large amount of redundant data being removed.
History A Hierarchy of Lossless Compression Algorithms Data compression has only played a significant role in computing since the s, when the Internet was becoming more popular and the Lempel-Ziv algorithms were invented, but it has a much longer history outside of computing.
Later, as mainframe computers were starting to take hold inClaude Shannon and Robert Fano invented Shannon-Fano coding. Their algorithm assigns codes to symbols in a given block of data based on the probability of the symbol occuring. The probability is of a symbol occuring is inversely proportional to the length of the code, resulting in a shorter way to represent the data.
Fano gave the class the choice of writing a term paper or taking a final exam.
Huffman chose the term paper, which was to be on finding the most efficient method of binary coding. After working for months and failing to come up with anything, Huffman was about to throw away all his work and start studying for the final exam in lieu of the paper.
It was at that point that he had an epiphany, figuring out a very similar yet more efficient technique to Shannon-Fano coding. The key difference between Shannon-Fano coding and Huffman coding is that in the former the probability tree is built bottom-up, creating a suboptimal result, and in the latter it is built top-down.
It was not until the s and the advent of the Internet and online storage that software compression was implemented that Huffman codes were dynamically generated based on the input data. More specifically, LZ77 used a dynamic dictionary oftentimes called a sliding window.
Most of the commonly used algorithms are derived from the LZ77 algorithm. This is not due to technical superiority, but because LZ78 algorithms became patent-encumbered after Sperry patented the derivative LZW algorithm in and began suing software vendors, server admins, and even end users for using the GIF format without a license.
In the long run, this was a benefit for the UNIX community because both the gzip and bzip2 formats nearly always achieve significantly higher compression ratios than the LZW format.
There have also been some LZW derivatives since then but they do not enjoy widespread use either and LZ77 algorithms remain dominant. Another legal battle erupted in regarding the LZS algorithm.
When Stac Electronics found out its intellectual property was being used, it filed suit against Microsoft. Microsoft had a large judgment, it did not impede the development of Lempel-Ziv algorithms as the LZW patent dispute did.
The only consequence seems to be that LZS has not been forked into any new algorithms. The Rise of Deflate Corporations and other large entities have used data compression since the Lempel-Ziv algorithms were published as they have ever-increasing storage needs and data compression allows them to meet those needs.
However, data compression did not see widespread use until the Internet began to take off toward the late s when a need for data compression emerged.Data compression can be viewed as a special case of data differencing: Data differencing consists of producing a difference given a source and a target, with patching producing a target given a source and a difference, while data compression consists of producing a compressed file given a target, and decompression consists of producing a target given only a compressed file.
Large Text Compression Benchmark. Matt Mahoney Last update: July 21, history This page is no longer maintained. The newest version can be found at http. Data compression algorithms are algorithms that try to approximate the Kolmogorov complexity of a source by finding the minimal length model that represents the data and then encoding that model.
We refer to the data compression schemes used in internetworking devices as lossless compression algorithms. These schemes reproduce the original bit streams exactly, with no degradation or loss.
This feature is required by routers and other devices to transport data across the network. In information technology, lossy compression or irreversible compression is the class of data encoding methods that uses inexact approximations and partial data discarding to represent the content.
These techniques are used to reduce data size for storing, handling, and transmitting content. The different versions of the photo of the cat to the right show how higher degrees of approximation. Lossy algorithms achieve better compression ratios by selectively getting rid of some of the information in the file.
Such algorithms can be used for images or sound files but not for text or program data.