Data Compression Explained
DataProt is supported by its audience. When you buy through links on our site, we may earn a commission. This, however, does not influence the evaluations in our reviews. Learn More.
Have you ever needed to send a few hundred photos from your vacation by packing them into an archive? Maybe you remember ripping your CDs into MP3s in the days when the iPod was still a brand new gadget. Whatever the case may be, you probably remember how the resulting archive ended up being significantly smaller than what you’ve started with. That’s because you compressed those files.
Data compression is almost as old as computers are. It’s one of those innovations that really changed how we interact with media. We wouldn’t be able to stream Netflix through a VPN, quickly send pictures to our friends, or even backup music onto our smartphones without it. If you’ve ever wondered how it all works under the hood, this is the article for you.
How Data Compression Works
For the uninitiated, compression looks like some sort of wizardry. You just press a few buttons, and voila – you have a .zip or .rar file that’s significantly smaller than the file(s) you started with. How does the computer “know” how to pack all that data up without damaging anything?
That’s where algorithms come into play. Every data compression technique has a particular set of rules. For example, when text compression is initiated, the computer will take all of the gaps in the text and assign them a single byte. After that, it will pack the byte into a string that tells the decoder where to put everything back.
Image compression works similarly. Depending on the algorithm, you may get a smaller file with visibly inferior image quality or something that’s almost the same size and looks pretty much identical to the original.
Compression works by either removing unnecessary data or gathering the same or similar bytes and giving them a new value, thus allowing the computer to reconstruct the original data.
Data Compression Types
Two main types of compression are called lossy and lossless since one is smaller but compromises image or sound quality, while the other tends to be larger but keeps the file quality intact.
Lossy compression produces smaller files by analyzing the original data and removing unnecessary bits. That can be adjacent pixels of similar color or unused frequencies in a song. When executed well, lossy compression produces good results that are very close to the original work.
However, making the compression algorithm more aggressive causes significant data loss in the final product – a photo can look pixelated, you’ll hear songs missing certain sounds, and videos will become a blocky mess.
Lossless data compression produces much better results if you’re willing to sacrifice storage space. It’s also non-destructive in its process. Instead of outright removing same-value bytes, the algorithm counts them and replaces the block with a byte signifying the number of replaced blocks. The idea is to preserve the structure of the original file(s).
This is how most archiving tools and formats work, which is why you get the original files when you unpack archives created this way. Lossless compression is used in situations where lossy compression would cause irreparable damage to files, such as executables. It is also popular with audiophiles looking to preserve the quality of their music recordings.
Common Data Compression Algorithms and Their Uses
Over the past several decades, computer scientists have been developing and perfecting different algorithms for data compression. Today, many different algorithms are in use, with some being more effective for video and others for images. Here are some of the most common ones:
- LZ77 – Released in 1977, uses triples to represent offset, the number of characters in a phrase, and markers for deviating characters.
- LZSS – An improvement over the LZ77, using only pairs without deviations. Used by the .rar compression format and for compressing network information.
- DEFLATE – A data compression method combining the previous two methods with codes assigned based on character frequency.
- LZMA – Uses LZ77 on bit level and then further compresses data through arithmetic coding. Most commonly used by 7-Zip. The format was upgraded to LZMA2 in 2009.
- MLP – One of the first neural networks, MLP uses a combination of binary coding, quantization, and pixel-by-pixel transformation for creating output data.
- RLE – Lossless compression that stores a single value in count, great for image and animation compression.
- ZStandard – Another lossless compression. It’s similar to DEFLATE but offers faster decompression and can be paired with a dictionary for even quicker data compression.
- bzip2 – Based on the Burrows-Wheeler transform block-sorting compression, bzip2 looks for recurring sequences and converts them into identical letter strings. Then, it employs two additional transformations, leading to blocks that are between 100 and 900 KB in size.
Pretty much any image format, whether it be JPG or GIF, is a compressed file. In audio, MP3 is the most well-known file format, but audiophiles prefer FLAC files for that lossless, full quality sound. Of course, online streaming platforms such as YouTube and Netflix all use compressed video for faster transfer to the end-user.
Advantages of Data Compression
The main reason we compress our files is to save on storage space. This, in turn, saves on transfer times, data usage for sending files over the internet, and also hardware, as we don’t need many storage devices for keeping all the data. Compression is also good for backups, and many data loss prevention apps will compress your backups for quicker access later on.
There is one major disadvantage of compression, though: increased requirements for computational power. With how data compression works, access to compressed formats and files can be slower, which can cause stutters on slower machines if done on the fly. It is for this reason that some algorithms and file formats became more popular than others.