A C++ implementation of Huffman coding for lossless file compression. Characters that appear more frequently in the input are assigned shorter binary codes, reducing the total number of bits needed to represent the file.
| File | Description |
|---|---|
HuffmanCode.cpp |
Core implementation |
smallinput.txt |
Two-sentence sample input |
largeinput.txt |
Full repeated-paragraph sample input |
driver.cpp |
Main program loop, utilizing huffman coding |
HuffmanCode huff;
// Compress
huff.compressFile("largeinput.txt", "compressed.txt");
// Decompress
huff.decompressFile("compressed.txt", "decompressed.txt");- Frequency table — count how often each character appears in the input.
- Min-heap — insert one leaf node per unique character, ordered by frequency.
- Tree construction — repeatedly merge the two lowest-frequency nodes under a new parent until one root remains.
- Code generation — traverse the tree; each left edge adds a
0, each right edge adds a1. Frequent characters end up near the root with short codes; rare ones sit deeper with longer codes. - Output — write a header (the code table) followed by the encoded content as a packed bitstream.
<ASCII value> <binary code>
<ASCII value> <binary code>
...
====
[packed bitstream]
The ==== line separates the human-readable header from the encoded binary content. A special EOF marker character (~) is embedded at the end of the bitstream so the decompressor knows exactly where real data ends, regardless of zero-padding in the final byte.
Huffman coding's compression ratio is directly tied to how skewed the character frequency distribution is — and nothing skews a frequency table faster than repeated text.
smallinput.txt contains just three sentences. The character frequencies are fairly spread out, so codes are roughly similar in length across characters, and the savings over plain ASCII are modest.
largeinput.txt repeats the same paragraph 35 times. Because the same characters appear over and over, the frequency table becomes extremely lopsided, common characters like e, t, space, and a accumulate enormous counts while the rare ones stay small. Huffman responds by assigning the highest-frequency characters very short codes (sometimes just 2–3 bits), while rare characters get longer ones. Since the common characters dominate the file, the average bits-per-character drops significantly.
The math: if a character makes up 40% of the file and gets a 2-bit code instead of the standard 8-bit ASCII, that's a 75% reduction for nearly half the content. The more the same characters repeat, the more aggressively this effect compounds across the entire file.
In short: more repetition → more extreme frequency skew → shorter codes for dominant characters → better compression. largeinput.txt is an ideal input for this algorithm.
- Uses
~as the EOF marker, so files containing~will have that character's frequency incremented by 1 (negligible in practice but worth noting). - Compressed output includes a plain-text header, which adds some overhead for very small inputs — compression gains only outpace this overhead as file size grows.