Huffman Coding — File Compression & Decompression

A C++ implementation of Huffman coding for lossless file compression. Characters that appear more frequently in the input are assigned shorter binary codes, reducing the total number of bits needed to represent the file.

Files

File	Description
`HuffmanCode.cpp`	Core implementation
`smallinput.txt`	Two-sentence sample input
`largeinput.txt`	Full repeated-paragraph sample input
`driver.cpp`	Main program loop, utilizing huffman coding

Usage

HuffmanCode huff;

// Compress
huff.compressFile("largeinput.txt", "compressed.txt");

// Decompress
huff.decompressFile("compressed.txt", "decompressed.txt");

How It Works

Frequency table — count how often each character appears in the input.
Min-heap — insert one leaf node per unique character, ordered by frequency.
Tree construction — repeatedly merge the two lowest-frequency nodes under a new parent until one root remains.
Code generation — traverse the tree; each left edge adds a 0, each right edge adds a 1. Frequent characters end up near the root with short codes; rare ones sit deeper with longer codes.
Output — write a header (the code table) followed by the encoded content as a packed bitstream.

File Format

<ASCII value> <binary code>
<ASCII value> <binary code>
...
====
[packed bitstream]

The ==== line separates the human-readable header from the encoded binary content. A special EOF marker character (~) is embedded at the end of the bitstream so the decompressor knows exactly where real data ends, regardless of zero-padding in the final byte.

Why Repetition Makes Huffman Shine

Huffman coding's compression ratio is directly tied to how skewed the character frequency distribution is — and nothing skews a frequency table faster than repeated text.

smallinput.txt contains just three sentences. The character frequencies are fairly spread out, so codes are roughly similar in length across characters, and the savings over plain ASCII are modest.

largeinput.txt repeats the same paragraph 35 times. Because the same characters appear over and over, the frequency table becomes extremely lopsided, common characters like e, t, space, and a accumulate enormous counts while the rare ones stay small. Huffman responds by assigning the highest-frequency characters very short codes (sometimes just 2–3 bits), while rare characters get longer ones. Since the common characters dominate the file, the average bits-per-character drops significantly.

The math: if a character makes up 40% of the file and gets a 2-bit code instead of the standard 8-bit ASCII, that's a 75% reduction for nearly half the content. The more the same characters repeat, the more aggressively this effect compounds across the entire file.

In short: more repetition → more extreme frequency skew → shorter codes for dominant characters → better compression. largeinput.txt is an ideal input for this algorithm.

Limitations

Uses ~ as the EOF marker, so files containing ~ will have that character's frequency incremented by 1 (negligible in practice but worth noting).
Compressed output includes a plain-text header, which adds some overhead for very small inputs — compression gains only outpace this overhead as file size grows.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
test_files		test_files
HuffmanCode.cpp		HuffmanCode.cpp
LICENSE		LICENSE
README.md		README.md
a.out		a.out
bitTesting.cpp		bitTesting.cpp
driver.cpp		driver.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Huffman Coding — File Compression & Decompression

Files

Usage

How It Works

File Format

Why Repetition Makes Huffman Shine

Limitations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Huffman Coding — File Compression & Decompression

Files

Usage

How It Works

File Format

Why Repetition Makes Huffman Shine

Limitations

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages