Skip to content

mccrea-dev/AlgoFinal

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Huffman Coding — File Compression & Decompression

A C++ implementation of Huffman coding for lossless file compression. Characters that appear more frequently in the input are assigned shorter binary codes, reducing the total number of bits needed to represent the file.


Files

File Description
HuffmanCode.cpp Core implementation
smallinput.txt Two-sentence sample input
largeinput.txt Full repeated-paragraph sample input
driver.cpp Main program loop, utilizing huffman coding

Usage

HuffmanCode huff;

// Compress
huff.compressFile("largeinput.txt", "compressed.txt");

// Decompress
huff.decompressFile("compressed.txt", "decompressed.txt");

How It Works

  1. Frequency table — count how often each character appears in the input.
  2. Min-heap — insert one leaf node per unique character, ordered by frequency.
  3. Tree construction — repeatedly merge the two lowest-frequency nodes under a new parent until one root remains.
  4. Code generation — traverse the tree; each left edge adds a 0, each right edge adds a 1. Frequent characters end up near the root with short codes; rare ones sit deeper with longer codes.
  5. Output — write a header (the code table) followed by the encoded content as a packed bitstream.

File Format

<ASCII value> <binary code>
<ASCII value> <binary code>
...
====
[packed bitstream]

The ==== line separates the human-readable header from the encoded binary content. A special EOF marker character (~) is embedded at the end of the bitstream so the decompressor knows exactly where real data ends, regardless of zero-padding in the final byte.


Why Repetition Makes Huffman Shine

Huffman coding's compression ratio is directly tied to how skewed the character frequency distribution is — and nothing skews a frequency table faster than repeated text.

smallinput.txt contains just three sentences. The character frequencies are fairly spread out, so codes are roughly similar in length across characters, and the savings over plain ASCII are modest.

largeinput.txt repeats the same paragraph 35 times. Because the same characters appear over and over, the frequency table becomes extremely lopsided, common characters like e, t, space, and a accumulate enormous counts while the rare ones stay small. Huffman responds by assigning the highest-frequency characters very short codes (sometimes just 2–3 bits), while rare characters get longer ones. Since the common characters dominate the file, the average bits-per-character drops significantly.

The math: if a character makes up 40% of the file and gets a 2-bit code instead of the standard 8-bit ASCII, that's a 75% reduction for nearly half the content. The more the same characters repeat, the more aggressively this effect compounds across the entire file.

In short: more repetition → more extreme frequency skew → shorter codes for dominant characters → better compression. largeinput.txt is an ideal input for this algorithm.


Limitations

  • Uses ~ as the EOF marker, so files containing ~ will have that character's frequency incremented by 1 (negligible in practice but worth noting).
  • Compressed output includes a plain-text header, which adds some overhead for very small inputs — compression gains only outpace this overhead as file size grows.

About

This was my final project for CMSC - 285, Algorithms. It is Huffman Coding, an algorithm used for lossless data compression. This HuffmanCode class is able to both compress and decompress files.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages