ISTA 311 HW 4: HUFFMAN CODES AND DATA COMPRESSION
Due Friday, May 1, 2020, 11:59 PM
1. Introduction
For this assignment, we will write a simple file compression utility for text files implementing the Huffman
code algorithm. This utility reads a text file, analyzes the frequency distribution of characters in the file,
and calculates an optimal symbol code (Huffman code) for that distribution. It then writes a binary file
encoded in that Huffman code, and an auxiliary file containing the Python representation of the code in the
form of a dictionary so that it can be decoded later.
2. Instructions
2.1. Code and submission. Create a Python module called compress.py and upload it to the assignment
folder on D2L. You will need to do the following imports:
from bitstring import Bits
import numpy as np
and then implement the functions described in the specifications below.
You might have to install the bitstring library if you haven’t yet. This can be installed through pip or
conda.
Documentation for the Bits class can be found at https://pythonhosted.org/bitstring/constbitarray.
html
2.2. Documentation. Your script must contain a header docstring containing your name, ISTA 311, the
date, and a brief description of the module. Each function must contain a docstring. Each function docstring
should include a description of the function’s purpose, the name, type, and purpose of each parameter, and
the type and meaning of the function’s return value.
2.3. Testing and required files. A test script, test compress.py, will be available on D2L along with
some auxiliary text and data files necessary to run the tests.
2.4. Grading. Your submission will be graded by running it through the test script, running some tests on
other files, and examining your code. Code should be clear and concise, but you will only lose points for
style if your code is a real mess. Include inline comments to explain tricky lines and summarize sections of
code.
3. Function specifications
Implement the following functions:
• get frequency dict: This function takes a single string argument and returns a dictionary whose
keys are the characters appearing in the string, and whose values are the normalized frequencies
(number of times each character appears, divided by the total number of characters in the string).
1
2 ISTA 311 HW 4: HUFFMAN CODES AND DATA COMPRESSION
• dict entropy: This function takes a frequency dictionary and returns the entropy of the probability
distribution. Recall that the entropy is computed by the formula:
H(X) = −
∑
i
P (xi) log2 P (xi)
where xi are the individual outcomes. Here that means the characters in the dictionary.
• huffman code: This takes a frequency dictionary and returns a Huffman code dictionary, where the
keys are the characters appearing in the string and the values are the code words (strings of 0s
and 1s) assigned to each character. (This may be simpler with some helper functions; see the tree
algorithm in the Jupyter notebook encoding trees filled.ipynb, for instance.)
• invert dict: This function takes a dictionary representing an encoding, with characters mapping
to code words, and reverses it so that it maps code words to characters. (This is useful in decoding.)
• encode string: This function takes two arguments: a string and a dictionary representing an
encoding. Encode the string into a string of 0s and 1s and return it.
• decode string: This function takes two arguments: a string of 0s and 1s and a dictionary repre-
senting a reversed encoding. Use the dictionary to decode the string.
• pad bitstring: Remember that bit strings we write to a file must be a whole number of bytes (i.e. a
multiple of 8 bits.) This function takes a string of 1s and 0s and appends the number of 0s necessary
onto the end of the string to make the length of the string a multiple of 8. Then it prepends onto
the string an 8-bit bitstring representation of that number of 1s and 0s. Then return the padded
string.
• write encoded file: This function takes a string as its only argument, and does the following:
– Passes the string to get frequency dict to get a frequency dictionary;
– Finds the Huffman code for that frequency dictionary;
– Converts the contents of the file into a bit string using the Huffman code, pads it using
pad bitstring, and writes it to a (binary) file. The filename of the binary file should be
the original filename with ’.compressed’ replacing the .txt extension;
– Writes (in plain text) a representation of the Huffman code dictionary. Use repr to get a string
representation of the dictionary. The filename of the text file should be the original filename
with .code replacing the .txt extension.
After all this, the function should return [file len, dict len], where file len is the length of
the compressed file in bytes (which is the length of your bit string divided by 8) and dict len is the
length of the string representation of the dictionary.
• decode encoded file: This function takes a filename stem (a filename without an extension). Then
do the following:
– Open the file filename.compressed in binary read mode and read it to a Bits object.
– Open the file filename.code in text read mode, read it to a string, and call eval() on that
string to obtain the encoding dictionary; reverse it to get a decoding dictionary.
– Access the bin attribute of the Bits object to obtain a string of 0s and 1s.
– Interpret the first 8 bits in the bit string as an integer to learn how many digits of zero padding
are on the end of the bit string.
– Strip the initial 8 bits and any zero padding from the end, and pass the resulting bit string to
decode string along with the decoding dictionary, then return the resulting string.
• main: Finally, write a main function. The main function should:
– Open the file saguaros.txt, reads the text, and passes the text to write encoded file to
write a compressed version of the file.
– Print the length of the file saguaros.txt in bytes (which is the same as the length of the string
you read from it). Also print the entropy of the frequency dictionary, length of the compressed
file in bytes and the length of the dictionary, formatting the output like this:
Length of input file: xxxx
Input file entropy: xxxx
Length of compressed file: xxxx
Length of encoding dict: xxxx
where the xxxx are the various numerical quantities.
ISTA 311 HW 4: HUFFMAN CODES AND DATA COMPRESSION 3
– Pass the filename stem poem to decode encoded file. Take the string that is returned and
write it to a file called poem.txt.
Hint: The code necessary for encode string, decode string, pad bitstring, invert dict,
get frequency dict, and part of write encoded file can be found in the Jupyter notebook
encoding trees filled.ipynb on D2L.