lab_dict

Devious Dictionaries

Assignment Description

This lab is all about making you think with dictionaries. Dictionaries (aka maps or associative arrays) are an abstract data type which stores pairs of data and can support operations like insert, find, and remove. We generally call these (key, value) pairs. The idea is to associate a key with a certain value. Hence, dictionaries are useful for when you want to quickly “lookup” a value associated with a certain key. The name dictionary comes from the more familiar physical dictionary, where you lookup the definitions of words. If we were to model a physical dictionary with an ADT dictionary the keys would be the words, and the values would be their definitions.

Since dictionaries are abstract in nature, they have no prescription for what underlying structure should be used to implement them. Common choices are tree-based structures (e.g. balanced binary search trees, BTrees, &c.) or some sort of hash table. Different underlying implementations have different performance considerations, and which you use in the real world will depend on things like desired performance and what is readily available.

For this lab, we’re not going to concern ourselves with the underlying implementation of dictionaries (we’ve already done that in previous labs). Instead, we’ll be using dictionary types that have already been defined for us. In C++ there are two “built-in” dictionary types: std::map and std::unordered_map. Both exist in the wonderful land of the Standard Template Library (STL). The STL is a subset of the C++ Standard Library (the library a compiler must implement to be standard-compliant) which provides data containers. As the name implies, these containers are templated and thus can contain any type (which meets the restriction of the container). You’ve probably experienced at least one of the STL types so far: std::vector, which is pretty much just a super fancy array. In general, the STL types are very convenient as they provide a lot of functionality (e.g. sorting, automatic resizing) that you don’t have to bother implementing yourself.

Dictionaries are amongst some of the most important abstract data types that are used in the real world. Therefore, this lab is perhaps one of the most useful labs from a real world perspective. Mastering the concepts that you see in this lab will be a tremendous help for most technical interviews and programming problems in the future.

Now that we’ve talked about what dictionaries and underlined their importance, let’s move on the actual contents of this lab. This lab is composed of a bunch of different components. We suggest you do them in the order that we describe them over here.

Lab Insight

Dictionaries are important data structures that allow you to quickly search for a value given a key. One important use for dictionaries is memoization, which is used in dynamic programming. This basically means just storing the intermediate values of an algorithm to speed up later computations. This concept is discussed more below. To learn more about the practical uses of dictionaries, CS 374, CS 473 will delve deeper into the applications.

Checking Out The Code

From your CS 225 git directory, run the following on EWS:

git fetch release
git merge release/lab_dict -m "Merging initial lab_dict files"

If you’re on your own machine, you may need to run:

git fetch release
git merge --allow-unrelated-histories release/lab_dict -m "Merging initial lab_dict files"

Upon a successful merge, your lab_dict files are now in your lab_dict directory.

This lab has a data directory that you need to download separately. You can get it by running:

make data

Memoization

Memoization is an aspect of Dynamic Programming, which you’ll learn much more about when you take CS 374. However, it’s a fairly simple idea, and a common implementation uses dictionaries at its core.

Factorial

Our first introduction to memoization will be through the use of factorials.

The factorial is a deterministic function: for a given input \(n\), the value (\(n!\)) is always the same. The formal (recursive) definition is:

\begin{align} 0! &= 1 \\
n! &= n \times (n-1)! \end{align}

For example, say we already know the value of \(5!\) and we want to compute \(6!\). The naïve way computing of \(6!\) would be to calculate \(6! = 6\times(5\times(4\times(3\times(2\times(1)))))\). Since we already know the value of \(5!\), however, we can save quite a bit of computation (and time) by directly computing \(6! = 6\times5!\), using the known value of \(5!\).

This idea is essentially what memoization is. According to Wikipedia:

[M]emoization is an optimization technique used primarily to speed up computer programs by storing the results of expensive function calls and returning the cached result when the same inputs occur again

Now let’s look at how we do this in code by looking at the memoized_fac function in fac.cpp. On line 93, we find:

static map<unsigned long, unsigned long> memo = { {0, 1} };

The left hand side is creating a variable named memo which is a std::map that associates keys of type unsigned long to values of type unsigned long. In our case, we’re going to map a key \(n\) to the value \(n!\).

The static keyword is a neat little thing that basically makes this local variable remain initialized for the lifetime of the program so that we can reuse it across multiple calls to the function. It’s like a global variable, but it’s only accessible by that function.

The right hand side initializes the map to contain a key-value pair where the key is 0 and the value is 1. Notice that this key-value pair corresponds to the base case of our mathematical definition of \(n!\) given above.

Alternatively, we could have used

static map<unsigned long, unsigned long> memo;
memo[0] = 1;

The disadvantage of this method is that now the key 0’s value will be changed to the value 1 during every function call, as opposed to just at initialization as in the first example.

Line 97 is

map<unsigned long, unsigned long>::iterator lookup = memo.find(n);

memo.find(n) uses the std::map::find function. As mentioned above, this method returns an iterator; the type of lookup is std::map<unsigned long, unsigned long>::iterator.

Lines 98–104 are:

if (lookup != memo.end()) {
    return lookup->second;
} else {
    unsigned long result = n * memoized_fac(n - 1);
    memo[n] = result;
    return result;
}

Going back to what we already know about iterators and std::map::find, a return value of memo.end() basically implies that the key n does not exist in the map. In this case, we need to compute n! and update our map to store that value, which is exactly what we do in the else branch.
A std::map::iterator is basically a pointer to a std::pair<const key_type, mapped_type> . A std::pair stores two objects (key and its associating data in the case of map) in a pair. To access the first object in a std::pair, we use the member name first; to access the second object, we use the member name second. For example, in the if branch above, lookup->second returns the value stored in the map for key lookup->first, which is going to be n in our case. If we wanted to be clever, we could rewrite this code as:

if (lookup == memo.end()) {
    memo[n] = n * memoized_fac(n - 1);
}
return memo[n];

Memoizing the factorial function doesn’t actually do much for us in terms of computation time. To see this for yourself you can run:

make fac
time ./fac 20

and compare it to the time it takes for the memoized version by passing the -m flag:

time ./fac 20 -m

Fibonacci

Now that you’re familiar with how to memoize the factorial function, you’ll apply memoization to the Fibbonacci function. As we’ll soon find out, memoizing this function has real speed benefits.

The Fibonacci numbers are recursively defined as:

\begin{align} F(0) &= 0 \\
F(1) &= 1 \\
F(n) &= F(n-1) + F(n-2) \end{align}

If you draw out some trees for calculating these numbers (which I highly recommend) you’ll notice something interesting. For example here is such a tree for \(F(6)\); each node represents a function call:

There’s quite a bit of repetition in this tree. Specifically the entire \(F(3)\) subtree appears three times in the tree and \(F(4)\) twice. The values in those trees get calculated twice. For larger values of \(n\), the tree naturally gets larger and so too do the repeated subtrees. Remember that the nodes in this tree represent function calls, so each one takes some time. Suppose the \(F(4)\) call on the right happens before the \(F(4)\) call on the left. By the time we get to the \(F(4)\) call on the left, we already knew the value of \(F(4)\). What if we could store the result of \(F(4)\)? If we stored it, then when we call \(F(4)\) on the left we don’t need to bother with going through that entire subtree.

As we found out with factorial, we can use a dictionary for exactly this purpose. Imagine that inside our \(F\) function that we associate an input number \(n\) with a Fibonacci number. Every time we successfully calculate a value \(F(n)\) we store the pair \((n, F(n))\) in our dictionary. So now in our function we first check to see if \(n\) already exists in the dictionary (i.e. we’ve already computed it). If it is, we return that immediately. Otherwise, we make the normal recursive call, store the result in the dictionary, and then return the result. Using memoization the tree becomes:

The dashed nodes are ones which didn’t need to be calculated; one of their parent’s values was taken from the dictionary rather than being recalculated from scratch. The red nodes are the only calls where it’s calculated. As you’ll see in a while, the difference memoization makes is rather dramatic.

Your task is to implement both the normal and memoized version of the fib function in fib.cpp. After you do this you can race them with the fib_generator executable:

make fib_generator
time ./fib_generator 45

To use the memoized version, pass the -m flag:

time ./fib_generator 45 -m

Common Words

The next part of this assignment involves making an object that can find common words in different files. The end goal is: given a list of files and a number \(n\) find all words that appear in every file at least \(n\) times. This is a sort of conditional intersection of the words in the files. As you might expect, you’ll be using a dictionary to solve this problem, with the words as the keys, and their frequencies as the values. (This is a common application of associative structures). You will accomplish this by finishing the CommonWords class.

You can find the Doxygen for the CommonWords class here. You are responsible for writing the init_file_word_maps, init_common, and get_common_words functions. The former two will help you write the latter function. The header file (common_words.h) has descriptions of the variables you are initializing.

Once you’ve written your class, you can compile it with make find_common_words. You can then test it on some small text files or novels:

./find_common_words data/small1.txt data/small2.txt -n 3
dog
pig
./find_common_words data/PrideAndPrejudice.txt data/Beowulf.txt data/SherlockHolmes.txt -n 500
and
in
of
the
to

The executable command below finds all the words that appear greater than or equal to \(n\) times in ALL the parameter text files. Here is the usage:

./find_common_words [TEXT FILES] -n [NUM] [-o FILE]

Pronunciation Puzzler

For this part of the lab you’ll complete an object which will help solve the following puzzle (taken from CarTalk):

This was sent in by a fellow named Dan O’Leary. He came upon a common one-syllable, five-letter word recently that has the following unique property. When you remove the first letter, the remaining letters form a homophone of the original word, that is a word that sounds exactly the same. Replace the first letter, that is, put it back and remove the second letter and the result is yet another homophone of the original word. And the question is, what’s the word?

Now I’m going to give you an example that doesn’t work. Let’s look at the five-letter word, ‘wrack.’ W-R-A-C-K, you know like to ‘wrack with pain.’ If I remove the first letter, I am left with a four-letter word, ‘R-A-C-K.’ As in, ‘Holy cow, did you see the rack on that buck! It must have been a nine-pointer!’ It’s a perfect homophone. If you put the ‘w’ back, and remove the ‘r,’ instead, you’re left with the word, ‘wack,’ which is a real word, it’s just not a homophone of the other two words.

But there is, however, at least one word that Dan and we know of, which will yield two homophones if you remove either of the first two letters to make two, new four-letter words. The question is, what’s the word?

Basically, we’re looking to find words such that

  • the word itself,
  • the word with its first character removed, and
  • the word with its second character removed

are all homophones (i.e., have the same pronounciation) of each other.

To accomplish this we are going to employ the help of CMU’s Pronouncing Dictionary. The data associated with that dictionary is stored in a file (data/cmudict.0.7a) and read on construction of a PronounceDict object (see pronounce_dict.h). You don’t have to worry about the constructors this time since they’re not very interesting. Instead, you’re responsible for writing the homophones function which determines whether two words are homophones are not. Check out pronounce_dict.h to see what kind of dictionary structure you’re working with, and remember that it is mapping a word to its pronunciation.

Here is the Doxygen for the PronounceDict class.

PronounceDict expects the words to be uppercase. Here’s how you can transform a std::string str to uppercase in C++:

std::transform(str.begin(), str.end(), str.begin(), ::toupper);

It uses std::transform (you know how to look this function up, right?) We’ll leave it up to you to try and figure out exactly how it works. If you need help, post on Piazza!

Next you will have to write the code which actually solves the puzzle. This function resides in cartalk_puzzle.cpp. It takes a word list file name (similar to before) and a PronounceDict.

You need to be able to read in a list of words form the word list. To do this, you’ll need to know some basic C++ file I/O. This following snippet will print out every word in a word list file:

ifstream wordsFile(filename);
string word;
if (wordsFile.is_open()) {
    /* Reads a line from `wordsFile` into `word` until the file ends. */
    while (getline(wordsFile, word)) {
        cout << word << endl;
    }
}

To solve the actual puzzle, methods like the std::string’s substr function will probably prove useful. If all goes well you should be able to make it and run the executable with:

make homophone_puzzle
./homophone_puzzle

If you want to debug with cout, I would recommend typing using std::cout; and using std::endl; instead of a blanket using namespace std; or else you may get some nasty errors.

There should be 5 (only 4 if you follow the CarTalk problem which looks for words with length 5) resulting triples of words, but only one which matters. I believe the answer CarTalk was looking for is the only one in the list where the two homophones are distinct (it doesn’t matter for us if you don’t make sure the homophones are distinct).

Anagrams

The final part of this assignment involves making a dictionary for looking up anagrams of a given word. Two words are anagrams of one another if we can rearrange the letters of one to form the other. For example, the letters in the word “dog” can be rearranged to form the word “god”. We want an object (built from a word list) whom I can feed the string "dog" and it will spit out both "dog" and "god".

You’ll find the definition of the AnagramDict class in anagram_dict.h. You are responsible for implementing both constructors, as well as the functions get_anagrams and get_all_anagrams. AnagramDict has an instance of map named dict which maps a string to a vector of strings.

Remember that when using dictionaries we need to figure out a way to uniquely identify each of our values. In this case we want to uniquely identify a group of words which are anagrams of one another: we need some sort of function f(key) that we can apply to a key such that f("dog") == f("god") and f("silent") == f("listen") == f("tensil"), but f("dog") != f("silent") Phrased differently, you’ll need to come up with some sort of a function, such that it always returns the same value for all words that are anagrams of one another.

Once you figure out what a function f must do to satisfy that constraint (and there are many possible answers), you’ll find it useful to Google how to actually implement that function if it seems non-trivial to you. (Actually, you’ll find it useful to Google for pretty much anything, including what should f do.)

Here is the Doxygen for the AnagramDict class.

One of the AnagramDict constructors takes a word list file name, which is expected to be a newline-separated list of words (see words.txt). Read this file the in same way as you did for the pronounciation puzzle.

To test your dictionary you can use the anagram_finder executable, which you can compile by running make anagram_finder. Example usage and output:

./anagram_finder dog
Anagrams for dog:
dog
god

You can run it with the flag -a to print out a list of all the known anagrams for this word list. -o will redirect the output to a file, e.g.

./anagram_finder -a -o anagrams.txt

You can diff your output of the above command with data/all_anagrams_soln.txt. You may also specify your own word list file with -w if you’re feeling adventurous (it defaults to the words.txt in the data directory).

Testing Your Program

As usual, there are Catch tests for this lab, which can be run with:

make test && ./test

Submitting Your Work:

The following files are used in grading:

  • anagram_dict.cpp
  • anagram_dict.h
  • fib.cpp
  • pronounce_dict.cpp
  • pronounce_dict.h
  • cartalk_puzzle.cpp
  • common_words.cpp
  • common_words.h

All other files including any testing files you have added will not be used for grading.

Good Luck!