Hash Tables

A hash table is a data structure where keys are organized into smaller groups that can be quickly searched. Hash tables are a remarkably useful data structure. They implement the abstract data type dictionary, a variation of the symbol table ADT.

A closely related ADT is the map. Many authors use the terms map and dictionary synonymously, but these notes distinguish the two. When a symbol table is implemented with a search tree, we call the resulting ADT a map. When implemented with a hash table, the resulting ADT is a dictionary. We maintain these distinctions to keep communication clear. That said, it's critical to keep in mind that there is no established terminology on the matter. Languages themselves give different names to ADTs implemented with hash tables:

Python calls them dictionaries.
JavaScript calls them object literals.
C++, Java, Go, and Scala call them maps.
Ruby, ever the odd, calls them hashes.

Benefits of Hash Tables

Hash tables exist because of the conflicting tradeoffs in searching on arrays. With arrays, we get constant time access and all the benefits of caching, but at the cost of ${O(1)}$ linear search. If we sort the elements, we can better our situation up to ${O(\lg n)}$ with binary search, but then we introduce the bottleneck and complexity of sorting. Hash tables are an attempt at hitting the middle ground — fast searching and fast access. The cost: We don't get anywhere near the benefits of caching from using arrays and we require far more memory.

Basic Idea

Say we had the following array:

0	1	2	3	4	5	6
8	3	6	10	15	18	4

We're asked to build a data structure that allows the user to determine if an element ${n}$ exists in the array, in ${O(1)}$ time. That takes binary search and linear search off the table. So what can we do? Well, we could store the elements in an array where the indices correspond to the actual elements. We'll call this new data structure H:

0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
			3	4		6		8	9	10					15

This does, in fact, give us ${O(1)}$ search. If the user wants to know whether 𝑁 is comprised, they just have to write H[𝑁]. If a number is returned, 𝑁 is comprised. Otherwise, they'll get some specified value, perhaps null.

Of course, this isn't very feasible. Suppose the array was simply:

0
${100~000~000~000}$

Sparing the trouble, that's one hundred billion. We'd need a H of size one hundred billion just to store that one element. This isn't some outlandish concern. String encodings today go up to the million! Clearly, our initial data structure H is only ideal for the smallest cases. But, the idea is there and it isn't bad, so let's build on it.

Hash Functions

Thinking carefully about the problem with H, we realize that the problem stems from the way we map the elements to the indices. With our initial approach, it was a 1-to-1 mapping:

k \rarr \texttt{array[k]}

More specifically, we passed ${k}$ to some function, and that function returned a value that we used to index into the array:

\begin{aligned} &k \rarr \texttt{array[$h(k)$]} \\[1em] &\text{where}~~h(x) = x \end{aligned}

The function ${h(x)}$ is a type of hash function, and when it takes the form ${h(x) = x,}$ we call it ideal hashing. Let's try a different function, ${h(x) = x \rem 10.}$ This is the remainder (or modulo) operation from number theory (divide ${x}$ by ${10,}$ give me the remainder). Say we had the following array:

0	1	2	3	4	5
8	3	6	10	15	4

we get the following (where ${k}$ is an element in the array):

${k}$	${h(k) = k \rod 10}$
8	8
3	3
6	6
10	0
15	5
4	4

The resulting array uses substantially less space.

0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
			3	4		6		8	9	10					15

0	1	2	3	4	5	6	7	8
10			3	4	15	6		8

It seems we've solved the problem. At least until the the value ${18}$ is added to the array:

${k}$	${h(k) = k \rod 10}$
8	${{8}}$
3	3
6	6
10	0
15	5
4	4
18	${{8}}$

An array slot can only hold one element at a time, and we've gotten ourselves in tangle — the values ${8}$ and ${18}$ map to the same index. This is called a collision. Now we have to solve this problem. We have some options:

closed hashing — We keep the size of our table fixed. To prevent collisions, we'll either (1) modify our hash function, (2) use some contigency procedure in the event of a collision, or (3) some combination of both.
open hashing — Forget all of the complexity with writing hash functions and dealing with minutiae — just use more memory.

Open Hashing

Also called chaining, open hashing is a straightforward approach to handling collisions. Instead of having the hash table's underlying array store the values directly, we have it store an array of pointers to linked lists.

For example, say we had the array:

[16 12 25 39 6 122 5 68 75]

To store these values, we use our usual hash function:

${k}$	${h(k) = k \rod 10}$
16	6
12	2
25	5
39	9
6	6
122	2
5	5
68	8
75	5

${k}$	${entry}$
0	${\nil}$
1	${\nil}$
2	${12 \to 122 }$
3	${\nil}$
4	${\nil}$
5	${25 \to 5 \to 75}$
6	${16 \to 6}$
7	${\nil}$
8	${68}$
9	${39}$

This successfully (and easily) solves the collision problem. But there's a catch. Say we had ${n = 100}$ keys. The size of our hash table is ${S = 10.}$ Using these assumed values, the hash table world provides a special value:

\lambda = \dfrac{n}{S} = \dfrac{100}{10} = 10

This is called the load factor — the number of keys divided by the size of the table. To see why this value is important, let's say the ${100}$ keys are uniformly distributed. I.e., each of our hash table's spaces gets ${10}$ keys. Because our hash function is a simple modulo operation, it takes ${O(1)}$ time to obtain the hash table's index. And because the keys are uniformly distributed, the average time it takes to reach a successful search for a key is:

T = 1 + \dfrac{\lambda}{2}.

One for obtaining the hash table index, and ${\lambda / 2}$ for the average. That's the average time for a successful search. For the average of an unsuccessful search, we get, ${T = 1 + \lambda.}$ That is, we get all the way to the end of the hash table. Now we see why that load factor is critical. If we had ${10~000}$ keys and only ${10}$ spaces, we'd have:

\lambda = \dfrac{10~000}{10} = 1000

We might want to decrease this loading factor by consuming more memory. Unfortunately, even if we were content with consuming more memory, there's another problem with chaining: Suppose we start with an empty hash table, and we did allocate a size of ${5000}$ . Then we get a key space that looks like:

\set{35, 95, 145, 155, 205, 305, \ldots, 5n}

The resulting hash table:

0	1	2	3	4	5	6	7	8	9	${\ldots}$	${5n}$
					35
					95
					145
					155
					205
					305
					${\vdots}$
					${5n}$

With our current hash function, as long as the user enters some multiple of ${5,}$ the entries will be appended to the list mapped to index ${5.}$ This is called clustering, and is arguably an even worse problem than what we grappled with earlier. Why? Because now the hash table is no longer uniformly distributed. Which means that none of our load factor analysis was correct to begin with.

Linear Probing

An alternative technique to resolving collisions is linear probing. To illustrate, we'll use the same simple hash function, ${h(x)=x \rem 10.}$ Let's suppose our keyspace is:

\set{26, 30, 45, 23, 25, 43, 74}

Just as we did in the previous sections, let's map the keys using the hash function (we'll stop when we encounter a collision):

0	1	2	3	4	5	6	7	8	9
30			23		45	26
					${\times}$	${\times}$

Once we hit 25, we encounter a collision. We need it at 5, but 5 is already occupied by 45. What we can do is check if the next space is available:

0	1	2	3	4	5	6	7	8	9
30			23		45	26
					${\times}$	${\times}$

The next index is occupied, so we try the next:

0	1	2	3	4	5	6	7	8	9
30			23		45	26
					${\times}$	${\times}$	${\checkmark}$

We have a space, so we place 25 at this index:

0	1	2	3	4	5	6	7	8	9
30			23		45	26	25

This is the basic premise of linear probing. If we find that an index is already mapped-to, we try the next. We continue trying until we get an index that isn't already mapped-to. In short, linear probing is an attempt to keep the hash functions as close to being injective functions as possible (i.e., avoiding surjectivity). Open hashing, in contrast, is perfectly happy with surjectivity — we just use more memory. Finding the next available index is done with a linear probe — a function that takes two inputs, the value and an index, and is continuously evaluated with incrementing indices until it returns a value that has not been mapped to. For example, one such probe might be ${h'(x)=(h(2)+f(i)) \rem 10.}$ Thus, we have:

${i}$	${h'}$
0	${h'(25)=(5+0) \rem 10 = 5}$	${\times}$
1	${h'(25)=(5+1) \rem 10 = 6}$	${\times}$
2	${h'(25)=(5+2) \rem 10 = 7}$	${\checkmark}$

Like open hashing, linear probing is analyzed with a loading factor. I.e., ${\lambda = n/S,}$ where ${n \in \nat}$ is the number of of keys, and ${S}$ is the table's size. For a table of size 9 with 10 keys, we have ${\lambda = 9/10 = 0.9.}$ With linear probing, we're always guaranteed to have a loading factor less than one. A well-implemented linear probing approach should ideally have ${\lambda \le 0.5.}$ Assuming the loading factor is within range, the average time for a successful search is

t = \dfrac{1}{\lambda} \ln \ar{\dfrac{1}{1-\lambda}}.

The average time for an unsuccessful search is

t = \dfrac{1}{\lambda}.

The most obvious cost to linear probing is the fixed size. Not all applications have a determinable key space. The second cost is that deletion can get hairy. For example, suppose we had a hash table:

0	1	2	3	4	5	6	7	8	9
30	29		23	43	45	26	25	74	19

we then delete the entry with 25:

0	1	2	3	4	5	6	7	8	9
30	29		23	43	45	26	${\gy{25}}$	74	19

When this occurs, we must put 74 into that now-open slot. Why? Because that's where 74 should've gone. We cannot just leave the keys remaining where they are because remember how we look up values: With the linear probe. When we ask for 74, the hash function returns ${\nil,}$ because it expects 74 to be mapped to 7. But 7 is now ${\nil.}$ Moreover, it's not always the case that we have to shift. If we deleted the key ${45,}$ we don't want to shift 26. Why? Because 26 belongs where it is. Because of how messy this problem can get — real-world hash functions are far more complicated than the examples used here — some implementations employ the technique of rehashing: When a key is deleted, remove all of the entries and hash them again. This is a costly operation, but for many hash functions, the cost of a bug in trying to shift the keys is much higher. This is also why linear probing approaches almost always include some assumption about the amount of deletions.