Hashing

Prof. Dr. Mirco Schoenfeld

Recap

Remember searching?

Implementation	Search	Insert	Delete
sorted array	\(O( \log n)\)	\(O(n)\)	\(O(n)\)
unsorted list	\(O(n)\)	\(O(1)\)	\(O(n)\)
binary search tree	\(O(h)\)	\(O(h)\)	\(O(h)\)

REALLY FAST searches

There is a data structure which allows

inserting in \(O(1)\),
deleting in \(O(1)\), and
searching in \(O(1)\)

Too good to be true?

Hashing

At the heart of such a data strcucture lies hashing.

Hashing tries to map data of arbitrary size to fixed-size values.

In other words, we calculate the storage location of an element.

Hashing: Usage

With hashing, our search operation changes slightly.

Before

we selected an element for comparison
we search by comparing to that element

With Hashing

elements are adressed by keys
a search is conducted by calculating an address from a key
the search yields a reference to the element

Hashing: Usage

Ancient key-value storage. Paperback, ~1900.

Hashing: Usage

phone_book = {'peter' : '239-555-292',
              'paul'  : '239-555-183',
              'mary'  : '433-555-798'}

call(phone_book['mary'])

Contemporary dictionary, python, ~2025.

Hashing: Definitions

We have

a universe \(U\) of keys. \(U\) is very big.
a subset of actual keys \(S \subseteq U\), \(|S| = m \le |U|\)
an array \(T [0,...,n-1]\) which is a hash table
a hash function \(h: U \to [0,...,n-1]\)

Hashing: Definitions

https://commons.wikimedia.org/wiki/File:Hash_table_3_1_1_0_1_0_0_SP.svg

Hashing: Definitions

Ideally, a hash function \(h\) should

be efficient to evaluate,
be small in size (easy to store), and
guarantee a good distribution of elements over the hash table

Hashing: Definitions

Definitely, a hash function \(h\) should

be non-reversible, i.e. input cannot be inferred from its hash
always produce the same hash value to the same input
produce a different hash even for very similar input

What is a hash function?

An example hash function

Hashing often comprises of two steps:

transform elements to (potentially large) integers
map large integers to the interval of the hash table

An example hash function

Let’s assume a string consisting of the characters \(a_0a_1a_2\).

The ASCII-values are \(a_0=43\), \(a_0=68\), and \(a_0=100\):

Mapping the input \(a_0a_1a_2\) to a large integer:

\(43 + 256 \cdot 68 + 256 \cdot 256 \cdot 100 = 6571051\)

An example hash function

Mapping a large number \(x\) to the hash table \(T[0,n-1]\):

For example using modulo-operation: \(x \bmod n\)

Given two positive numbers \(a\) and \(n\), \(a\) modulo \(n\) (often abbreviated as \(a \bmod n\)) is the remainder of the Euclidean division of a by \(n\), where \(a\) is the dividend and \(n\) is the divisor.

https://en.wikipedia.org/wiki/Modulo

An example hash function

Mapping a large number \(x\) to the hash table \(T[0,n-1]\):

For example using a multiplication using a decimal number \(\phi\) : \(\lfloor n \cdot (\phi x -\lfloor \phi x \rfloor ) \rfloor\)

An example hash function

Finding a good hash function is not trivial.

Back to Searching

Hashing: Direct Addressing

Ideally, the hash function maps all elements
to distinct buckets of the hash table:

Hashing: Direct Addressing

This is called direct addressing. It is desirable,
yet rarely applicable since \(U\) is usually too large

Hashing: Perfect Hash Function

It’d be perfect if we knew all keys beforehand
and prohibit insertions and deletions, obviously.

Hashing: The problem

The problem: hash collisions!

https://commons.wikimedia.org/wiki/File:Hash_table_4_1_1_0_0_1_0_LL.svg

Hash Collisions: solution using chaining

One possible solution for hash collisions: separate chaining

Hash Collisions: solution using chaining

One possible solution for hash collisions: separate chaining

search: obtain \(h(k)\) and search list
insert: obtain \(h(k)\) and prepend list: \(O(1)\)
delete: doubly linked list: \(O(1)\)
singly linked list: search + \(O(1)\)
worst-case: all elements are mapped to the same bucket… search is then \(O(n)\)

Hash Collisions: solution using cuckoo hashing

Another possible solution for hash collisions:
Cuckoo hashing based on open addressing Pagh and Rodler (2001)

https://www.youtube.com/watch?v=1El1uzjsdO0

Hash Collisions: Cuckoo hashing

https://levelup.gitconnected.com/cuckoo-hashing-a-beginners-guide-e010288bf05d

Hash Collisions: Cuckoo hashing

Basic idea:

use two hash tables with two independent hash functions
each element has two possible locations
hash collisions lead to elements being kicked out of their nest

Hash Collisions

There are many many solutions to collision resolution:

separate chaining (general principle)
open addressing (also a general principle)
coalesced hashing
hopscotch hashing
Robin Hood hashing
…

Moving away from search and retrieval

Applications of Hash functions

There are two types of hash functions:

Cryptographic Hash Functions

Important characteristics:

Reversing a hash is infesible,
a hash-value can represent a message,
small changes in input drastically change
the output - the avalanche effect,
and more…

Non-CHF

More focused towards scalability,
i.e. calculation is fast and less resource intense.

Collision resistance, or avalanche
properties are less important.

Applications of Hash functions

The avalanche effect in action:

https://commons.wikimedia.org/wiki/File:Avalanche_effect.svg

Applications of Hash functions

Idea when using a good cryptographic hash function:
store password hashes instead of the actual passwords:

https://commons.wikimedia.org/wiki/File:Cryptographic_Hash_Function.svg

Application of Hashing: Bloom Filter

A data structure making use of (non-cryptographic) hash functions:
Bloom filter.

a space-efficient probabilistic data structure
used to test whether an element is in a set
false positives are possible (‘possibly in the set’)
false negates are not (‘definitely not in set!’)
used in huge distributed databases

Application of Hashing: Bloom Filter

The data structure consists of

a bit array of \(m\) bits, initally all 0
\(k\) hash functions, ideally uniformly distributed and independent
the idea that inserting an element into a set is immediately followed by inserting the element into the filter.

Application of Hashing: Bloom Filter

An empty Bloom filter:

Application of Hashing: Bloom Filter

Inserting into a Bloom filter:

Application of Hashing: Bloom Filter

Querying a Bloom filter:

Application of Hashing: Bloom Filter

Querying a Bloom filter:

The element is
definitely not
in the set!

Application of Hashing: Bloom Filter

Beware of false positives, though!

The element is
probably
in the set!

Application of Hashing: Bloom Filter and their many variants

Bloom filters come in over 60 variants:

counting Bloom filter
distributed Bloom filter
Bloom filters for streaming data
scalable Bloom filters
spatial Bloom filters
…

Application of Hashing: Bloom Filter and why

Bloom filters have a strong space advantage.

A 1% false positive probability can
be achieved with only 9.6 bits per element.

Adding another 4.8 bits per element
decreases that probability by ten times.

Thanks

mirco.schoenfeld@uni-bayreuth.de

https://xkcd.com/2934/

Acknowledgement

Acknowledgement:

This script is inspired by Tantau (2016) and Räcke (2019, 2021).

References

Pagh, Rasmus, and Flemming Friche Rodler. 2001. “Cuckoo Hashing.” In Algorithms — ESA 2001, 121–33. Springer Berlin Heidelberg. https://doi.org/10.1007/3-540-44676-1_10.

Räcke, Harald. 2019. “Lecture Notes to the Lecture Algorithmen und Datenstrukturen.” TUM. https://wwwalbers.in.tum.de/lehre/2019SS/ad/index.html.de.

———. 2021. “Lecture Notes to the Lecture Intro to Linear Programming.” TUM. http://www14.in.tum.de/lehre/2021WS/ea/.

Tantau, Till. 2016. Einführung in Die Informatik. Online. https://www.tcs.uni-luebeck.de/de/mitarbeiter/tantau/lehre/lectures/EinfuehrungindieInformatik-2016.pdf.