Hashing

Prof. Dr. Mirco Schoenfeld

Recap

Remember searching?

Implementation Search Insert Delete
sorted array \(O( \log n)\) \(O(n)\) \(O(n)\)
unsorted list \(O(n)\) \(O(1)\) \(O(n)\)
binary search tree \(O(h)\) \(O(h)\) \(O(h)\)

REALLY FAST searches

There is a data structure which allows

  • inserting in \(O(1)\),
  • deleting in \(O(1)\), and
  • searching in \(O(1)\)

Too good to be true?

Hashing

At the heart of such a data strcucture lies hashing.

Hashing tries to map data of arbitrary size to fixed-size values.

In other words, we calculate the storage location of an element.

Hashing: Usage

With hashing, our search operation changes slightly.

Before

  • we selected an element for comparison
  • we search by comparing to that element

With Hashing

  • elements are adressed by keys
  • a search is conducted by calculating an address from a key
  • the search yields a reference to the element

Hashing: Usage

Ancient key-value storage. Paperback, ~1900.

Hashing: Usage

phone_book = {'peter' : '239-555-292',
              'paul'  : '239-555-183',
              'mary'  : '433-555-798'}

call(phone_book['mary'])

Contemporary dictionary, python, ~2025.

Hashing: Definitions

We have

  • a universe \(U\) of keys. \(U\) is very big.
  • a subset of actual keys \(S \subseteq U\), \(|S| = m \le |U|\)
  • an array \(T [0,...,n-1]\) which is a hash table
  • a hash function \(h: U \to [0,...,n-1]\)

Hashing: Definitions

Hashing: Definitions

Ideally, a hash function \(h\) should

  • be efficient to evaluate,
  • be small in size (easy to store), and
  • guarantee a good distribution of elements over the hash table

Hashing: Definitions

Definitely, a hash function \(h\) should

  • be non-reversible, i.e. input cannot be inferred from its hash
  • always produce the same hash value to the same input
  • produce a different hash even for very similar input

What is a hash function?

An example hash function

Hashing often comprises of two steps:

  1. transform elements to (potentially large) integers
  2. map large integers to the interval of the hash table

An example hash function

Let’s assume a string consisting of the characters \(a_0a_1a_2\).

The ASCII-values are \(a_0=43\), \(a_0=68\), and \(a_0=100\):

Mapping the input \(a_0a_1a_2\) to a large integer:

\(43 + 256 \cdot 68 + 256 \cdot 256 \cdot 100 = 6571051\)

An example hash function

Mapping a large number \(x\) to the hash table \(T[0,n-1]\):

For example using modulo-operation: \(x \bmod n\)

Given two positive numbers \(a\) and \(n\), \(a\) modulo \(n\) (often abbreviated as \(a \bmod n\)) is the remainder of the Euclidean division of a by \(n\), where \(a\) is the dividend and \(n\) is the divisor.

An example hash function

Mapping a large number \(x\) to the hash table \(T[0,n-1]\):

For example using a multiplication using a decimal number \(\phi\) : \(\lfloor n \cdot (\phi x -\lfloor \phi x \rfloor ) \rfloor\)

An example hash function

Finding a good hash function is not trivial.

Back to Searching

Hashing: Direct Addressing

Ideally, the hash function maps all elements
to distinct buckets of the hash table:

Hashing: Direct Addressing

This is called direct addressing. It is desirable,
yet rarely applicable since \(U\) is usually too large

Hashing: Perfect Hash Function

It’d be perfect if we knew all keys beforehand
and prohibit insertions and deletions, obviously.

Hashing: The problem

The problem: hash collisions!

Hash Collisions: solution using chaining

One possible solution for hash collisions: separate chaining

Hash Collisions: solution using chaining

One possible solution for hash collisions: separate chaining

  • search: obtain \(h(k)\) and search list
  • insert: obtain \(h(k)\) and prepend list: \(O(1)\)
  • delete: doubly linked list: \(O(1)\)
    singly linked list: search + \(O(1)\)
  • worst-case: all elements are mapped to the same bucket… search is then \(O(n)\)

Hash Collisions: solution using cuckoo hashing

Another possible solution for hash collisions:
Cuckoo hashing based on open addressing Pagh and Rodler (2001)

Hash Collisions: Cuckoo hashing

Hash Collisions: Cuckoo hashing

Basic idea:

  • use two hash tables with two independent hash functions
  • each element has two possible locations
  • hash collisions lead to elements being kicked out of their nest

Hash Collisions

There are many many solutions to collision resolution:

  • separate chaining (general principle)
  • open addressing (also a general principle)
  • coalesced hashing
  • hopscotch hashing
  • Robin Hood hashing

Moving away from search and retrieval

Applications of Hash functions

There are two types of hash functions:

Cryptographic Hash Functions

Important characteristics:

Reversing a hash is infesible,
a hash-value can represent a message,
small changes in input drastically change
the output - the avalanche effect,
and more…

Non-CHF

More focused towards scalability,
i.e. calculation is fast and less resource intense.

Collision resistance, or avalanche
properties are less important.

Applications of Hash functions

The avalanche effect in action:

Applications of Hash functions

Idea when using a good cryptographic hash function:
store password hashes instead of the actual passwords:

Application of Hashing: Bloom Filter

A data structure making use of (non-cryptographic) hash functions:
Bloom filter.

  • a space-efficient probabilistic data structure
  • used to test whether an element is in a set
  • false positives are possible (‘possibly in the set’)
  • false negates are not (‘definitely not in set!’)
  • used in huge distributed databases

Application of Hashing: Bloom Filter

The data structure consists of

  • a bit array of \(m\) bits, initally all 0
  • \(k\) hash functions, ideally uniformly distributed and independent
  • the idea that inserting an element into a set is immediately followed by inserting the element into the filter.

Application of Hashing: Bloom Filter

An empty Bloom filter:

Application of Hashing: Bloom Filter

Inserting into a Bloom filter:

Application of Hashing: Bloom Filter

Querying a Bloom filter:

Application of Hashing: Bloom Filter

Querying a Bloom filter:

The element is
definitely not
in the set!

Application of Hashing: Bloom Filter

Beware of false positives, though!

The element is
probably
in the set!

Application of Hashing: Bloom Filter and their many variants

Bloom filters come in over 60 variants:

  • counting Bloom filter
  • distributed Bloom filter
  • Bloom filters for streaming data
  • scalable Bloom filters
  • spatial Bloom filters

Application of Hashing: Bloom Filter and why

Bloom filters have a strong space advantage.

A 1% false positive probability can
be achieved with only 9.6 bits per element.

Adding another 4.8 bits per element
decreases that probability by ten times.

Thanks

https://xkcd.com/2934/

Acknowledgement

Acknowledgement:

This script is inspired by Tantau (2016) and Räcke (2019, 2021).

References

Pagh, Rasmus, and Flemming Friche Rodler. 2001. “Cuckoo Hashing.” In Algorithms — ESA 2001, 121–33. Springer Berlin Heidelberg. https://doi.org/10.1007/3-540-44676-1_10.
Räcke, Harald. 2019. “Lecture Notes to the Lecture Algorithmen und Datenstrukturen.” TUM. https://wwwalbers.in.tum.de/lehre/2019SS/ad/index.html.de.
———. 2021. “Lecture Notes to the Lecture Intro to Linear Programming.” TUM. http://www14.in.tum.de/lehre/2021WS/ea/.
Tantau, Till. 2016. Einführung in Die Informatik. Online. https://www.tcs.uni-luebeck.de/de/mitarbeiter/tantau/lehre/lectures/EinfuehrungindieInformatik-2016.pdf.
Back to Lecture Website