Remember searching?
Implementation | Search | Insert | Delete |
---|---|---|---|
sorted array | \(O( \log n)\) | \(O(n)\) | \(O(n)\) |
unsorted list | \(O(n)\) | \(O(1)\) | \(O(n)\) |
binary search tree | \(O(h)\) | \(O(h)\) | \(O(h)\) |
There is a data structure which allows
At the heart of such a data strcucture lies hashing.
Hashing tries to map data of arbitrary size to fixed-size values.
In other words, we calculate the storage location of an element.
With hashing, our search operation changes slightly.
Before
With Hashing
Ancient key-value storage. Paperback, ~1900.
phone_book = {'peter' : '239-555-292',
'paul' : '239-555-183',
'mary' : '433-555-798'}
call(phone_book['mary'])
Contemporary dictionary, python, ~2025.
We have
Ideally, a hash function \(h\) should
Definitely, a hash function \(h\) should
Hashing often comprises of two steps:
Let’s assume a string consisting of the characters \(a_0a_1a_2\).
The ASCII-values are \(a_0=43\), \(a_0=68\), and \(a_0=100\):
Mapping the input \(a_0a_1a_2\) to a large integer:
\(43 + 256 \cdot 68 + 256 \cdot 256 \cdot 100 = 6571051\)
Mapping a large number \(x\) to the hash table \(T[0,n-1]\):
For example using modulo-operation: \(x \bmod n\)
Given two positive numbers \(a\) and \(n\), \(a\) modulo \(n\) (often abbreviated as \(a \bmod n\)) is the remainder of the Euclidean division of a by \(n\), where \(a\) is the dividend and \(n\) is the divisor.
Mapping a large number \(x\) to the hash table \(T[0,n-1]\):
For example using a multiplication using a decimal number \(\phi\) : \(\lfloor n \cdot (\phi x -\lfloor \phi x \rfloor ) \rfloor\)
Finding a good hash function is not trivial.
Ideally, the hash function maps all elements
to distinct
buckets of the hash table:
This is called direct addressing. It is desirable,
yet
rarely applicable since \(U\) is
usually too large
It’d be perfect
if we knew all keys beforehand
and prohibit insertions and
deletions, obviously.
The problem: hash collisions!
One possible solution for hash collisions: separate chaining
One possible solution for hash collisions: separate chaining
Another possible solution for hash collisions:
Cuckoo
hashing based on open
addressing Pagh
and Rodler (2001)
Basic idea:
There are many many solutions to collision resolution:
There are two types of hash functions:
Important characteristics:
Reversing a hash is infesible,
a hash-value can
represent a message,
small changes in input drastically
change
the output - the avalanche
effect,
and more…
More focused towards scalability,
i.e. calculation is fast and
less resource intense.
Collision resistance, or avalanche
properties are less
important.
The avalanche effect in action:
Idea when using a good cryptographic hash function:
store
password hashes instead of the actual passwords:
A data structure making use of (non-cryptographic) hash
functions:
Bloom filter.
The data structure consists of
An empty Bloom filter:
Inserting into a Bloom filter:
Querying a Bloom filter:
Querying a Bloom filter:
The element is
definitely not
in the
set!
Beware of false positives, though!
The element is
probably
in the set!
Bloom filters come in over 60 variants:
Bloom filters have a strong space advantage.
A 1% false positive probability can
be achieved with only 9.6
bits per element.
Adding another 4.8 bits per element
decreases that probability by
ten times.
Acknowledgement:
This script is inspired by Tantau (2016) and Räcke (2019, 2021).