Count min sketch
Problem: given a stream of data with keys and values, how can we get the sum of all the values for a given key?
Approximate solution: Assume that we have update
), to get the sum of values (estimate
) we
take the hash of the key and return the minimum value of the counters in all the
Images taken from: Algorithms and Data Structures for Massive Datasets

Update

Estimate
https://florian.github.io/count-min-sketch/
Applications
Top k elements, every time we update
the count min sketch we also call estimate
and insert the record
to a min heap, when the heap’s capacity is greater than
Similarity of words, assume that we have a stream of pairs (word, context)
, the problem is to find if two words
A, B are similar in meaning based on the context where they appear, the similarity of two words is computed with:
To solve the problem we can create a matrix of size O(number of words * number of contexts)
.
The intuition behind this formula is that it measures how likely A and B are to occur close to each other (enumerator)
in comparison to how often they would co-occur if they were independent (denominator).
To answer queries we can processes by using a matrix
The solution is to transform the matrix such that the word-context pair frequencies are stored in the count-min sketch, the occurrences of words and contexts are kept in other hash maps.
Range queries Use a segment tree where each node is a CMS
Images taken from: Algorithms and Data Structures for Massive Datasets

Update

Read
e-approximate heavy hitters In a stream where the total number of frequencies is
If
If update(x, 1)
and then estimate(x)
,
if the estimate is
trending hashtags Quantify how different the currently observed activity against an estimate of the
expected activity, for each hashtag store how many times it’s shared in an X-minute window over the
last Y days
The top
Based on https://instagram-engineering.com/trending-on-instagram-b749450e6d93
Bloom filter
Problem: test if an element doesn’t exist in a set
Approximate solution: same as count min sketch, if the returned value is zero then we’re sure the element is not in the set, otherwise, it might be in the set, and we need to test for existence with another (more expensive) data structure
https://florian.github.io/bloom-filters/
Applications
SSTable reads In the read path, Cassandra merges data on disk (in SSTables) with data in RAM (in memtables). To avoid checking every SSTable data file for the partition being requested we can query the SSTable bloom filter.
Reservoir sampling
Problem: given a stream of elements, we want to sample k random ones, without replacement and by using uniform probabilities
Solution: store first