\documentclass[11pt]{article}
\usepackage{amsmath,amssymb,amsthm}
\usepackage{graphicx}
\usepackage{subcaption}
\usepackage{caption}
\usepackage{algorithm}
\usepackage[noend]{algpseudocode}
\usepackage{multicol}
\setlength{\columnsep}{1cm}
\DeclareMathOperator*{\E}{\mathbb{E}}
\let\Pr\relax
\DeclareMathOperator*{\Pr}{\mathbb{P}}
\newcommand{\eps}{\epsilon}
\newcommand{\inprod}[1]{\left\langle #1 \right\rangle}
\newcommand{\R}{\mathbb{R}}
\newcommand{\handout}[5]{
\noindent
\begin{center}
\framebox{
\vbox{
\hbox to 5.78in { {\bf CS 388R: Randomized Algorithms } \hfill #2 }
\vspace{4mm}
\hbox to 5.78in { {\Large \hfill #5 \hfill} }
\vspace{2mm}
\hbox to 5.78in { {\em #3 \hfill #4} }
}
}
\end{center}
\vspace*{4mm}
}
\newcommand{\lecture}[4]{\handout{#1}{#2}{#3}{Scribe: #4}{Lecture #1}}
\newtheorem{theorem}{Theorem}
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{observation}[theorem]{Observation}
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{claim}[theorem]{Claim}
\newtheorem{fact}[theorem]{Fact}
\newtheorem{assumption}[theorem]{Assumption}
\newtheorem{example}{Example}
% 1-inch margins, from fullpage.sty by H.Partl, Version 2, Dec. 15, 1988.
\topmargin 0pt
\advance \topmargin by -\headheight
\advance \topmargin by -\headsep
\textheight 8.9in
\oddsidemargin 0pt
\evensidemargin \oddsidemargin
\marginparwidth 0.5in
\textwidth 6.5in
\parindent 0in
\parskip 1.5ex
\begin{document}
\lecture{11 --- October 7, 2015}{Fall 2015}{Prof.\ Eric Price}{Jimmy Xin Lin}
\section{Overview}
In the last lecture, we talk about perfect hashing and mechanism of
constructing minimal pefect hash function (MPHF).
In this lecture, we introduce Bloom Filters with its characterized Approximate
Membership Queries. Then we describe a variant of bloom filter, the Counting
Bloom Filter, as a deletion-support filter. We also take an in-depth
investigation to Count-Min Sketch algorithm, as well as Count Sketch
algorithm.
\section{Hash Tables For Sets}
%{{{
As we know from the previous lecture, hash tables give mappings from a set of
keys to a collection of values. But can we use hash tables for sets?
To implement sets, the operations we need include insert/delete/build,
membership query (is $x$ in the set $S$) and cardinality query ($|S|$).
The hash table implementation for sets requires:
\begin{itemize}
\item[--] $\mathcal{O}(1)$ per operation time.
\item[--] $\mathcal{O}(n)$ words of space (cannot be constant time).
\end{itemize}
Note that the word size $w = \log_2 u > \log_2 n$.
If we use bit vector to implement sets, it is obvious that $u$ bits are required.
But the hash-table implementation for sets only requires $n\log_2 u$ bits.
The following question will lead us to prove this.
\paragraph{Question:}
How many bits are necessary to store a set of $n$ items in $[u]$?
Hint: how many sets of size $n$ are there?
\begin{align*}
\log \binom{u}{n} &\geq \log (\frac{u}{n})^n \\
&= n \log(\frac{u}{n}) \\
&= n\log u - n\log n \\
&\gtrsim n\log u
\end{align*}
which also implies that hash tables are the optimal implementation for sets,
under deterministic circumstances.
%}}}
\section{Bloom Filters}
Now we start to discuss whether we can further improve the space cost by
introducing randomness.
\subsection{How does it work?}
%{{{
A Bloom filter is a space-efficient probabilistic data structure, which employ
a bit vector to keep track of the memebership of a set.
The construction of a (regular) bloom filter with $k$ different independent
hash functions is as follows: for each element $x$ of a given set $S$, it is
hashed into $k$ positions, and the bit vector of these corresponding positions
flip to $1$. Formally:
\begin{align*}
\begin{cases}
\forall i,\ h_i(x) = 1, & \ \ \text{ if } x \in S \\
\forall i,\ h_i(x) \text{ keep unchanged.} & \ \ \text{ if } x \not\in S
\end{cases}
\end{align*}
An arbitrary element $y$ is said to be a member of the set $S$ if and only if
the responses of all hash function are positive, that is,
\begin{align*}
\forall i \in [k],\ h_i(y) = 1 \ \Longrightarrow \ y \in S
\end{align*}
However, this membership rule can be faulty if all hashed positions of an
element $w \not\in S$ happen to be flipped to $1$s by members of $S$.
Also note that removal of an element is not possible for a regular bloom
filter.
\begin{figure}[h]
\centering
\includegraphics[width=3in,height=1.2in]{./bloom_filter.png}
\caption{A Simple Bloom Filter}
\end{figure}
Queries to bloom filters can be characterized as Approximate Membership
Queries.
\begin{definition}[Approximate Membership Queries]
Given a query $x$ and a set $S$, the approximate memebership query returns
as follows:
\begin{center}
If $x \in S$, answer YES. \\
If $x \not \in S$, usually answer NO.
\end{center}
\end{definition}
With the undeterministic membership queries, there could exist some false
positive cases: \textit{an queried element $x$ is reported to be in the set $S$
but in reality it is not}.
Bloom Filters only employ $\mathcal{O}(n\log\frac{1}{f})$ bits. If we set the
false positive rate as $f=2\%$, it will cause approximately $10\%$ improvement,
comparing to the deterministic hash-table implementation.
%}}}
\subsection{Examples}
%{{{
Below is a list of examples to show the usage of bloom filters in real practices.
\begin{example}[Chrome Browser \cite{BFChrome}] \normalfont
Google Chrome web browser used to use a bloom filter to identify malicious
URLs. Any URL was first checked against a local Bloom filter, and only if the
Bloom filter returned a positive result was a full check of the URL performed.
\end{example}
\begin{example}[Databases] \normalfont
Bloom filters are used to reduce the disk lookups for non-existent rows or
columns. Keys of bloom filters are stored in PAM, full content is restored on
disk.
\end{example}
\begin{example}[BitCoins \cite{BFBitcoins}] \normalfont
Bitcoins implementation uses Bloom filters to speed up wallet synchronization,
which updates a list of wallets and transaction IDs users care about.
\end{example}
%}}}
\subsection{Analysis}
%{{{
Now we turn to analytics over the false positive rate $f$.
Let $n$ be the number of items, $m$ be the number of buckets, $k$ be the
number of hash functions, $p$ be the fraction of 0s in an array, and $f$ be the
false positive rate. Then we have
\begin{align*}
\E[p] = (1-\frac{1}{m})^{kn} \approx e^{\frac{-kn}{m}}
\end{align*}
Although we derive $\E[p]$, $p$ may vary. So the question is how much does it
vary.
\begin{align*}
\E \# &= P_n \text{ with } Pr = 1 - \delta \\
\# &= P_n \pm \sqrt{n} \log \frac{1}{\delta} \\
&= (1 \pm \mathcal{O}(1)) P_n
\end{align*}
With the $E[p]$, we have the false positive rate to be
\begin{align*}
f &= (1-p)^k \\
&\approx (1 - e^{-\frac{kn}{m}})
\end{align*}
Now our objective is to choose $k$ to minimize $f$.
\begin{align*}
& \arg\min_k \Big(1-e^{\frac{-kn}{m}}\Big)^k \\
=& \arg\min_k \Big(1-e^{\frac{-kn}{m}}\Big)^{\frac{-kn}{m}} \\
=& \frac{m}{n} \arg\min_z \Big(1-e^{-z}\Big)^{z}
\end{align*}
where $z=\frac{kn}{m}$. Apply logarithm and then derivative to the above objective,
\begin{align*}
g(z) &= z \ln (1-e^{-z}) \\
g'(z)&= \ln (1-e^{-z}) + z \frac{e^{-z}}{(1-e^{-z})}
\end{align*}
Set $g'(z) = 0$, we have $z = \ln 2$, and then $k = \frac{m}{n}\ln2$. Thus,
\begin{align*}
p &= e^{\frac{-kn}{m}} = \frac{1}{2} \ \ \
\longrightarrow \text{ bloom filter is balanced }\\
f &= (1-p)^k = \big(\frac{1}{2}\big)^{\frac{m}{n}\ln2}
\approx (0.618)^{\frac{m}{n}}
\end{align*}
For example, if $m=8n$ bits, then $f \approx 0.02$.
%$n$ items take up to $12n$ words, $768n$ bits.
%}}}
\subsection{Counting Bloom Filters}
%{{{
Now we discuss one particular variant of the bloom filter, named
\textit{Counting Bloom Filters}. In a Counting Bloom Filter, the removal of an
element is plausible.
%\begin{figure}[h]
% \centering
% \includegraphics[width=2in,height=1.5in]{CountingBF.png}
% \caption{Counting Bloom Filter}
%\end{figure}
Two operations need to be explicitly specified: insert$(x)$ and
membership$(x)$.
\begin{algorithm}[h]
\caption{Counting Bloom Filters}
\begin{algorithmic}[1]
\Statex
\Procedure{insert$(x)$}{}
\For{$ i \in [k]$}
\State $y_{h_i(x)}$ increments by $1$.
\EndFor
\EndProcedure
\end{algorithmic}
\begin{algorithmic}[1]
\Statex
\Procedure{membership$(x)$}{}
\If{$y_{h_i(x)} = 1$, $\forall i\in [k]$}
\State \Return YES.
\Else{ optimize $m$, $k$ such that $Pr(\text{YES})=f$ is small}.
\EndIf
\EndProcedure
\end{algorithmic}
\end{algorithm}
What is the chance that the given counting bloom filter does not work?
Let $y_j$ be the number of items $X$, $i\in [k]$, s.t. $h_i(x) = j$.
Now we throw $nk$ balls into $m$ bins.
\begin{align*}
Pr[y_j \geq t]
&\leq \binom{nk}{t} \cdot \frac{1}{m^t} \\
&\leq \big( \frac{enk}{mt} \big)^t \\
&\leq \big( \frac{e\ln2}{t} \big)^t
\end{align*}
If we employ $4$ bits for each entry (set $t = 16$), then $Pr[y_j \geq t] = 1.4
\times 10^{-15}$.
In other words, $Pr[\text{any $y_j$ overflows given $4$ bits}] \leq m \times 1.4
\times 10^{-15}$.
%}}}
\section{Count-Min Sketch}
%{{{
Now we introduce a new sublinear space data structure -- the Count-Min
Sketch (CMS) \cite{CG05}, for dealing with the scenario that an given
element could appear multiple times. So for CMS, we have streams of
insertions, deletions, and queries of how many times a element could have
appeared.
If the number is always positive, it is called Turnstile Model.
For example, in a music party, you will see lots of people come in and leave,
and you want to know what happens inside. But you do not want to store every
thing happened inside, you want to store it more efficiently.
One application of CMS might be you scanning over a corpus of a lib. There are
a bunch of URLs you have seen. There are huge number of URLs. You cannot
remember all URLs you see. But you want to estimate the query about how many
times you saw the same URLs.
What we can do is to store a set of counting bloom filters.
Because a URL can appear multiple times, how would you estimate the query
given the set of counting bloom filter?
We can take the minimal of all hashed counters to estimate the occurrence of
a particular URL.
Specifically:
\begin{align*}
\text{query}(x) = \min_i y_{h_i(x)}
\end{align*}
Note that the previous analysis about the overflow of counting bloom filters does
work.
Then there is a question of how accurate the query is?
Let $C(x)$ be the real count of an individual item $x$.
One simple bound of accuracy can be
\begin{align*}
\forall i, \E[y_{h_i(x)} - C(x)] &= \frac{\big(||C||_1-C(x)\big)\cdot k}{m}
\\
&\leq \frac{||C||_2 \ln2}{n}
\end{align*}
which tells us the average error for all single hashed places with regard to
the real occurrence. So we know that it is always overestimated.
For the total number of items $\sum_x C(x)$, we have ($C(x)$ is non-negative)
\begin{align*}
\sum_x C(x) = || C ||_1
\end{align*}
where, in general, $\forall z \in \mathbb{R}^d$,
$
||z||_1 = \sum_{i=1}^d |z_i|, \
||z||_2 = \sum_{i=1}^d z_i^2, \
||z||_{\infty} = \max_{i\in [1,d]} |z_i|
$.
Note that you do not even need $m$ to be larger than $n$. If you
have a huge number of items, you can choose $m$ to be very small ($m$ can be
millions for billions of URLs).
Now we have bound of occurrence estimation for each individual $i$ in
expectation. However, what we really need to concern is the query result.
We know that
\begin{align*}
\forall i, k,\ y_{h_i(x)} - C(x) \leq 2 || C ||_1 \frac{k}{m}
\ \ \text{ w.p. } \frac{1}{2}
\end{align*}
And now if I choose $k = \log{\frac{1}{\delta}}$,
\begin{align*}
\min y_{h_i(x)} - C(x) \leq 2 || C ||_1 \frac{k}{m}
\ \ \text{ w.p. } 1-\delta
\end{align*}
If $C(x)$ is concentrated in a few elements, the $t^{th}$ largest is
proportional to roughly $\frac{1}{t^\alpha}$ with the power law distribution.
So if we choose $m$ to be small, then you can estimate the top URLs pretty well.
\begin{align*}
\sum C(x) &= \frac{m^2}{k} = \mathcal{O}(1)
\end{align*}
In fact, you can show a better result for CMS, which is rather than having
your norm depend on the $1$-norm. There could be a few elements having all of
occurrences. For example, Google.com has a lot of people visiting it.
The top few URLs have almost all the occurrences.
Then probabaly for a given URL, it might collide some of them in some of the
time. But probably one of them is not gonna collide, and probably most of them
are going to collide.
So you can actually get in terms of $l$-$1$ norm but in terms of $l$-$1$ after
dropping the top $k$ elements. So given billions of URLs, you can drop the
top ones and get $l$-$1$ norm for the residual URLs.
\begin{align*}
\sum_{\text{non-top}\ k} C(x) &\approx \frac{1}{m/k} = \frac{k}{m}
\end{align*}
%}}}
\section{Count Sketch}
%{{{
There is another sketch algorithm, which is called Count Sketch \cite{CCF02}.
It is basically like CMS, except that when you do hashing, you also associate
the sum with each hash function $h$. Formally,
\begin{align*}
y_j = \sum_{(i,x):h_i(x)=j} C(x) S_i(x)
\end{align*}
Then the query can be defined as
\begin{align*}
\text{query}(x) = \text{median } S_i(x) y_{h_i(x)}
\end{align*}
The error can be converted from $l$-1 norm to $l$-2 norm.
\begin{align*}
error^2 \lesssim \frac{||C_{\frac{m}{k}}||_2^2 }{\frac{m}{k}}
\end{align*}
On top of that, suppose everything else is $0$, then
$y_{h_i(x)} \approx S_i(x)C(x)$. So we will have
\begin{align*}
\text{query}(x) \approx \text{median } \big(S_i(x)\big)^2 C(x)
\end{align*}
Then if there is nothing special going on, the query result would be
$C(x)$.
%}}}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\bibliographystyle{alpha}
\begin{thebibliography}{42}
\bibitem[MR]{MR}
Rajeev~Motwani, Prabhakar~Raghavan
\newblock Randomized Algorithms.
\newblock {\em Cambridge University Press}, 0-521-47465-5, 1995. %pub - isbn - year
\bibitem[BFChrome]{BFChrome}
Yakunin, Alex.
\newblock Alex Yakunin's blog: Nice Bloom filter application.
\newblock {\em
http://blog.alexyakunin.com/2010/03/nice-bloom-filter-application.html}.
\bibitem[BFBitcoins]{BFBitcoins}
Bitcoins Developer Guide: Bloom Filters.
\newblock {\em https://bitcoin.org/en/developer-guide#bloom-filters}.
\bibitem[CG05]{CG05}
Cormode, Graham, and S. Muthukrishnan.
\newblock An improved data stream summary: the count-min sketch and its applications.
\newblock {\em Journal of Algorithms}, 55(1), 58-75, 2005.
\bibitem[CCF02]{CCF02}
Moses Charikar, Kevin Chen, and Martin Farach-Colton.
\newblock Finding Frequent Items in Data Streams.
\newblock {\em ICALP}, 55(1), 2002.
%@book{Motwani:1995:RA:211390,
% author = {Motwani, Rajeev and Raghavan, Prabhakar},
% title = {Randomized Algorithms},
% year = {1995},
% isbn = {0-521-47465-5, 9780521474658},
% publisher = {Cambridge University Press},
% address = {New York, NY, USA},
%}
\end{thebibliography}
\end{document}