## Schedule

For Fall 2017, the Theory Seminar met on Fridays from 2:00-3:00pm in GDC 4.302 unless otherwise noted.

09/08/2017 - Shachar Lovett, University of California: “Linear Decision Trees for 3-SUM and a Duality with Active Learning”
09/15/2017 - Jelani Nelson, Harvard: "Optimality of the Johnson-Lindenstrauss lemma”
09/22/2017 - Pooya Hatami, University of Texas at Austin: "Approximate Local Decoding of Third-Order Reed-Muller Codes Beyond the List Decoding Radius"
09/29/2017 - John Iacono, New York University, "In Pursuit of the Dynamic Optimality Conjecture"
10/6/2017 - Yevgeniy Dodis, "Fixing Cracks in the Concrete: Random Oracles with Auxiliary Input, Revisited”
10/20/2017 - Vijaya Ramachandran, University of Texas: "Resource-Oblivious Sorting on Multicores”
10/27/2017 - Yang Yuan, Cornell University: "Convergence Analysis of Two-layer Neural Networks with ReLU Activation”
11/10/2017 – Valerie King, Victoria: “The Communication Cost of Broadcasting”
11/17/2017 – Eric Price, University of Texas at Austin: “Condition number-free query and active learning of linear families”
12/1/2017 - Kasper Green Larsen, Aarhus: “Crossing the Logarithmic Barrier for Dynamic Boolean Data Structure Lower Bounds”
12/8/2017 - Bernhard Haeupler, Carnegie Mellon: “Distributed Optimization Algorithms via Low-Congestion Shortcuts”
12/15/17 – Michał Dereziński, UC Santa Cruz: "Fixed Design Linear Regression with a Small Number of Labels"

9/08/2017
Shachar Lovett, University of California San Diego
Linear Decision Trees for 3-SUM and a Duality with Active Learning

Abstract: The 3-SUM problem asks, given n real numbers x_1,...,x_n, whether there exist 3 of them that sum to zero. There is a trivial algorithm that solves it in O(n^2) time, and the best algorithm to date only improves upon it in logarithmic terms. A significantly better algorithm is believed to be impossible (or at least would be very surprising), as it is a bottleneck for many problems in computational geometry and graph theory.

We show that in the linear decision tree model, 3-SUM has a near-linear O(n polylog(n)) algorithm. A linear decision tree is an adaptive algorithm which makes linear queries of the form "sum a_i x_i>0 ?" to the input x, and at the end decides whether a 3-SUM exists. Moreover, the type of queries we use are only *label queries* of the form "x_i+x_j+x_k>0 ?" or *comparison queries* of the form "x_i+x_j+x_k>x_a+x_b+x_c ?". Thus, the queries are all 6-sparse linear queries with {-1,0,1} coefficients. This improves upon previous work of Gronlund and Pettie (improved by Gold and Sharir) which gives an O(n^{3/2}) linear decision tree.

The same technique solves many related problem with near-optimal linear decision tree complexity. For example, for any k, we obtain a linear decision tree for k-SUM which makes O(kn polylog(n)) queries which are each 2k-sparse with {-1,0,1} coefficients. This matches a sparsity lower bound of Ailon and Chazelle. Other examples include sorting sumsets and subset sum.

The proof is based on a connection between the linear decision trees model and active learning. Specifically, it builds upon a new combinatorial notion introduced by the authors in previous work of "inference dimension".

Joint work with Daniel Kane and Shay Moran.
https://arxiv.org/abs/1705.01720

9/15/2017
Jelani Nelson, Harvard
Optimality of the Johnson-Lindenstrauss lemma

Abstract: Dimensionality reduction in Euclidean space, as attainable by the Johnson-Lindenstrauss lemma (also known as "random projections"), has been a fundamental tool in algorithm design and machine learning. The JL lemma states that any n points in Euclidean space can be mapped to m-dimensional Euclidean space while preserving all pairwise distances up to 1+epsilon, where m only needs to be on the order of (log n) / epsilon^2, independent of the original dimension. In this talk, I discuss our recent proof that the JL lemma is optimal, in the sense that for any n there are point sets of size n such that no embedding providing (1+epsilon)-distortion exists into a dimension that is more than a constant factor better than what the JL lemma guarantees. I will also discuss some subsequent work and future directions. Joint work with Kasper Green Larsen (Aarhus University).

Joint work with Kasper Green Larhus (Aarhus University).

9/22/2017
Pooya Hatami, University of Texas at Austin
Approximate Local Decoding of Third-Order Reed-Muller Codes Beyond the List Decoding Radius

Abstract: In this talk I will discuss the question of decoding Reed-Muller codes over \F_2^n beyond their list-decoding radius. Since, by definition, in this regime one cannot demand an efficient exact list-decoder, we seek an approximate decoder which we define as follows: Given a word F and radii r' > r> 0, the goal is to output a codeword within radius r' of F, if there exists a codeword within distance r. As opposed to the list decoding problem, we only require the decoder to output any single codeword with this property, since the list can be too large if r exceeds the list decoding radius.

Such decoders are known for Reed-Muller codes of degree 2, due to works of Wolf and Tulsiani [FOCS 2011]. In joint work with Madhur Tulsiani, we make the first progress on this problem for the degree 3 case where the list decoding radius is 1/8. We show that there is a constant \delta=1/2-\sqrt{1/8}>1/8 and an efficient approximate decoder, that given query access to a function F:\F_2^n \to \F_2, such that F is within distance r = \delta  - \epsilon from a cubic polynomial, runs in time polynomial in n and outputs with high probability a cubic polynomial which is at distance at most r' = 1/2 - \epsilon' from F, where \eps' is a quasi polynomial function of \eps.

9/29/2017
John Iacono, New York University
In Pursuit of the Dynamic Optimality Conjecture

Abstract: In 1985, Sleator and Tarjan introduced the splay tree, a self-adjusting binary search tree algorithm. Splay trees were conjectured to perform within a constant factor as any offline rotation-based search tree algorithm on every sufficiently long sequence---any binary search tree algorithm that has this property is said to be dynamically optimal. However, currently neither splay trees nor any other tree algorithm is known to be dynamically optimal. Here we survey the progress that has been made in the almost thirty years since the conjecture was first formulated.

10/6/2017
Yevgeniy Dodis, New York University
Fixing Cracks in the Concrete: Random Oracles with Auxiliary Input, Revisited

Abstract: We revisit security proofs for various cryptographic primitives in the random oracle model with *auxiliary input* (ROM-AI): a (computationally unbounded) attacker A can compute arbitrary S bits of leakage z=z(O) about the random oracle O *before* attacking the system, and then use additional T *oracle* queries to O *during* the attack. This model was explicitly studied by Unruh in 2007 (CRYPTO 2007), but dates back to the seminal paper of Hellman in 1980 about time-space tradeoffs for inverting random functions, and has natural applications in settings where traditional random oracle proofs are not useful: (a) security against non-uniform attackers; (b) space-time tradeoffs; (c) security against preprocessing; (d) resilience to backdoors in hash functions.

We obtain a number of new results about ROM-AI, but our main message is that ROM-AI is the "new cool kid in town": it nicely connects theory and practice, has a lot of exciting open questions, leads to beautiful math, short definitions, elegant proofs, surprising algorithms, and is still in its infancy. In short, you should work on it!

10/20/2017
Vijaya Ramachandran, University of Texas
Resource-Oblivious Sorting on Multicores

Abstract:
We present SPMS (Sample, Partition, and Merge Sort), a deterministic sorting algorithm
that interleaves the partitioning of a sample sort with merging steps. Sequentially, it
sorts n elements in O(nlog n) steps cache-obliviously with an optimal number of cache
misses. In also runs in O(log n  loglog n) parallel steps, which is  close to the optimal
bound for any parallel sorting algorithm.

SPMS is multithreaded and it has other useful properties, such as low cache miss overhead
and low false sharing cost in a parallel execution. SPMS is resource-oblivious in that
the dependence on machine parameters appears only in the analysis of its performance, and
not within the algorithm itself.

(Joint work with Richard Cole.)

10/27/2017

Yang Yuan, Cornel University
Convergence Analysis of Two-layer Neural Networks with ReLU Activation

Abstract: In recent years, stochastic gradient descent (SGD) based techniques has become the standard tools for training neural networks. However, formal theoretical understanding of why SGD can train neural networks in practice is largely missing.

In this talk, I will talk about our recent work on understanding this mystery by providing a convergence analysis for SGD on a rich subset of two-layer feedforward networks with ReLU activations. This subset is characterized by a special structure called “identity mapping”. We prove that, if input follows from Gaussian distribution, with standard O(1/\sqrt{d}) initialization of the weights, SGD converges to the global minimum in polynomial number of steps. Unlike normal vanilla networks, the “identity mapping” makes our network asymmetric and thus the global minimum is unique. To complement our theory, we are also able to show experimentally that multi-layer networks with this mapping have better performance compared with normal vanilla networks.

Our convergence theorem differs from traditional non-convex optimization techniques. We show that SGD converges to optimal in two phases'': In phase I, the gradient points to the wrong direction, however, a potential function g gradually decreases. Then in phase II, SGD enters a nice one point convex region and converges. We also show that the identity mapping is necessary for convergence, as it moves the initial point to a better place for optimization. Experiment verifies our claims.

This is joint work with Yuanzhi Li from Princeton University.

11/10/2017
Valerie King, University of Victoria

Abstract:  It was long thought that to broadcast a message in a network where each node
knows only its neighbors would require either a "flooding" of the network in which the
message is sent along every edge or the construction of a spanning tree through the use of
very long messages which communicate which nodes have already been seen. The number of bits
of communication in either of these techniques is at least m where m is the number of edges,
possibly much higher than the number of nodes n in the graph, which is clearly a lower
bound.

A method developed in the study of streaming and also dynamic graph data structures for
connectivity gives rise to a fairly simple Monte Carlo protocol to create a spanning tree
(and a minimum spanning tree) in a distributed network with Tilde(n) communication, if the
network is synchronous. My student Ali Mashraghi and I have recently found the first method
for asynchronous networks which is sublinear in m  (for large m).

I will talk about the ideas behind these protocol and related work investigating the
complexity of this problem under varying assumptions.

Bio: Valerie King is Professor of Computer Science at the University of Victoria and a
Fellow of the ACM. She received her PhD under Richard Karp from University of California,
Berkeley, and works in randomized algorithms and data structures, and distributed computing.

11/17/2017
Eric Price, University of Texas at Austin
Condition number-free query and active learning of linear families

Abstract:

We consider the problem of learning a function from samples with L2-bounded noise.  In the
simplest agnostic learning setting, the number of samples required for robust estimation depends
on a condition number that can be arbitrarily large.  We show how to improve this dependence in
two natural extensions of the setting: a query access setting, where we can estimate the function
at arbitrary points, and an active learning setting, where we get a large number of unlabeled
points and choose a small subset to label.

For linear spaces of functions, such as the family of n-variate degree-d polynomials, this
eliminates the dependence on the condition number.  The technique can also yield improvements for
nonlinear spaces, as we demonstrate for the family of k-Fourier-sparse signals with continuous
frequencies.

This is joint work with Xue Chen.

12/1/2017
Kasper Green Larsen, Aarhus
Crossing the Logarithmic Barrier for Dynamic Boolean Data Structure Lower Bounds

Abstract:
In this talk, we prove the first super-logarithmic lower bounds on the cell probe complexity
of dynamic boolean (a.k.a. decision) data structure problems, a long-standing milestone in
data structure lower bounds.
We introduce a new method for proving dynamic cell probe lower bounds and use it to prove
a ?Ω(lg^(1.5)n) lower bound on the operational time of a wide range of boolean data
structure problems, most notably, on the query time of dynamic range counting over F2
([Pat'07]). Proving an ω(lgn) lower bound for this problem was explicitly posed as one of
five important open problems in the late Mihai Patrascu's obituary [Tho'13]. This result
also implies the first ω(lgn) lower bound for the classical 2D range counting problem, one
of the most fundamental data structure problems in computational geometry and spatial
databases. We derive similar lower bounds for boolean versions of dynamic polynomial
evaluation and 2D rectangle stabbing, and for the (non-boolean) problems of range selection
and range median.
Our technical centerpiece is a new way of weakly" simulating dynamic data structures
using efficient one-way communication protocols with small advantage over random guessing.
This simulation involves a surprising excursion to low-degree (Chebychev) polynomials which
may be of independent interest, and offers an entirely new algorithmic angle on the cell
sampling" method of Panigrahy et al. [PTW'10].

Joint work with: Omri Weinstein and Huacheng Yu

12/8/2017
Bernhard Haeupler, Carnegie Mellon
Distributed Optimization Algorithms via Low-Congestion Shortcuts

Title: Distributed Optimization Algorithms via Low-Congestion Shortcuts

Abstract: How fast a non-local distributed optimization problem can be solved in a given
network depends in a highly non-trivial manner on the topology of the network. This talk
will introduce a simple graph structure, called low-congestion shortcuts, which often
tightly characterize and capture this dependency. Low-congestion shortcuts furthermore make
it easy to design near optimal distributed algorithms for a wide variety of problems. For
example, they lead to MST, approximate min-cut and shortest-path algorithms in the CONGEST
model which have the optimal O~(sqrt{n} + D) running times on worst-case network topologies
while also achieving near instance-optimal O~(D) running times on planar, low-genus,
low-treewidth and other non-pathological network topologies.

Based on joint work with Mohsen Ghaffari (ETHZ), Goran Zuzic & Jason Li & David Wajc & Ellis
Hershkowitz (CMU) and Taisuke Izumi (Nagoya IT)

12/15/17
Michał Dereziński, UC Santa Cruz
Fixed design linear regression with a small number of labels

Title: Fixed design linear regression with a small number of labels

Abstract:
We use volume sampling to obtain exact multiplicative loss bounds for linear regression. The
learner is given a fixed set of n input vectors in R^d of a linear regression problem (aka fixed
design). Each vector has a real hidden label. The goal is to approximately solve the linear least
squares problem for all n labeled vectors while seeing only a small number of the labels.

In our most basic setup, the learner selects a random subset of d vectors from the fixed set of
n vectors (without knowing any of the labels). The learner is then given the labels of the chosen
subset of d vectors. We show that if the random subset is chosen based on volume sampling, then
the linear least squares solution for the subset of d labeled vectors:

- is an unbiased estimator of optimum solution on all n labeled vectors,

- and in expectation, the total square loss (on all n labeled vectors) of the solution found for
the subset is by a factor of exactly d+1 larger than the minimum achievable total square loss
(provided the vectors are in general position).

We extend volume sampling to subsets of size larger than d, offering further statistical
guarantees on the total square loss. Our results are elementary and we will describe our proof
method for deriving unbiased matrix estimators. Surprisingly in some cases our method produces
exact formulas for the variance of the estimators as well. Finally we present an efficient
algorithm for volume sampling.

Joint work with Manfred Warmuth.