Skip to main content

Subsection 2.1.1 Low rank approximation

Consider this picture of the Gates Dell Complex that houses our Department of Computer Science:

It consists of an \(m \times n \) array of pixels, each of which is a numerical value. Think of the \(j\)th column of pixels as a vector of values, \(b_j \text{,}\) so that the whole picture is represented by columns as

\begin{equation*} B = \left( \begin{array}{c | c | c | c } b_0 \amp b_1 \amp \cdots \amp b_{n-1} \end{array} \right), \end{equation*}

where we recognize that we can view the picture as a matrix. What if we want to store this picture with fewer than \(m \times n \) data? In other words, what if we want to compress the picture? To do so, we might identify a few of the columns in the picture to be the "chosen ones" that are representative of the other columns in the following sense: All columns in the picture are approximately linear combinations of these chosen columns.

Let's let linear algebra do the heavy lifting: what if we choose \(k \) roughly equally spaced columns in the picture:

\begin{equation*} \begin{array}{ccc} a_0 \amp = \amp b_0 \\ a_1 \amp = \amp b_{n/k-1} \\ \vdots \amp \amp \vdots \\ a_{k-1} \amp = \amp b_{(k-1)n/k-1}, \end{array} \end{equation*}

where for illustration purposes we assume that \(n \) is an integer multiple of \(k \text{.}\) (We could instead choose them randomly or via some other method. This detail is not important as we try to gain initial insight.) We could then approximate each column of the picture, \(b_j \text{,}\) as a linear combination of \(a_0, \ldots, a_{k-1} \text{:}\)

\begin{equation*} b_j \approx \chi_{0,j} a_0 + \chi_{1,j} a_1 + \cdots + \chi_{k-1,j} a_{k-1} = \left( \begin{array} { c | c | c } a_0 \amp \cdots \amp a_{k-1} \end{array} \right) \left( \begin{array}{c} \chi_{0,j} \\ \hline \vdots \\ \hline \chi_{k-1,j} \end{array} \right). \end{equation*}

We can write this more concisely by viewing these chosen columns as the columns of matrix \(A \) so that

\begin{equation*} b_j \approx A x_j, \quad \mbox{where} \quad A = \left( \begin{array}{c | c | c} a_0 \amp \cdots \amp a_{k-1} \end{array} \right) \mbox{ and } x_j = \left( \begin{array}{c} \chi_{0,j} \\ \hline \vdots \\ \hline \chi_{k-1,j} \end{array} \right). \end{equation*}

If \(A \) has linearly independent columns, the best such approximation (in the linear least squares sense) is obtained by choosing

\begin{equation*} x_j = ( A^T A )^{-1} A^T b_j, \end{equation*}

where you may recognize \(( A^T A )^{-1} A^T \) as the (left) pseudo-inverse of \(A \text{,}\) leaving us with

\begin{equation*} b_j \approx A ( A^T A )^{-1} A^T b_j. \end{equation*}

This approximates \(b_j \) with the orthogonal projection of \(b_j \) onto the column space of \(A \text{.}\) Doing this for every column \(b_j \) leaves us with the following approximation to the picture:

\begin{equation*} B \approx \left( \begin{array}[t]{c | c | c} A \begin{array}[t]{c} \underbrace{( A^T A )^{-1} A^T b_0} \\ x_0 \end{array} \amp \cdots \amp A \begin{array}[t]{c} \underbrace{ ( A^T A )^{-1} A^T b_{n-1}} \\ x_{n-1} \end{array} \end{array} \right), \end{equation*}

which is equivalent to

\begin{equation*} B \approx A \begin{array}[t]{c} \underbrace{ ( A^T A )^{-1} A^T \left( \begin{array}{c | c | c} b_0 \amp \cdots \amp b_{n-1} \end{array} \right)} \\ \left( \begin{array}{c | c | c} x_0 \amp \cdots \amp x_{n-1} \end{array} \right) \end{array} = A \begin{array}[t]{c} \underbrace{ ( A^T A )^{-1} A^T B } \\ X \end{array} = A X . \end{equation*}

Importantly, instead of requiring \(m \times n\) data to store \(B \text{,}\) we now need only store \(A \) and \(X \text{.}\)


If \(B \) is \(m \times n \) and \(A \) is \(m \times k \text{,}\) how many entries are there in \(A \) and \(X \) ?

  • \(A \) is \(m \times k \text{.}\)

  • \(X \) is \(k \times n \text{.}\)

A total of \((m+n) k \) entries are in \(A \) and \(X \text{.}\)


\(A X \) is called a rank-k approximation of \(B \text{.}\) Why?


The matrix \(A X \) has rank at most equal to \(k \) (it is a rank-k matrix) since each of its columns can be written as a linear combinations of the columns of \(A \) and hence it has at most \(k \) linearly independent columns.

Let's have a look at how effective this approach is for our picture:


\(k = 1 \)

\(k = 2 \)

\(k = 10 \)

\(k = 25 \)

\(k = 50 \)

Now, there is no reason to believe that picking equally spaced columns (or restricting ourselves to columns in \(B \)) will yield the best rank-k approximation for the picture. It yields a pretty good result here in part because there is quite a bit of repetition in the picture, from column to column. So, the question can be asked: How do we find the best rank-k approximation for a picture or, more generally, a matrix? This would allow us to get the most from the data that needs to be stored. It is the Singular Value Decomposition (SVD), possibly the most important result in linear algebra, that provides the answer.


Those who need a refresher on this material may want to review Week 11 of Linear Algebra: Foundations to Frontiers [20]. We will discuss solving linear least squares problems further in Week 4.