An Introduction to Randomized Sketching (draft)

In this post, I will make an introductory presentation about sketching, a statistical technique to handle large datasets. First, I will give the intuitive idea behind sketching, which is also the most important and valuable part of this post. Then, I will describe the various sketching algorithms in detail. Finally, I will give a non-exhaustive list of theoretical results concerning the soundness of sketching. Since this post is an introduction, I will build my presentation around Ordinary Least Square, which is the first topic in every machine learning course and arguably the most popular technique.

Table of Contents

  • Intuitive Ideas
    • Ordinary Least Square
    • Projection
    • Subsampling
    • Partial Sketching
  • Sketching Matrices
    • Random Projection
    • Random Sampling
  • Theoretical Guarantees
    • Algorithmic View
    • Statistical View
    • Raskutti and Mahoney’s Results
    • Other Results

Intuitive Ideas

Ordinary Least Square

Let us consider the classic Ordinary Least Square problem. Given a dataset $(X, Y) \in \mathbb{R}^{n \times d} \times \mathbb{R}^n$, where $n \gg d$, we want to calculate a $\hat{\beta} \in \mathbb{R}^d$, optimal in the following sense: [ \hat{\beta} \in \arg\min_\beta \lVert Y - X\beta \rVert_2. ]

For simplicity, we assume $\text{rank} (X)=d$, which is true for most applications. Otherwise, we can reduce the dimension so that this assumption could be valid. When $X$ is of full rank, $\hat{\beta}$ is unique and has the following closed-form formulation: [ \hat{\beta} = X^\dagger Y = (X^T X)^{-1} X^TY, ] where $X^\dagger := (X^T X)^{-1} X$ is the Moore-Penrose inverse.

The computation involves four steps:

  1. $X^T X$, which is a matrix-matrix multiplication, yielding $O(nd^2)$ time complexity.
  2. The inverse of $X^T X$, which yields $O(d^3)$ time complexity.
  3. $X^T Y$, which is a matrix-vector multiplication, yielding $O(nd)$ time complexity.
  4. The multiplication of $(X^T X)^{-1}$ and $XY$, which is again a matrix-vector multiplication, yielding $O(d^2)$ time complexity.

Therefore, the dominant part, or the bottleneck, is the 1st step—$X^T X$. For many modern applications, $n$ may be on the order of $10^6 – 10^9$ and $d$ may be on the order of $10^3 – 10^4$ [1]. Even though the time complexity is only linear in $n$, the gigantic $n$ makes the computation challenging.

For this sake, statisticians have been thinking of reducing the sample size into a more manageable $r$, where $d \lesssim r \ll n$. They came up with two solutions, with one being the subsampling method and the other being the projection method. In the following, I will talk about the projection method first, since it is less straightforward.

Projection

Here, we consider each column of $X$ and $Y$ as a point in $\mathbb{R}^n$, and we have $d+1$ points. Then, reducing sample size is equivalent to projecting these points into $\mathbb{R}^r$. Of course, we need this projection to preserve the relationship between $X$ and $Y$, so that we could recover the desired $\beta$ later.

The existence of such projection is suggested by the Johnson–Lindenstrauss lemma [2]:

For a set $A$ of $m$ fixed points in $\mathbb{R}^n$ and $\varepsilon>0$, there exists a linear map $S: \mathbb{R}^n \rightarrow \mathbb{R}^r$, where $r=O(\tfrac{\log m}{\varepsilon^2})$, such that for any point $u \in A$, we have [ (1-\varepsilon) \lVert u \rVert_2 \le \lVert S(u) \rVert_2 \le (1+\varepsilon) \lVert u \rVert_2. ]

This lemma claims that there exists a linear projection preserving the Euclidean norm of a bunch of points. The existence of such a projection is a strong hint that we could learn a desired $\beta$ in the lower-dimensional space.

Note: Michael W. Mahoney’s description of the Johnson–Lindenstrauss lemma is inaccurate. The probability clause therein is redundant.

The proof of this lemma adopts a probability approach. We construct a random linear projection and then prove that the probability of this random projection satisfying the above property is nonzero. Hence, we prove the existence. Since it is a non-constructing proof, we do not know what this desired projection is. We only know that if we try randomly (and not so hard), we can luckily finish with that desired projection.

Assuming now we have got this desired projection $S$, we want to solve the following problem: [ \hat{\beta}_ {S} := \arg\min_{\beta} \lVert SY - SX\beta \rVert_2, ] which has this closed-form expression: [ \hat{\beta}_ {S} = (SX)^\dagger SY ] Notice how we just replaced $X$ and $Y$ with $SX$ and $SY$ respectively. Therefore, the time complexity is $O(rd^2) \ll O(nd^2)$ once we have completed the projection. The tricky part is that the projection process is a matrix-matrix multiplication, which itself yields $O(nrd) \succsim O(nd^2)$ time complexity.

Warning: $(SX)^\dagger = ((SX)^T SX)^{-1} (SX)^T$ only if $\text{rank}(SX) = d$.

Luckily, not all projections are born equal. Some projections, such as the Hadamard transform [3], are fast to compute.

Subsampling

Probably the most straightforward way of reducing sample size is subsampling. Statisticians use sample distribution to approximate the population distribution, which is their way to understand the world. Ordinary Least Square is used to learn the relationship between $X$ and $Y$, which amounts to learning the joint distribution of $(X, Y)$. Thus, subsampling is a natural candidate for sketching.

Talking about sampling, one can immediately think about uniform sampling, which amounts to sampling each data point with equal probability. This method can work, but we can do better, because not every data point is born equal. We can adopt importance sampling by giving each data point a weight proportional to its importance.

This technique is very similar to novel writing. In a novel, you don’t write about the life of an average Joe; instead, you write about typical characters. If it is a miser person, it must be the most miser person in the world.

Then, how to decide the importance of a data point? Statisticians use the leverage score. The leverage score of matrix $X$ is defined as the diagonal entry of the hat matrix $H=X (X^T X)^{-1} X^T$. One way to remember this formula is to be aware of the fact that [ \widehat{Y} = HY, ] where $\widehat{Y}$ is the predicted output. The hat matrix is like putting a “hat” on $Y$. Also, this fact leads us to realize that the hat matrix is the (Jacobian) derivative of $\widehat{Y}$ on $Y$: [ \frac{\partial \widehat{Y}}{\partial Y} = H. ] In particular, the $i$-th leverage score is the derivative of $\widehat{Y}_i$ on $Y_i$: [ \frac{\partial \widehat{Y}_i}{\partial Y_i}. ] The larger the derivative is, the more sensitive $\widehat{Y}_i$ is to $Y_i$, and the more important the $i$-th data point will be.

With Singular Value Decomposition, we can represent the leverage score with a more concise form. It is just the $\ell_2$-norm of each row of the left singular matrix. Given the singular value decomposition $X=U \Sigma V^T$, we have [ H = U U^T, ] and thus [ H_{ii} = \lVert U_{i \cdot} \rVert_2^2 ]

It is easy to prove that the sum of all leverage scores, or equivalently, the trace of the hat matrix is equal to $d$. [ \text{tr}(H) = \text{tr}(U U^T) = \text{tr}(U^T U) = \text{tr}(I_d) = d. ] Incidentally, the trace of the hat matrix is also the Frobenius norm of $U$.

Once we get the leverage scores, we can either use them directly as the sampling probability (i.e., $\pi_i = \tfrac{H_{ii}}{d}$) or make a tradeoff between them and a prior (oftentimes the uniform distribution): [ \pi_i = (1-\theta) \tfrac{H_{ii}}{d} + \theta q_i, ] where $q$ is the prior and $\theta \in [0,1]$ is an arbitrary constant.

In terms of the time complexity, the bottleneck is (like the projection method) to compute the leverage scores. According to Raskutti and Mahoney, the naive way is to perform a QR decomposition to compute an orthogonal basis for $X$ [4], which yields $O(nd^2)$ time complexity and provides no benefit compared to the original problem. Luckily, one can do the above QR decomposition in an approximate way and hence accelerate the computation [5].

Subsampling sometimes also involves a rescaling step, whose meaning will be evident in the next subsection.

Partial Sketching

The above two methods are full sketching. In this subsection, I will first summarize them and then proceed to the so-called partial sketching.

Similar to the projection method, which is characterized by the projection matrix $S$, the subsampling method can also be characterized by a subsampling matrix. The only difference is that the projection matrix is a dense matrix, whereas the subsampling matrix is a sparse matrix with each row having a single nonzero entry.

Let us denote $\widetilde{X}:=SX$ and $\widetilde{Y}:=SY$, the full sketching methods can be universally formulated as [ \hat{\beta}_ \text{F} := \arg\min_{\beta} \lVert \widetilde{Y} - \widetilde{X}\beta \rVert_2. ] When $\widetilde{X}$ is of full rank, we have furthermore [ \hat{\beta}_ \text{F} = (\widetilde{X}^T \widetilde{X})^{-1} \widetilde{X}^T \widetilde{Y}. ] Noticing that $\widetilde{X}^T \widetilde{Y}$ only brings us marginal benefit compared to $X^T Y$ (from $O(n)$ to $O(r)$), we can apply the sketching only on the first part. This idea gives us the partial sketching: [ \hat{\beta}_ \text{P} = (\widetilde{X}^T \widetilde{X})^{-1} X^T Y. ]

Then, $\widetilde{X}^T \widetilde{X}$ can be regarded as an estimator of $X^T X$. In particular, if $S$ is a subsampling matrix with rescaling, $\widetilde{X}^T \widetilde{X}$ is an unbiased estimator of $X^T X$: \[ \mathbb{E}[\widetilde{X}^T \widetilde{X} | X^T X ] = X^T X \] It will be proved in the next section after we mathematically formulate the subsampling matrix and the rescaling matrix.

Sketching Matrices

Random Projection

There are so far 4 choices of projection matrices (the first two have high time complexity and thus are not practical for sketching):

  • Gaussian random variables
  • Rademacher random variables
  • Randomized Hadamard transform
  • Clarkson-Woodruff sketching

Gaussian random variables. In this case, the entries of the projection matrix $S$ are i.i.d. Gaussian random variables. That is, $S_{ij} \sim \mathcal{N}(0, \sigma^2)$.

Rademacher random variables. The Rademacher distribution is very similar to the Bernoulli distribution, with two exceptions though. The Bernoulli distribution has the support on $\{0, 1\}$, whereas the Rademacher distribution has the support on $\{-1, +1\}$. The Bernoulli distribution has the parameter $\theta$ characterizing the probability of $1$, whereas the Rademacher distribution puts equal probability on $+1$ and $-1$. In this case, the projection matrix consists of entries of independent Rademacher random variables.

Randomized Hadamard transform.[3] Before adding the randomization in, let us first talk about the (deterministic and linear) Hadamard transform, aka Walsh–Hadamard transform. The Hadamard transform $H_m$ is a $2^m \times 2^m$ matrix, dubbed the Hadamard matrix, which defines a linear transform in a $2^m$-dimensional space. The Hadamard matrix is symmetric and consists purely of $\pm 1$, and it has a special structure in that adjacent rows or columns confront each other equally with the same signs and opposite signs. For instance, in [ H_1 = \begin{pmatrix} +1 & +1 \\
+1 & -1 \end{pmatrix}, ] the 2nd row confronts the 1st row first with $+1$ against $+1$ and then $-1$ against $+1$. The Hadamard matrices can be explicitly constructed with the recursive formula: [ H_{m+1} := H_1 \otimes H_m = \begin{pmatrix} H_m & H_m \\
H_m & -H_m \end{pmatrix}. ] A naive application of Hadamard transform involves multiplication by an $n \times n$ matrix, yielding $O(n^2)$ time complexity. Luckily, like the Fourier transform, it has a fast version: a divide-and-conquer strategy yields only $O(n \log n)$ time complexity. That is all for the deterministic Hadamard transform. As for the randomized Hadamard transform, it is just the subsampling + Hadamard transform. Denote it as $S_\text{Had}$, which is defined as $S_\text{Had}:=S_\text{unif}HD$, where $D \in \mathbb{R}^{n \times n}$ is a diagonal matrix with random equiprobable $\pm 1$ entries, $H \in \mathbb{R}^{n \times n}$ is the Hadamard matrix, and $S_\text{unif} \in \mathbb{R}^{r \times n}$ is a uniform sampling matrix. Therefore, the randomized Hadamard transform is simply 1) randomly flipping the sample sign, 2) project on the basis defined by the Hadamard transform, and then 3) uniform subsampling. The time complexity is $O(dn \log n)$.

Note: Daniel Ahfock et al. got the time complexity of the randomized Hadamard transform wrong in their Table 1 [6].

Clarkson-Woodruff sketching.[7] The sketching matrix associated with Clarkson-Woodruff sketching is one having a single randomly chosen nonzero entry in each column, and that nonzero entry takes the value on $\pm 1$ with equal probability. In other words, Clarkson-Woodruff sketching is simply taking all $n$ rows of $(X, Y)$ and randomly smashing them together until it remains only $r$ rows. The time complexity is $O(nd)$, the lowest among all four.

Random Sampling

Let us suppose that we have already got the leverage scores and are prepared to do the subsampling. The sampling probability could be $\{\pi_i\}_ {i=1,\ldots,n}$, with $\pi_i = \tfrac{H_{ii}}{d}$, or it could be a convex combination between the leverage score and a prior. Denote $W \in \mathbb{R}^{r \times n}$ as the weighting matrix, with each row independently following a multinomial distribution parameterized by $\{\pi_i\}_ {i=1,\ldots,n}$. Then, we have $\Pr(W_{ij}=1)=\pi_j$. The rescaling matrix $R \in \mathbb{R}^{n \times n}$ is a diagonal matrix with $R_{ii} = \tfrac{1}{\sqrt{r \pi_i}}$. The final sketching matrix is the product of the weighting matrix and the rescaling matrix $S=WR$.

Now let us prove that $\mathbb{E}[\widetilde{X}^T \widetilde{X} | X^T X ] = X^T X$ as we claimed earlier. \[ \begin{align} \mathbb{E} [\widetilde{X}^T \widetilde{X} | X^T X] &= \mathbb{E}[X^T R^T W^T W R X| X^T X ] \\
&= X^T R^T \mathbb{E}[ W^T W | X^T X ] RX \\
&= X^T R^T \mathbb{E}[ W^T W] RX. \end{align} \] Denote $A:=\mathbb{E}[W^T W]$, then we have \[ A_{ii} = \mathbb{E} \sum_{k=1}^r W_{ki}^2 = \sum_{j=1}^r \pi_i = r \pi_i, \] and $A_{ij} = 0$ for $i \neq j$ because $W_{ki}$ and $W_{kj}$ cannot simultaneously be $1$ (each row of $W$ has a single $1$). Thus $A$ is a diagonal matrix, just like $R$, so we have $R^T A R = I_{n \times n}$. The proof is completed.

Theoretical Guarantees

There are two families of ways to evaluate the sketching effect. One is the algorithmic view, and the other is the statistical view. I will present both views and use Raskutti and Mahoney’s paper [1] as an example. After that, I will briefly mention some other results.

Algorithmic View

The algorithmic view sees the data $(X, Y)$, and it wants to compute the optimal $\hat{\beta}$ so that $\lVert Y - X\hat{\beta} \rVert_2$ is the smallest possible. In terms of sketching, it wants $\hat{\beta}_S$ to achieve the same effect as $\hat{\beta}$. Mathematically, $\frac{\lVert Y - X\hat{\beta}_S \rVert_2^2}{\lVert Y - X\hat{\beta}\rVert_2^2}$ is expected to be close to 1.

The above quantity depends on both $X$ and $Y$, which means that we are at the mercy of what nature offers us. Naturally, we want to quantify the worst-case performance. For this sake, we define the worst-case efficiency of a sketching matrix $S$: [ C_\text{WC}(S) := \sup_Y \frac{\lVert Y - X\hat{\beta}_S\rVert_2^2}{\lVert Y - X\hat{\beta} \rVert_2^2}. ] We take supreme only on $Y$ but not on $X$ because we consider that the design $X$ is for us to choose and that only the measure $Y$ is uncontrollable. In the cases where both $X$ and $Y$ are uncontrollable, we may want to take supreme on both $X$ and $Y$.

Statistical View

The statistical view does not see $(X, Y)$ as what it is. Instead, it considers $(X, Y)$ to be generated by some model $Y = X\beta + \varepsilon$, where $\varepsilon$ is a noise vector with good properties. The unobservable $\beta$ is important in that $\mathbb{E}[Y|X] = X\beta$. In other words, it does not want $X\hat{\beta}$ to be close to $Y$; it wants that quantity to be close to $X\beta$. In terms of sketching, it wants $\frac{\lVert X(\beta - \hat{\beta}_S) \rVert_2^2}{\lVert X(\beta - \hat{\beta}) \rVert_2^2}$ to be close to 1.

The above quantity depends on $X$ and $Y$ (because of $\hat{\beta}$ and $ \hat{\beta}_ S $), where $X$ is oftentimes considered as the fixed design. This time, we do not take supreme on $Y$, since now we have a model governing the (conditional) distribution of $Y$. Instead, we take expectations. For this sake, we define the prediction efficiency of a sketching matrix $S$: [ C_\text{PE}(S) := \frac{\mathbb{E}_\varepsilon \lVert X(\beta - \hat{\beta}_S)\rVert_2^2}{\mathbb{E} _\varepsilon \lVert X(\beta - \hat{\beta}) \rVert_2^2} ]

Raskutti and Mahoney’s Results

In this subsection, I will selectively present some results in Raskutti and Mahoney’s paper [1]. I chose this paper because their results are unambiguously stated and easy to digest. In the following, I will summarize their results in a table, which includes theoretical guarantees associated with sampling with rescaling $S_\text{R}$, sampling without rescaling $S_\text{NR}$, sub-Gaussian projection $S_\text{SGP}$, and randomized Hadamard projection $S_\text{Had}$.

Note: A sub-Gaussian random variable is one with its tail equal to or thinner than the one of a Gaussian random variable. Obviously, Gaussian random variables and random variables with finite support are all sub-Gaussian. Therefore, the Gaussian sketching matrix and the Rademacher sketching matrix mentioned earlier are both sub-Gaussian projections.

There are two elements important to interpret the table. The first is that these theoretical guarantees hold only with probability. The reason is that no matter $C_\text{WC}(S)$ or $C_\text{PE}(S)$, they are all random variables because of $S$. Especially for $C_\text{PE}(S)$, it takes expectation, but only on $\varepsilon$ not on $S$. The other element is that these guarantees all come with a cost: they all require a minimum number of rows $r$ for the sketching matrix.

$S$ $C_\text{WC}$ $C_\text{PE}$ $r$
$S_\text{R}$ $1+O(\tfrac{d}{r})$ $O(\tfrac{n}{r})$ $\Omega(d \log d)$
$S_\text{NR}$ $1+O(\tfrac{d}{r})$ $O(\tfrac{k}{r})$ $\Omega(d \log d)$
$S_\text{SGP}$ $1+O(\tfrac{d}{r})$ $O(\tfrac{n}{r})$ $\Omega(\log n)$
$S_\text{Had}$ $1+O(\tfrac{d}{r} \log nd)$ $O(\tfrac{n}{r} \log nd)$ $\Omega(d \log n(\log d + \log\log n))$

The performance guarantee of $S_\text{NR}$ needs a special condition. That is, the distribution of the leverage scores is severely skewed. A large portion is concentrated on $k$ data points. Hence, there is the $k$ appearing in the numerator of $C_\text{PE}(S_\text{NR})$.

Note: I did not present the residual efficiency in their paper, for I think their proof (Lemma 1 therein) is flawed. I wrote a letter to Prof. Mahoney but did not receive any reply. In general, I cannot guarantee everything I cited are absolutely correct, but if I have a doubt, I will not convey it to my readers.

The above results give the upper bounds of the prediction efficiency. One can ask whether they are tight. Pilanci and Wainwright proved that they are indeed tight [8].

Other Results

The above results focus exclusively on the prediction power of $\hat\beta_S$. One can also question the closeness of $\hat\beta_S$ to $\hat\beta$. Denote the total sum of squares as $\text{TSS}:=Y^T Y$, the residual sum of squares of the original OLS problem as $\text{RSS} := \lVert Y-X\hat{\beta} \rVert_2^2$, and the model sum of squares of the original OLS problem as $\text{MSS} := \lVert X\hat{\beta} \rVert_2^2$. Under certain conditions, we have [9] [ \lVert \hat\beta_F - \hat\beta \rVert_2^2 = O \left(\frac{\text{RSS}}{\sigma_\min^2(X)} \right) ] for the full sketching estimator, where $\sigma_\min(X)$ is the smallest singular value of $X$.

Also, under certain conditions, we have [6] [ \lVert \hat\beta_P - \hat\beta \rVert_2^2 = O \left(\frac{\text{MSS}}{\sigma_\min^2(X)} \right) ] for the partial sketching estimator.

References

  1. A Statistical Perspective on Randomized Sketching for Ordinary Least-Squares  2 3

  2. Extensions of Lipschitz mappings into a Hilbert space 

  3. Faster least squares approximation  2

  4. Matrix Computations 

  5. Fast approximation of matrix coherence and statistical leverage 

  6. Statistical properties of sketching algorithms  2

  7. Low rank approximation and regression in input sparsity time 

  8. Iterative Hessian Sketch: Fast and Accurate Solution Approximation for Constrained Least-Squares 

  9. Improved approximation algorithms for large matrices via random projections 

Written on November 12, 2021