Jeffrey G. Wang

Hutchinson Trace Estimator

The reader is assumed to be familiar with random variables, variance, covariance, moments, and basic linear algebra/matrix calculus.

Here, we write about the Hutchinson Trace Estimator, a core component of my recent paper in EMNLP about model-perturbation based membership inference attacks against LLMs (MoPe).

Review of Normality: Sum of normals is normal, which you can show with the MGF. TODO, write more.

Standard Multivariate Normal Vector: We say that a vector is MVN if every linear combination of its components is also distributed MVN. We’ll refer to the standard multivariate normal vector as $Z$, where $Z \sim \mathcal{N}(\mathbf{0}, \mathbf{\Sigma})$ and $\mathbf{\Sigma} = \mathbf{I}_{N \times N}$.

This means that every entry in the vector is independent from every other entry, and each entry is normally distributed with mean 0 and variance 1. The independence of two random variables X and Y means that $E[XY] = E[X]E[Y] \implies \text{Cov}(X, Y) = E[XY]-E[X]E[Y] = 0$. This implies 0 correlation between the two random variables. So independent implies uncorrelated.

Moving on…

Now, let’s move onto the Hutchinson Trace Estimator, which is defined as follows: for isotropic random vectors $\epsilon$ ($E[\epsilon \epsilon^T] = \mathbf{I}$), we can estimate the trace of some matrix A as follows: $$\hat{\text{Tr}}(A) = \frac{1}{m}\sum_{i=1}^m\epsilon_i^TA\epsilon_i$$ Commonly used isotropic vectors include the MVN and Rademacher. To flesh this out, consider some vector $\epsilon \in \mathbb{R}^n$ that is standard MVN.

Q: What is $E[\epsilon \epsilon^T]$?

A: Note that $\epsilon \epsilon^T$ is some $n \times n$ matrix where every entry is the product of two independent standard normal random variables. Because they are independent, their covariance is 0 $\implies$ every entry in this matrix is 0, except for the diagonals, which are the covariance of a rv with itself, which is just variance 1.

Unbiasedness

We wish to show that $E[\hat{\text{Tr}}(A)] = \text{Tr}(A)$. To do so, we’ll use tricks with traces.

When working with traces, there are 2 tricks that unlock 90% of our derivations. The first is not as much of a trick as it is a notational thing to remember: that the trace of a scalar if itself. The second property to remember is that $\text{Tr}(AB) = \text{Tr}(BA)$, but the same is not true of the three dimensional case, which must be handled with care. One can think of the matrix product $BCD$ as beads on closed string. One can move the last bead $D$ to the front of the other two ($\text{Tr}(BCD) = \text{Tr}(DBC)$, but not interchange any beads ($\text{Tr}(BCD) \neq \text{Tr}(BDC)$).

Since $\epsilon^T A \epsilon$ is a constant and an expectation of a constant is itself, $$E[\frac{1}{m}\sum_{i=1}^m\epsilon_i^TA\epsilon_i] = \frac{1}{m} \cdot m \cdot E[\epsilon_i^TA\epsilon_i]$$ $$E[\epsilon^TA\epsilon] = E[\text{Tr}(\epsilon^T A \epsilon)] = E[\text{Tr}(\epsilon^T A \epsilon)] = E[\text{Tr}(\epsilon \epsilon^T A] = E[\text{Tr}(A)] = \text{Tr}(A)$$ The first equality is since trace of a constant is itself. The second is because expectation of a constant is itself. This result shows that the empirical Hutchinson estimator is unbiased for the trace.

Variance

We can write the variance of the estimator using independence laws:

$$\text{Var}(\hat{\text{Tr}}(A)) = \frac{1}{m^2} \sum_{i=1}^m \text{Var}(\epsilon_i^TA\epsilon_i) = \frac{1}{m} \text{Var}(\epsilon^T A \epsilon)$$

Recall that the MSE of an estimator = Variance + Bias. Since this is unbiased, the RMSE (the average error) decays at a rate $\frac{1}{\sqrt{m}}$. This is slow!

Note that this slow reduction in variance is not surprising, as it is typical of Monte Carlo methods.

Improved Hutchinson Estimators

A few recent papers have demonstrated other Hessian trace estimators with much better variance properties. One such method is Hutch++, whose variance falls at a rate of $\frac{1}{m^2}$. Also, the authors show that the estimator is better with balanced spectra, which is super interesting!