From Gibbs to GigaBytes: TurboQuant, Rate-Distortion, and Statistical Mechanics

A detailed walkthrough of TurboQuant KV compression, along with some thoughts from reading it.

Google Research recently published a new method for KV compression, called TurboQuant. The method is quite impressive.At least on the surface. There are actually quite a few flaws with how they benchmark results (like, comparing their method on a GPU against KIVI running on CPU. lol). It struck me in particular because they show that it achieves compression close to the information-theoretic limit, while keeping model quality nearly unchanged. There are also some peculiar parallels to systems from statistical mechanics, so I decided to walk through the paper in detail.

Problem

KV cache memory is one of the biggest bottlenecks in frontier LLM deployment. The task of the KV cache is to optimize the computation of this equation:

\[\text{Attention}(Q,K,V) = \text{softmax}\left( \frac{QK^\top}{\sqrt{d_k}}\right) V\]

which gives us the scores of the attention block in transformer models. $Q \in \mathbb{R}^{m\times d_k}$, $K\in \mathbb{R}^{n\times d_k}$, and $V\in \mathbb{R}^{n\times d_v}$ are the query, key, and value matrices (i.e., the stacked query, key, and value vectors for all positions). $d_k$ is the dimension of the attention head. What this equation does is essentially: for every query token, we compute the “interaction” with all previous tokens, and use that to weight the value vectors.

In principle, for every new token in $Q$, we would have to recompute all the entries in $K$ and $V$ for all previous tokens, despite the fact that their entries didn’t change at all. So, in order to avoid this recomputation, we store them in the…KV Cache. We can just look them up from memory.One might wonder whether recomputing $K$ and $V$ on the fly would be cheaper than storing and loading them. During decoding, however, the GPU is memory-bandwidth-bound rather than compute-bound: at batch size 1 or small batches, tensor cores are underutilized and the bottleneck is transferring data from HBM, not performing arithmetic. Recomputing $K$ and $V$ requires loading the projection weight matrices $W_K$ and $W_V$ from HBM on every step, which costs bandwidth proportional to the model size, independent of context length. Loading the KV cache, on the other hand, costs bandwidth proportional to context length. The advantage of caching therefore grows with context: at short contexts the KV cache is small and adds little overhead; at long contexts, where KV cache memory pressure is most acute, caching dominates. The downside of this approach is that this KV Cache can get huge, really huge. Just as an example, for a Llama-3.1-70B model with 80 layers, 8 KV heads (Grouped Query Attention), and head dimension $d_k = d_v = 128$, the KV cache for a single token in fp16 is

\[\underbrace{80}_\text{layers} \times \underbrace{8}_\text{KV heads} \times \underbrace{2}_\text{K and V} \times \underbrace{128}_\text{head dimension} \times \underbrace{2\,\text{bytes}}_\text{bytes per element} \approx 320\,\mathrm{KB}\]

This scales linearly with context length. With a context length of 128k tokens, this amounts to roughly $40\,\text{GB}$. For reference, the model weights themselves at fp16 are roughly $70 \times 10^9 \times 2\,\text{bytes} \approx 140\,\mathrm{GB}.$ If the model is served at int4, it’s around $35\,\mathrm{GB}$, so the KV Cache would be larger than the model itself! And this is even without batching for multiple users.

Then, there is also the latency problem. The KV cache is stored in the main GPU memory, the HBM, and for every new token, the entire cache must be streamed from HBM into the on-chip SRAM to compute attention. This makes the decoding phase heavily memory-bound: the GPU’s tensor cores sit underutilized while waiting on memory transfers.

All in all, we need ways to compress the KV cache while still ensuring that the model’s quality is high — in particular, that the distortion to the query-key inner product $\langle q,k\rangle$ is minimal. This problem is fundamental in information theory and dates back to Claude Shannon. Shannon established a lower bound on the distortion imposed by any compression. As we will see, the method introduced by Zandieh et al. gets very close.

Compression

There are two kinds of compression schemes: lossless compression and lossy compression. When you zip a file, you are dealing with lossless compression. There is a preservation of information, which is great since we want to reconstruct the exact same file from the compressed version. TurboQuant is a lossy compression scheme. We reduce the precision, but try to do so under minimum distortion.

TurboQuant tl;dr:

TurboQuant specifically targets key vectors, while value vectors are compressed separately using a simpler scalar scheme. There are two targets which are addressed in two steps:

  1. Minimize MSE distortion: Rotate key vectors randomly, do Lloyd-Max tessellation, and quantize each coordinate into $b-1$ bits.
  2. Minimize inner product distortion: Reconstruct the key vector from the compressed version, measure the residual, and store its magnitude in fp32 and each coordinate with $1$ bit.

TurboQuant

The overall encoding scheme of TurboQuant looks like this. We will next go through each step in detail.

Overview of the TurboQuant encoding pipeline.

Bit budget

Splitting the bit budget: $b-1$ bits for MSE quantization, 1 bit for the QJL residual correction.

We have a bit budget of $b$ bits per coordinate. We capture the bulk information with MSE quantization and store this in $b-1$ bits. We use the remaining bit for correcting the inner product with QJL.

Rotation

A random rotation $\Pi$ decorrelates the coordinates of the key vector.

We start by applying a random rotation matrix to the key vector:

\[y = \Pi \cdot k \in \mathbb{R}^d\]

The result is that for large $d$, the coordinates become approximately i.i.d.

\[y_i \sim \mathcal{N}\left(0,\frac{1}{d}\right)\]

The previously correlated coordinates become nearly independent, and the quantization reduces to a 1D scalar quantization problem under a known distribution.

Details behind this result
Slicing argument: fixing one coordinate of a point on the unit sphere constrains the remaining coordinates to a sub-sphere.

For a point drawn from the unit hypersphere $S^{d-1}$, the marginal distribution of any single coordinate follows from a slicing argument: if we fix one coordinate $x_1 = t$, this constrains the remaining coordinates to a “sub”-sphere of radius $\sqrt{1-t^2}$, because $|x|^2 = 1$.With $x_1 = t$, the remaining coordinates must satisfy $x_2^2 + x_3^2 + \dots x_d^2 = 1-t^2$. The surface area of this sphere scales as $(1-t^2)^{(d-2)/2}$. Together with the Jacobian factor from the projection, $(1-t^2)^{-1/2}$, the marginal density is

\[f(t) \propto (1-t^2)^{(d-3)/2}.\]

This is a symmetric Beta distribution with parameters $\alpha = \beta = (d-1)/2$ and variance $1/d$. As $d\to \infty$, $\alpha = \beta \to \infty$ and the Beta distribution concentrates tightly around zero, converging to $\mathcal{N}(0,1/d)$.

This is precisely what the rotation $\Pi$ achieves here. $\Pi$ is drawn from the Haar measure, which is the uniform distribution over all rotations, so $\Pi k$ is a uniformly random rotation of $k$. Assuming $k$ is normalized to unit norm, its coordinates follow this distribution regardless of what $k$ actually was. Any structure or correlation in the original key vector is randomized away, and every coordinate ends up with the same known marginal. The codebook can therefore be designed once for $\mathcal{N}(0,1/d)$ and reused universally — across every token, every layer, every model.

Tessellation

Each rotated coordinate is assigned to its nearest centroid from a precomputed Lloyd-Max codebook.

The rotated coordinates are grouped together at $2^{b-1}$ centroids via Lloyd-Max and quantized according to their index. The special part here is that we know the underlying distribution, and thus we can solve for the centroid coordinates only once, and reuse the result in all following iterations with a lookup table. All we need to remember are the indices of each coordinate.

The assignment rule is

\[\text{idx}_j = \arg \min_k |y_j-c_k|\]

We search for the index of the centroid which has the minimal absolute distance to the rotated vector $y$, in each of its coordinates. What we store in cache are the centroid indices as

\[\mathrm{idx} \in \{ 0,\dots, 2^{b-1}-1\}^d\]

So, for a bit budget of $b=3$, there would be 4 centroid indices $00, 01, 10, 11$.

This problem — quantizing the output from a Gaussian sourceRecall that the rotated vectors $y = \Pi k$ are approximately Gaussian. — is fundamental to information theory and dates back to the work of Claude Shannon. His rate-distortion theory establishes the hard floor: for a Gaussian source, no quantizer of any kind can achieve distortion below $4^{-b}$ at $b$ bits per coordinate. This is the information-theoretic minimum.

Shannon's rate-distortion limit

At $b$ bits per coordinate, we can distinguish at most $2^b$ different values. But no matter how cleverly we place those bins, there will always be some values that are genuinely different that end up in the same bin and thus become indistinguishable after compression. This introduces an inevitable reconstruction error.

Now, a Gaussian source has a specific amount of information content, which can be measured by its differential entropy

\[H(X) = \frac{1}{2}\log(2\pi e\sigma^2).\]

Shannon showed that this entropy implies a rate-distortion function

\[D(R) = \sigma^2\cdot 2^{-2R},\]

which at $R=b$ bits per coordinate gives

\[D=\sigma^2\cdot 4^{-b}.\]

This is the theoretical minimum distortion achievable by any compression scheme at $b$ bits per coordinate.

This holds for any compression scheme, be it scalar, vector, or otherwise. Here, we are using scalar quantization, meaning that each coordinate is quantized independently. This is computationally attractive but comes at the cost that a scalar quantizer cannot exploit correlations between coordinates, and even after the random rotation has made them nearly independent, quantizing one at a time wastes some capacity.

Panter and Dite have shown that the best scalar quantizer cannot achieve a distortion below

\[D \sim \frac{\pi\sqrt{3}}{2}\cdot 4^{-b},\]

sitting within a fixed factor of $\pi\sqrt{3}/2 \approx 2.72$ above the Shannon floor at all bit widths.The prefactor comes from the optimal bin placement: narrow bins where the Gaussian is dense, wide bins in the tails. The high-resolution asymptotic from Bennett (1948) gives that within each quantization cell the error variance is $\Delta^2/12$, where $\Delta$ is the cell width. Panter and Dite (1951) then showed that optimizing $\Delta(x)$ across a source density $f(x)$ gives a total distotion of $4^{-b}\cdot \frac{1}{12} \cdot \left(\int f(x)^{1/3} dx\right)^3$. For a Gaussian, $\frac{1}{12}\left(\int f(x)^{1/3} dx\right)^3 = \frac{\pi\sqrt{3}}{2}\sigma^2$, giving the Panter-Dite bound $\frac{\pi\sqrt{3}}{2}\sigma^2\cdot4^{-b}$. This is the unavoidable overhead that cannot be reduced by a better choice of centroids, only by moving to vector quantization.

We can test this by sampling points from a standard Gaussian $\mathcal{N}(0,1)$, then applying Lloyd-Max with $2^b$ levels, and measuring the distortion $D_\text{mse}(b)$ as the mean-squared error with an evaluation set (points sampled from the same Gaussian, but with a different random seed).

Left: Lloyd-Max MSE distortion as a function of bit budget $b$, compared to the Shannon lower bound and the Panter-Dite bound. Right: distortion normalized by $4^b$ converges to the Panter-Dite constant $\pi\sqrt{3}/2 \approx 2.72$.

The empirical markers sit below the Panter-Dite bound at every bit width, confirming that Lloyd-Max achieves near-optimal distortion. In the right plot, each value is normalized by $4^b$ to show the convergence to the Panter-Dite constant of $\pi\sqrt{3}/2 \approx 2.7207$. What’s interesting to note is that for $b=1$, the prefactor is only around 1.457, meaning that at the tightest compression, we are actually the closest to the Shannon bound.

Why is that?

tl;dr: The Shannon bound is achievable only by vector (block) quantizers with infinitely long blocks, where samples can be encoded jointly. At $b=1$, however, both scalar and vector quantizers are equally constrained by the coarseness of a two-level representation, leaving little room for a joint scheme to outperform a scalar one.

To understand why that is, we need to think about what the Shannon bound actually represents at a given bit width. The Shannon bound of $4^{-b}$ is the minimum distortion achievable by any quantizer (scalar, vector, whatever) at $b$ bits. But this bound is only achieved by an ideal vector quantizer operating on infinitely long blocks.

A block quantizer does not compress one sample at a time, but rather looks at a block of $n$ samples together and encodes them jointly. With such a joint code, we can redistribute the bit budget across the block (spending more bits where the signal is large, fewer where it is near zero), while keeping the total at $nb$ bits. As $n\to\infty$, the redistribution becomes increasingly efficient and the distortion approaches the Shannon bound from above. Scalar quantization is the special case of $n=1$: we look at one sample at a time and must commit to a fixed codebook, with no knowledge of what other samples look like. This is the overhead captured by the Panter-Dite constant.

Now, why is this gap smallest at $b=1$? With one bit per coordinate, there are only two levels, and the optimal placement is fully determined: two centroids sitting at $\pm \sqrt{2/(\pi d)}$ with the threshold at zero. A vector quantizer with the same budget of 1 bit per coordinate would face the same constraint: representing a continuous Gaussian with just two values will be coarse regardless of how sophisticated the scheme is. Both scalar and vector quantization are here equally limited by the coarseness of the representation, so the gap between them is small.

This changes with higher $b$. With more levels, a vector quantizer can exploit increasing design freedom across coordinates jointly, adapting to the actual realization of each block. A scalar quantizer cannot do this; it applies the same codebook to every coordinate independently. The gap thus widens as $b$ grows, and the normalized distortion $4^b \cdot D_\mathrm{mse}(b)$ rises from 1.457 at $b=1$ toward the asymptotic ceiling of 2.72.

Derivation for $b=1$

For a single coordinate $x\sim \mathcal{N}(0,\sigma^2)$ with $\sigma=1/\sqrt{d}$, the optimal 1-bit quantizer assigns every positive $x$ to centroid $+c$ and every negative $x$ to centroid $-c$, with threshold at zero. The optimal $c$ minimizes the expected squared reconstruction error:

\[\min_{c} \ \mathbb{E}[(x-c)^2 \ | \ x>0]\]

The minimizer is the conditional mean, $c=\mathbb{E}[x\ |\ x>0]$. A standard Gaussian distribution $X \sim \mathcal{N}(0,\sigma^2)$ has the probability density function (PDF)

\[f(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{x^2}{2\sigma^2}}.\]

When restricted to $X>0$, the total area under the curve is $0.5$, so we multiply by $2$ to normalize it, which gives us the half-normal distribution PDF:

\[f_{X>0}(x) = \frac{\sqrt{2}}{\sigma\sqrt{\pi}}e^{-\frac{x^2}{2\sigma^2}}.\]

We get the expected value $\mathbb{E}[X]$ by integrating $x\cdot f(x)$ from $0$ to $\infty$:

\[\mathbb{E}[X] = \int_0^\infty x \frac{\sqrt{2}}{\sigma \sqrt{\pi}}e^{-\frac{x^2}{2\sigma^2}} \ \mathrm{d}x\]

Substituting $u=\frac{x^2}{2\sigma^2}$ with $\mathrm{d}u = \frac{x}{\sigma^2}\ \mathrm{d}x$:

\[\mathbb{E}[X] = \frac{\sqrt{2}}{\sigma\sqrt{\pi}}\int_0^\infty \sigma^2e^{-u} \ \mathrm{d}u = \frac{\sigma^2\sqrt{2}}{\sigma \sqrt{\pi}}[-e^{-u}]_0^\infty = \sigma \sqrt{\frac{2}{\pi}}\]

With $\sigma = \frac{1}{\sqrt{d}}$, we get

\[c = \mathbb{E}[x \ | \ x>0] = \sigma \sqrt{\frac{2}{\pi}} = \frac{1}{\sqrt{d}} \cdot \sqrt{\frac{2}{\pi}} = \sqrt{\frac{2}{\pi d}}\]

Reconstruction & Residual

The Lloyd-Max reconstruction $\tilde{k}$ is subtracted from $k$ to give the residual $r$, which is then compressed separately.

During encoding, we immediately perform a reconstruction of the key vector, but because the quantization is lossy, the reconstructed key vector $\tilde{k}$ will be slightly different from the actual key vector $k$. What we do is to rotate the centroid look-up table back into the original space:

\[\tilde{k} = \Pi^\top \cdot c_\text{idx} \in \mathbb{R}^d\]

From that, we can calculate the residual as the difference

\[r = k-\tilde{k} \in \mathbb{R}^d\]

This is the stuff that the MSE stage failed to capture. It is as small as it can be given the $b-1$ budget; it’s the minimal MSE from the Lloyd-Max stage, but we still need to correct for it to get the attention scores right. So we compress this as well. In particular, we are using a Quantized Johnson-Lindenstrauss transform, which allows us to quantize the residual vector into a single bit per coordinate.How fitting, we happen to have one bit left in the budget.

Residual compression

Residual compression via QJL: a random Gaussian projection followed by a sign operation, plus the residual norm $\|r\|_2$.

In short: we apply a random Gaussian matrix and take the sign. This gives us 1 bit per coordinate, which we store in the vector $\sigma \in {-1,+1}^d$:

\[\sigma = \text{sign}(Sr)\]

And we also store the magnitude of $r$ as a scalar, as we need that for reconstruction.

How QJL works

QJL in general

QJL builds on the Johnson-Lindenstrauss lemma: a random matrix $S\in\mathbb{R}^{m\times d}$ preserves inner products in expectation,

\[\mathbb{E}[\langle Su, Sv\rangle ] = m\langle u,v\rangle\]

with error shrinking as $m$ grows. QJL uses a square matrix $S\in \mathbb{R}^{d\times d}$ (no dimensional reduction) and compresses by storing only the sign of each projected coordinate:

\[\sigma = \text{sign}(Sx) \in \{ -1,+1\}^d\]

We go from 32 bits per coordinate to 1 bit. To recover the magnitude, QJL also stores $|x|$ as a single float32. The estimator for the inner product of $q$ and $x$ is then:

\[\widehat{\langle q,x\rangle} = \frac{\sqrt{\pi/2}}{d} \cdot \|x\|\cdot \langle Sq,\sigma\rangle\]

with $Sq$ computed in full float. The factor $\sqrt{\pi/2}$ corrects for the sign shrinkage and $d^{-1}$ normalizes for summing $d$ terms.

QJL on the residual

After Lloyd-Max at $b-1$ bits, we have a reconstruction $\tilde{k}$ and residual $r = k-\tilde{k} \in \mathbb{R}^d$. The inner product splits as

\[\langle q,k\rangle = \langle q,\tilde{k}\rangle + \langle q,r\rangle\]

We already have the first term from the Lloyd-Max reconstruction. For the second, we apply QJL to $r$ using the remaining 1 bit of the budget:

\[\sigma = \text{sign}(S\cdot r), \quad \text{store}\ \sigma \ \text{and} \ \|r\|\]

At query time:

\[\langle q,k\rangle \approx \langle q,\tilde{k} \rangle + \frac{\sqrt{\pi/2}}{d} \cdot \|r\|\cdot \langle Sq,\sigma\rangle\]

which gives us an unbiased estimator of the true inner product.

You may ask why we do not just apply QJL to $k$ directly. The reason is that the variance of the QJL estimator is proportional to $|x|^2$. Lloyd-Max has already minimized $|r|$, so the noise from the 1-bit quantization is already as small as it could possibly be. If we applied QJL to the full $k$, the variance would be proportional to $|k|^2$, which is much larger. I show this in more detail below.

Recap - What do we got?

We now have covered the encoding pipeline from TurboQuant, which happens for every new token, both during prefill (where we encode all prompt tokens in parallel) and during decode (where each newly generated token is encoded one by one). Instead of storing the full key vectors, our key cache now consists of three entries per vector:

By a napkin estimate, this reduces the memory footprint by a factor of more than $5\times$ for $b=3$, and approximately $8\times$ for $b=2$.

Napkin estimate

A key vector in float16 costs $d\times 2\,\text{bytes}$. The encoded representation costs

Component Size
idx ($b-1$ bits/coord) $\frac{d(b-1)}{8}$ bytes
σ ($1$ bit/coord) $\frac{d}{8}$ bytes
‖r‖₂ (one float32) 4 bytes
Total $\frac{db}{8}+4$ bytes

For large $d$, the scalar 4 bytes is negligible, which gives a compression ratio of:

\[\frac{2d}{db/8} = \frac{16}{b}\]

So approximately $5.3\times$ at $b=3$, and approximately $8\times$ at $b=2$, as we go from 16 bits to $b$ bits per coordinate.

Decode

At attention time, we do the decoding step, where we want to use the reconstructed key vectors together with the value vectorsIn the paper, the authors use an outlier-aware strategy where a small set of high-magnitude channels get 3 bits and the remaining channels get 2 bits. (which we may have compressed using the same strategy) for calculating the inner product.

We achieve this in two steps. First, we look up the centroid coordinates corresponding to the stored indices. We know these analytically, since they were precomputed once offline by solving the Lloyd-Max problem on the known Beta distribution for our specific bit width. We then rotate them back to the original key space using $\Pi^\top$, recovering the lossy MSE reconstruction $\tilde{k} = \Pi^\top c_\mathrm{idx}$. Second, we correct this reconstruction using the stored QJL representation, reversing the residual compression. Together, the inner product is estimated as:

\[\begin{aligned} \langle q, \tilde{k}+\tilde{r}\rangle &= \langle q,\Pi^\top c_\mathrm{idx}\rangle + \frac{\sqrt{\pi/2}}{d} \cdot \|r\|_2\cdot \langle q,S^\top\sigma\rangle \\ &= \langle q,\Pi^\top c_\mathrm{idx}\rangle + \frac{\sqrt{\pi/2}}{d} \cdot \|r\|_2\cdot \langle Sq,\sigma\rangle \end{aligned}\]

The second line uses the identity $\langle q,S^\top\sigma\rangle = \langle Sq, \sigma\rangle$, which means we pre-multiply the query by $S$ once and then dot with the stored sign vector. This keeps the query side in full float precision and only the key side quantized.

Some final thoughts

QJL vs TurboQuant

The TurboQuant method is an advancement from the authors’ previous paper on Quantized Johnson-Lindenstrauss, which we covered above. QJL was itself developed for KV cache quantization and I was interested in how much of an improvement the TurboQuant hybrid is above using just QJL alone. So the two methods we compare are:

TurboQuant: split the bit budget into $b-1$ bits for Lloyd-Max and $1$ bit for QJL on the residual.

vs.

QJL: skip Lloyd-Max entirely and spend all $b$ bits on QJL sign projection, applying $m=bd$ random sign measurements directly to $k$.

To test this, we run $N=2000$ independent trials for each strategy at $d=128$, drawing random unit vector pairs $(q,k)\in\mathbb{R}^d$ and measuring the inner product distortion of each estimator, which is shown in the left panel. The right panel shows the ratio between the two distortions, alongside the theoretical prediction $b\cdot D_\mathrm{mse}(b-1)$ and the Panter-Dite upper bound $2\pi\sqrt{3}\cdot b \cdot 4^{-b}$.

Left: inner product distortion of TurboQuant and pure QJL as a function of bit budget $b$. Right: distortion ratio $D_\mathrm{TQ}/D_\mathrm{QJL}$ compared to the theoretical scaling $b \cdot D_\mathrm{mse}(b-1)$ and the Panter-Dite upper bound.

At $b=1$, the two strategies are identical: TurboQuant allocates zero bits to Lloyd-Ma x and thus reduces to pure QJL. For $b>1$, the gap opens quite rapidly, and TurboQuant gains significantly in precision. The reason is that when QJL is applied directly to $k$, the variance of its inner product estimator is proportional to $|k|^2$, and this cannot be reduced by any amount of additional sign bits. However, when Lloyd-Max is applied first, the residual satisfies $|r|^2\ll|k|^2$ because Lloyd-Max minimizes exactly this quantity given $b-1$ bits. QJL then operates on a much smaller target, and its variance shrinks proportionally. Lloyd-Max shrinks the target, and QJL corrects the bias that Lloyd-Max cannot avoid.

The scaling law $D_\mathrm{TQ}/D_\mathrm{QJL}\sim b\cdot 4^{-b}$ tells us this advantage compounds with each additional bit. At $b=3$, TurboQuant achieves three times lower inner product distortion than pure QJL at the same memory cost. Equivalently, TurboQuant at $b=2$ achieves roughly the same distortion as pure QJL would at $b=3$, so we save one bit per coordinate at no quality cost. This is quite significant in a production KV cache with $d=128$ over millions of tokens.

Parallels in physics

As mentioned in the introduction, I found the TurboQuant paper also interesting because mathematically, there are a couple of interesting parallels to concepts used in physics, which I finally want to touch on briefly.

Ising and Potts model

With $b=1$, the Lloyd-Max codebook reduces to two centroids at $\pm\sqrt{2/(\pi d)}$ and the quantization map becomes $\sigma = \mathrm{sign}(\Pi k)$, which is a vector of binary spins, just like in the Ising model. Each coordinate $(\Pi k)_j$ acts as a local field, and the spin aligns with the field at zero temperature. For $b=2$, it is analogous to a 4-state Potts model; at $b=3$, to an 8-state Potts model. The compression hierarchy from 1-bit to multi-bit quantization maps onto an increasing spin multiplicity: from Ising through Potts towards the continuous limit.

Zero temperature

In a physical system, the hard sign in QJL corresponds to a zero-temperature setting, being the limit of an otherwise smooth sigmoid. We can see what would happen if we introduced a temperature parameter (inverse temperature $\beta$) by replacing the deterministic sign with a stochastic spin drawn from:

\[P(\sigma_\mu = +1) = \text{sigmoid}(2\beta\cdot s_\mu^\top k)\]

At $\beta \to \infty$ this recovers the hard sign; at finite $\beta$ the spin fluctuates around the field direction $s_\mu^\top k$ with thermal noise of scale $1/\beta$. To see how this affects the inner product estimator, we decompose $q$ into its component parallel to $k$ and its orthogonal complement:

\[q = \frac{\langle q,k\rangle}{\|k\|^2}k + q^\perp\]

The parallel component carries all information about $\langle q,k\rangle$. Taking the expectation of the estimator over the random matrix $S$ and the stochastic spins:

\[\mathbb{E}[\langle Sq,\sigma \rangle] = \frac{\langle q,k\rangle}{\|k\|} \cdot \mathbb{E}[|s_\mu^\top k|]\cdot g(\alpha)\]

where $q^\perp$ contributes zero by symmetry, and the bias function is:

\[g(\alpha) = \sqrt{\frac{\pi}{2}} \cdot \mathbb{E}_{z\sim\mathcal{N}(0,1)}[z \tanh(\alpha z)], \quad \alpha = \beta \|k\|\]

Since $g(\alpha)<1$ for all finite $\alpha$ and $g(\alpha)\to 1$ only as $\alpha\to\infty$, the estimator is systematically biased at any finite temperatures: it underestimates all inner products by the factor $g(\alpha)$. After correcting for this bias, the variance of the corrected estimator is strictly decreasing in $\alpha$, minimized only at $\beta \to \infty$. Hence, zero temperature is the unique optimum. Using the hard sign as done in TurboQuant, we get to the ground state of the Ising system, and thermal fluctuations only hurt.

Concentration of measure and the thermodynamic limit

The rotation step used in TurboQuant works because of concentration of measure in high dimensions. This is central to both statistical mechanics and high-dimensional probability. As $d\to\infty$, almost all of the sphere’s surface concentrates near the equator, and each coordinate of a random point on $S^{d-1}$ becomes approximately Gaussian with variance $1/d$. This is like a geometric version of the thermodynamic limit: as the number of degrees of freedom grows, macroscopic quantities (here, the coordinate distribution) become deterministic and predictable, even though individual realizations remain random. The $1/d$ variance is the direct analogue of $1/N$ fluctuation suppression in large thermodynamic systems.

Sphere packing and Voronoi tessellation

The Lloyd-Max optimisation partitions the real line into Voronoi cells, where each coordinate gets assigned to its nearest centroid, and the cell boundaries are midpoints between centroids. In higher dimensions, the analogous problem of partitioning space optimally with a given number of centres is the sphere packing problem, well studied in condensed matter physics. The paper itself notes (in Theorem 3) that comparable lower bounds can be derived using sphere packing arguments. The $\frac{\pi\sqrt{3}}{2}$ Panter-Dite constant is in this sense a one-dimensional sphere packing result evaluated for a Gaussian measure.

Entropy

Shannon built his entropy $H=-\sum p\log p$ explicitly by analogy with Boltzmann and Gibbs. The rate-distortion function for a Gaussian source has a formal structure similar to a free energy minimization.

The rate-distortion optimization problem asks essentially: for a source $X$ with distribution $p(x)$, find the encoder $p(\hat{x}|x)$ that minimizes the mutual information $I(X;\hat{X})$ subject to the constraint $\mathbb{E}[d(X,\hat{X})] \leq D$. We can formalize this as an unconstrained minimization using Lagrange multipliers:

\[\min_{p(\hat{x}|x)} \left[ I(X;\hat{X}) + \lambda \cdot \mathbb{E}[d(X,\hat{X})]\right]\]

for some $\lambda \geq 0$ that trades rate against distortion. The first-order condition yields the optimal encoder:

\[p^*(\hat{x}|x) = \frac{p(\hat{x}) \cdot e^{-\lambda \ d(x,\hat{x})}}{Z(x)}, \quad Z(x)=\int p(\hat{x}) \ e^{-\lambda \ d(x,\hat{x})} \ d\hat{x}\]

This is a Gibbs distribution. The Lagrange multiplier $\lambda$ acts as the inverse temperature, $d(x,\hat{x})$ as an energy, and $Z(x)$ is a partition function. At the optimum, the Lagrangian satisfies:

\[I^*(X;\hat{X}) + \lambda D^* = -\mathbb{E}_x[\log Z(x)] = \lambda \mathbb{E}_x[F(x)]\]

where $F(x) = -\frac{1}{\lambda}\log Z(x)$ is the free energy per sample. In that sense, rate-distortion optimization is analogous to a free energy minimization: Shannon’s lower bound is the minimum free energy achievable with a continuous reconstruction alphabet, and the Panter-Dite bound is the minimum achievable with a discrete alphabet of $2^b$ centroids. At zero temperature their ratio is exactly $\frac{\pi\sqrt{3}}{2}$.

The entire rate-distortion curve is traced by varying $\lambda$: at $\lambda=0$ (infinite temperature), the encoder ignores the input entirely, $p^*(\hat{x}|x)=p(\hat{x})$, achieving zero rate at maximum distortion; as $\lambda \to \infty$ (zero temperature), it concentrates on the nearest reconstruction point, approaching lossless compression. For a Gaussian source $X\sim \mathcal{N}(0,\sigma^2)$ with squared error $d(x,\hat{x}) = (x-\hat{x})^2$, the parametric form of the rate-distortion curve is:

\[\lambda= -\frac{dR}{dD}=\frac{1}{2D}, \quad R(D) = \frac{1}{2}\log\frac{\sigma^2}{D}\]

At $R=b$ bits, this gives $D=\sigma^2\cdot 4^{-b}$, which is the Shannon lower bound. The left panel below shows this curve. Every point on it corresponds to a specific operative temperature $T=1/\lambda=2D$: the four points from Lloyd-Max at $b=1,2,3,4$ sit on the curve, each at a different temperature (different slope). The achievable region (where compression is possible) lies above the curve ($R\geq R(D)$); the region below is forbidden by Shannon’s theorem, regardless of the encoding scheme used.

Left: Shannon rate-distortion curve $R(D) = \tfrac{1}{2}\log(\sigma^2/D)$ with Lloyd-Max operating points. Middle: the Gibbs posterior $p^*(\hat{x}|x)$ for fixed $x=0.75$ at varying temperatures. Right: Lloyd-Max distortion compared to the Shannon lower bound, separated by the constant Panter-Dite overhead.

The middle panel shows what the Gibbs encoder actually does for a fixed input $x=0.75$. At high temperature, the encoder spreads probability mass broadly, pulled toward the prior mean at zero. This comes from Bayesian shrinkage, as the reconstruction is a compromise between the observed input and the prior $p(\hat{x}) = \mathcal{N}(0,1)$. As temperature falls, the likelihood sharpens and the distribution concentrates closer to $x$. At $T\to0$, it collapses onto the nearest Lloyd-Max centroid. So in that sense, the zero-temperature limit of the Gibbs encoder is Lloyd-Max quantization.

A scalar quantizer with $2^b$ levels is a restricted version of this optimal encoder. The reconstruction $\hat{x}$ is constrained to one of $2^b$ fixed centroids rather than drawn from a continuous distribution. The partition function becomes a finite sum:

\[Z(x) = \sum_{k=1}^{2^b} e^{-\lambda(x-c_k)^2}\]

rather than a Gaussian integral, and the achievable distortion at rate $b$ is correspondingly higher than the Shannon minimum. The Panter-Dite result quantifies how much higher (right panel): the minimum distortion achievable by any scalar quantizer at $b$ bits is $\frac{\pi\sqrt{3}}{2}\approx 2.72$ times the Shannon lower bound $\sigma^2\cdot 4^{-b}$. This discretization overhead corresponds to the filled region between the two dashed lines. Note that this factor does not shrink with $b$: the two lines are parallel on the log scale, meaning the overhead is a fixed multiplicative cost of restricting the Gibbs distribution to finitely many states.

Randomness as a constructive resource

Finally, we see that both $\Pi$ and $S$ are random matrices used deliberately in TurboQuant. This is analogous to how randomness is used constructively in methods from statistical mechanics, like Monte Carlo, simulated annealing, or replica methods. The rotation $\Pi$ drawn from the Haar measure randomizes the input into a form where the optimal codebook is known analytically, which sidesteps the need for data-dependent calibration. The matrix $S$ spreads the residual’s energy uniformly across sign bits so the noise is isotropic and cancels in expectation. In both cases, the randomness is not noise to be suppressed, but acts as a resource that enables theoretical guarantees.

If you have any thoughts on this, or think I got something wrong somewhere, please reach out!