A detailed walkthrough of TurboQuant KV compression, along with some thoughts from reading it.
Google Research recently published a new method for KV compression, called TurboQuant. The method is quite impressive.
KV cache memory is one of the biggest bottlenecks in frontier LLM deployment. The task of the KV cache is to optimize the computation of this equation:
\[\text{Attention}(Q,K,V) = \text{softmax}\left( \frac{QK^\top}{\sqrt{d_k}}\right) V\]which gives us the scores of the attention block in transformer models. $Q \in \mathbb{R}^{m\times d_k}$, $K\in \mathbb{R}^{n\times d_k}$, and $V\in \mathbb{R}^{n\times d_v}$ are the query, key, and value matrices (i.e., the stacked query, key, and value vectors for all positions). $d_k$ is the dimension of the attention head. What this equation does is essentially: for every query token, we compute the “interaction” with all previous tokens, and use that to weight the value vectors.
In principle, for every new token in $Q$, we would have to recompute all the entries in $K$ and $V$ for all previous tokens, despite the fact that their entries didn’t change at all. So, in order to avoid this recomputation, we store them in the…KV Cache. We can just look them up from memory.Llama-3.1-70B model with 80 layers, 8 KV heads (Grouped Query Attention), and head dimension $d_k = d_v = 128$, the KV cache for a single token in fp16 is
This scales linearly with context length. With a context length of 128k tokens, this amounts to roughly $40\,\text{GB}$. For reference, the model weights themselves at fp16 are roughly $70 \times 10^9 \times 2\,\text{bytes} \approx 140\,\mathrm{GB}.$ If the model is served at int4, it’s around $35\,\mathrm{GB}$, so the KV Cache would be larger than the model itself! And this is even without batching for multiple users.
Then, there is also the latency problem. The KV cache is stored in the main GPU memory, the HBM, and for every new token, the entire cache must be streamed from HBM into the on-chip SRAM to compute attention. This makes the decoding phase heavily memory-bound: the GPU’s tensor cores sit underutilized while waiting on memory transfers.
All in all, we need ways to compress the KV cache while still ensuring that the model’s quality is high — in particular, that the distortion to the query-key inner product $\langle q,k\rangle$ is minimal. This problem is fundamental in information theory and dates back to Claude Shannon. Shannon established a lower bound on the distortion imposed by any compression. As we will see, the method introduced by Zandieh et al. gets very close.
There are two kinds of compression schemes: lossless compression and lossy compression. When you zip a file, you are dealing with lossless compression. There is a preservation of information, which is great since we want to reconstruct the exact same file from the compressed version. TurboQuant is a lossy compression scheme. We reduce the precision, but try to do so under minimum distortion.
TurboQuant specifically targets key vectors, while value vectors are compressed separately using a simpler scalar scheme. There are two targets which are addressed in two steps:
The overall encoding scheme of TurboQuant looks like this. We will next go through each step in detail.
We have a bit budget of $b$ bits per coordinate. We capture the bulk information with MSE quantization and store this in $b-1$ bits. We use the remaining bit for correcting the inner product with QJL.
We start by applying a random rotation matrix to the key vector:
\[y = \Pi \cdot k \in \mathbb{R}^d\]The result is that for large $d$, the coordinates become approximately i.i.d.
\[y_i \sim \mathcal{N}\left(0,\frac{1}{d}\right)\]The previously correlated coordinates become nearly independent, and the quantization reduces to a 1D scalar quantization problem under a known distribution.
For a point drawn from the unit hypersphere $S^{d-1}$, the marginal distribution of any single coordinate follows from a slicing argument: if we fix one coordinate $x_1 = t$, this constrains the remaining coordinates to a “sub”-sphere of radius $\sqrt{1-t^2}$, because $|x|^2 = 1$.
This is a symmetric Beta distribution with parameters $\alpha = \beta = (d-1)/2$ and variance $1/d$. As $d\to \infty$, $\alpha = \beta \to \infty$ and the Beta distribution concentrates tightly around zero, converging to $\mathcal{N}(0,1/d)$.
This is precisely what the rotation $\Pi$ achieves here. $\Pi$ is drawn from the Haar measure, which is the uniform distribution over all rotations, so $\Pi k$ is a uniformly random rotation of $k$. Assuming $k$ is normalized to unit norm, its coordinates follow this distribution regardless of what $k$ actually was. Any structure or correlation in the original key vector is randomized away, and every coordinate ends up with the same known marginal. The codebook can therefore be designed once for $\mathcal{N}(0,1/d)$ and reused universally — across every token, every layer, every model.
The rotated coordinates are grouped together at $2^{b-1}$ centroids via Lloyd-Max and quantized according to their index. The special part here is that we know the underlying distribution, and thus we can solve for the centroid coordinates only once, and reuse the result in all following iterations with a lookup table. All we need to remember are the indices of each coordinate.
The assignment rule is
\[\text{idx}_j = \arg \min_k |y_j-c_k|\]We search for the index of the centroid which has the minimal absolute distance to the rotated vector $y$, in each of its coordinates. What we store in cache are the centroid indices as
\[\mathrm{idx} \in \{ 0,\dots, 2^{b-1}-1\}^d\]So, for a bit budget of $b=3$, there would be 4 centroid indices $00, 01, 10, 11$.
This problem — quantizing the output from a Gaussian source
At $b$ bits per coordinate, we can distinguish at most $2^b$ different values. But no matter how cleverly we place those bins, there will always be some values that are genuinely different that end up in the same bin and thus become indistinguishable after compression. This introduces an inevitable reconstruction error.
Now, a Gaussian source has a specific amount of information content, which can be measured by its differential entropy
\[H(X) = \frac{1}{2}\log(2\pi e\sigma^2).\]Shannon showed that this entropy implies a rate-distortion function
\[D(R) = \sigma^2\cdot 2^{-2R},\]which at $R=b$ bits per coordinate gives
\[D=\sigma^2\cdot 4^{-b}.\]This is the theoretical minimum distortion achievable by any compression scheme at $b$ bits per coordinate.
This holds for any compression scheme, be it scalar, vector, or otherwise. Here, we are using scalar quantization, meaning that each coordinate is quantized independently. This is computationally attractive but comes at the cost that a scalar quantizer cannot exploit correlations between coordinates, and even after the random rotation has made them nearly independent, quantizing one at a time wastes some capacity.
Panter and Dite have shown that the best scalar quantizer cannot achieve a distortion below
\[D \sim \frac{\pi\sqrt{3}}{2}\cdot 4^{-b},\]sitting within a fixed factor of $\pi\sqrt{3}/2 \approx 2.72$ above the Shannon floor at all bit widths.
We can test this by sampling points from a standard Gaussian $\mathcal{N}(0,1)$, then applying Lloyd-Max with $2^b$ levels, and measuring the distortion $D_\text{mse}(b)$ as the mean-squared error with an evaluation set (points sampled from the same Gaussian, but with a different random seed).
The empirical markers sit below the Panter-Dite bound at every bit width, confirming that Lloyd-Max achieves near-optimal distortion. In the right plot, each value is normalized by $4^b$ to show the convergence to the Panter-Dite constant of $\pi\sqrt{3}/2 \approx 2.7207$. What’s interesting to note is that for $b=1$, the prefactor is only around 1.457, meaning that at the tightest compression, we are actually the closest to the Shannon bound.
tl;dr: The Shannon bound is achievable only by vector (block) quantizers with infinitely long blocks, where samples can be encoded jointly. At $b=1$, however, both scalar and vector quantizers are equally constrained by the coarseness of a two-level representation, leaving little room for a joint scheme to outperform a scalar one.
To understand why that is, we need to think about what the Shannon bound actually represents at a given bit width. The Shannon bound of $4^{-b}$ is the minimum distortion achievable by any quantizer (scalar, vector, whatever) at $b$ bits. But this bound is only achieved by an ideal vector quantizer operating on infinitely long blocks.
A block quantizer does not compress one sample at a time, but rather looks at a block of $n$ samples together and encodes them jointly. With such a joint code, we can redistribute the bit budget across the block (spending more bits where the signal is large, fewer where it is near zero), while keeping the total at $nb$ bits. As $n\to\infty$, the redistribution becomes increasingly efficient and the distortion approaches the Shannon bound from above. Scalar quantization is the special case of $n=1$: we look at one sample at a time and must commit to a fixed codebook, with no knowledge of what other samples look like. This is the overhead captured by the Panter-Dite constant.
Now, why is this gap smallest at $b=1$? With one bit per coordinate, there are only two levels, and the optimal placement is fully determined: two centroids sitting at $\pm \sqrt{2/(\pi d)}$ with the threshold at zero. A vector quantizer with the same budget of 1 bit per coordinate would face the same constraint: representing a continuous Gaussian with just two values will be coarse regardless of how sophisticated the scheme is. Both scalar and vector quantization are here equally limited by the coarseness of the representation, so the gap between them is small.
This changes with higher $b$. With more levels, a vector quantizer can exploit increasing design freedom across coordinates jointly, adapting to the actual realization of each block. A scalar quantizer cannot do this; it applies the same codebook to every coordinate independently. The gap thus widens as $b$ grows, and the normalized distortion $4^b \cdot D_\mathrm{mse}(b)$ rises from 1.457 at $b=1$ toward the asymptotic ceiling of 2.72.
For a single coordinate $x\sim \mathcal{N}(0,\sigma^2)$ with $\sigma=1/\sqrt{d}$, the optimal 1-bit quantizer assigns every positive $x$ to centroid $+c$ and every negative $x$ to centroid $-c$, with threshold at zero. The optimal $c$ minimizes the expected squared reconstruction error:
\[\min_{c} \ \mathbb{E}[(x-c)^2 \ | \ x>0]\]The minimizer is the conditional mean, $c=\mathbb{E}[x\ |\ x>0]$. A standard Gaussian distribution $X \sim \mathcal{N}(0,\sigma^2)$ has the probability density function (PDF)
\[f(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{x^2}{2\sigma^2}}.\]When restricted to $X>0$, the total area under the curve is $0.5$, so we multiply by $2$ to normalize it, which gives us the half-normal distribution PDF:
\[f_{X>0}(x) = \frac{\sqrt{2}}{\sigma\sqrt{\pi}}e^{-\frac{x^2}{2\sigma^2}}.\]We get the expected value $\mathbb{E}[X]$ by integrating $x\cdot f(x)$ from $0$ to $\infty$:
\[\mathbb{E}[X] = \int_0^\infty x \frac{\sqrt{2}}{\sigma \sqrt{\pi}}e^{-\frac{x^2}{2\sigma^2}} \ \mathrm{d}x\]Substituting $u=\frac{x^2}{2\sigma^2}$ with $\mathrm{d}u = \frac{x}{\sigma^2}\ \mathrm{d}x$:
\[\mathbb{E}[X] = \frac{\sqrt{2}}{\sigma\sqrt{\pi}}\int_0^\infty \sigma^2e^{-u} \ \mathrm{d}u = \frac{\sigma^2\sqrt{2}}{\sigma \sqrt{\pi}}[-e^{-u}]_0^\infty = \sigma \sqrt{\frac{2}{\pi}}\]With $\sigma = \frac{1}{\sqrt{d}}$, we get
\[c = \mathbb{E}[x \ | \ x>0] = \sigma \sqrt{\frac{2}{\pi}} = \frac{1}{\sqrt{d}} \cdot \sqrt{\frac{2}{\pi}} = \sqrt{\frac{2}{\pi d}}\]
During encoding, we immediately perform a reconstruction of the key vector, but because the quantization is lossy, the reconstructed key vector $\tilde{k}$ will be slightly different from the actual key vector $k$. What we do is to rotate the centroid look-up table back into the original space:
\[\tilde{k} = \Pi^\top \cdot c_\text{idx} \in \mathbb{R}^d\]From that, we can calculate the residual as the difference
\[r = k-\tilde{k} \in \mathbb{R}^d\]This is the stuff that the MSE stage failed to capture. It is as small as it can be given the $b-1$ budget; it’s the minimal MSE from the Lloyd-Max stage, but we still need to correct for it to get the attention scores right. So we compress this as well. In particular, we are using a Quantized Johnson-Lindenstrauss transform, which allows us to quantize the residual vector into a single bit per coordinate.
In short: we apply a random Gaussian matrix and take the sign. This gives us 1 bit per coordinate, which we store in the vector $\sigma \in {-1,+1}^d$:
\[\sigma = \text{sign}(Sr)\]And we also store the magnitude of $r$ as a scalar, as we need that for reconstruction.
QJL in general
QJL builds on the Johnson-Lindenstrauss lemma: a random matrix $S\in\mathbb{R}^{m\times d}$ preserves inner products in expectation,
\[\mathbb{E}[\langle Su, Sv\rangle ] = m\langle u,v\rangle\]with error shrinking as $m$ grows. QJL uses a square matrix $S\in \mathbb{R}^{d\times d}$ (no dimensional reduction) and compresses by storing only the sign of each projected coordinate:
\[\sigma = \text{sign}(Sx) \in \{ -1,+1\}^d\]We go from 32 bits per coordinate to 1 bit. To recover the magnitude, QJL also stores $|x|$ as a single float32. The estimator for the inner product of $q$ and $x$ is then:
\[\widehat{\langle q,x\rangle} = \frac{\sqrt{\pi/2}}{d} \cdot \|x\|\cdot \langle Sq,\sigma\rangle\]with $Sq$ computed in full float. The factor $\sqrt{\pi/2}$ corrects for the sign shrinkage and $d^{-1}$ normalizes for summing $d$ terms.
QJL on the residual
After Lloyd-Max at $b-1$ bits, we have a reconstruction $\tilde{k}$ and residual $r = k-\tilde{k} \in \mathbb{R}^d$. The inner product splits as
\[\langle q,k\rangle = \langle q,\tilde{k}\rangle + \langle q,r\rangle\]We already have the first term from the Lloyd-Max reconstruction. For the second, we apply QJL to $r$ using the remaining 1 bit of the budget:
\[\sigma = \text{sign}(S\cdot r), \quad \text{store}\ \sigma \ \text{and} \ \|r\|\]At query time:
\[\langle q,k\rangle \approx \langle q,\tilde{k} \rangle + \frac{\sqrt{\pi/2}}{d} \cdot \|r\|\cdot \langle Sq,\sigma\rangle\]which gives us an unbiased estimator of the true inner product.
You may ask why we do not just apply QJL to $k$ directly. The reason is that the variance of the QJL estimator is proportional to $|x|^2$. Lloyd-Max has already minimized $|r|$, so the noise from the 1-bit quantization is already as small as it could possibly be. If we applied QJL to the full $k$, the variance would be proportional to $|k|^2$, which is much larger. I show this in more detail below.
We now have covered the encoding pipeline from TurboQuant, which happens for every new token, both during prefill (where we encode all prompt tokens in parallel) and during decode (where each newly generated token is encoded one by one). Instead of storing the full key vectors, our key cache now consists of three entries per vector:
idx σ ‖r‖₂ By a napkin estimate, this reduces the memory footprint by a factor of more than $5\times$ for $b=3$, and approximately $8\times$ for $b=2$.
A key vector in float16 costs $d\times 2\,\text{bytes}$. The encoded representation costs
| Component | Size |
|---|---|
idx ($b-1$ bits/coord) | $\frac{d(b-1)}{8}$ bytes |
σ ($1$ bit/coord) | $\frac{d}{8}$ bytes |
‖r‖₂ (one float32) | 4 bytes |
| Total | $\frac{db}{8}+4$ bytes |
For large $d$, the scalar 4 bytes is negligible, which gives a compression ratio of:
\[\frac{2d}{db/8} = \frac{16}{b}\]So approximately $5.3\times$ at $b=3$, and approximately $8\times$ at $b=2$, as we go from 16 bits to $b$ bits per coordinate.
At attention time, we do the decoding step, where we want to use the reconstructed key vectors together with the value vectors
We achieve this in two steps. First, we look up the centroid coordinates corresponding to the stored indices. We know these analytically, since they were precomputed once offline by solving the Lloyd-Max problem on the known Beta distribution for our specific bit width. We then rotate them back to the original key space using $\Pi^\top$, recovering the lossy MSE reconstruction $\tilde{k} = \Pi^\top c_\mathrm{idx}$. Second, we correct this reconstruction using the stored QJL representation, reversing the residual compression. Together, the inner product is estimated as:
\[\begin{aligned} \langle q, \tilde{k}+\tilde{r}\rangle &= \langle q,\Pi^\top c_\mathrm{idx}\rangle + \frac{\sqrt{\pi/2}}{d} \cdot \|r\|_2\cdot \langle q,S^\top\sigma\rangle \\ &= \langle q,\Pi^\top c_\mathrm{idx}\rangle + \frac{\sqrt{\pi/2}}{d} \cdot \|r\|_2\cdot \langle Sq,\sigma\rangle \end{aligned}\]The second line uses the identity $\langle q,S^\top\sigma\rangle = \langle Sq, \sigma\rangle$, which means we pre-multiply the query by $S$ once and then dot with the stored sign vector. This keeps the query side in full float precision and only the key side quantized.
The TurboQuant method is an advancement from the authors’ previous paper on Quantized Johnson-Lindenstrauss, which we covered above. QJL was itself developed for KV cache quantization and I was interested in how much of an improvement the TurboQuant hybrid is above using just QJL alone. So the two methods we compare are:
TurboQuant: split the bit budget into $b-1$ bits for Lloyd-Max and $1$ bit for QJL on the residual.
vs.
QJL: skip Lloyd-Max entirely and spend all $b$ bits on QJL sign projection, applying $m=bd$ random sign measurements directly to $k$.
To test this, we run $N=2000$ independent trials for each strategy at $d=128$, drawing random unit vector pairs $(q,k)\in\mathbb{R}^d$ and measuring the inner product distortion of each estimator, which is shown in the left panel. The right panel shows the ratio between the two distortions, alongside the theoretical prediction $b\cdot D_\mathrm{mse}(b-1)$ and the Panter-Dite upper bound $2\pi\sqrt{3}\cdot b \cdot 4^{-b}$.
At $b=1$, the two strategies are identical: TurboQuant allocates zero bits to Lloyd-Ma x and thus reduces to pure QJL. For $b>1$, the gap opens quite rapidly, and TurboQuant gains significantly in precision. The reason is that when QJL is applied directly to $k$, the variance of its inner product estimator is proportional to $|k|^2$, and this cannot be reduced by any amount of additional sign bits. However, when Lloyd-Max is applied first, the residual satisfies $|r|^2\ll|k|^2$ because Lloyd-Max minimizes exactly this quantity given $b-1$ bits. QJL then operates on a much smaller target, and its variance shrinks proportionally. Lloyd-Max shrinks the target, and QJL corrects the bias that Lloyd-Max cannot avoid.
The scaling law $D_\mathrm{TQ}/D_\mathrm{QJL}\sim b\cdot 4^{-b}$ tells us this advantage compounds with each additional bit. At $b=3$, TurboQuant achieves three times lower inner product distortion than pure QJL at the same memory cost. Equivalently, TurboQuant at $b=2$ achieves roughly the same distortion as pure QJL would at $b=3$, so we save one bit per coordinate at no quality cost. This is quite significant in a production KV cache with $d=128$ over millions of tokens.
As mentioned in the introduction, I found the TurboQuant paper also interesting because mathematically, there are a couple of interesting parallels to concepts used in physics, which I finally want to touch on briefly.
Ising and Potts model
With $b=1$, the Lloyd-Max codebook reduces to two centroids at $\pm\sqrt{2/(\pi d)}$ and the quantization map becomes $\sigma = \mathrm{sign}(\Pi k)$, which is a vector of binary spins, just like in the Ising model. Each coordinate $(\Pi k)_j$ acts as a local field, and the spin aligns with the field at zero temperature. For $b=2$, it is analogous to a 4-state Potts model; at $b=3$, to an 8-state Potts model. The compression hierarchy from 1-bit to multi-bit quantization maps onto an increasing spin multiplicity: from Ising through Potts towards the continuous limit.
Zero temperature
In a physical system, the hard sign in QJL corresponds to a zero-temperature setting, being the limit of an otherwise smooth sigmoid. We can see what would happen if we introduced a temperature parameter (inverse temperature $\beta$) by replacing the deterministic sign with a stochastic spin drawn from:
\[P(\sigma_\mu = +1) = \text{sigmoid}(2\beta\cdot s_\mu^\top k)\]At $\beta \to \infty$ this recovers the hard sign; at finite $\beta$ the spin fluctuates around the field direction $s_\mu^\top k$ with thermal noise of scale $1/\beta$. To see how this affects the inner product estimator, we decompose $q$ into its component parallel to $k$ and its orthogonal complement:
\[q = \frac{\langle q,k\rangle}{\|k\|^2}k + q^\perp\]The parallel component carries all information about $\langle q,k\rangle$. Taking the expectation of the estimator over the random matrix $S$ and the stochastic spins:
\[\mathbb{E}[\langle Sq,\sigma \rangle] = \frac{\langle q,k\rangle}{\|k\|} \cdot \mathbb{E}[|s_\mu^\top k|]\cdot g(\alpha)\]where $q^\perp$ contributes zero by symmetry, and the bias function is:
\[g(\alpha) = \sqrt{\frac{\pi}{2}} \cdot \mathbb{E}_{z\sim\mathcal{N}(0,1)}[z \tanh(\alpha z)], \quad \alpha = \beta \|k\|\]Since $g(\alpha)<1$ for all finite $\alpha$ and $g(\alpha)\to 1$ only as $\alpha\to\infty$, the estimator is systematically biased at any finite temperatures: it underestimates all inner products by the factor $g(\alpha)$. After correcting for this bias, the variance of the corrected estimator is strictly decreasing in $\alpha$, minimized only at $\beta \to \infty$. Hence, zero temperature is the unique optimum. Using the hard sign as done in TurboQuant, we get to the ground state of the Ising system, and thermal fluctuations only hurt.
Concentration of measure and the thermodynamic limit
The rotation step used in TurboQuant works because of concentration of measure in high dimensions. This is central to both statistical mechanics and high-dimensional probability. As $d\to\infty$, almost all of the sphere’s surface concentrates near the equator, and each coordinate of a random point on $S^{d-1}$ becomes approximately Gaussian with variance $1/d$. This is like a geometric version of the thermodynamic limit: as the number of degrees of freedom grows, macroscopic quantities (here, the coordinate distribution) become deterministic and predictable, even though individual realizations remain random. The $1/d$ variance is the direct analogue of $1/N$ fluctuation suppression in large thermodynamic systems.
Sphere packing and Voronoi tessellation
The Lloyd-Max optimisation partitions the real line into Voronoi cells, where each coordinate gets assigned to its nearest centroid, and the cell boundaries are midpoints between centroids. In higher dimensions, the analogous problem of partitioning space optimally with a given number of centres is the sphere packing problem, well studied in condensed matter physics. The paper itself notes (in Theorem 3) that comparable lower bounds can be derived using sphere packing arguments. The $\frac{\pi\sqrt{3}}{2}$ Panter-Dite constant is in this sense a one-dimensional sphere packing result evaluated for a Gaussian measure.
Entropy
Shannon built his entropy $H=-\sum p\log p$ explicitly by analogy with Boltzmann and Gibbs. The rate-distortion function for a Gaussian source has a formal structure similar to a free energy minimization.
The rate-distortion optimization problem asks essentially: for a source $X$ with distribution $p(x)$, find the encoder $p(\hat{x}|x)$ that minimizes the mutual information $I(X;\hat{X})$ subject to the constraint $\mathbb{E}[d(X,\hat{X})] \leq D$. We can formalize this as an unconstrained minimization using Lagrange multipliers:
\[\min_{p(\hat{x}|x)} \left[ I(X;\hat{X}) + \lambda \cdot \mathbb{E}[d(X,\hat{X})]\right]\]for some $\lambda \geq 0$ that trades rate against distortion. The first-order condition yields the optimal encoder:
\[p^*(\hat{x}|x) = \frac{p(\hat{x}) \cdot e^{-\lambda \ d(x,\hat{x})}}{Z(x)}, \quad Z(x)=\int p(\hat{x}) \ e^{-\lambda \ d(x,\hat{x})} \ d\hat{x}\]This is a Gibbs distribution. The Lagrange multiplier $\lambda$ acts as the inverse temperature, $d(x,\hat{x})$ as an energy, and $Z(x)$ is a partition function. At the optimum, the Lagrangian satisfies:
\[I^*(X;\hat{X}) + \lambda D^* = -\mathbb{E}_x[\log Z(x)] = \lambda \mathbb{E}_x[F(x)]\]where $F(x) = -\frac{1}{\lambda}\log Z(x)$ is the free energy per sample. In that sense, rate-distortion optimization is analogous to a free energy minimization: Shannon’s lower bound is the minimum free energy achievable with a continuous reconstruction alphabet, and the Panter-Dite bound is the minimum achievable with a discrete alphabet of $2^b$ centroids. At zero temperature their ratio is exactly $\frac{\pi\sqrt{3}}{2}$.
The entire rate-distortion curve is traced by varying $\lambda$: at $\lambda=0$ (infinite temperature), the encoder ignores the input entirely, $p^*(\hat{x}|x)=p(\hat{x})$, achieving zero rate at maximum distortion; as $\lambda \to \infty$ (zero temperature), it concentrates on the nearest reconstruction point, approaching lossless compression. For a Gaussian source $X\sim \mathcal{N}(0,\sigma^2)$ with squared error $d(x,\hat{x}) = (x-\hat{x})^2$, the parametric form of the rate-distortion curve is:
\[\lambda= -\frac{dR}{dD}=\frac{1}{2D}, \quad R(D) = \frac{1}{2}\log\frac{\sigma^2}{D}\]At $R=b$ bits, this gives $D=\sigma^2\cdot 4^{-b}$, which is the Shannon lower bound. The left panel below shows this curve. Every point on it corresponds to a specific operative temperature $T=1/\lambda=2D$: the four points from Lloyd-Max at $b=1,2,3,4$ sit on the curve, each at a different temperature (different slope). The achievable region (where compression is possible) lies above the curve ($R\geq R(D)$); the region below is forbidden by Shannon’s theorem, regardless of the encoding scheme used.
The middle panel shows what the Gibbs encoder actually does for a fixed input $x=0.75$. At high temperature, the encoder spreads probability mass broadly, pulled toward the prior mean at zero. This comes from Bayesian shrinkage, as the reconstruction is a compromise between the observed input and the prior $p(\hat{x}) = \mathcal{N}(0,1)$. As temperature falls, the likelihood sharpens and the distribution concentrates closer to $x$. At $T\to0$, it collapses onto the nearest Lloyd-Max centroid. So in that sense, the zero-temperature limit of the Gibbs encoder is Lloyd-Max quantization.
A scalar quantizer with $2^b$ levels is a restricted version of this optimal encoder. The reconstruction $\hat{x}$ is constrained to one of $2^b$ fixed centroids rather than drawn from a continuous distribution. The partition function becomes a finite sum:
\[Z(x) = \sum_{k=1}^{2^b} e^{-\lambda(x-c_k)^2}\]rather than a Gaussian integral, and the achievable distortion at rate $b$ is correspondingly higher than the Shannon minimum. The Panter-Dite result quantifies how much higher (right panel): the minimum distortion achievable by any scalar quantizer at $b$ bits is $\frac{\pi\sqrt{3}}{2}\approx 2.72$ times the Shannon lower bound $\sigma^2\cdot 4^{-b}$. This discretization overhead corresponds to the filled region between the two dashed lines. Note that this factor does not shrink with $b$: the two lines are parallel on the log scale, meaning the overhead is a fixed multiplicative cost of restricting the Gibbs distribution to finitely many states.
Randomness as a constructive resource
Finally, we see that both $\Pi$ and $S$ are random matrices used deliberately in TurboQuant. This is analogous to how randomness is used constructively in methods from statistical mechanics, like Monte Carlo, simulated annealing, or replica methods. The rotation $\Pi$ drawn from the Haar measure randomizes the input into a form where the optimal codebook is known analytically, which sidesteps the need for data-dependent calibration. The matrix $S$ spreads the residual’s energy uniformly across sign bits so the noise is isotropic and cancels in expectation. In both cases, the randomness is not noise to be suppressed, but acts as a resource that enables theoretical guarantees.
If you have any thoughts on this, or think I got something wrong somewhere, please reach out!