Gzip is all You Need! (This SHOULD NOT work)

TL;DR

Represent each text as a vector of normalized gzip compression distances (NCD) to all training samples, then run K-nearest neighbors on those vectors.

Briefing Cornell Notes

Briefing

A surprisingly effective sentiment classifier can be built from a simple recipe: compress text with gzip, convert those compression results into normalized compression distances (NCD), and then run K-nearest neighbors (KNN) on the resulting distance vectors. On common sentiment data, this “parameter-free” style approach reaches roughly the mid-70% accuracy range—far above random guessing—despite using no neural network and only a few dozen lines of straightforward code. The core takeaway is that statistical compression behavior can encode enough information about language patterns (like common phrasing tied to positive vs. negative sentiment) for KNN to exploit.

The method starts by representing each text string numerically. For any two strings X1 and X2, gzip compresses them individually and together, and NCD is computed from the combined compressed length relative to the smaller and larger individual compressed lengths. Each training example becomes a feature vector: it stores its NCD values against every training sample. That means classification is driven by how similarly a new text compresses relative to the entire training set’s “compression neighborhood,” not by token counts or embeddings.

Training then uses standard KNN. For a test string, the model compresses it, computes its NCD vector against the training set, and finds the closest neighbors under the NCD distance structure. Those neighbors’ sentiment labels vote the final prediction. The computational bottleneck is the all-pairs NCD calculation: every sample must be compared to every other sample, producing a quadratic workload. The implementation speeds this up with multiprocessing while preserving the correct ordering of the NCD vectors so each training row aligns with the right sample.

Results are mixed when compared to the original research claim that this compressor-based method could beat BERT on sentiment. Using a small dataset (500 samples), accuracy lands around 70% in the author’s tests, with noticeable variance depending on which samples are drawn. The method stabilizes as dataset size grows: around 10,000 samples, accuracy settles near 75.7%. That’s still below the paper’s reported performance, but it remains strong for a baseline that uses only gzip, NCD, and KNN.

Part of the discrepancy is traced to a likely evaluation issue in the original paper: using K=2 in KNN can create ties, and the paper’s tie-handling reportedly counted ties as correct without a tie-breaker. Correcting that kind of mistake can materially change reported accuracy. Even after accounting for that, the approach still performs meaningfully, suggesting the compression-distance representation is not just a fluke.

The analysis also highlights practical limitations. Accuracy depends on text length: very short strings tend to be misclassified, and the classifier appears to work best when inputs are within a length range common to the training data (roughly 200+ characters, and more reliably around 600+ in these experiments). Attempts at adding dimensionality reduction via PCA didn’t reveal extra structure to exploit. Overall, the work functions as a reminder that non-neural baselines can still capture real signal—and that revisiting “first principles” methods can uncover useful, lightweight alternatives to deep learning in some NLP settings.

Cornell Notes

Gzip-based normalized compression distances (NCD) can turn raw text into numeric feature vectors for K-nearest neighbors. Each text is compressed with gzip, and NCD between two strings is computed from the compressed lengths of each string and their concatenation. For classification, every training sample becomes a vector of NCD values against all other training samples; a new input gets its own NCD vector against the training set, then KNN votes using sentiment labels. On sentiment datasets, accuracy rises from about ~70% on 500 samples to around 75.7% on 10,000 samples, though it doesn’t match the original paper’s BERT-better claim. Performance is sensitive to text length and the computational cost of all-pairs NCD, which can be sped up with multiprocessing.

How does gzip-based NCD convert two text strings into a comparable distance?

For strings X1 and X2, gzip compresses X1, X2, and the concatenation X1+X2. NCD is computed using the combined compressed length, normalized by the compressed lengths of the individual strings: (C(X1+X2) − min(C(X1), C(X2))) / max(C(X1), C(X2)). This yields a normalized value intended to be comparable across different text pairs.

What exactly becomes the feature vector for KNN in this approach?

Each training sample i is represented by a vector of NCD values between that sample and every training sample j. So if there are 401 training samples, each feature vector has 401 dimensions. During inference, a test string is compressed and compared via NCD to each training sample to produce a test vector of the same dimensionality.

Why isn’t this “data leakage,” even though test vectors are computed against training samples?

The test sample’s NCD vector is computed against the training texts to match the feature definition used at training time. The labels of the training set aren’t used in computing distances; only the training texts themselves define the reference vectors. The classifier then uses the training labels only at the KNN voting stage.

What caused a gap between the original paper’s reported results and the recreated results?

A likely evaluation flaw: the original work used K=2 for KNN, which can produce ties. Reportedly, ties were counted as correct without a tie-breaker, inflating accuracy. Correcting that kind of tie handling can shift accuracy by several percentage points, enough to undermine claims of beating BERT.

Why does runtime become a problem, and how is it addressed?

NCD requires comparing every sample to every other sample, creating a quadratic all-pairs computation. The KNN step itself is fast; the expensive part is building the NCD matrix. Multiprocessing is used to parallelize NCD vector computation while writing results into preallocated matrices by sample index to preserve correct ordering.

What input property most affects accuracy in practice?

Text length. Very short strings tend to be misclassified, often appearing to be driven by length-related compression behavior rather than sentiment patterns. In these experiments, accuracy improves when inputs are within the length range common to the training data—roughly 200+ characters, and more reliably around 600+ characters.

Review Questions

How does the NCD formula use min and max of individual compressed lengths to normalize similarity between two texts?
Why does the method require an all-pairs NCD computation, and what part of the pipeline dominates runtime?
What kinds of tie-handling issues can distort KNN accuracy when K is even (like K=2)?

Key Points

1
Represent each text as a vector of normalized gzip compression distances (NCD) to all training samples, then run K-nearest neighbors on those vectors.
2
Compute NCD from gzip compressed lengths of X1, X2, and X1+X2 using (C(X1+X2) − min(C(X1), C(X2))) / max(C(X1), C(X2)).
3
The main computational cost is building the full NCD matrix (quadratic in dataset size), not the KNN prediction step.
4
Reported performance can be inflated if KNN tie cases (common with K=2) are handled incorrectly or counted as correct without a tie-breaker.
5
Accuracy improves with larger datasets (about ~70% at 500 samples to ~75.7% at 10,000 samples in these tests) but remains sensitive to input length.
6
Inputs far shorter than the training distribution tend to be misclassified, suggesting the compression-distance signal depends on having enough text for statistical patterns to emerge.
7
Even without neural networks, compressor-based similarity can capture sentiment-related language regularities well enough to beat random guessing by a wide margin.

Highlights

A sentiment classifier can be built without embeddings or neural nets by using gzip compression to derive normalized compression distances and feeding them into KNN.

The all-pairs NCD computation is the bottleneck; multiprocessing can make it practical while keeping feature vectors aligned to the correct samples.

The recreated results reach ~75.7% accuracy on 10,000 samples—strong for a baseline—yet fall short of the original paper’s BERT-better claim.

KNN evaluation details matter: using K=2 can create ties, and counting ties as correct can materially inflate reported accuracy.

Performance depends heavily on text length; very short inputs often get classified negatively regardless of true sentiment.

Topics

Normalized Compression Distance
K-Nearest Neighbors
Sentiment Analysis
Gzip Compression
Model Evaluation