Gzip is all You Need! (This SHOULD NOT work)
Based on sentdex's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Represent each text as a vector of normalized gzip compression distances (NCD) to all training samples, then run K-nearest neighbors on those vectors.
Briefing
A surprisingly effective sentiment classifier can be built from a simple recipe: compress text with gzip, convert those compression results into normalized compression distances (NCD), and then run K-nearest neighbors (KNN) on the resulting distance vectors. On common sentiment data, this “parameter-free” style approach reaches roughly the mid-70% accuracy range—far above random guessing—despite using no neural network and only a few dozen lines of straightforward code. The core takeaway is that statistical compression behavior can encode enough information about language patterns (like common phrasing tied to positive vs. negative sentiment) for KNN to exploit.
The method starts by representing each text string numerically. For any two strings X1 and X2, gzip compresses them individually and together, and NCD is computed from the combined compressed length relative to the smaller and larger individual compressed lengths. Each training example becomes a feature vector: it stores its NCD values against every training sample. That means classification is driven by how similarly a new text compresses relative to the entire training set’s “compression neighborhood,” not by token counts or embeddings.
Training then uses standard KNN. For a test string, the model compresses it, computes its NCD vector against the training set, and finds the closest neighbors under the NCD distance structure. Those neighbors’ sentiment labels vote the final prediction. The computational bottleneck is the all-pairs NCD calculation: every sample must be compared to every other sample, producing a quadratic workload. The implementation speeds this up with multiprocessing while preserving the correct ordering of the NCD vectors so each training row aligns with the right sample.
Results are mixed when compared to the original research claim that this compressor-based method could beat BERT on sentiment. Using a small dataset (500 samples), accuracy lands around 70% in the author’s tests, with noticeable variance depending on which samples are drawn. The method stabilizes as dataset size grows: around 10,000 samples, accuracy settles near 75.7%. That’s still below the paper’s reported performance, but it remains strong for a baseline that uses only gzip, NCD, and KNN.
Part of the discrepancy is traced to a likely evaluation issue in the original paper: using K=2 in KNN can create ties, and the paper’s tie-handling reportedly counted ties as correct without a tie-breaker. Correcting that kind of mistake can materially change reported accuracy. Even after accounting for that, the approach still performs meaningfully, suggesting the compression-distance representation is not just a fluke.
The analysis also highlights practical limitations. Accuracy depends on text length: very short strings tend to be misclassified, and the classifier appears to work best when inputs are within a length range common to the training data (roughly 200+ characters, and more reliably around 600+ in these experiments). Attempts at adding dimensionality reduction via PCA didn’t reveal extra structure to exploit. Overall, the work functions as a reminder that non-neural baselines can still capture real signal—and that revisiting “first principles” methods can uncover useful, lightweight alternatives to deep learning in some NLP settings.
Cornell Notes
Gzip-based normalized compression distances (NCD) can turn raw text into numeric feature vectors for K-nearest neighbors. Each text is compressed with gzip, and NCD between two strings is computed from the compressed lengths of each string and their concatenation. For classification, every training sample becomes a vector of NCD values against all other training samples; a new input gets its own NCD vector against the training set, then KNN votes using sentiment labels. On sentiment datasets, accuracy rises from about ~70% on 500 samples to around 75.7% on 10,000 samples, though it doesn’t match the original paper’s BERT-better claim. Performance is sensitive to text length and the computational cost of all-pairs NCD, which can be sped up with multiprocessing.
How does gzip-based NCD convert two text strings into a comparable distance?
What exactly becomes the feature vector for KNN in this approach?
Why isn’t this “data leakage,” even though test vectors are computed against training samples?
What caused a gap between the original paper’s reported results and the recreated results?
Why does runtime become a problem, and how is it addressed?
What input property most affects accuracy in practice?
Review Questions
- How does the NCD formula use min and max of individual compressed lengths to normalize similarity between two texts?
- Why does the method require an all-pairs NCD computation, and what part of the pipeline dominates runtime?
- What kinds of tie-handling issues can distort KNN accuracy when K is even (like K=2)?
Key Points
- 1
Represent each text as a vector of normalized gzip compression distances (NCD) to all training samples, then run K-nearest neighbors on those vectors.
- 2
Compute NCD from gzip compressed lengths of X1, X2, and X1+X2 using (C(X1+X2) − min(C(X1), C(X2))) / max(C(X1), C(X2)).
- 3
The main computational cost is building the full NCD matrix (quadratic in dataset size), not the KNN prediction step.
- 4
Reported performance can be inflated if KNN tie cases (common with K=2) are handled incorrectly or counted as correct without a tie-breaker.
- 5
Accuracy improves with larger datasets (about ~70% at 500 samples to ~75.7% at 10,000 samples in these tests) but remains sensitive to input length.
- 6
Inputs far shorter than the training distribution tend to be misclassified, suggesting the compression-distance signal depends on having enough text for statistical patterns to emerge.
- 7
Even without neural networks, compressor-based similarity can capture sentiment-related language regularities well enough to beat random guessing by a wide margin.