Get AI summaries of any video or article — Sign up free

Distributed Representations of Words and Phrases and their Compositionality

Tomáš Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, Jeff Dean
8 min read

Read the full paper at DOI or on arxiv

TL;DR

The paper improves Skip-gram training with three main extensions: negative sampling, subsampling of frequent words, and phrase token learning.

Briefing

This paper asks how to efficiently learn high-quality distributed vector representations for words and phrases, and whether these representations exhibit useful compositional structure. The question matters because scalable representation learning is a prerequisite for many downstream NLP tasks (e.g., translation, speech, semantic similarity), and because earlier word-embedding methods struggled with two practical issues: (1) training efficiency at very large corpora and vocabularies, and (2) representing phrase-level meaning, especially for idiomatic or non-compositional phrases.

The authors build on the continuous Skip-gram model, which learns vectors by predicting nearby context words from a center word. Their core contribution is to present several extensions that improve both training speed and embedding quality: subsampling of frequent words, a simpler alternative to hierarchical softmax via negative sampling, and a method for learning phrase embeddings by detecting frequent word pairs and treating them as single tokens. They also provide evidence that the learned vectors support linear compositional operations, enabling analogical reasoning and meaningful vector addition.

Methodologically, the paper evaluates multiple training objectives and training-time optimizations on analogy benchmarks. For word embeddings, they use the analogical reasoning task introduced in prior work: given an analogy of the form A:B::C:?, the model is considered correct if the nearest vector to (by cosine distance) matches the target word. They train Skip-gram variants with 300-dimensional vectors and a context window size (context size is described as a parameter; in later phrase experiments they explicitly use context size 5). They discard words that occur fewer than 5 times, resulting in a vocabulary size of 692K. The main training corpus for word experiments is an internal Google news dataset of one billion words.

To address the computational bottleneck of the full softmax (whose gradient cost scales with vocabulary size), they compare hierarchical softmax (HS) and negative sampling (NEG) against noise contrastive estimation (NCE). HS uses a Huffman-coded binary tree so that computing probabilities and gradients requires evaluating only about nodes rather than all vocabulary outputs. NEG simplifies NCE by using only sampled negatives and optimizing a logistic loss that distinguishes the true target word from noise words drawn from a noise distribution . They report that a noise distribution proportional to the unigram distribution raised to the power (i.e., ) outperforms unigram and uniform choices across tasks.

They also introduce subsampling of frequent words. Each word is discarded with probability , where is the word frequency and is a threshold typically around . This reduces the dominance of high-frequency function words, increases the effective proportion of informative co-occurrences, and speeds up training.

Key findings for word embeddings are summarized in Table 1. On the analogy task, negative sampling outperforms hierarchical softmax and is slightly better than NCE. With 300-dimensional embeddings trained on the one-billion-word dataset, the reported total accuracy (with syntactic/semantic breakdown) is: - NEG with : total 59% (syntactic 54%, semantic 59%) and training time 38 minutes. - NEG with : total 61% (syntactic 58%, semantic 61%) and training time 97 minutes. - HS-Huffman: total 47% (syntactic 40%, semantic 47%) and training time 41 minutes. - NCE with : total 53% (syntactic 45%, semantic 53%) and training time 38 minutes. The paper also reports that applying subsampling of frequent words improves accuracy and yields large speedups. For example, with subsampling enabled, NEG-5 reaches total accuracy 60% with training time 14 minutes, and NEG-15 reaches total accuracy 61% with training time 36 minutes, while HS-Huffman with subsampling reaches total accuracy 55% with training time 21 minutes. Overall, the authors emphasize speedups on the order of roughly to from subsampling.

For phrase embeddings, the authors first detect phrases using a data-driven bigram scoring function based on unigram and bigram counts: . Bigrams with scores above a threshold are merged into single tokens, and they run multiple passes over the training data with decreasing thresholds to allow longer multiword phrases. They then train Skip-gram on this phrase-tokenized corpus, using 300-dimensional vectors and context size 5 for initial phrase experiments.

They evaluate phrase vectors using a phrase analogy dataset (3218 examples) and a similar nearest-vector analogy criterion. Table 3 reports phrase analogy accuracy under different training choices: - NEG-5: 24% without subsampling vs 27% with subsampling. - NEG-15: 27% without subsampling vs 42% with subsampling. - HS-Huffman: 19% without subsampling vs 47% with subsampling. These results show that subsampling can be especially beneficial for hierarchical softmax in the phrase setting, and that increasing improves negative sampling performance.

To maximize phrase accuracy, they scale up training data substantially. They train on a dataset of about 33 billion words, using hierarchical softmax, 1000-dimensional embeddings, and the entire sentence as context. Under these settings, the model reaches 72% accuracy on the phrase analogy dataset; reducing training data to 6 billion words lowers accuracy to 66%, suggesting that large-scale data is crucial.

Beyond analogy benchmarks, the paper provides qualitative and mechanistic evidence of compositionality. First, it shows that analogical reasoning can be performed with vector arithmetic (e.g., country-capital relations). Second, it reports an additional linear property: element-wise addition of word vectors can yield meaningful combined meanings. For instance, they state that the sum of vectors for “Russia” and “river” is close to “Volga River,” and “Germany” plus “capital” is close to “Berlin.” They also provide an explanation grounded in the Skip-gram objective: word vectors relate to log context probabilities, so adding vectors corresponds to combining (approximately) log-probability features, which behaves like an AND over contexts.

Limitations are not deeply formalized (the paper is primarily empirical), but several constraints are apparent from the methodology. Phrase detection is heuristic and depends on thresholds and multiple passes; the authors do not compare their phrase extraction method against other phrase mining approaches. Evaluation relies on analogy tasks, which may not fully capture performance on broader NLP objectives. Additionally, the paper’s reported results depend on specific hyperparameter choices (vector dimensionality, context window, subsampling threshold, negative sampling rate , and noise distribution exponent), and the authors note that optimal hyperparameters are task-specific.

Practical implications are clear. For practitioners training word embeddings at scale, the paper provides concrete guidance: use negative sampling with an appropriate (often 5–20 depending on dataset size), subsample frequent words using the provided formula to gain large speedups and better rare-word vectors, and consider hierarchical softmax when subsampling is used (especially for phrase embeddings). For applications requiring phrase-level representations (e.g., retrieval, semantic matching, or feeding phrase tokens into downstream models), their phrase-token approach offers a simple path to incorporate non-compositional phrase meaning without complex recursive architectures. Who should care includes NLP engineers and researchers building embedding pipelines for large corpora, as well as teams needing efficient training and improved representations for rare entities and multiword expressions.

Overall, the paper’s core contribution is demonstrating that with efficient training objectives (negative sampling), frequency-aware sampling (subsampling), and scalable phrase tokenization, Skip-gram embeddings can be trained on orders of magnitude more data while retaining (and sometimes enhancing) linear compositional behaviors that support analogical reasoning and meaningful vector arithmetic—a combination that helped establish word2vec-style embeddings as a practical foundation for modern NLP.

Cornell Notes

The paper extends the Skip-gram word2vec model with negative sampling, frequency subsampling, and a phrase-detection/tokenization method. It shows that these changes greatly improve training speed and embedding quality, and that learned word/phrase vectors support linear compositionality via analogies and vector addition.

What is the main research question of the paper?

How can we efficiently learn high-quality distributed vector representations for words and phrases, and do these representations exhibit useful compositional structure (e.g., analogies and vector arithmetic)?

What study design and evaluation method do the authors use to test embedding quality?

They train multiple Skip-gram variants on large news corpora and evaluate them on analogy reasoning tasks using nearest-neighbor vector arithmetic with cosine similarity for both word and phrase analogies.

What data sources and vocabulary filtering are used for word embedding experiments?

They train on an internal Google news dataset of about one billion words, discarding words occurring fewer than 5 times, yielding a vocabulary of 692K.

How do hierarchical softmax and negative sampling reduce computational cost compared to full softmax?

Hierarchical softmax replaces the full softmax with a Huffman-coded binary tree so probability/gradient computation scales with the path length . Negative sampling replaces the softmax objective with logistic regression that distinguishes the true target from noise samples.

What noise distribution works best for negative sampling/NCE in the authors’ experiments?

A unigram distribution raised to the power (i.e., ) outperformed both the plain unigram and uniform distributions across tasks they tried.

What is the subsampling method for frequent words, and what does it accomplish?

Each word is discarded with probability (with typically around ), which speeds training and improves accuracy for rare words by reducing dominance of frequent co-occurrences.

What are the main quantitative results for word analogies?

On the one-billion-word setup, NEG-5 achieves 59% total analogy accuracy (38 minutes), NEG-15 achieves 61% (97 minutes), HS-Huffman achieves 47% (41 minutes), and NCE-5 achieves 53% (38 minutes). With subsampling, NEG-5 reaches 60% in 14 minutes and NEG-15 reaches 61% in 36 minutes.

How do phrase embeddings differ from word embeddings in this paper?

They detect phrases using a count-based scoring function, merge detected bigrams into single tokens, then train Skip-gram on the phrase-tokenized corpus and evaluate on a phrase analogy dataset (3218 examples).

What is the best reported phrase analogy performance and under what training conditions?

With about 33B words, hierarchical softmax, 1000-dimensional vectors, and full-sentence context, the model reaches 72% accuracy; with 6B words it drops to 66%.

What compositional behavior do the authors claim beyond analogies?

They show that element-wise addition of word vectors can produce meaningful combined meanings (e.g., “Russia” + “river” near “Volga River”), and they explain this via the relationship between vectors and context log-probabilities in the Skip-gram objective.

Review Questions

  1. Which training objective (HS, NCE, NEG) performed best on word analogies, and how did subsampling change the speed/accuracy tradeoff?

  2. How does the choice of noise distribution relate to negative sampling’s objective?

  3. What is the phrase detection scoring function, and why does it include a discount term ?

  4. Explain, at a high level, why vector addition can correspond to combining context distributions in Skip-gram.

  5. What evidence do the authors provide that scaling training data improves phrase analogy accuracy?

Key Points

  1. 1

    The paper improves Skip-gram training with three main extensions: negative sampling, subsampling of frequent words, and phrase token learning.

  2. 2

    Negative sampling with yields higher analogy accuracy than hierarchical softmax and is slightly better than NCE on the reported word analogy benchmark.

  3. 3

    Subsampling frequent words (discard probability ) provides large speedups (about ) and improves rare-word embedding quality.

  4. 4

    On word analogies (1B-word training), NEG-5 reaches 59% total accuracy and NEG-15 reaches 61%, while HS-Huffman reaches 47%. With subsampling, NEG-5 reaches 60% in 14 minutes and NEG-15 reaches 61% in 36 minutes.

  5. 5

    For phrase analogies, subsampling is crucial: NEG-15 improves from 27% to 42% with subsampling, and HS-Huffman improves from 19% to 47%.

  6. 6

    Phrase embeddings are learned by detecting frequent/informative bigrams using a count-based score and treating them as single tokens during Skip-gram training.

  7. 7

    Scaling phrase training data to ~33B words with hierarchical softmax and 1000 dimensions yields 72% phrase analogy accuracy (vs 66% with 6B words).

  8. 8

    The learned vectors exhibit linear compositionality: analogies work via , and element-wise addition can yield meaningful phrase-like combinations (e.g., “Russia” + “river” → “Volga River”).

Highlights

“Negative Sampling outperforms the Hierarchical Softmax on the analogical reasoning task, and has even slightly better performance than the Noise Contrastive Estimation.” (Table 1)
With subsampling, “NEG-5” reaches 60% total accuracy in 14 minutes, compared to 59% in 38 minutes without subsampling. (Table 1)
For phrase analogies, “NEG-15” improves from 27% (no subsampling) to 42% with subsampling. (Table 3)
The best phrase model reaches “72%” accuracy using ~33B words, hierarchical softmax, 1000 dimensions, and full-sentence context; it drops to 66% with 6B words. (Phrase results section)
The authors report that “vec(‘Russia’) + vec(‘river’) is close to vec(‘Volga River’)” and “vec(‘Germany’) + vec(‘capital’) is close to vec(‘Berlin’).” (Additive compositionality section)

Topics

  • Natural language processing
  • Representation learning
  • Word embeddings
  • Neural language models
  • Efficient training objectives
  • Negative sampling
  • Hierarchical softmax
  • Phrase mining
  • Distributional semantics
  • Compositionality in vector spaces

Mentioned

  • Google
  • word2vec (open-source project)
  • PCA (used for visualization)
  • Tomáš Mikolov
  • Ilya Sutskever
  • Kai Chen
  • Greg S. Corrado
  • Jeff Dean
  • Yoshua Bengio
  • Ronan Collobert
  • Jason Weston
  • Andriy Mnih
  • Geoffrey Hinton
  • Frederic Morin
  • Aapo Hyvärinen
  • Yee Whye Teh
  • Richard Socher
  • Andrew Y. Ng
  • Christopher D. Manning
  • Peter D. Turney
  • Patrick Pantel
  • Wen-tau Yih
  • Geoffrey Zweig
  • HS - Hierarchical Softmax
  • NCE - Noise Contrastive Estimation
  • NEG - Negative Sampling
  • NLP - Natural Language Processing
  • PCA - Principal Component Analysis