Get AI summaries of any video or article — Sign up free
Using Semantic Trees In Place of Sentences | Munashe Shumba | OpenAI Scholars Demo Day 2018 thumbnail

Using Semantic Trees In Place of Sentences | Munashe Shumba | OpenAI Scholars Demo Day 2018

OpenAI·
5 min read

Based on OpenAI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Dependency parse trees improved semantic relatedness prediction versus plain word-sequence inputs, lowering mean squared error from 1.3 to 0.35.

Briefing

Semantic trees—specifically dependency parse trees—can outperform plain sentence sequences for natural-language tasks like semantic relatedness when paired with an LSTM-style model. In a demo built around 10,000 sentence-pair comparisons, the dependency-tree approach reached lower mean squared error (0.35) than a baseline trained on regular sentence word sequences (1.3), reaching the target loss in far fewer training steps. The result matters because it suggests that preserving grammatical structure as an explicit tree can make learning more sample- and compute-efficient, even when the downstream model is sequence-based.

The work started with a semantic relatedness quiz: pairs of sentences were scored from 1 to 5 based on how closely they matched in meaning. Examples included near-paraphrases about “dogs fighting” and “dogs wrestling and hugging,” plus unrelated cases where only the presence of an actor overlapped. Instead of treating sentences as undifferentiated word sequences, the project treated meaning as layered: a high-level “essence” (e.g., fighting/wrestling) with additional details (who is fighting, what objects are involved, and when/which participants).

To generate the tree structure, the project used a syntactic parser (described as a “syntactic net,” producing dependency trees) and manually corrected parsing mistakes. Word embeddings came from GloVe, converting each token into a vector representation. The key technical challenge was feeding trees into an LSTM, which expects sequences. The solution was to linearize the dependency tree using a depth-first search traversal, producing a bracketed sequential form. In this representation, the parent-child structure becomes explicit via bracket tokens placed around subtrees, with the “main idea” appearing earlier in the traversal and more specific details appearing deeper in the resulting sequence.

Two models were trained for comparison: one consumed the linearized dependency trees, and the other consumed the original sentences as standard word sequences. Both achieved similar training loss, but the tree-based model converged dramatically faster—about 150 steps to reach the target level versus roughly 1.06 million steps for the sentence baseline. That convergence gap translated into the lower mean squared error reported for semantic relatedness.

Looking ahead, the project planned to move beyond generic LSTMs toward tree-structured recurrent models (noted as “three LSTMs” / an LSTM variant designed for trees) that can operate directly on tree inputs rather than relying on bracketed sequence conversion. There was also interest in applying the same dependency-tree manipulation to question answering using SQuAD, leveraging the fact that dependency trees are easier to edit: for example, reordering a set of children in the tree to generate new, structurally valid training examples. The demo also clarified that the relatedness scores were curated by people with consensus rather than derived from an automatic empirical metric, and it described how special bracket symbols were embedded separately from normal word vectors to prevent confusion with ordinary tokens.

Cornell Notes

Dependency parse trees can improve semantic relatedness modeling compared with treating sentences as plain word sequences. In experiments on 10,000 sentence-pair examples scored by human consensus (1–5), a model trained on linearized dependency trees achieved lower mean squared error (0.35) than a baseline trained on regular sentence sequences (1.3). The tree-based model also converged far faster, reaching the target loss in about 150 steps versus roughly 1.06 million steps for the sentence baseline, despite similar training loss levels. The approach uses GloVe embeddings for words, converts trees to sequences via depth-first traversal with bracket tokens, and then feeds the result into an LSTM-style architecture. Future work targets tree-specific LSTM variants and data augmentation for SQuAD by editing dependency structures.

Why does representing sentences as dependency trees help semantic relatedness modeling?

Dependency trees encode grammatical structure—who modifies whom and how phrases attach—so the model can learn meaning composition with structure preserved. In the demo, the tree-based model reached lower mean squared error (0.35) than the sentence-sequence baseline (1.3), suggesting that explicit structure reduces learning difficulty for semantic similarity tasks.

How were the dependency trees turned into something an LSTM could consume?

The project linearized each dependency tree using a depth-first search traversal. It produced a bracketed sequential representation where subtree boundaries are marked with special bracket tokens, and parent-child relationships appear in the order produced by the traversal. This allowed a sequence model to ingest tree structure without changing the core LSTM interface.

What role did embeddings and special symbols play in the tree representation?

Words were embedded using GloVe. Bracket tokens used for tree linearization were handled separately: the project created custom symbols and assigned them vectors designed to be far from normal word vectors, with opening and closing bracket symbols placed close to each other but still distant from other tokens. This separation aimed to prevent the model from treating brackets as ordinary words.

What evidence showed the tree model was more efficient than the sentence baseline?

Both models reportedly reached similar training loss, but the dependency-tree model converged much faster. It hit the target level at about 150 steps, while the sentence-sequence model required roughly 1.06 million steps (about 1.8×10^6 steps was mentioned in the transcript context), indicating faster learning when structure is explicit.

How were the semantic relatedness scores obtained for training and evaluation?

The 1–5 scores were not automatic measurements. They were curated by people with significant consensus, meaning the dataset reflects human judgment of semantic similarity rather than an empirically computed metric.

How might dependency trees be used for data augmentation in question answering?

Because dependency trees are manipulable, the project suggested editing the tree structure—such as changing the order of a set of children—to generate new tree variants. Those variants can be converted into new training examples, potentially increasing dataset size for tasks like SQuAD, though not every manipulation would remain valid in all cases.

Review Questions

  1. What specific mechanism converted dependency trees into LSTM-compatible inputs, and why was that necessary?
  2. Compare the reported mean squared error and convergence speed between the dependency-tree model and the sentence-sequence baseline. What do those differences imply?
  3. Why might human-curated semantic relatedness scores (with consensus) affect how you interpret model performance?

Key Points

  1. 1

    Dependency parse trees improved semantic relatedness prediction versus plain word-sequence inputs, lowering mean squared error from 1.3 to 0.35.

  2. 2

    Tree-based training reached the target loss in far fewer steps (about 150) than the sentence-sequence baseline (about 1.06 million).

  3. 3

    Dependency trees were converted to sequences using depth-first traversal and a bracketed representation so an LSTM-style model could process them.

  4. 4

    GloVe embeddings represented words, while custom bracket symbols were embedded in a way intended to keep them distinct from normal vocabulary tokens.

  5. 5

    The dependency trees were produced by a syntactic parser and required manual correction for parsing errors before training.

  6. 6

    The semantic relatedness labels (1–5) came from human curation with consensus rather than an automatically computed empirical score.

  7. 7

    Future plans included using tree-specific LSTM variants and applying dependency-tree edits for SQuAD-style question answering augmentation.

Highlights

Dependency-tree modeling cut mean squared error to 0.35 from 1.3 while keeping training loss levels comparable to the sentence baseline.
The tree approach converged in ~150 steps, versus ~1.06 million steps for the word-sequence model.
Depth-first traversal plus bracket tokens turned dependency structure into an LSTM-friendly sequence representation.
Human consensus—not an automatic metric—generated the 1–5 semantic relatedness scores used for training and evaluation.

Topics

  • Dependency Trees
  • Semantic Relatedness
  • LSTM
  • Tree Linearization
  • SQuAD

Mentioned

  • Manoj Jaya
  • Munashe Shumba
  • LSTM
  • GloVe
  • SQuAD