Using Semantic Trees In Place of Sentences | Munashe Shumba | OpenAI Scholars Demo Day 2018
Based on OpenAI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Dependency parse trees improved semantic relatedness prediction versus plain word-sequence inputs, lowering mean squared error from 1.3 to 0.35.
Briefing
Semantic trees—specifically dependency parse trees—can outperform plain sentence sequences for natural-language tasks like semantic relatedness when paired with an LSTM-style model. In a demo built around 10,000 sentence-pair comparisons, the dependency-tree approach reached lower mean squared error (0.35) than a baseline trained on regular sentence word sequences (1.3), reaching the target loss in far fewer training steps. The result matters because it suggests that preserving grammatical structure as an explicit tree can make learning more sample- and compute-efficient, even when the downstream model is sequence-based.
The work started with a semantic relatedness quiz: pairs of sentences were scored from 1 to 5 based on how closely they matched in meaning. Examples included near-paraphrases about “dogs fighting” and “dogs wrestling and hugging,” plus unrelated cases where only the presence of an actor overlapped. Instead of treating sentences as undifferentiated word sequences, the project treated meaning as layered: a high-level “essence” (e.g., fighting/wrestling) with additional details (who is fighting, what objects are involved, and when/which participants).
To generate the tree structure, the project used a syntactic parser (described as a “syntactic net,” producing dependency trees) and manually corrected parsing mistakes. Word embeddings came from GloVe, converting each token into a vector representation. The key technical challenge was feeding trees into an LSTM, which expects sequences. The solution was to linearize the dependency tree using a depth-first search traversal, producing a bracketed sequential form. In this representation, the parent-child structure becomes explicit via bracket tokens placed around subtrees, with the “main idea” appearing earlier in the traversal and more specific details appearing deeper in the resulting sequence.
Two models were trained for comparison: one consumed the linearized dependency trees, and the other consumed the original sentences as standard word sequences. Both achieved similar training loss, but the tree-based model converged dramatically faster—about 150 steps to reach the target level versus roughly 1.06 million steps for the sentence baseline. That convergence gap translated into the lower mean squared error reported for semantic relatedness.
Looking ahead, the project planned to move beyond generic LSTMs toward tree-structured recurrent models (noted as “three LSTMs” / an LSTM variant designed for trees) that can operate directly on tree inputs rather than relying on bracketed sequence conversion. There was also interest in applying the same dependency-tree manipulation to question answering using SQuAD, leveraging the fact that dependency trees are easier to edit: for example, reordering a set of children in the tree to generate new, structurally valid training examples. The demo also clarified that the relatedness scores were curated by people with consensus rather than derived from an automatic empirical metric, and it described how special bracket symbols were embedded separately from normal word vectors to prevent confusion with ordinary tokens.
Cornell Notes
Dependency parse trees can improve semantic relatedness modeling compared with treating sentences as plain word sequences. In experiments on 10,000 sentence-pair examples scored by human consensus (1–5), a model trained on linearized dependency trees achieved lower mean squared error (0.35) than a baseline trained on regular sentence sequences (1.3). The tree-based model also converged far faster, reaching the target loss in about 150 steps versus roughly 1.06 million steps for the sentence baseline, despite similar training loss levels. The approach uses GloVe embeddings for words, converts trees to sequences via depth-first traversal with bracket tokens, and then feeds the result into an LSTM-style architecture. Future work targets tree-specific LSTM variants and data augmentation for SQuAD by editing dependency structures.
Why does representing sentences as dependency trees help semantic relatedness modeling?
How were the dependency trees turned into something an LSTM could consume?
What role did embeddings and special symbols play in the tree representation?
What evidence showed the tree model was more efficient than the sentence baseline?
How were the semantic relatedness scores obtained for training and evaluation?
How might dependency trees be used for data augmentation in question answering?
Review Questions
- What specific mechanism converted dependency trees into LSTM-compatible inputs, and why was that necessary?
- Compare the reported mean squared error and convergence speed between the dependency-tree model and the sentence-sequence baseline. What do those differences imply?
- Why might human-curated semantic relatedness scores (with consensus) affect how you interpret model performance?
Key Points
- 1
Dependency parse trees improved semantic relatedness prediction versus plain word-sequence inputs, lowering mean squared error from 1.3 to 0.35.
- 2
Tree-based training reached the target loss in far fewer steps (about 150) than the sentence-sequence baseline (about 1.06 million).
- 3
Dependency trees were converted to sequences using depth-first traversal and a bracketed representation so an LSTM-style model could process them.
- 4
GloVe embeddings represented words, while custom bracket symbols were embedded in a way intended to keep them distinct from normal vocabulary tokens.
- 5
The dependency trees were produced by a syntactic parser and required manual correction for parsing errors before training.
- 6
The semantic relatedness labels (1–5) came from human curation with consensus rather than an automatically computed empirical score.
- 7
Future plans included using tree-specific LSTM variants and applying dependency-tree edits for SQuAD-style question answering augmentation.