Richard Socher on NLP at Salesforce (Full Stack Deep Learning - March 2019)
Based on The Full Stack's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
DecaNLP unifies many NLP tasks by converting them into a single “context + question → answer” generation format.
Briefing
Natural language processing is stuck in a cycle of single-task models that improve benchmarks but don’t add up to a general system. Richard Socher’s core push is that the field should unify most NLP tasks under one multitask framework—so a single model can transfer knowledge across tasks, adapt to new domains, and eventually support zero-shot learning.
Socher traces how NLP moved from heavy manual feature engineering (hand-built signals like negation cues) to deep learning, where progress increasingly comes from engineering architectures for individual tasks. That approach works—especially when each task has enough labeled data—but it doesn’t naturally produce a continuously improving “one model for everything,” analogous to how humans learn across experiences rather than swapping models per problem. The missing ingredient, in his view, is a mechanism for knowledge transfer that doesn’t require retraining from scratch every time the task changes.
The proposed solution is “NLP decathlon” (DecaNLP), a multitask turning framework built around a single input-output pattern: provide a context document and ask a question, then generate an answer word-by-word. Most NLP problems can be cast into one of three structural families—sequence tagging, text classification, and sequence-to-sequence generation—but DecaNLP maps them into a unified “language modeling + question answering + dialogue” style. Language modeling becomes “predict the next words that answer the question,” translation becomes “generate the next words in the target language,” summarization becomes “generate the summary tokens,” and classification becomes “ask for the label,” with the label options embedded in the question.
A major practical challenge is making tasks share representations without forcing the model into a narrow solution. Socher argues that NLP’s reasoning needs vary widely—logical and linguistic reasoning, sentiment judgment, and even commonsense/visual-like inference—so the community historically split NLP into many small benchmarks and chased improvements within each. DecaNLP instead aims for weight sharing across tasks, including shared encoders and a shared decoder, so the model learns how to adapt internally based on the task prompt.
The architecture combines shared bidirectional LSTMs for context and question encoding, a co-attention mechanism (capturing interactions between the two sequences), transformer layers, and a pointer-style generation mechanism that can either copy from the context, copy from the question, or generate from a vocabulary via a softmax. Training is evaluated across ten publicly available tasks spanning question answering, semantic role labeling, relation extraction, sentiment analysis, entailment, translation, summarization, and more.
Results highlight that multitasking helps—especially for zero-shot relation extraction—while also revealing a key failure mode: when tasks are dissimilar, joint training can cause catastrophic interference, where learning for one task harms another. To address this, DecaNLP uses “anti-curriculum learning,” starting with harder tasks (like translation and question answering) and adding easier ones (like sentiment classification) later. With tuning and improved pretraining, the combined model narrows the gap to an “oracle” setup that would otherwise use separate models per task.
Beyond benchmark scores, the most striking claim is transfer: a pretrained multitask model can be applied off-the-shelf to new domains like Amazon product reviews and Yelp restaurant reviews, and it can answer newly phrased label questions (a form of zero-shot learning) by learning to point to relevant adjectives—even if those exact label prompts weren’t in training. The broader takeaway is that unified multitask learning and prompt-based task specification may be a fundamental step toward generalized NLP systems rather than more incremental architecture tweaks.
Cornell Notes
DecaNLP reframes most NLP tasks as a single problem: given a context and a natural-language question, generate the answer token-by-token. The approach unifies language modeling, question answering, and classification under one multitask model with shared encoders/decoder plus a pointer mechanism that can copy from the context or question or generate from vocabulary. Training across ten tasks improves transfer and enables zero-shot behavior, but joint training can also trigger catastrophic interference when tasks don’t relate. To reduce that, the system uses anti-curriculum learning—start with harder tasks and add easier ones later—then tunes pretraining and model components. The practical goal is a general NLP model that adapts to new tasks and domains without retraining separate systems for every benchmark.
Why does casting many NLP tasks into one “context + question → answer” format matter?
What are the three structural families of NLP tasks, and how does DecaNLP unify them?
How does the pointer mechanism help the model handle tasks like translation, extraction, and classification?
What is catastrophic interference in multitask learning, and how does DecaNLP respond?
What evidence is offered for zero-shot transfer and domain adaptation?
Why does Socher argue that NLP needs more than unsupervised learning to become general?
Review Questions
- How does embedding the task definition into the model inputs (as a question) change the training and transfer dynamics compared with training separate task-specific models?
- What mechanisms in DecaNLP allow outputs to be copied from the context or question, and why is that useful for extraction and classification tasks?
- Why might anti-curriculum learning outperform curriculum learning in large multitask settings, according to the explanation given?
Key Points
- 1
DecaNLP unifies many NLP tasks by converting them into a single “context + question → answer” generation format.
- 2
Shared weights across tasks are intended to enable transfer learning and reduce the need for separate models per benchmark.
- 3
A pointer-style generation mechanism lets the decoder copy from the context or question or generate from vocabulary, improving performance on tasks with extractive or label-token outputs.
- 4
Multitask training can fail via catastrophic interference when tasks are dissimilar, so task relatedness strongly affects outcomes.
- 5
Anti-curriculum learning—starting with harder tasks and adding easier ones later—helps prevent the model from collapsing into solutions optimized for simple tasks.
- 6
Pretraining the full multitask model (not just word vectors) supports faster convergence and better zero-shot/domain transfer.
- 7
Zero-shot behavior is demonstrated by prompting for new label questions and applying the pretrained model to new sentiment domains without retraining.