Semantic Parsing English to GraphQL | Andre Carerra | OpenAI Scholars Demo Day 2020
Based on OpenAI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
GraphQL’s schema acts as a contract, so English-to-GraphQL parsing must generate queries that validate against real schema constraints.
Briefing
Semantic parsing from English into GraphQL is feasible with general-purpose encoder–decoder language models, but accuracy lags behind SQL-focused systems because GraphQL’s structure is harder to map from natural language without language-specific modeling. The core result is a new English-to-GraphQL dataset built by converting the SQL-focused Spider benchmark into GraphQL, then training models—especially T5—to generate GraphQL queries that match target queries under an order-insensitive comparison of parsed query trees. On the resulting GraphQL validation set, T5 reaches roughly 46–50% exact set matching accuracy, a meaningful generalization signal given the dataset’s scale and the need to produce syntactically valid, schema-consistent queries.
The project starts by framing GraphQL as an API query language with a schema that acts like a contract: clients request nested data predictably, and the schema constrains what queries are valid. Semantic parsing is then defined as translating a natural-language utterance into a machine-readable logical form. In this case, the logical form is a GraphQL query derived from an English question and a GraphQL schema.
A major obstacle is data availability. There is no native GraphQL equivalent to Spider, so the work pivots to repurposing Spider’s structure: 10,000 natural-language questions paired with complex SQL queries across 200 databases and 138 domains. The conversion pipeline uses Hasura to generate GraphQL schemas from databases and pg_loader to move data from SQLite to Postgres so Hasura can infer the schema. The central technical step is translating SQL abstract syntax trees into GraphQL abstract syntax trees, then serializing those trees into raw GraphQL queries.
Validation is treated as a first-class requirement. The generated queries are checked for GraphQL syntax, validated against the schema to ensure keywords and fields exist, and executed against an endpoint to confirm they run. This validation step is also where the conversion’s limitations show up: Hasura’s lack of a group-by mapping forces many queries to be partially transferred or handled differently, creating an obstacle that could have been addressed manually but wasn’t feasible within time constraints.
After building the dataset—160 schemas across 138 domains, about 4,300 English prompts, and roughly 2,400 GraphQL queries—the project evaluates model performance. BART and T5 are fine-tuned as translation-style systems: English prompt plus schema in, GraphQL query out, trained with an autoregressive objective. The evaluation metric is “exact set matching accuracy,” computed by parsing both predicted and target GraphQL queries into abstract syntax trees and comparing them while treating child-node order as irrelevant. Under this metric, T5 trained on GraphQL achieves 46–50% accuracy; notably, the 50% variant is trained on both SQL and GraphQL and performs better than the GraphQL-only version.
The results are positioned against Spider’s SQL leaderboard, where top systems reach around 60–65% exact match accuracy on SQL. The gap is attributed to SQL-specialized architectures that can’t naturally emit GraphQL. The work suggests future improvements: design GraphQL-aware “heads” while keeping general model backbones, expand the dataset with more complex GraphQL features, and test on enterprise GraphQL schemas such as Salesforce and GitHub. In practice, syntactically invalid outputs are reported as relatively rare (around ~5%), and the most time-consuming part of the project is the SQL-to-GraphQL conversion via tree/graph transformations. The demo examples—music and flight databases—illustrate that the model can generate correct GraphQL queries for new schemas and return expected answers like “USA” for an artist’s country or “United Airlines” for an airline abbreviation.
Cornell Notes
The project builds a benchmark for translating English questions into GraphQL queries by converting the SQL-focused Spider dataset into GraphQL. Using Hasura to infer GraphQL schemas and pg_loader to prepare databases, it converts SQL abstract syntax trees into GraphQL abstract syntax trees, then validates generated queries for syntax, schema compatibility, and successful execution. Models fine-tuned for translation—especially T5—produce GraphQL queries from (English prompt + schema) with an autoregressive objective. Evaluation uses exact set matching accuracy, comparing parsed query trees while ignoring child-node order. T5 reaches about 46–50% exact set matching accuracy on the GraphQL validation set, showing generalization across many schemas, though it trails SQL-specialized systems.
Why does GraphQL’s schema matter for semantic parsing from English?
How was a GraphQL dataset created when no direct GraphQL benchmark existed?
What role did validation scripts play in dataset quality?
What is “exact set matching accuracy,” and why isn’t normal string matching enough?
Why did T5’s performance differ from SQL-focused leaderboard results?
What practical error pattern appeared in generated outputs?
Review Questions
- What specific conversion steps were required to transform Spider’s SQL queries into executable GraphQL queries, and which tools supported each step?
- How does exact set matching accuracy handle semantically equivalent GraphQL queries that differ in child-node ordering?
- What architectural change is suggested for improving performance: replacing only output “heads” or redesigning the whole model? Explain the rationale given.
Key Points
- 1
GraphQL’s schema acts as a contract, so English-to-GraphQL parsing must generate queries that validate against real schema constraints.
- 2
A GraphQL benchmark was created by converting Spider’s SQL question–query pairs into GraphQL using Hasura for schema generation and pg_loader for database conversion.
- 3
The conversion pipeline relies on translating SQL abstract syntax trees into GraphQL abstract syntax trees, then serializing them into GraphQL text.
- 4
Dataset reliability depended on validation: syntax checks, schema validation, and execution against an endpoint.
- 5
Fine-tuned encoder–decoder models (especially T5) can generate GraphQL queries from (English prompt + schema) with an autoregressive objective.
- 6
Evaluation uses exact set matching accuracy by comparing parsed query trees while ignoring child-node order, making it robust to superficial ordering differences.
- 7
T5 achieved about 46–50% exact set matching accuracy on the GraphQL validation set, trailing SQL-specialized systems (~60–65%) due to SQL-specific architectures and output constraints.