Semantic Parsing English to GraphQL | Andre Carerra

TL;DR

GraphQL’s schema acts as a contract, so English-to-GraphQL parsing must generate queries that validate against real schema constraints.

Briefing Cornell Notes

Briefing

Semantic parsing from English into GraphQL is feasible with general-purpose encoder–decoder language models, but accuracy lags behind SQL-focused systems because GraphQL’s structure is harder to map from natural language without language-specific modeling. The core result is a new English-to-GraphQL dataset built by converting the SQL-focused Spider benchmark into GraphQL, then training models—especially T5—to generate GraphQL queries that match target queries under an order-insensitive comparison of parsed query trees. On the resulting GraphQL validation set, T5 reaches roughly 46–50% exact set matching accuracy, a meaningful generalization signal given the dataset’s scale and the need to produce syntactically valid, schema-consistent queries.

The project starts by framing GraphQL as an API query language with a schema that acts like a contract: clients request nested data predictably, and the schema constrains what queries are valid. Semantic parsing is then defined as translating a natural-language utterance into a machine-readable logical form. In this case, the logical form is a GraphQL query derived from an English question and a GraphQL schema.

A major obstacle is data availability. There is no native GraphQL equivalent to Spider, so the work pivots to repurposing Spider’s structure: 10,000 natural-language questions paired with complex SQL queries across 200 databases and 138 domains. The conversion pipeline uses Hasura to generate GraphQL schemas from databases and pg_loader to move data from SQLite to Postgres so Hasura can infer the schema. The central technical step is translating SQL abstract syntax trees into GraphQL abstract syntax trees, then serializing those trees into raw GraphQL queries.

Validation is treated as a first-class requirement. The generated queries are checked for GraphQL syntax, validated against the schema to ensure keywords and fields exist, and executed against an endpoint to confirm they run. This validation step is also where the conversion’s limitations show up: Hasura’s lack of a group-by mapping forces many queries to be partially transferred or handled differently, creating an obstacle that could have been addressed manually but wasn’t feasible within time constraints.

After building the dataset—160 schemas across 138 domains, about 4,300 English prompts, and roughly 2,400 GraphQL queries—the project evaluates model performance. BART and T5 are fine-tuned as translation-style systems: English prompt plus schema in, GraphQL query out, trained with an autoregressive objective. The evaluation metric is “exact set matching accuracy,” computed by parsing both predicted and target GraphQL queries into abstract syntax trees and comparing them while treating child-node order as irrelevant. Under this metric, T5 trained on GraphQL achieves 46–50% accuracy; notably, the 50% variant is trained on both SQL and GraphQL and performs better than the GraphQL-only version.

The results are positioned against Spider’s SQL leaderboard, where top systems reach around 60–65% exact match accuracy on SQL. The gap is attributed to SQL-specialized architectures that can’t naturally emit GraphQL. The work suggests future improvements: design GraphQL-aware “heads” while keeping general model backbones, expand the dataset with more complex GraphQL features, and test on enterprise GraphQL schemas such as Salesforce and GitHub. In practice, syntactically invalid outputs are reported as relatively rare (around ~5%), and the most time-consuming part of the project is the SQL-to-GraphQL conversion via tree/graph transformations. The demo examples—music and flight databases—illustrate that the model can generate correct GraphQL queries for new schemas and return expected answers like “USA” for an artist’s country or “United Airlines” for an airline abbreviation.

Cornell Notes

The project builds a benchmark for translating English questions into GraphQL queries by converting the SQL-focused Spider dataset into GraphQL. Using Hasura to infer GraphQL schemas and pg_loader to prepare databases, it converts SQL abstract syntax trees into GraphQL abstract syntax trees, then validates generated queries for syntax, schema compatibility, and successful execution. Models fine-tuned for translation—especially T5—produce GraphQL queries from (English prompt + schema) with an autoregressive objective. Evaluation uses exact set matching accuracy, comparing parsed query trees while ignoring child-node order. T5 reaches about 46–50% exact set matching accuracy on the GraphQL validation set, showing generalization across many schemas, though it trails SQL-specialized systems.

Why does GraphQL’s schema matter for semantic parsing from English?

GraphQL’s schema defines the allowed types, fields, and relationships, acting like an API contract. Semantic parsing here means generating a query that must be consistent with that contract. The dataset construction explicitly validates generated queries against the schema—checking that keywords and fields exist—so the model isn’t just producing “English-like” output but producing queries that can be executed against a real endpoint.

How was a GraphQL dataset created when no direct GraphQL benchmark existed?

The work repurposes Spider, which pairs natural-language questions with SQL queries across many databases and domains. It converts each SQL query into a GraphQL query by: (1) generating a GraphQL schema from the database using Hasura, (2) converting SQLite to Postgres with pg_loader so Hasura can infer schema, and (3) translating SQL abstract syntax trees into GraphQL abstract syntax trees, then serializing them into raw GraphQL.

What role did validation scripts play in dataset quality?

Validation scripts ensure the generated GraphQL queries are actually usable. They check GraphQL syntax, validate the query against the inferred schema (so fields and arguments are legal), and execute the queries against an endpoint. This turns the dataset into something closer to “executable supervision,” not just text pairs.

What is “exact set matching accuracy,” and why isn’t normal string matching enough?

Exact set matching accuracy parses both predicted and target GraphQL queries into abstract syntax trees and compares the trees. It treats child-node order as irrelevant, so two queries that differ only by the order of equivalent children can still count as correct. This matters because GraphQL query structure can be semantically equivalent even when surface ordering differs.

Why did T5’s performance differ from SQL-focused leaderboard results?

SQL-specialized systems often use architectures tailored to SQL output formats. Those models can’t naturally emit GraphQL, so their strengths don’t transfer directly. In contrast, the GraphQL-capable setup fine-tunes a general translation model (T5) to output GraphQL, which supports the task but may be less optimized than SQL-only architectures.

What practical error pattern appeared in generated outputs?

Syntactically invalid outputs were reported as relatively uncommon—about ~5% of outputs—because the evaluation metric counts invalid queries as wrong. The model still sometimes produces incorrect structure, but most failures are not outright syntax errors.

Review Questions

What specific conversion steps were required to transform Spider’s SQL queries into executable GraphQL queries, and which tools supported each step?
How does exact set matching accuracy handle semantically equivalent GraphQL queries that differ in child-node ordering?
What architectural change is suggested for improving performance: replacing only output “heads” or redesigning the whole model? Explain the rationale given.

Key Points

1
GraphQL’s schema acts as a contract, so English-to-GraphQL parsing must generate queries that validate against real schema constraints.
2
A GraphQL benchmark was created by converting Spider’s SQL question–query pairs into GraphQL using Hasura for schema generation and pg_loader for database conversion.
3
The conversion pipeline relies on translating SQL abstract syntax trees into GraphQL abstract syntax trees, then serializing them into GraphQL text.
4
Dataset reliability depended on validation: syntax checks, schema validation, and execution against an endpoint.
5
Fine-tuned encoder–decoder models (especially T5) can generate GraphQL queries from (English prompt + schema) with an autoregressive objective.
6
Evaluation uses exact set matching accuracy by comparing parsed query trees while ignoring child-node order, making it robust to superficial ordering differences.
7
T5 achieved about 46–50% exact set matching accuracy on the GraphQL validation set, trailing SQL-specialized systems (~60–65%) due to SQL-specific architectures and output constraints.

Highlights

The work turns Spider’s SQL benchmark into an executable English-to-GraphQL dataset by converting SQL abstract syntax trees into GraphQL abstract syntax trees and validating them against real endpoints.

T5 reaches roughly 46–50% exact set matching accuracy on GraphQL, with a higher-performing variant trained on both SQL and GraphQL.

Exact set matching accuracy compares parsed query trees and treats child-node order as irrelevant, so equivalent query structures can score as correct.

Syntactically invalid outputs are reported at about ~5%, indicating most errors are structural/semantic rather than outright malformed GraphQL.

Future gains are framed as adding GraphQL-aware output “heads,” expanding dataset coverage of complex GraphQL features, and testing on enterprise schemas like Salesforce and GitHub.

Topics

GraphQL
Semantic Parsing
Dataset Conversion
T5
Exact Set Matching

Mentioned

Andre Carerra
API
SQL
T5
BART
AST
GPU

Semantic Parsing English to GraphQL | Andre Carerra | OpenAI Scholars Demo Day 2020