Tagging and Extraction - Classification using OpenAI Functions

TL;DR

Use OpenAI function-style schemas in LangChain to obtain structured JSON outputs without triggering external functions.

Briefing Cornell Notes

Briefing

OpenAI “functions” can be used in LangChain not to trigger external code, but to force large language models to return structured JSON outputs—turning messy text into reliable fields. LangChain’s built-in split into two capabilities—tagging (classification) and extraction (entity/field extraction)—lets developers define a schema up front and then constrain what the model is allowed to output.

For tagging, the workflow starts with a schema describing the labels to predict. A tagging chain is created with an LLM (the transcript references “GPT 3.5 turbo model from the 13th Of June”) and a temperature set to zero. The schema in the example asks for three outputs from each review: sentiment (positive/negative), stars (a rating), and language (English/Spanish/French/German). Under the hood, the model receives both a human prompt template and a functions role payload that includes the schema. A JSON output functions parser then converts the model’s response into structured data automatically.

Initial tests on Amazon reviews from the book “Spare” show what happens when the schema is too loose. One five-star review returns positive sentiment and a stars value of four, but leaves the language field blank. A negative review returns sentiment but omits stars and language. A mixed review returns nothing at all. The takeaway is practical: when the model isn’t sufficiently constrained—especially for required fields—outputs can be incomplete or fail.

Tightening the schema fixes much of the problem. The sentiment field becomes an enum with explicit allowed values (positive, neutral, negative). Stars are constrained to a numeric range (one to five). Language is no longer a free-form guess; it’s an enum limited to specific options (Spanish, English, French, German), and the fields are marked as required. With these constraints, the chain returns complete structured results for the same reviews, including sentiment, stars, and language—now as a Python dictionary. The transcript also demonstrates a Pydantic-based variant (“create tagging chain Pydantic”), where the output conforms to a typed Pydantic class, making it easier to access fields like response.sentiment and response.stars and to pass the object through downstream code.

Extraction works similarly but targets pulling specific fields from longer text, akin to named entity recognition. Using a TechCrunch article about controversy at Reddit, an extraction chain is built with a schema for fields such as person name, startup, news outlet, app name, and month. Only person name and startup are required in the initial setup. The functions role is used to request “information extraction,” and the model can call the function multiple times across the passage, producing a list of JSON objects.

The example highlights common extraction pitfalls: the model may confuse entity types (e.g., treating “Reddit” as a human), misclassify startups vs apps, and miss months. A key mitigation is context management: splitting the article into smaller chunks (e.g., two paragraphs at a time) improves coverage and reduces missed entities. The transcript also notes that adding descriptions to schema fields and using in-context learning examples can further improve accuracy.

Overall, the approach turns unstructured text into structured data using constrained function-style outputs, enabling downstream uses like sentiment analysis, routing different complaint types to different chains, and building knowledge graphs from extracted entities.

Cornell Notes

LangChain’s “tagging” and “extraction” chains use OpenAI function-style schemas to force structured JSON outputs from unstructured text. Tagging behaves like multi-class classification: define allowed labels (enums), mark required fields, and the model returns sentiment, star ratings, and language in a consistent structure. Loose schemas lead to missing fields or empty outputs; tightening constraints (enums, required fields, numeric ranges) improves reliability. Extraction pulls named fields from articles using a similar functions role, often returning multiple JSON objects per passage. Accuracy improves with better schema descriptions and by limiting context size (e.g., processing two paragraphs at a time).

Why did the initial tagging results come back incomplete or empty, even though the schema listed sentiment, stars, and language?

The early schema left too much freedom for the model. With sentiment and stars partially specified but language effectively left to guesswork, the model sometimes returned only some fields (e.g., sentiment and stars but no language). In the mixed review case, the output was empty. The transcript’s practical diagnosis is that the model wasn’t constrained enough—especially around what values are allowed and which fields are required—so it failed to consistently satisfy the full structure.

How do enums and required fields change tagging reliability?

Adding explicit allowed values (enums) and marking fields as required forces the model to choose from a fixed set. Sentiment becomes one of positive/neutral/negative, stars are constrained to one to five, and language is restricted to Spanish/English/French/German. With these constraints, the chain returns complete structured outputs for the same reviews, including sentiment, stars, and language.

What’s the difference between returning a dictionary and returning a Pydantic object in tagging?

The schema can be used to produce plain structured output (a Python dictionary). Alternatively, a Pydantic class can define a typed structure for the tags, and the “create tagging chain Pydantic” variant returns an instance of that class. That makes downstream code cleaner: fields can be accessed as response.sentiment, response.stars, and response.language, and the object can be serialized or passed around with stronger guarantees.

How does extraction resemble NER, and what does it return?

Extraction is like named entity recognition in spirit: it pulls specific fields (person name, startup, news outlet, app name, month) from a passage. Instead of returning one label, it can call the extraction function multiple times across the text, producing a list of JSON objects—each object filled with the extracted properties.

What concrete strategy improved extraction results in the example article?

Processing the article in smaller chunks. When the full text was effectively too broad for consistent coverage, the transcript suggests running extraction on two paragraphs at a time. That improved entity capture—for example, later names appeared only when the relevant section was included—while also reducing misses like months not being mentioned in the sampled text.

Review Questions

When tagging, what specific schema changes (enums, required fields, numeric ranges) most directly prevent missing outputs?
In extraction, why might the model label “Reddit” as a person, and how could schema descriptions or context chunking reduce such errors?
How would you decide whether to use a plain dictionary output or a Pydantic-typed output for downstream processing?

Key Points

1
Use OpenAI function-style schemas in LangChain to obtain structured JSON outputs without triggering external functions.
2
Tagging is classification: define the fields to predict and constrain allowed values using enums and required fields.
3
Loose schemas often produce partial or empty outputs; tightening constraints improves completeness and consistency.
4
A Pydantic tagging chain returns a typed object, making field access and downstream validation easier than raw dictionaries.
5
Extraction pulls multiple structured entities from text by repeatedly calling an extraction function across the passage.
6
Extraction quality improves with better schema descriptions and by limiting context size (e.g., two paragraphs at a time).
7
Structured outputs can feed downstream workflows like sentiment analysis, complaint-type routing, and knowledge graph construction.

Highlights

LangChain’s tagging/extraction uses function-style schemas to turn free text into structured JSON—without needing to call external code.

Constraining outputs with enums and required fields turns unreliable partial responses into consistent sentiment/stars/language results.

Extraction can return a list of JSON objects via multiple function calls across a passage, but entity-type confusion is common unless schema and context are managed.

Chunking long articles (two paragraphs at a time) materially improves which entities and months get extracted.

Topics

OpenAI Functions
LangChain Tagging
LangChain Extraction
Structured JSON
Pydantic Schemas