Information Extraction with LangChain & Kor

TL;DR

Kor performs information extraction by prompting a large language model to fill a predefined schema, avoiding the need for immediate NER training data.

Briefing Cornell Notes

Briefing

Turning messy text into structured data is the bottleneck for many NLP workflows—especially when there’s no labeled dataset to train a named-entity recognition (NER) model. Kor, used alongside LangChain, offers a practical workaround: it uses a large language model (in this case, ChatGPT via the “chat OpenAI” model) to extract fields directly into a predefined schema, producing report-ready outputs and even a seed dataset for training a future NER system.

The core idea is straightforward. Kor wraps a LangChain extraction chain that takes raw text, cleans and splits it, then prompts the LLM with a schema describing exactly what to extract. That schema is reinforced through in-context learning: each field comes with a description and one or more examples, so the model learns what values to pull out and how to format them. In the simplest setup, the schema defines a flat set of attributes—like extracting first name and age from sentences such as “John Smith was 23.” The system then returns structured results that match the schema, demonstrating that extraction can be done without training a dedicated model.

Kor also supports nested structures, which matters when real information is hierarchical. For example, address extraction can be modeled as a JSON object containing from/to addresses, each with street, city, state, zip code, and country. When nested fields are required, the workflow shifts from a CSV-style encoder to a JSON encoder, and the prompt is tightened further: it instructs the model to output only valid JSON, omit any fields not present in the schema, and ignore irrelevant attributes found in the text.

Where the approach becomes most useful in practice is with Pydantic classes. Instead of manually defining a schema as plain text, the workflow defines a Restaurant model with required and optional fields—name (required), location (optional), style (optional), and top dish (optional). Pydantic then enforces formatting and field presence rules, validating that the model’s output conforms to the expected structure. The extraction is run with an “extraction validator” and can handle multiple restaurants at once (using a parameter like many=True). The results are then converted into human-readable summaries and can be loaded into a pandas DataFrame for downstream analysis or reporting.

A key lesson from the examples is that validation and examples materially improve reliability. When Pydantic validation is removed, the output becomes more inconsistent: restaurant names and dish details start to drift, repeat, or appear partially—signs the model is improvising rather than adhering to the target structure. With validation in place and enough high-quality in-context examples, extraction quality improves substantially, even though errors still occur.

The takeaway is pragmatic: Kor can automate the pipeline from conversation transcripts to structured restaurant data—names, locations, and dishes—at low cost per run. That extracted dataset can then support later steps, including generating reports or training a conventional NER model (such as a BERT-based approach) once enough labeled data exists.

Cornell Notes

Kor (built on LangChain) turns unstructured text into structured outputs by prompting a large language model to fill a predefined schema. It avoids the usual requirement of training an NER model by using in-context learning—field descriptions and examples—to guide extraction. The workflow supports both flat schemas and nested JSON structures, with strict instructions to output only schema-matching fields. For real-world reliability, Pydantic classes enforce output shape (required vs. optional fields) and validate formatting, which reduces drift and partial/incorrect extractions. The extracted results can be formatted for humans, converted into pandas DataFrames, and later used to build datasets for training a proper NER model.

Why does Kor matter when there’s no labeled data to train an NER model?

Traditional NER or information extraction often requires training data and a model (e.g., BERT, distilled BERT, or a small T5) with labeled examples. Kor sidesteps that by using a large language model (Chat OpenAI) to perform extraction directly from text. The schema—field names plus descriptions and examples—acts as the “instruction set,” so the system can produce structured outputs immediately. Those outputs can later become training data for a conventional NER model once enough examples are extracted.

How does Kor use schemas and in-context learning to control what gets extracted?

Kor defines a schema describing the target fields (e.g., first name, age). Each field includes a description and example(s), which are inserted into the prompt as in-context learning. The prompt instructs the model to extract structured information matching the form described in the schema. For instance, given “David Jones was 33,” the system extracts the person’s name and age according to the schema’s definitions.

What changes when the extraction target is nested data like from/to addresses?

Nested structures require JSON-style modeling rather than simple CSV-style extraction. The workflow switches to a JSON encoder and uses a stricter prompt: output must be valid JSON only, no clarifying text is allowed, and any attributes not present in the schema must be ignored. This prevents the model from adding extra fields and helps keep the output machine-parseable.

Why are Pydantic classes a big deal for extraction quality?

Pydantic classes define a Restaurant structure with required and optional fields (name required; location/style/top dish optional). This forces the model to stick to the expected grouping of information and validates that outputs conform to the correct format. The “extraction validator” helps ensure the model returns instances of the class rather than loosely related text snippets.

What happens when Pydantic validation is removed?

Without Pydantic enforcement, the model output becomes less consistent. Examples show partial or incorrect extractions: restaurant names may be missing or mismatched, details like “Attica” may not appear where expected, and some fields can repeat or drift. Validation plus good examples reduces improvisation and improves adherence to the target schema.

How can extracted information be used after extraction?

Once Kor outputs structured restaurant data, it can be converted into human-readable text and also loaded into a pandas DataFrame for analysis. From there, downstream steps can generate reports by prompting again with the structured data. The extracted dataset can also serve as training material for a future NER model (e.g., BERT-based) when enough labeled examples are available.

Review Questions

How does Kor’s schema design (field descriptions and examples) function as a substitute for labeled training data?
When would you choose a JSON encoder over a CSV encoder in Kor, and what prompt constraints help enforce correctness?
How does Pydantic validation change the failure modes of extraction compared with running extraction without validation?

Key Points

1
Kor performs information extraction by prompting a large language model to fill a predefined schema, avoiding the need for immediate NER training data.
2
Field-level descriptions and examples in the prompt provide in-context learning that steers what the model extracts and how it formats results.
3
Nested extraction (like from/to addresses) is handled more reliably with JSON encoding and stricter “JSON-only” prompt instructions.
4
Pydantic classes improve extraction reliability by enforcing required vs. optional fields and validating output structure.
5
Using Pydantic validation reduces drift, repetition, and partial/mismatched fields that appear when validation is removed.
6
Extracted outputs can be transformed into human-readable summaries and loaded into pandas DataFrames for reporting or analysis.
7
The extracted data can later bootstrap a conventional NER model training pipeline once enough examples exist.

Highlights

Kor turns raw conversations into structured restaurant records (name, location, style, top dish) without training a dedicated NER model first.

Nested schemas work best with JSON encoding plus strict prompt rules: output only valid JSON and ignore fields not in the schema.

Pydantic validation is a practical guardrail—without it, the model starts improvising, causing missing or mismatched restaurant and dish details.

Extraction quality improves when the schema includes strong in-context examples, not just field names.

Topics

Information Extraction
LangChain
Kor
Pydantic
Named Entity Recognition

Mentioned

Sam Witteveen
NER