Information Extraction with LangChain & Kor
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Kor performs information extraction by prompting a large language model to fill a predefined schema, avoiding the need for immediate NER training data.
Briefing
Turning messy text into structured data is the bottleneck for many NLP workflows—especially when there’s no labeled dataset to train a named-entity recognition (NER) model. Kor, used alongside LangChain, offers a practical workaround: it uses a large language model (in this case, ChatGPT via the “chat OpenAI” model) to extract fields directly into a predefined schema, producing report-ready outputs and even a seed dataset for training a future NER system.
The core idea is straightforward. Kor wraps a LangChain extraction chain that takes raw text, cleans and splits it, then prompts the LLM with a schema describing exactly what to extract. That schema is reinforced through in-context learning: each field comes with a description and one or more examples, so the model learns what values to pull out and how to format them. In the simplest setup, the schema defines a flat set of attributes—like extracting first name and age from sentences such as “John Smith was 23.” The system then returns structured results that match the schema, demonstrating that extraction can be done without training a dedicated model.
Kor also supports nested structures, which matters when real information is hierarchical. For example, address extraction can be modeled as a JSON object containing from/to addresses, each with street, city, state, zip code, and country. When nested fields are required, the workflow shifts from a CSV-style encoder to a JSON encoder, and the prompt is tightened further: it instructs the model to output only valid JSON, omit any fields not present in the schema, and ignore irrelevant attributes found in the text.
Where the approach becomes most useful in practice is with Pydantic classes. Instead of manually defining a schema as plain text, the workflow defines a Restaurant model with required and optional fields—name (required), location (optional), style (optional), and top dish (optional). Pydantic then enforces formatting and field presence rules, validating that the model’s output conforms to the expected structure. The extraction is run with an “extraction validator” and can handle multiple restaurants at once (using a parameter like many=True). The results are then converted into human-readable summaries and can be loaded into a pandas DataFrame for downstream analysis or reporting.
A key lesson from the examples is that validation and examples materially improve reliability. When Pydantic validation is removed, the output becomes more inconsistent: restaurant names and dish details start to drift, repeat, or appear partially—signs the model is improvising rather than adhering to the target structure. With validation in place and enough high-quality in-context examples, extraction quality improves substantially, even though errors still occur.
The takeaway is pragmatic: Kor can automate the pipeline from conversation transcripts to structured restaurant data—names, locations, and dishes—at low cost per run. That extracted dataset can then support later steps, including generating reports or training a conventional NER model (such as a BERT-based approach) once enough labeled data exists.
Cornell Notes
Kor (built on LangChain) turns unstructured text into structured outputs by prompting a large language model to fill a predefined schema. It avoids the usual requirement of training an NER model by using in-context learning—field descriptions and examples—to guide extraction. The workflow supports both flat schemas and nested JSON structures, with strict instructions to output only schema-matching fields. For real-world reliability, Pydantic classes enforce output shape (required vs. optional fields) and validate formatting, which reduces drift and partial/incorrect extractions. The extracted results can be formatted for humans, converted into pandas DataFrames, and later used to build datasets for training a proper NER model.
Why does Kor matter when there’s no labeled data to train an NER model?
How does Kor use schemas and in-context learning to control what gets extracted?
What changes when the extraction target is nested data like from/to addresses?
Why are Pydantic classes a big deal for extraction quality?
What happens when Pydantic validation is removed?
How can extracted information be used after extraction?
Review Questions
- How does Kor’s schema design (field descriptions and examples) function as a substitute for labeled training data?
- When would you choose a JSON encoder over a CSV encoder in Kor, and what prompt constraints help enforce correctness?
- How does Pydantic validation change the failure modes of extraction compared with running extraction without validation?
Key Points
- 1
Kor performs information extraction by prompting a large language model to fill a predefined schema, avoiding the need for immediate NER training data.
- 2
Field-level descriptions and examples in the prompt provide in-context learning that steers what the model extracts and how it formats results.
- 3
Nested extraction (like from/to addresses) is handled more reliably with JSON encoding and stricter “JSON-only” prompt instructions.
- 4
Pydantic classes improve extraction reliability by enforcing required vs. optional fields and validating output structure.
- 5
Using Pydantic validation reduces drift, repetition, and partial/mismatched fields that appear when validation is removed.
- 6
Extracted outputs can be transformed into human-readable summaries and loaded into pandas DataFrames for reporting or analysis.
- 7
The extracted data can later bootstrap a conventional NER model training pipeline once enough examples exist.