LangChain Tutorial: The Core Building Blocks | LLMs, JSON output, RAGs, Tools and Observability

TL;DR

LangChain provides a unified way to initialize and invoke chat models across OpenAI, Google GenAI (Gemini), and local Ollama models.

Briefing Cornell Notes

Briefing

LangChain’s practical value comes from a small set of reusable building blocks: a unified way to call different LLM providers, structured outputs that constrain responses to a schema, retrieval-augmented generation (RAG) for private documents, tool calling for actions, and observability so developers can trace every model call end-to-end. The tutorial walks through each piece with working examples, showing how the same core patterns carry across OpenAI, Google Gemini, and local models.

It starts with the simplest interface: initialize a chat model for a chosen provider and call it with a prompt. Using LangChain’s chat model initialization, the same “invoke with a prompt, read the response” workflow works for OpenAI (example: GPT-4 mini), Google GenAI (example: Gemini 2.5 Flash with a “thinking budget” and token/usage metadata), and local inference via Ollama (example: Qwen 8 billion). The transcript emphasizes what comes back from these calls: not just the answer text, but also metadata like finish reason and token usage, plus—depending on the model—reasoning content that can be enabled or suppressed.

Next comes chat orchestration. Instead of a single prompt string, LangChain uses a chat prompt template built from a system message and user messages, with template variables filled at runtime (the example swaps in an agent name like “Slim Shady”). Conversation history is maintained using typed message objects (system, human, AI), and subsequent turns reuse that history to keep context. The tutorial also demonstrates how to disable “thinking” so outputs stay concise.

Structured output is presented as a reliability lever. By defining a Pydantic model (a “song classification” schema with fields like song name, style enum, and reasoning), the model is forced to return JSON that matches the schema. Two lyric-based classification attempts show the tradeoff: the structure is correct even when the factual fields are wrong, and the reasoning field can still be useful. The key takeaway is that schema-constrained outputs reduce format-related hallucinations and make downstream parsing safer.

For private knowledge, the tutorial builds a minimal RAG pipeline over a PDF. It loads a two-page Aston Martin Valhalla document using PyPDF, embeds the pages with FastEmbed (locally), stores vectors in an in-memory vector store, and retrieves the most similar page(s) for a query. A Q&A prompt then instructs the model to answer only from retrieved context and to say “I don’t know” when information is missing. The example queries (engine type, horsepower, 0–60 acceleration) demonstrate that even without chunking, the retrieved context can be sufficient for accurate answers.

Tool calling ties the pieces together. A LangChain tool (decorated function) wraps the RAG “answer query” logic. The model is bound to this tool and returns tool-call arguments; the system executes the tool and feeds the tool result back into the conversation. The example asks for the Valhalla transmission, and the final response incorporates the PDF-derived answer.

Finally, observability is handled through MLflow integration (MLflow 3 release mentioned). By running an MLflow tracking server locally and viewing traces in the MLflow UI, developers can inspect each request, prompt, response, timing, and model metadata—turning “it worked” into traceable, debuggable behavior across chat, RAG, and tool calls.

Cornell Notes

LangChain’s core building blocks let developers reuse the same workflow across LLM providers, add reliability with structured outputs, and ground answers in private documents via RAG. The tutorial demonstrates unified model invocation (OpenAI, Google GenAI/Gemini, and local Ollama/Qwen), then upgrades to chat prompts with system/user roles and persistent message history. Structured output uses a Pydantic schema so responses conform to a JSON format, reducing parsing failures even when factual accuracy varies. For RAG, PDF pages are extracted with PyPDF, embedded with FastEmbed, stored in an in-memory vector store, retrieved by similarity, and injected into a Q&A prompt that discourages fabrication. Tool calling wraps the RAG function as a callable tool, and MLflow integration provides traceability through OpenTelemetry-style traces and a UI.

How does LangChain keep model calls consistent across different LLM providers?

A chat model is initialized with a chosen provider and model name, then invoked with a prompt. The same pattern—initialize, call invoke (or equivalent), and read the response—works for OpenAI (GPT-4 mini), Google GenAI (Gemini 2.5 Flash with parameters like a thinking budget and token/usage metadata), and local inference through Ollama (Qwen 8 billion). The response objects include not only content but also metadata such as finish reason and usage token counts, which helps standardize monitoring.

What changes when moving from a single prompt to a multi-turn chatbot?

Instead of one prompt string, LangChain uses a chat prompt template composed of a system message and user messages. Template variables can be filled at runtime (the example replaces an agent name with “Slim Shady”). Conversation state is maintained by appending typed messages (human and AI) to a history list. Each new turn invokes the model with the updated message history so the model can respond with prior context.

Why use structured output with a Pydantic schema, and what does it guarantee?

Structured output constrains the model’s response to a defined JSON schema (via a Pydantic model). In the song classification example, fields include song name, a style enum (e.g., gangster rap, R&B, other), and reasoning. This guarantees the response format matches the schema, making it easier and safer to parse downstream. It does not guarantee factual correctness—lyrics-based tests show the structure can be correct even when the model misidentifies the song.

How does the minimal RAG pipeline over a PDF work end-to-end?

The PDF is loaded with PyPDF, pages are embedded with FastEmbed (configured for local embeddings), and vectors are stored in an in-memory vector store. For each question, similarity search retrieves the most relevant page(s) (the example uses K=1 and does not chunk the document). A Q&A prompt injects the retrieved context and instructs the model to answer from that context or say “I don’t know” rather than inventing details.

How does tool calling connect the model to retrieval logic?

A retrieval function is wrapped as a LangChain tool using a decorator (at two is mentioned). The tool includes documentation describing what it does (answering questions using private PDF information). The model is bound to the tool, then returns tool-call arguments (e.g., query = “what is the transmission of the Valhalla”). The system executes the tool with those arguments, adds the tool result to message history, and then generates the final user-facing response.

What does MLflow integration add for debugging and monitoring?

MLflow integration provides traceability for LLM calls. The tutorial runs an MLflow tracking server locally, sets an MLflow tracking URI and experiment name, then inspects traces in the MLflow UI. Traces show request IDs, timing (matching notebook call durations), prompts, assistant responses (including reasoning when enabled), and model metadata—making it easier to debug slow calls or incorrect outputs across chat, RAG, and tool calling.

Review Questions

When using structured output with a Pydantic schema, what is guaranteed about the response format—and what is not?
In the minimal PDF RAG setup, what are the roles of PyPDF, FastEmbed, the in-memory vector store, and the similarity search parameter K?
How does tool calling change the flow compared with directly injecting retrieved context into the prompt?

Key Points

1
LangChain provides a unified way to initialize and invoke chat models across OpenAI, Google GenAI (Gemini), and local Ollama models.
2
Chat prompt templates support system and user roles plus runtime template variables, and typed message history enables multi-turn context.
3
Structured output with a Pydantic schema enforces JSON conformity, improving downstream parsing reliability even when factual answers can still be wrong.
4
A minimal RAG pipeline can be built by extracting PDF pages (PyPDF), embedding them locally (FastEmbed), retrieving by similarity from an in-memory vector store, and grounding answers in retrieved context.
5
Tool calling wraps retrieval logic as a callable tool, letting the model request specific actions via tool-call arguments and then generating a final response from tool results.
6
MLflow integration adds trace-level observability—prompts, responses, timing, and metadata—so LLM behavior becomes debuggable rather than opaque.

Highlights

The same LangChain invocation pattern works across GPT-4 mini, Gemini 2.5 Flash, and a local Qwen 8 billion model via Ollama, with consistent access to response metadata and usage tokens.

Schema-constrained structured output can keep response formatting stable (JSON matching a Pydantic model) even when the model’s factual fields are incorrect.

A two-page PDF RAG demo answers questions accurately by retrieving the most similar page (K=1) and instructing the model not to fabricate when context is missing.

Tool calling turns the RAG function into a model-accessible capability: the model emits tool-call arguments, the system executes retrieval, and the final answer is generated from tool output.

MLflow UI traces provide a timing breakdown that matches notebook call durations and show prompts/responses for each step.

Topics

LangChain Building Blocks
LLM Provider Abstraction
JSON Structured Output
RAG PDF Ingestion
Tool Calling
Observability with MLflow

Mentioned

LangChain
LangGraph
OpenAI
Google GenAI
Ollama
Oama
PyPDF
FastEmbed
MLflow
LLMs
RAGs
JSON
DCT
MLflow
UI
API
K