LangSmith Crash Course | LangSmith Tutorial for Beginners | Observability in GenAI | CampusX
Based on CampusX's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
LangSmith provides end-to-end traces that record inputs/outputs, intermediate steps, per-component latency, token usage, and cost—turning LLM “black-box” behavior into debuggable evidence.
Briefing
LangSmith is positioned as the missing “white-box” layer for LLM applications—turning opaque, non-deterministic behavior into traceable, component-by-component evidence. The core problem is that LLM systems often fail in ways that don’t produce clear error traces: the same input can yield different outputs, and production issues like latency spikes, cost blowups, or hallucinations can’t be reliably attributed to a specific stage (retrieval, generation, parsing, etc.). LangSmith’s value is that it records end-to-end executions as granular traces, letting teams answer not just what happened, but where and why it happened.
The walkthrough starts with three production-style scenarios that motivate observability. First, a job-application assistant built on an LLM workflow suddenly slows from ~2 minutes to 7–10 minutes. The system has multiple stages—JD ingestion, document fetching from Google Drive, matching, cover-letter generation, and proof-reading—but only the final input/output and total latency are visible. Without internal breakdowns, debugging becomes guesswork (e.g., a recent code push might cause the system to scan the entire Drive). Second, a research agent sees API costs spike: some reports jump from $0.50 to $2. The likely cause is agentic looping—an upgraded prompt can cause the agent to keep re-running steps until it believes the output is “perfect,” increasing token usage unpredictably for certain topics. Third, a RAG chatbot begins hallucinating policy answers (e.g., leaf policy, health insurance, notice period). Here the failure could be in retrieval (wrong documents fetched) or generation (LLM ignores context or the prompt is too permissive), but the system only returns the final answer—making root-cause analysis extremely difficult.
LangSmith is introduced with a definition of observability: understanding a system’s internal state by examining external outputs like logs, metrics, and traces. In practice, LangSmith traces each execution’s inputs, outputs, intermediate steps, latency per component, token usage, and cost—plus errors and optional tags/metadata. The tutorial then demonstrates integration with LangChain using a simple prompt → OpenAI model → string output parser chain. Without changing core logic, the user runs the app and LangSmith automatically creates traces under a project, showing component-level runs, per-step latency, tokens, and cost.
The same approach is extended to sequential chains (two-step report generation plus summarization), where the tutorial shows how to set project names, add custom tags and metadata, and even rename runs. Next comes a RAG example using a PDF: the tutorial highlights two key production pitfalls. One is incomplete tracing—LangSmith only traces LangChain “runnables” by default, so PDF loading, chunking, and embedding may be invisible unless wrapped with LangSmith’s traceable decorators. The second pitfall is repeated recomputation: every query reloads the PDF, re-chunks it, and re-embeds it, causing large latency. The fix is to persist an index (using a vector store like FAISS) and reuse it across runs, rebuilding only when inputs or configuration change.
Finally, the tutorial covers agentic workflows and LangGraph integration. For agents, LangSmith traces the scratchpad, tool selection, tool inputs/outputs, and the iterative thought-action-observation loop—making it possible to debug wrong tool usage or incorrect intermediate assumptions. For LangGraph, the tutorial emphasizes that an entire graph execution becomes one trace, and each node execution becomes a run inside that trace, enabling visualization of branching and conditional paths.
Beyond observability, LangSmith is framed as part of LLM Ops: monitoring and alerting across many traces (latency, cost, error rate, success rate), evaluation against datasets and metrics (including “LLM-as-a-judge”), prompt experimentation/AB testing, dataset creation and annotation, user feedback capture, and team collaboration via shareable trace links and versioned prompt workflows. The takeaway is that production reliability for LLM systems requires more than building prompts and chains—it requires systematic measurement, debugging, and iteration across the full lifecycle.
Cornell Notes
LangSmith is presented as a unified observability and evaluation platform for LLM apps, designed to make non-deterministic, multi-stage systems debuggable in production. It records end-to-end executions as traces and captures component-level runs, including inputs/outputs, intermediate steps, latency, token usage, and cost—so teams can identify whether failures come from retrieval, generation, or other stages. The tutorial demonstrates LangSmith integration with LangChain (simple chains and sequential chains), then shows RAG debugging where tracing can be incomplete unless non-runnable steps (PDF load/chunk/embed) are wrapped with traceable decorators. It also addresses RAG latency spikes by persisting embeddings/indexes (e.g., FAISS) and reusing them across queries. Finally, it extends to agentic workflows and LangGraph, where graph executions and node executions map cleanly into traces and runs, enabling step-by-step debugging and performance tracking.
Why does debugging LLM applications get harder in production compared to traditional software?
How does LangSmith’s tracing help pinpoint root causes like latency spikes or hallucinations?
What are the three core concepts LangSmith uses in the tutorial: Project, Trace, and Run?
What two major issues appear in the RAG example, and how are they addressed?
How does LangSmith make agentic workflows debuggable?
What does monitoring and alerting add beyond single-trace observability?
Review Questions
- In a multi-stage LLM workflow, what specific evidence would you look for in LangSmith to determine whether a RAG hallucination came from retrieval or generation?
- How would you modify a RAG pipeline so that PDF loading/chunking/embedding appear in LangSmith traces, and why might they be missing by default?
- When using a persistent vector index (e.g., FAISS), what conditions should trigger rebuilding the index rather than reusing it?
Key Points
- 1
LangSmith provides end-to-end traces that record inputs/outputs, intermediate steps, per-component latency, token usage, and cost—turning LLM “black-box” behavior into debuggable evidence.
- 2
LLM production failures often stem from specific stages (retrieval, generation, parsing) but don’t surface as clean error traces; tracing is how root cause becomes identifiable.
- 3
RAG systems commonly fail in two ways: incomplete tracing (non-runnable steps like PDF load/chunk/embed aren’t visible) and repeated recomputation (rebuilding embeddings/indexes every query).
- 4
Persisting a vector index (e.g., FAISS) and reusing it across runs can cut RAG latency dramatically, while rebuilding only when inputs or configuration change.
- 5
Agentic workflows become debuggable when traces capture tool selection, tool inputs/outputs, and scratchpad updates across iterative steps.
- 6
Monitoring aggregates metrics across many traces over time, while alerting notifies teams when latency/cost/error rates drift beyond thresholds.
- 7
LangSmith extends beyond observability into LLM Ops: evaluation, prompt experimentation, dataset creation/annotation, user feedback capture, and team collaboration.