Get AI summaries of any video or article — Sign up free
The 4 Stacks of LLM Apps & Agents thumbnail

The 4 Stacks of LLM Apps & Agents

Sam Witteveen·
6 min read

Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Organize LLM apps into four stacks: LLM, search/memory/data, reasoning-and-action, and personalization to make architecture decisions concrete.

Briefing

Building useful LLM apps and agents comes down to assembling four distinct “stacks” in the right places: the model itself, the data/search/memory layer, the reasoning-and-action layer, and the personalization layer. The practical value of this framework is that it turns vague “LLM app” ideas into an architecture you can plan—what needs to be trained or served, what information must be retrieved and injected, how decisions and tool use should work, and how the system should speak and behave.

At the foundation sits the LLM stack: everything tied to the language model’s capabilities and deployment. That includes how the model was created—pre-training, fine-tuning, and whether RLHF was used—and whether additional fine-tuning is worth doing for a specific domain, use case, or writing style. Deployment choices matter too. For open-source models, teams must decide between cloud serving (paying per token or compute) and local hosting, potentially using quantized “4-bit” style models. The tradeoff is not just cost; quantization can reduce performance on tasks beyond simple chat, especially where logic or mathematics is involved.

Above that is the search, memory, and data stack, which focuses on getting the right information into prompts. This layer typically uses semantic search and vector stores, with decisions such as whether to use Faiss or hosted options like ChromaDB or Pinecone. It also covers how data is sourced—whether from conventional databases, knowledge graphs, or the live web via tools like Google or Duck Duck Go—and how it’s extracted and injected into the model. Context window size changes what’s feasible: a 16K context window allows more retrieved content to be included. For agents, persistent memory is crucial: saving and retrieving information for in-context learning so the system can condition responses on prior facts.

The third layer—reasoning and actions—handles decision-making and tool use. Instead of leaving every choice to the language model alone, this stack often incorporates structured approaches like ReAct-style tool calling and external solvers. Mathematical or logic solvers can run heuristics over inputs and return answers that the LLM then uses to take the next step. This layer is also where API calls and “go fetch information” behaviors belong. The expectation is that this stack will grow quickly as systems adopt reasoning methods beyond pure text generation.

At the top is the personalization stack, which shapes how the system sounds and relates to users. It includes prompt engineering for personality and brand, plus customization of output style and conversational behavior. Sometimes it involves tracking user-specific details; often it’s about controlling tone, format, and interaction patterns through prompts and related manipulations. The framework helps teams allocate effort: role-playing and stylistic fidelity demand more personalization work, while knowledge-heavy applications demand more investment in the data/search/memory layer.

Finally, architecture choices determine where each stack lives. Mobile apps may push much of the heavy lifting to the cloud while keeping some personalization on-device. Agent design requires deciding which tools the agent can use, what reasoning mechanisms it relies on, and whether heuristic systems should assist. Tools like LLaMA index are positioned as strong interfaces for moving data from databases and vector stores into prompt-ready formats for in-context learning. The overall message: treat LLM apps as systems of components, not monolithic prompts, and build each stack deliberately.

Cornell Notes

Useful LLM apps and agents can be organized into four stacks: (1) the LLM stack (model training, fine-tuning, and serving choices), (2) the search/memory/data stack (retrieval, vector stores, web search, and persistent memory for in-context learning), (3) the reasoning-and-action stack (decision-making, tool use, and external solvers such as ReAct-style flows), and (4) the personalization stack (prompt engineering for tone, personality, and user-specific interaction). The key is matching each product goal to the right stack so engineering effort lands where it matters. This approach also clarifies deployment tradeoffs, like cloud vs local hosting and how quantization can affect logic-heavy tasks.

What belongs in the LLM stack, and why do deployment choices matter as much as model quality?

The LLM stack includes how the model was built (pre-training, fine-tuning, and whether RLHF was applied) and whether additional fine-tuning is needed for a specific domain or style. It also includes serving strategy for open-source models: cloud hosting with token or compute-based pricing versus local hosting. Local options may involve quantized models (e.g., 4-bit style), which can reduce cost but may hurt performance on tasks beyond chat—particularly logic and mathematics—compared with full-resolution models.

How does the search, memory, and data stack turn external information into prompt context?

This stack is responsible for retrieving and injecting information into prompts. It commonly uses semantic search and vector stores, choosing between systems like Faiss or hosted options such as ChromaDB and Pinecone. It also covers data extraction from databases or knowledge graphs and real-time web retrieval via services like Google or Duck Duck Go. Context window size (e.g., a 16K context window) affects how much retrieved content can be included. For agents, it also provides persistent memory: saving and retrieving facts so the model can condition responses through in-context learning.

What distinguishes the reasoning and action stack from “just prompting” an LLM?

The reasoning and action stack focuses on decision-making and executing steps. It often uses LLM-driven tool selection patterns such as ReAct, but it can also rely on external reasoning components like mathematical or logic solvers that apply heuristics to inputs and return stable results. Those outputs then guide the LLM’s next action. Tool/API calls and behaviors like “decide to fetch information” also fit here, and the framework expects this stack to expand as systems adopt reasoning methods beyond text generation.

What is the personalization stack responsible for, and what kinds of work does it demand?

The personalization stack shapes how the system communicates and relates to users. It includes prompt engineering for personality and brand, plus controlling conversational style and output format. It may involve tracking user attributes, but often it’s about customizing responses through prompts and manipulations. If an app needs role-playing or consistent persona behavior, most effort shifts to this stack; if the main challenge is knowledge grounding, the data stack becomes the priority.

How should teams decide where to invest across stacks when designing an app or agent?

The framework suggests mapping requirements to stacks. Knowledge-heavy features (many data sources, retrieval, grounding, long context) point to the search/memory/data stack. Complex decision-making, tool use, and reliable logic point to the reasoning-and-action stack. Output tone, persona, and interaction style point to the personalization stack. The LLM stack underpins everything through model choice, fine-tuning, and serving constraints. It also helps decide where components live: mobile apps may keep some personalization on-device while relying on cloud for lower-level stacks.

Where does LLaMA index fit in this architecture?

LLaMA index is positioned mainly as an interface for the data stack. It helps move data from sources like databases or vector stores into a format suitable for in-context learning—effectively bridging raw data systems and the prompt-ready context the LLM needs to produce responses.

Review Questions

  1. If an agent must perform reliable math or logic, which stack should carry more responsibility: personalization, data retrieval, or reasoning-and-action—and what mechanism would support that?
  2. How would a 16K context window change the design choices in the search/memory/data stack compared with a much smaller context window?
  3. What are the main tradeoffs between cloud serving and local hosting (including quantized models) in the LLM stack for logic-heavy applications?

Key Points

  1. 1

    Organize LLM apps into four stacks: LLM, search/memory/data, reasoning-and-action, and personalization to make architecture decisions concrete.

  2. 2

    Model quality isn’t enough; serving strategy (cloud vs local, token/compute pricing vs quantized hosting) can determine whether logic-heavy tasks perform well.

  3. 3

    Use the search/memory/data stack to retrieve and inject grounded information via semantic search, vector stores (e.g., Faiss, ChromaDB, Pinecone), and web search (e.g., Google, Duck Duck Go).

  4. 4

    Persistent memory for agents—saving and retrieving facts for in-context learning—is a core requirement, not an optional add-on.

  5. 5

    Move decision-making and tool execution into the reasoning-and-action stack using patterns like ReAct and external solvers for stability.

  6. 6

    Allocate engineering effort based on product goals: persona and style belong in personalization; grounding and retrieval belong in data; reliability in decisions belongs in reasoning-and-action.

  7. 7

    Choose where each stack runs (cloud vs device) based on app type, latency needs, and what must remain personalized on-device.

Highlights

Quantization and local hosting can reduce cost, but full-resolution models often perform better on logic and mathematics than 4-bit-style approaches.
A 16K context window expands what can be injected from retrieval, changing how much external context can be used per request.
External solvers and heuristic reasoning can make agent decisions more stable than relying on the LLM alone.
Personalization is treated as a dedicated stack—prompt engineering for tone, persona, and user interaction style—not a side effect of the base model.
LLaMA index is framed primarily as a data-stack interface that prepares database/vector-store content for in-context learning.

Mentioned