“Llama 3 with Agents gives you Godlike power”

TL;DR

An agentic orchestration approach can make local models more dependable by delegating subtasks to weaker agents and using a stronger model for final verification.

Briefing Cornell Notes

Briefing

Llama 3’s biggest practical punch, as described in this conversation, isn’t just raw benchmark parity with frontier systems—it’s the ability to run a capable model locally and still get “good enough” reasoning for real tasks. The guest built an agentic orchestration framework that turns a simple goal into a workflow executed by smaller “mini agents,” then uses a stronger model as a final checker. That design choice—splitting labor and verifying at the end—lets cheaper, faster models handle most of the work while a higher-level model catches mistakes, making local deployment feel like having “the power of the Gods” in hand.

Anthropic’s email about winning an ACT developer competition is mentioned as external validation for the framework, which the guest calls an agentic system (MyAso). The practical demo is local app generation: the user describes what they want, the system thinks through sub-agents, and then outputs a structured folder with files—an end-to-end workflow rather than a chat-only experience. The guest also emphasizes why Llama 3 fits this approach: it’s fast, relatively cheap (citing a cost around 25 cents per million tokens), and strong enough for task execution when paired with a “senior developer” style final review.

On model comparisons, the discussion draws a nuanced line between general capability and specific features. Llama 3 70B is described as close to GPT-4 for many tasks like coding and functional outputs, but still not quite there for areas such as function calling and “J mode,” with expectations that fine-tuning or future releases will close the gap. A key theme is that open models are becoming usable on everyday hardware—an argument reinforced by the claim that an 8B Llama 3 model compares well to a much larger Llama 2 variant, and that people are only just beginning to realize what local inference unlocks.

Creativity becomes a second major thread. The guest argues that Llama 3 (and especially Claude 3) can outperform GPT-4 on pseudo-creative tasks—brainstorming, generating variations, and image-related reasoning—where GPT-4 may “go off the rails.” A concrete example is YouTube title generation: GPT-4 allegedly tends to insert semicolons regardless of prompt formatting, while Claude 3 and Llama 3 produce more usable title variations. Another example is one-shot personalization: the guest describes feeding Claude data from ~800 tweets into a JSON structure and getting outputs that mimic the author’s voice and habits, including specific tagging behavior.

From there, the conversation shifts to fine-tuning and personalization as the next step beyond retrieval and prompting. Fine-tuning is framed as most valuable when a company’s or person’s data isn’t present in base models (brand style, proprietary image sets, customer-support tone) or when the goal is to make an assistant feel like an extension of the user. The guest predicts that “instant fine-tuning” and more dynamic weight adaptation will arrive within a few years, enabling household devices and truly personal AI.

Finally, the discussion broadens into a philosophy of preparation for a post-AGI world: don’t wait for alignment with the future—build for it. The guest recommends doubling down on what people are already uniquely good at (human skills like empathy and taste), then using AI to amplify the rest. In that framing, the real advantage goes to those who can “speak the new language” of AI and keep building as capabilities accelerate—while regulation and corporate risk determine how quickly the stack becomes widely available.

Cornell Notes

Llama 3’s practical breakthrough, in this discussion, is that it can be run locally while still producing high-quality outputs—especially when paired with an agentic orchestration system that delegates work to smaller agents and uses a stronger model for final verification. The guest argues that this “senior developer checking a junior” approach makes cheaper models viable for most steps, while catching errors at the end. Comparisons with GPT-4 are described as task-dependent: Llama 3 is said to be close for many coding and general tasks, but still trails on function calling and some structured modes. Creativity is treated as a differentiator too, with examples like YouTube title generation and image reasoning where Claude 3 and Llama 3 are claimed to behave more usefully than GPT-4. The conversation then pivots to fine-tuning and personalization—especially when proprietary style or data isn’t in base models—and ends with advice to prepare for superintelligence by amplifying human strengths with AI tools.

How does the agentic framework described here make local Llama 3 feel more reliable than “just prompting” a model?

The framework turns a single goal into a workflow executed by multiple smaller “mini agents.” A more capable model acts as an orchestrator/director, while less-intelligent agents handle sub-tasks. The system then uses the smarter model at the end as a checker—so the cheaper solver can be imperfect as long as the final review catches and fixes issues. The guest likens the dynamic to a senior developer reviewing a junior developer’s work before delivery.

Why does the guest claim Llama 3 can be “close to GPT-4” without matching it perfectly?

The comparison is framed as capability-by-capability. Llama 3 70B is described as nearly competitive for many tasks such as coding and general functional outputs, with examples like generating working code on the first try. But it’s said to lag for structured behaviors like function calling and “J mode,” which the guest expects could improve via fine-tuning or future releases.

What concrete examples are used to argue that Claude 3 and Llama 3 are stronger at “creative” tasks than GPT-4?

Two examples stand out. First, YouTube title generation: GPT-4 is claimed to repeatedly insert semicolons regardless of prompt formatting, while Claude 3 and Llama 3 produce more varied, usable titles. Second, image-related creativity: when given images, Claude 3 is described as producing outputs that stay on-task and understand the intent better than GPT-4 Vision, which the guest says can “go off the rails.”

How does one-shot fine-tuning/personalization enter the discussion, and what’s the claimed outcome?

The guest describes collecting around 800 tweets and converting them into a JSON dataset with fields like the author identity and the tweet text. The goal is to have Claude generate new tweets “in the style of” the author. The claimed result is that the model writes in the person’s voice—matching phrasing habits and even tagging behaviors (e.g., using specific tags the author commonly uses). The takeaway is that strong models can sometimes learn style from a relatively small, well-structured dataset.

When should companies fine-tune instead of relying on prompts or retrieval?

Fine-tuning is presented as most useful in two cases. First, when a company’s data or brand style isn’t available in the base model—such as proprietary image style, brand persona, or copyright-cleared training assets. Second, when the assistant must mirror a specific tone or lived experience, like customer support that sounds like the company or a personal device that adapts to how an individual speaks. In both cases, fine-tuning turns the assistant into an extension of the brand or person rather than a generic model.

What preparation strategy is recommended for a post-AGI world?

The guest’s advice is to keep building around human strengths. People should identify skills they’re uniquely good at (empathy, taste, communication, product execution) and then use AI to amplify the parts that are easier to automate. The goal is not to let AI replace everything, but to become “better with AI” at the tasks where humans still have an edge—so job displacement risk is reduced while output scales.

Review Questions

Which design choice in the agentic framework reduces the need for a single “perfect” model at every step?
What differences between Llama 3 and GPT-4 does the guest attribute to structured capabilities like function calling?
According to the discussion, what are the two main reasons to fine-tune, and how do they differ from retrieval or prompting?

Key Points

1
An agentic orchestration approach can make local models more dependable by delegating subtasks to weaker agents and using a stronger model for final verification.
2
Llama 3 is described as close to GPT-4 for many practical tasks (especially coding), but still behind on function calling and certain structured modes.
3
Creativity and “staying on-task” are treated as differentiators, with examples like YouTube title generation and image-based reasoning where Claude 3 and Llama 3 are claimed to outperform GPT-4.
4
Fine-tuning is most valuable when proprietary style/data isn’t in base models or when a system must mirror a specific brand/person tone.
5
Personalization is expected to move toward faster or even near-instant fine-tuning within a few years, enabling more household and device-level assistants.
6
Preparation for superintelligence should focus on amplifying human-unique skills (empathy, taste, communication, product thinking) rather than trying to out-automate AI at everything.

Highlights

Local Llama 3 becomes more usable when an orchestrator splits work across mini agents and then performs a final “senior check” to catch errors.

The comparison to GPT-4 is task-specific: Llama 3 is portrayed as nearly competitive for many coding/general tasks, while function calling and some structured behaviors lag.

Creativity is argued as a measurable difference—especially in brainstorming and image-related tasks—supported by examples like YouTube title generation.

Fine-tuning is framed as turning assistants into extensions of brands or individuals, particularly when proprietary style or data isn’t present in foundation models.

The post-AGI strategy is to double down on what humans do best, then use AI to scale it rather than trying to replace it.

Topics

Mentioned

Pietro Schirano
David Ondrej
Mark Zuckerberg
Sam Altman
Andrej Karpathy
Greg Brockman
Ted Chiang
Zach
Zak
Mark Zach
Pietro
ACT
AI
GPT-4
RHF
RAG
TPU
AGI
LLM
JSON
M3
70B
8B
J mode
OKRs
TPUs
GPT
GPT-3
GPT-4 Vision
RAG
BAGI
BAGI sasan

“Llama 3 with Agents gives you Godlike power” - Pietro Schirano