“Llama 3 with Agents gives you Godlike power” - Pietro Schirano
Based on David Ondrej's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
An agentic orchestration approach can make local models more dependable by delegating subtasks to weaker agents and using a stronger model for final verification.
Briefing
Llama 3’s biggest practical punch, as described in this conversation, isn’t just raw benchmark parity with frontier systems—it’s the ability to run a capable model locally and still get “good enough” reasoning for real tasks. The guest built an agentic orchestration framework that turns a simple goal into a workflow executed by smaller “mini agents,” then uses a stronger model as a final checker. That design choice—splitting labor and verifying at the end—lets cheaper, faster models handle most of the work while a higher-level model catches mistakes, making local deployment feel like having “the power of the Gods” in hand.
Anthropic’s email about winning an ACT developer competition is mentioned as external validation for the framework, which the guest calls an agentic system (MyAso). The practical demo is local app generation: the user describes what they want, the system thinks through sub-agents, and then outputs a structured folder with files—an end-to-end workflow rather than a chat-only experience. The guest also emphasizes why Llama 3 fits this approach: it’s fast, relatively cheap (citing a cost around 25 cents per million tokens), and strong enough for task execution when paired with a “senior developer” style final review.
On model comparisons, the discussion draws a nuanced line between general capability and specific features. Llama 3 70B is described as close to GPT-4 for many tasks like coding and functional outputs, but still not quite there for areas such as function calling and “J mode,” with expectations that fine-tuning or future releases will close the gap. A key theme is that open models are becoming usable on everyday hardware—an argument reinforced by the claim that an 8B Llama 3 model compares well to a much larger Llama 2 variant, and that people are only just beginning to realize what local inference unlocks.
Creativity becomes a second major thread. The guest argues that Llama 3 (and especially Claude 3) can outperform GPT-4 on pseudo-creative tasks—brainstorming, generating variations, and image-related reasoning—where GPT-4 may “go off the rails.” A concrete example is YouTube title generation: GPT-4 allegedly tends to insert semicolons regardless of prompt formatting, while Claude 3 and Llama 3 produce more usable title variations. Another example is one-shot personalization: the guest describes feeding Claude data from ~800 tweets into a JSON structure and getting outputs that mimic the author’s voice and habits, including specific tagging behavior.
From there, the conversation shifts to fine-tuning and personalization as the next step beyond retrieval and prompting. Fine-tuning is framed as most valuable when a company’s or person’s data isn’t present in base models (brand style, proprietary image sets, customer-support tone) or when the goal is to make an assistant feel like an extension of the user. The guest predicts that “instant fine-tuning” and more dynamic weight adaptation will arrive within a few years, enabling household devices and truly personal AI.
Finally, the discussion broadens into a philosophy of preparation for a post-AGI world: don’t wait for alignment with the future—build for it. The guest recommends doubling down on what people are already uniquely good at (human skills like empathy and taste), then using AI to amplify the rest. In that framing, the real advantage goes to those who can “speak the new language” of AI and keep building as capabilities accelerate—while regulation and corporate risk determine how quickly the stack becomes widely available.
Cornell Notes
Llama 3’s practical breakthrough, in this discussion, is that it can be run locally while still producing high-quality outputs—especially when paired with an agentic orchestration system that delegates work to smaller agents and uses a stronger model for final verification. The guest argues that this “senior developer checking a junior” approach makes cheaper models viable for most steps, while catching errors at the end. Comparisons with GPT-4 are described as task-dependent: Llama 3 is said to be close for many coding and general tasks, but still trails on function calling and some structured modes. Creativity is treated as a differentiator too, with examples like YouTube title generation and image reasoning where Claude 3 and Llama 3 are claimed to behave more usefully than GPT-4. The conversation then pivots to fine-tuning and personalization—especially when proprietary style or data isn’t in base models—and ends with advice to prepare for superintelligence by amplifying human strengths with AI tools.
How does the agentic framework described here make local Llama 3 feel more reliable than “just prompting” a model?
Why does the guest claim Llama 3 can be “close to GPT-4” without matching it perfectly?
What concrete examples are used to argue that Claude 3 and Llama 3 are stronger at “creative” tasks than GPT-4?
How does one-shot fine-tuning/personalization enter the discussion, and what’s the claimed outcome?
When should companies fine-tune instead of relying on prompts or retrieval?
What preparation strategy is recommended for a post-AGI world?
Review Questions
- Which design choice in the agentic framework reduces the need for a single “perfect” model at every step?
- What differences between Llama 3 and GPT-4 does the guest attribute to structured capabilities like function calling?
- According to the discussion, what are the two main reasons to fine-tune, and how do they differ from retrieval or prompting?
Key Points
- 1
An agentic orchestration approach can make local models more dependable by delegating subtasks to weaker agents and using a stronger model for final verification.
- 2
Llama 3 is described as close to GPT-4 for many practical tasks (especially coding), but still behind on function calling and certain structured modes.
- 3
Creativity and “staying on-task” are treated as differentiators, with examples like YouTube title generation and image-based reasoning where Claude 3 and Llama 3 are claimed to outperform GPT-4.
- 4
Fine-tuning is most valuable when proprietary style/data isn’t in base models or when a system must mirror a specific brand/person tone.
- 5
Personalization is expected to move toward faster or even near-instant fine-tuning within a few years, enabling more household and device-level assistants.
- 6
Preparation for superintelligence should focus on amplifying human-unique skills (empathy, taste, communication, product thinking) rather than trying to out-automate AI at everything.