Wall Street Just Bet $285 Billion on AI Agents. The Best One Barely Works.
Based on AI News & Strategy Daily | Nate B Jones's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Wall Street’s agent hype was triggered by computer-operating agents like Co-work, but investor fears intensified because dependable long-running behavior is still missing.
Briefing
Wall Street’s $285 billion bet on AI agents is colliding with a sobering reality: many “outcome agents” still fail the hard requirements for dependable, long-running work. The standout example is Anthropic’s Co-work, which can autonomously operate on a user’s computer and files—opening apps, navigating browsers, and executing tasks without coding. That capability helped trigger a sell-off in software-as-a-service companies, because it threatened to replace expensive, license-heavy workflows. Yet Co-work remains in research preview and shows limitations that would be unacceptable for production software, such as going to sleep when a laptop is shut and becoming non-interactive afterward.
To separate genuine agents from hype, the analysis pivots to verifiability—how reliably an outcome can be judged as correct. Code-based agents came first because code is a “verifiable domain”: it’s easy to test whether it runs. For non-coding agents, three criteria determine whether outcomes can improve over time. First, persistent memory must exist beyond a single session; second, the agent must produce inspectable, editable artifacts rather than opaque outputs; third, the system must allow context to compound so it gets better as it’s used.
Co-work scores unevenly across those tests. It has partial memory through continued conversations, but users still need to supply essential context each session. It performs strongly on artifacts—spitting out tangible deliverables that users can work with—one reason it’s attracted both enterprise attention and investor anxiety. But it does not truly compound context over time, meaning it doesn’t yet deliver the “learn as you go” behavior that would make it feel like a durable workforce.
Other tools illustrate where the market is heading—and where it still stumbles. Lindy targets busy executives with natural-language outcome requests and an interface designed for easy start-up. Still, its Trustpilot score sits around 2.4/5, with complaints about credits burning, unclear value, and difficulty debugging or steering complex tasks. On the three-part rubric, Lindy lands as a qualified “maybe” on memory, weaker on editable artifacts, and uncertain on compounding context.
Sauna (formerly Wordware) is framed as a pivot toward a professional “AI workspace” built around memory as a foundational substrate, not a toggle. It preserves orchestration infrastructure from the earlier product and emphasizes compounding context, browser sessions with persistent login, and integrations aimed at long-running work. But it remains early and heavily demo-driven, leaving open whether it truly satisfies the artifact and context-compounding requirements in real usage.
Google Opal, built on Gemini 3 flash, is a free workflow builder that can route tools, self-correct, remember across sessions, and ask clarifying questions. Builders appear to be using it in practice—remixing workflows and creating meeting-prep agents—yet concerns remain about whether its memory is durable enough for long-running outcomes and whether artifact support is limited.
Finally, Obvious positions itself as a full workspace with cross-artifact relationships—slides referencing spreadsheets and vice versa—plus workbooks, documents with live charts, and custom apps. It appears to chase the “editable surfaces” and “context compounding” goals, but its newness makes real-world validation uncertain.
The through-line is a blueprint for building agents that can earn trust: memory must be built into the architecture, outcomes need editable surfaces users can verify, and context must compound over time. The proposed three-layer design—knowledge store, agent recipes, and a scheduling loop—maps directly to those requirements. The takeaway is less about which agent wins today and more about which foundations will determine whether agents become reliable coworkers or remain impressive demos.
Cornell Notes
AI agents are drawing massive investment, but dependable “outcome work” still hinges on three hard, verifiable capabilities. The rubric is: (1) persistent memory beyond a single session, (2) inspectable and editable artifacts as the deliverable, and (3) an architecture that lets context compound so performance improves over time. Co-work shows why the hype caught on—strong artifact generation and partial memory—but it doesn’t truly compound context and can stop working when the laptop is shut. Lindy, Sauna, Google Opal, and Obvious illustrate different trade-offs across memory, artifact editability, and long-running compounding, with many still early or demo-heavy. The practical implication: agents need architectural foundations (memory substrate, editable output surfaces, and compounding loops), not just better prompts.
Why does code-based agent work feel more “verifiable” than non-coding agents?
How does Co-work score on the three criteria for real outcome agents?
What does Lindy’s performance suggest about the trade-off between executive-friendly UX and agent control?
Why is Sauna treated as a “pivot story,” and what makes its memory approach stand out?
What are the key strengths and concerns about Google Opal’s approach?
How does the proposed three-layer architecture map to the three-agent rubric?
Review Questions
- Which of the three criteria (persistent memory, editable artifacts, context compounding) is most directly tied to long-running reliability, and why?
- Why does “partial memory” still require users to provide context each session in systems like Co-work?
- What design choice most affects whether an agent’s output can be verified and corrected: memory, artifact surfaces, or orchestration—and how does the rubric justify that?
Key Points
- 1
Wall Street’s agent hype was triggered by computer-operating agents like Co-work, but investor fears intensified because dependable long-running behavior is still missing.
- 2
Non-coding outcome agents need verifiability through persistent memory, editable artifacts, and context compounding—not just impressive autonomy.
- 3
Co-work demonstrates strong artifact generation and partial memory, yet it doesn’t truly compound context over time and can stop working when the laptop is shut.
- 4
Lindy’s executive-friendly UX appears to trade off artifact editability and steering/debugging, contributing to complaints like credit burn and unclear failure causes.
- 5
Sauna’s differentiator is memory as a foundational substrate, but real-world validation remains uncertain because it’s still early and demo-heavy.
- 6
Google Opal’s free access and builder activity are strengths, while concerns center on whether its memory and artifact support are durable enough for serious long-running work.
- 7
A practical agent blueprint uses a three-layer architecture: knowledge store (memory), agent recipes (editable outputs), and a scheduling loop (context compounding).