Wall Street Just Bet $285 Billion on AI Agents. The Best One Barely Works.

TL;DR

Wall Street’s agent hype was triggered by computer-operating agents like Co-work, but investor fears intensified because dependable long-running behavior is still missing.

Briefing Cornell Notes

Briefing

Wall Street’s $285 billion bet on AI agents is colliding with a sobering reality: many “outcome agents” still fail the hard requirements for dependable, long-running work. The standout example is Anthropic’s Co-work, which can autonomously operate on a user’s computer and files—opening apps, navigating browsers, and executing tasks without coding. That capability helped trigger a sell-off in software-as-a-service companies, because it threatened to replace expensive, license-heavy workflows. Yet Co-work remains in research preview and shows limitations that would be unacceptable for production software, such as going to sleep when a laptop is shut and becoming non-interactive afterward.

To separate genuine agents from hype, the analysis pivots to verifiability—how reliably an outcome can be judged as correct. Code-based agents came first because code is a “verifiable domain”: it’s easy to test whether it runs. For non-coding agents, three criteria determine whether outcomes can improve over time. First, persistent memory must exist beyond a single session; second, the agent must produce inspectable, editable artifacts rather than opaque outputs; third, the system must allow context to compound so it gets better as it’s used.

Co-work scores unevenly across those tests. It has partial memory through continued conversations, but users still need to supply essential context each session. It performs strongly on artifacts—spitting out tangible deliverables that users can work with—one reason it’s attracted both enterprise attention and investor anxiety. But it does not truly compound context over time, meaning it doesn’t yet deliver the “learn as you go” behavior that would make it feel like a durable workforce.

Other tools illustrate where the market is heading—and where it still stumbles. Lindy targets busy executives with natural-language outcome requests and an interface designed for easy start-up. Still, its Trustpilot score sits around 2.4/5, with complaints about credits burning, unclear value, and difficulty debugging or steering complex tasks. On the three-part rubric, Lindy lands as a qualified “maybe” on memory, weaker on editable artifacts, and uncertain on compounding context.

Sauna (formerly Wordware) is framed as a pivot toward a professional “AI workspace” built around memory as a foundational substrate, not a toggle. It preserves orchestration infrastructure from the earlier product and emphasizes compounding context, browser sessions with persistent login, and integrations aimed at long-running work. But it remains early and heavily demo-driven, leaving open whether it truly satisfies the artifact and context-compounding requirements in real usage.

Google Opal, built on Gemini 3 flash, is a free workflow builder that can route tools, self-correct, remember across sessions, and ask clarifying questions. Builders appear to be using it in practice—remixing workflows and creating meeting-prep agents—yet concerns remain about whether its memory is durable enough for long-running outcomes and whether artifact support is limited.

Finally, Obvious positions itself as a full workspace with cross-artifact relationships—slides referencing spreadsheets and vice versa—plus workbooks, documents with live charts, and custom apps. It appears to chase the “editable surfaces” and “context compounding” goals, but its newness makes real-world validation uncertain.

The through-line is a blueprint for building agents that can earn trust: memory must be built into the architecture, outcomes need editable surfaces users can verify, and context must compound over time. The proposed three-layer design—knowledge store, agent recipes, and a scheduling loop—maps directly to those requirements. The takeaway is less about which agent wins today and more about which foundations will determine whether agents become reliable coworkers or remain impressive demos.

Cornell Notes

AI agents are drawing massive investment, but dependable “outcome work” still hinges on three hard, verifiable capabilities. The rubric is: (1) persistent memory beyond a single session, (2) inspectable and editable artifacts as the deliverable, and (3) an architecture that lets context compound so performance improves over time. Co-work shows why the hype caught on—strong artifact generation and partial memory—but it doesn’t truly compound context and can stop working when the laptop is shut. Lindy, Sauna, Google Opal, and Obvious illustrate different trade-offs across memory, artifact editability, and long-running compounding, with many still early or demo-heavy. The practical implication: agents need architectural foundations (memory substrate, editable output surfaces, and compounding loops), not just better prompts.

Why does code-based agent work feel more “verifiable” than non-coding agents?

Code is a verifiable domain because correctness can be tested directly: does it run, and does it produce the expected behavior? That makes it easier to judge whether an agent’s output is right or wrong. Non-coding outcomes lack the same straightforward pass/fail test, so the analysis shifts to alternative verification signals—persistent memory, editable artifacts, and context compounding—so users can inspect, correct, and improve results over time.

How does Co-work score on the three criteria for real outcome agents?

Co-work has partial persistent memory through continued conversations, but it’s not reliable enough to remove the need for users to provide key context each session. It produces strong, tangible artifacts that users can inspect and build on, which helps explain its popularity. The weak point is context compounding: it doesn’t meaningfully accumulate learning across sessions, and even basic reliability issues appear (for example, shutting the laptop puts it to sleep and it stops doing work).

What does Lindy’s performance suggest about the trade-off between executive-friendly UX and agent control?

Lindy is designed for easy start-up and natural-language outcome requests, but its deliverables can be more opaque and harder to debug or edit than Co-work’s artifact-first approach. User feedback cited includes credits burning without clear value, tasks running toward unproductive results, and limited visibility into why failures happen. On the rubric, it’s a qualified “yes” for memory, a “no” for editable artifacts, and uncertain for context compounding—suggesting that simplifying the interface can reduce user control and verifiability.

Why is Sauna treated as a “pivot story,” and what makes its memory approach stand out?

Wordware originally built an IDE for agent development, but the team pivoted after concluding people don’t wake up wanting to build automations; they wake up with too much to do. Sauna keeps the underlying orchestration infrastructure and frames memory as foundational—an architectural substrate meant to enable compounding context and long-running work. The caveat is that it’s still early and demo-heavy, so it’s unclear whether it delivers the artifact editability and real-world compounding the rubric demands.

What are the key strengths and concerns about Google Opal’s approach?

Google Opal is notable for being free and for showing builder activity beyond demos, including meeting-prep agents and remixable workflows. It also claims capabilities like remembering across sessions, routing tools, self-correction, and clarifying questions. The concerns are that its memory may be too lightweight (compared to durable agent memory) and that artifact support may be limited—meaning it may fit lighter workflow actions more than serious long-running outcomes.

How does the proposed three-layer architecture map to the three-agent rubric?

The knowledge store is where persistent memory lives (e.g., a database and chunked knowledge with updatable storage). Agent recipes are pre-wired workflows that help produce inspectable, editable artifacts on known surfaces. The scheduling loop is the mechanism for long-running execution and iterative improvement, enabling context to compound over time as tasks are revisited and refined.

Review Questions

Which of the three criteria (persistent memory, editable artifacts, context compounding) is most directly tied to long-running reliability, and why?
Why does “partial memory” still require users to provide context each session in systems like Co-work?
What design choice most affects whether an agent’s output can be verified and corrected: memory, artifact surfaces, or orchestration—and how does the rubric justify that?

Key Points

1
Wall Street’s agent hype was triggered by computer-operating agents like Co-work, but investor fears intensified because dependable long-running behavior is still missing.
2
Non-coding outcome agents need verifiability through persistent memory, editable artifacts, and context compounding—not just impressive autonomy.
3
Co-work demonstrates strong artifact generation and partial memory, yet it doesn’t truly compound context over time and can stop working when the laptop is shut.
4
Lindy’s executive-friendly UX appears to trade off artifact editability and steering/debugging, contributing to complaints like credit burn and unclear failure causes.
5
Sauna’s differentiator is memory as a foundational substrate, but real-world validation remains uncertain because it’s still early and demo-heavy.
6
Google Opal’s free access and builder activity are strengths, while concerns center on whether its memory and artifact support are durable enough for serious long-running work.
7
A practical agent blueprint uses a three-layer architecture: knowledge store (memory), agent recipes (editable outputs), and a scheduling loop (context compounding).

Highlights

Co-work’s autonomy impressed investors, but basic reliability gaps—like going to sleep when the laptop is shut—undercut the promise of dependable software.

A three-question rubric (persistent memory, editable artifacts, context compounding) is used to separate real outcome agents from hype.

Lindy’s mixed user sentiment (around a 2.4/5 Trustpilot score) is tied to weak artifact editability and limited control when tasks go off track.

Sauna reframes memory as the substrate for compounding context, while Google Opal’s free workflow builder shows how builders can remix real workflows.

The proposed three-layer agent architecture—knowledge store, agent recipes, scheduling loop—maps directly to the verifiability requirements. 

Topics

AI Agents
Outcome Automation
Persistent Memory
Editable Artifacts
Context Compounding

Mentioned

Co-work
Claude
Microsoft Copilot
Zapier
Sauna
Wordware
Google Opal
Gemini 3 flash
Obvious
OpenBrain
Nate B Jones
Philip Kazera
Flo Crello
SAS
IDE
YC
MCP
X