I Stopped Drowning in AI Slop—Prompts That Saved Me 100+ Hours (Demo Inside)

TL;DR

Treat AI slop as a quality-gating problem: scale generation, but enforce review criteria that humans can’t manually apply to every draft.

Briefing Cornell Notes

Briefing

AI slop isn’t a detection problem—it’s a quality-gating problem. With AI churning out more PRDs, blog posts, emails, and drafts than any team can realistically review, organizations need a reliable way to decide what deserves scarce human attention. The core fix offered here is to treat large language model (LLM) “attention” as the dominant attention channel and use AI to do the bulk of the reading, while humans spend their limited time on the small fraction of outputs that truly matter.

The argument starts from a practical shift: AI has moved many workflows from “can I write one good thing?” to “can I generate fifty things fast?” That creates a new bottleneck—quality assessment. People either skim with their eyes or skip review entirely because there’s no consistent quality gate. But simply asking an AI “is this good?” leads to inconsistent results because quality judgments vary by model (e.g., ChatGPT vs. Claude) and by what the model already “believes” is good based on training and prior context. The proposed solution is therefore not generic prompting; it’s robust, use-case-specific prompting designed to function as a filter.

A key mindset is borrowed from Andrej Karpathy’s suggestion that LLMs should handle roughly 98% of attention, leaving humans with the 2% that matters. In this framework, the goal becomes: surface the highest-quality pieces—such as the two best blog posts out of a hundred, or the PRD that is truly promotable—while filtering out the rest. That requires prompts tailored to each artifact type and job family, because the criteria for a strong PRD differ from the criteria for a customer announcement email or a sales follow-up.

To make the approach concrete, the transcript walks through a sample prompt for evaluating a product requirements document (PRD). The prompt sets a clear “role” and stakes: determine whether an engineering team can build the PRD without needing three clarifying meetings. It then defines evaluation axes such as completeness, acceptance criteria, edge cases, and explicit non-goals. A scoring rubric maps quality to measurable signals—for example, a higher score corresponds to testable, well-documented edge cases and non-goals, while a low score corresponds to untestable vagueness.

The prompt also checks whether the document is testable and readable (with examples like readability targets for sales emails), whether scope is clear, and whether key elements are present. It includes dependency mapping and an “elements check” to ensure nothing critical is missing. Rather than relying on a vague verdict, the output is structured (using JSON) to produce a grading score plus plain-English feedback, including actionable revision guidance. The emphasis is on feedback that a writer can immediately act on—such as being explicit about the Stripe API version—so the system supports an ongoing improvement loop.

Finally, the transcript argues that “AI slop” is partly overhyped: sloppy work existed long before AI. The real opportunity is raising accountability and quality standards using AI as a scalable quality filter. There’s no single magic prompt, but a prompt pack—built for marketing, customer success, sales, product, and engineering—can help organizations stop drowning in low-quality output and start placing human attention where it has the most impact.

Cornell Notes

AI slop is framed as a quality-control failure, not an AI-detection failure. Because AI can generate far more drafts than humans can review, the solution is to use LLMs as an “attention filter”: let AI read and grade most outputs, then route only the top candidates to the 2% of human attention that matters. The transcript demonstrates a PRD-specific grading prompt that scores completeness, testability, scope clarity, and other criteria, using a rubric tied to concrete signals (e.g., measurable edge cases and explicit non-goals). It also uses structured output (JSON) to produce actionable, plain-English feedback and accept/reject thresholds. The approach is meant to be adapted per artifact type and job family, since quality criteria differ across PRDs, emails, blog posts, and more.

Why does “AI slop” become a management problem once generation scales up?

When AI output volume jumps from “one good draft” to “dozens of drafts,” the bottleneck shifts from writing to evaluation. Teams can’t realistically read and assess every PRD, blog post, email, or outreach draft with human eyeballs alone. Without a consistent quality gate, people either skim, skip, or rely on inconsistent judgments like “does this look good?”—which varies by model and context. The transcript treats this as a workflow design issue: build a filter so only the best work reaches humans.

What does it mean to use LLMs as an “attention filter” rather than a writing helper?

The filter model assumes LLMs should handle most reading and triage, while humans handle the small fraction that truly matters. The transcript aligns with Andrej Karpathy’s idea that LLMs can take on about 98% of attention, leaving humans about 2% for the highest-stakes decisions. Practically, that means prompts should grade and rank outputs (e.g., which PRD is promotable, which two blog posts are worth posting) instead of merely generating text.

Why is a generic prompt like “is this good?” unreliable?

Quality judgments depend on the specific LLM and on what it already considers “good” from training data and prior conversation context. That makes outcomes unpredictable if the prompt lacks artifact-specific criteria. The transcript argues for robust, use-case-specific prompting—PRD prompts for PRDs, blog prompts for blog posts, and so on—so the evaluation rubric has the right context.

How does the sample PRD prompt define stakes and evaluation criteria?

It assigns a clear role: evaluate a PRD and decide whether an engineering team can build it without needing three clarifying meetings. It then evaluates completeness (and related elements like acceptance criteria, edge cases, and non-goals), and it uses a scoring rubric where higher scores correspond to testable, measurable details and lower scores correspond to vague, untestable statements. The prompt also includes checks for testability and scope clarity, plus dependency mapping and an elements check.

What makes the feedback actionable instead of just a score?

The prompt uses structured output (JSON) to return not only a grading score but also plain-English guidance: an accept/reject decision, thresholds, and specific revision feedback. The transcript emphasizes that the feedback should point to concrete fixes—like specifying the Stripe API version—so writers can immediately improve the document and iterate, enabling an anti-slop feedback loop.

How should the filter differ across job families and artifact types?

The transcript stresses that the quality filter must be per use case and per job family. A PRD needs criteria like acceptance criteria and edge cases; a customer announcement email needs clarity about the product; a sales follow-up email might be evaluated for readability (e.g., an eighth-grade level target). The same overall prompt structure can be reused, but the axes and rubric must change to match the artifact’s purpose.

Review Questions

What workflow bottleneck emerges when AI generation increases output volume, and how does the proposed filter address it?
In the PRD grading example, which rubric dimensions are used to distinguish testable, promotable work from vague work?
Why does the transcript argue that “AI slop” can’t be solved by a single magic prompt or by AI detectors alone?

Key Points

1
Treat AI slop as a quality-gating problem: scale generation, but enforce review criteria that humans can’t manually apply to every draft.
2
Use LLMs as an attention filter—aim for LLMs to handle most reading and triage while humans review only the top candidates.
3
Avoid generic judgments like “is this good?”; quality scoring should be tied to artifact-specific rubrics and measurable signals.
4
Build prompts per job family and per artifact type (PRDs vs. blog posts vs. emails) because the definition of “good” changes with context.
5
Use structured outputs (e.g., JSON) to produce scores plus accept/reject decisions and plain-English, actionable revision feedback.
6
Design the rubric around concrete outcomes (e.g., whether engineering can build without multiple clarifying meetings) rather than vague impressions.
7
Focus on raising accountability and quality standards regardless of who wrote the work, since sloppy output predates AI.

Highlights

The proposed fix replaces “AI detectors” with a quality gate: LLMs grade and filter outputs so humans only review the small fraction that matters.

A sample PRD prompt scores completeness, acceptance criteria, edge cases, non-goals, testability, and scope clarity—anchored to a concrete engineering-stakes target (no three clarifying meetings).

Structured JSON output is used to turn grading into immediate, actionable feedback—like requiring explicit Stripe API version details.

The approach is explicitly per artifact type: PRD criteria can’t be the same as criteria for blog posts or sales emails.

Topics

AI Slop
Quality Gates
LLM Prompting
PRD Evaluation
Attention Filtering

Mentioned

Andrej Karpathy
LLM