Hacking AI is TOO EASY (this should be illegal)

TL;DR

AI security testing must treat AI-enabled apps as full-stack systems, not just chatbots that can be jailbroken for text outputs.

Briefing Cornell Notes

Briefing

AI-enabled apps are becoming an easy target because attackers can chain multiple weaknesses—inputs, surrounding systems, and the model itself—into practical data theft and account compromise. The core message is that “hacking AI” isn’t limited to jailbreaks that make a chatbot say forbidden words; it also includes stealing sensitive information, manipulating business logic, and pivoting from the model to the broader software ecosystem. That matters because many organizations are rushing to deploy AI features without matching the security rigor used for traditional web and API systems.

A central framework introduced around AI security testing splits the work into a holistic “AI pen test” approach rather than narrow “AI red teaming” that focuses only on getting harmful outputs. The methodology uses six repeatable segments: identify system inputs, attack the ecosystem around the AI application, and then attack the model (including attempts to induce harmful or biased behavior). The playbook continues through prompt engineering attacks, data attacks, application attacks, and finally pivoting to other systems—turning model interaction into a pathway for real-world impact.

The most emphasized technique is prompt injection, described as the primary “vehicle” for many of these attacks. Prompt injection can work with little or no coding skill at first: attackers rely on clever natural-language instructions to trick an LLM into following malicious directives embedded in user content or indirect signals. A key warning is that defenses that look strong in early tests often fail against more advanced variants, especially once guardrails and filters are in place.

To make prompt injection learnable, Jason Haddock’s taxonomy organizes attacks into intents (what the attacker wants, such as leaking a system prompt or extracting business information), techniques (how to achieve the intent, including evasion strategies), and utilities (supporting mechanisms). The taxonomy also frames attacks as a mix of intense techniques, evasions, and utilities—because successful attacks often depend on bypassing specific controls rather than one universal trick.

Concrete examples show how attackers evade modern classifiers and guardrails. “Emoji smuggling” hides instructions inside emoji metadata so the model interprets hidden directives even when classifiers miss them. “Link smuggling” uses encoded data embedded in image URLs: the model is induced to attempt a fetch, and server logs reveal the exfiltrated content. For image generation, a “syntactic anti classifier” approach reframes requests through synonyms, metaphors, and indirect phrasing to bypass restrictions.

The transcript also stresses that the AI attack surface expands as AI becomes more agentic and tool-using. Over-scoped API permissions and weak input validation can let an agent write back into systems like Salesforce, enabling malicious actions through prompt injection. Even newer standards such as MCP (Model Context Protocol) can widen risk: MCP servers may lack role-based access control, may allow file access for parsing or RAG memory, and can be backdoored by altering prompts or tool behavior. The result is a “Wild West” dynamic where powerful AI components and integrations outpace security controls.

Defense in depth is presented as the practical answer: secure the web layer with input/output validation, add an “AI firewall” (classifier or guardrail) around the model for both inbound and outbound prompts, and enforce least-privilege on API keys and tool permissions. As systems coordinate multiple agents, the challenge grows harder because each component must be protected without destroying performance—making trade-offs unavoidable. The overall takeaway is that AI security is now a full-stack problem, and the same urgency that once drove web hacking defenses is arriving for AI.

Cornell Notes

AI hacking is framed as a full-stack security problem, not just jailbreaks that force harmful chatbot outputs. A six-part “AI pen test” methodology targets inputs, the surrounding ecosystem, the model itself, prompt engineering, data, application behavior, and then pivots to other systems. Prompt injection is highlighted as the main attack path because it can often be executed with clever natural-language prompting and can bypass guardrails using techniques like emoji smuggling and link smuggling. Real-world risk increases when AI agents call tools and APIs with weak validation or overly broad permissions, including in integrations built on standards like MCP. Defense centers on layered controls: web-layer validation, an AI firewall for inbound/outbound prompts, and least-privilege access for APIs and tools.

What does “hacking AI” include beyond jailbreaks?

It includes attacking the entire AI-enabled application stack: how data enters (system inputs), how the model interacts with tools and services (ecosystem), and how the model behaves (AI red teaming). The framework also targets prompt engineering, data, and application logic, then pivots to other systems—so the end goal can be sensitive data theft or unauthorized actions, not just bad language outputs.

Why is prompt injection treated as the primary weapon?

Prompt injection turns the model’s own instruction-following behavior against it. Attackers can embed malicious directives in user content or indirect signals so the LLM follows attacker-controlled instructions. The transcript emphasizes that early-level attacks may require little more than natural-language cleverness, while later levels become harder due to real-world guardrails.

How does the taxonomy organize prompt injection attacks?

It breaks attacks into intents (desired outcomes like leaking a system prompt or extracting business info), techniques (methods to achieve those intents, including evasion strategies), and utilities (supporting mechanisms). The taxonomy is meant to classify what works so defenders and testers can reason about attack patterns rather than treating each jailbreak as a one-off trick.

What are examples of prompt-injection evasion techniques mentioned?

Emoji smuggling hides instructions in emoji Unicode metadata so classifiers and guardrails may miss the directive. Link smuggling encodes sensitive data in image URLs (often base64) and relies on the model attempting a fetch; the attacker then reads server logs to recover the data. For image generation, a “syntactic anti classifier” approach reframes requests with synonyms, metaphors, and indirect phrasing to bypass restrictions.

How do tool-using agents and MCP change the risk picture?

Agents can exploit weak input validation and over-scoped API permissions to write malicious content back into systems (e.g., using prompts to induce actions like writing notes into Salesforce). MCP can widen the attack surface because MCP servers may fetch files for parsing or RAG, store artifacts, and lack strong role-based access control; attackers may also backdoor MCP servers by injecting invisible code or altering prompts.

What defense-in-depth strategy is recommended?

Three layers: (1) web-layer fundamentals—validate inputs and outputs so harmful data doesn’t reach the model and malware-like outputs don’t reach users; (2) an AI firewall around the model—use a classifier or guardrail to check prompts entering and leaving the model, including prompt-injection patterns; and (3) data/tools layer least privilege—scope API keys to only the permissions the agent needs (read-only vs write). The transcript warns that agentic multi-AI systems make this harder due to added complexity and latency trade-offs.

Review Questions

How does the six-part AI pen test methodology differ from narrower AI red teaming focused only on harmful outputs?
Which prompt-injection examples rely on metadata, encoded URLs, or indirect phrasing, and what control do they attempt to bypass?
Why can least-privilege API scoping be as important as prompt filtering when AI agents can call tools?

Key Points

1
AI security testing must treat AI-enabled apps as full-stack systems, not just chatbots that can be jailbroken for text outputs.
2
Prompt injection is positioned as the main attack path because it can redirect LLM behavior using attacker-controlled instructions embedded in prompts or signals.
3
A structured taxonomy of prompt injection helps classify attacks by intent, technique, and utility, making defenses and testing more systematic.
4
Evasion tactics like emoji smuggling and link smuggling show how hidden directives or encoded data can bypass common guardrails and classifiers.
5
Over-scoped API permissions and missing input validation can let prompt injection escalate into unauthorized reads/writes in systems such as Salesforce.
6
Standards and frameworks like MCP can expand the attack surface if MCP servers lack role-based access control or allow unsafe file/tool access.
7
Defense should be layered: web-layer validation, an AI firewall for inbound/outbound prompts, and least-privilege access for tools and APIs—especially when agents coordinate multiple actions.

Highlights

Prompt injection is described as the “vehicle” behind many AI attacks, often achievable with clever natural-language prompting rather than advanced coding.

Emoji smuggling hides instructions inside emoji metadata, while link smuggling exfiltrates data via encoded URLs and server logs.

MCP can introduce new vulnerabilities because tool/resource access and prompt layers can be manipulated without strong role-based controls.

Defense-in-depth is framed as non-negotiable: secure the web layer, enforce an AI firewall around the model, and lock down API permissions with least privilege.

Topics

AI Pen Testing
Prompt Injection
Evasion Techniques
Agentic Workflows
Defense in Depth

Mentioned

OpenAI
Wiz
Salesforce
Lang Chain
Lang Graph
Crew AI
Jason Haddock
Sam Altman
Daniel Misler
AI
MCP
RAG
SQL
API
XSS
CVE