Preventing Threats to LLMs: Detecting Prompt Injections & Jailbreak Attacks
Based on WhyLabs's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Prompt injection manipulates LLM behavior through crafted inputs, and the worst outcomes happen when the LLM is connected to powerful tools/APIs.
Briefing
LLM security hinges less on “better refusals” and more on stopping malicious instructions from ever turning into actions. Prompt injection attacks manipulate an LLM through crafted user inputs so the model unknowingly follows an attacker’s intent—often by exploiting the application’s tool permissions. Jailbreak attacks are a specialized subset that try to bypass safety constraints, such as getting an LLM to provide instructions for wrongdoing (e.g., framing a request as an “actor” role-play to elicit “how to steal a car” style guidance).
The talk places these threats inside a broader taxonomy of attacks against LLM systems. Beyond prompt attacks, it highlights data poisoning (altering information in sources that models ingest, potentially planting later-referenced instructions), training data extraction (querying models to recover memorized sensitive training content like addresses or Social Security numbers), and model backdoors (embedding a secret trigger in training data so anyone who knows the trigger can unlock unintended behavior). In production systems, the most dangerous prompt injections are those that combine text manipulation with operational access—like an email assistant that can delete messages if the LLM is connected to deletion APIs.
To mitigate these risks, the session points to OWASP’s LLM safety guidance, emphasizing privilege control and least-privilege access. Role-based permissions and read-only configurations reduce the blast radius of a successful injection. Human approval can gate high-impact actions, while segregating untrusted content into trust boundaries limits how user text influences system instructions. Output monitoring adds another layer: even if a malicious prompt reaches the model, the response can be checked, flagged, or transformed before it reaches users. Monitoring both inputs and outputs is framed as a continuous security practice rather than a one-time fix.
Detection then becomes a text-security problem: turning raw prompts and responses into measurable signals that indicate likely attacks. WhyLabs’ approach uses LanKit to compute security-focused metrics such as jailbreak similarity, refusal similarity, toxicity, readability, and “frequent items,” plus checks for data leakage patterns like Social Security numbers or addresses. A dashboard view groups these metrics by hourly batches of uploaded data, making it possible to spot spikes—such as a near-max jailbreak similarity percentile—so teams can adjust guardrails or mitigation strategies.
The session also distinguishes two detection strategies. One is similarity-based detection against a curated harm dataset of known jailbreak and prompt-injection patterns, using embeddings to flag semantically similar attacks even when wording differs. The other is proactive prompt-injection detection, which runs an extra LLM call using a controlled “repeat this letter while ignoring the following text” style test; if the model’s output deviates from the expected control output, the system treats the original instruction as likely manipulative. Limitations are explicit: similarity methods can miss novel attacks (false negatives), proactive checks may not catch injections that don’t overwrite outputs (e.g., data leakage), and proactive methods add latency and cost due to extra LLM calls.
Finally, the talk situates these techniques within a changing threat landscape. As LLM applications expand—especially by granting tools and permissions beyond read-only access—the number and severity of attack vectors increases. The practical takeaway is to combine least privilege, trust boundaries, monitoring, and layered detection so prompt injections and jailbreak attempts are caught early, before they can translate into harmful actions.
Cornell Notes
Prompt injection attacks manipulate an LLM by crafting inputs that cause the model to follow an attacker’s intent, especially when the surrounding application grants tool permissions. Jailbreak attacks are a targeted form of prompt injection aimed at bypassing safety constraints (for example, using role-play framing to elicit disallowed instructions). Mitigation starts with least-privilege access, trust boundaries, human approval for sensitive actions, and monitoring input/output to catch suspicious behavior before users see it. Detection can be similarity-based—flagging prompts that resemble known attacks using embedding similarity—or proactive—running an extra controlled LLM test to see whether instructions appear manipulative. Both approaches have gaps: similarity checks can miss new attacks, proactive checks may not detect injections that don’t overwrite outputs, and proactive methods add cost and latency.
How do prompt injection and jailbreak attacks differ, and why does application permissions matter?
What categories of LLM attacks were highlighted beyond prompt injection?
Which OWASP-style mitigations were emphasized for prompt injection defenses?
How does LanKit’s similarity-based detection work conceptually?
What is proactive prompt-injection detection, and what does it look for?
Why can proactive detection still miss threats?
Review Questions
- Which mitigation steps reduce the impact of a successful prompt injection, and how do trust boundaries contribute?
- Explain how embedding-based similarity detection can flag unseen variations of known jailbreaks.
- Give one example of a threat type that proactive prompt-injection detection might fail to catch, and why.
Key Points
- 1
Prompt injection manipulates LLM behavior through crafted inputs, and the worst outcomes happen when the LLM is connected to powerful tools/APIs.
- 2
Jailbreak attacks target safety constraints specifically, often using role-play or framing tricks to bypass refusals.
- 3
Least-privilege access, role-based permissions, and human approval for sensitive actions are core defenses against prompt injection-driven harm.
- 4
Trust boundaries and segregating untrusted content limit how user text can influence system instructions.
- 5
Monitoring input and output—using security metrics like jailbreak similarity and toxicity—enables continuous detection and faster response.
- 6
Similarity-based detection flags prompts that are semantically close to known attacks using embedding space comparisons, but it can miss novel attacks.
- 7
Proactive detection adds a controlled extra LLM call to test whether instructions are being overridden, but it may not catch injections aimed at data leakage and it increases latency/cost.