Get AI summaries of any video or article — Sign up free
Preventing Threats to LLMs: Detecting Prompt Injections & Jailbreak Attacks thumbnail

Preventing Threats to LLMs: Detecting Prompt Injections & Jailbreak Attacks

WhyLabs·
5 min read

Based on WhyLabs's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Prompt injection manipulates LLM behavior through crafted inputs, and the worst outcomes happen when the LLM is connected to powerful tools/APIs.

Briefing

LLM security hinges less on “better refusals” and more on stopping malicious instructions from ever turning into actions. Prompt injection attacks manipulate an LLM through crafted user inputs so the model unknowingly follows an attacker’s intent—often by exploiting the application’s tool permissions. Jailbreak attacks are a specialized subset that try to bypass safety constraints, such as getting an LLM to provide instructions for wrongdoing (e.g., framing a request as an “actor” role-play to elicit “how to steal a car” style guidance).

The talk places these threats inside a broader taxonomy of attacks against LLM systems. Beyond prompt attacks, it highlights data poisoning (altering information in sources that models ingest, potentially planting later-referenced instructions), training data extraction (querying models to recover memorized sensitive training content like addresses or Social Security numbers), and model backdoors (embedding a secret trigger in training data so anyone who knows the trigger can unlock unintended behavior). In production systems, the most dangerous prompt injections are those that combine text manipulation with operational access—like an email assistant that can delete messages if the LLM is connected to deletion APIs.

To mitigate these risks, the session points to OWASP’s LLM safety guidance, emphasizing privilege control and least-privilege access. Role-based permissions and read-only configurations reduce the blast radius of a successful injection. Human approval can gate high-impact actions, while segregating untrusted content into trust boundaries limits how user text influences system instructions. Output monitoring adds another layer: even if a malicious prompt reaches the model, the response can be checked, flagged, or transformed before it reaches users. Monitoring both inputs and outputs is framed as a continuous security practice rather than a one-time fix.

Detection then becomes a text-security problem: turning raw prompts and responses into measurable signals that indicate likely attacks. WhyLabs’ approach uses LanKit to compute security-focused metrics such as jailbreak similarity, refusal similarity, toxicity, readability, and “frequent items,” plus checks for data leakage patterns like Social Security numbers or addresses. A dashboard view groups these metrics by hourly batches of uploaded data, making it possible to spot spikes—such as a near-max jailbreak similarity percentile—so teams can adjust guardrails or mitigation strategies.

The session also distinguishes two detection strategies. One is similarity-based detection against a curated harm dataset of known jailbreak and prompt-injection patterns, using embeddings to flag semantically similar attacks even when wording differs. The other is proactive prompt-injection detection, which runs an extra LLM call using a controlled “repeat this letter while ignoring the following text” style test; if the model’s output deviates from the expected control output, the system treats the original instruction as likely manipulative. Limitations are explicit: similarity methods can miss novel attacks (false negatives), proactive checks may not catch injections that don’t overwrite outputs (e.g., data leakage), and proactive methods add latency and cost due to extra LLM calls.

Finally, the talk situates these techniques within a changing threat landscape. As LLM applications expand—especially by granting tools and permissions beyond read-only access—the number and severity of attack vectors increases. The practical takeaway is to combine least privilege, trust boundaries, monitoring, and layered detection so prompt injections and jailbreak attempts are caught early, before they can translate into harmful actions.

Cornell Notes

Prompt injection attacks manipulate an LLM by crafting inputs that cause the model to follow an attacker’s intent, especially when the surrounding application grants tool permissions. Jailbreak attacks are a targeted form of prompt injection aimed at bypassing safety constraints (for example, using role-play framing to elicit disallowed instructions). Mitigation starts with least-privilege access, trust boundaries, human approval for sensitive actions, and monitoring input/output to catch suspicious behavior before users see it. Detection can be similarity-based—flagging prompts that resemble known attacks using embedding similarity—or proactive—running an extra controlled LLM test to see whether instructions appear manipulative. Both approaches have gaps: similarity checks can miss new attacks, proactive checks may not detect injections that don’t overwrite outputs, and proactive methods add cost and latency.

How do prompt injection and jailbreak attacks differ, and why does application permissions matter?

Prompt injection is any crafted input that manipulates an LLM into executing an attacker’s intentions, directly or indirectly. Jailbreak attacks are a specific subset that tries to elicit responses violating safety constraints. The permissions around the LLM are crucial because the most harmful outcomes occur when the model has access to tools/APIs that can take actions—like an email assistant deleting messages—so a successful injection can translate into real-world effects rather than just text output.

What categories of LLM attacks were highlighted beyond prompt injection?

The session grouped attacks into several types: data poisoning (altering information in sources the model later ingests so it causes downstream harm or plants instructions), training data extraction (querying a model to recover memorized sensitive training content such as addresses or Social Security numbers), and model backdoors (embedding secret triggers in training data so the attacker can unlock unintended behavior when the trigger is used). These sit alongside prompt attacks as major threat vectors.

Which OWASP-style mitigations were emphasized for prompt injection defenses?

Key mitigations included privilege control (least-privilege and role-based permissions so the LLM can’t perform destructive actions), human approval for high-impact actions, segregating untrusted content from user prompts via trust boundaries, and monitoring input/output data to detect weaknesses. The overall theme is to reduce what the model can do and to add checks before actions or responses are finalized.

How does LanKit’s similarity-based detection work conceptually?

Similarity-based detection uses a curated harm dataset of known jailbreaks/prompt injections. Incoming prompts are embedded into a numerical vector space, then compared against stored embeddings of known attacks. Even if the exact attack text isn’t present, semantically similar prompts can be flagged via a similarity score (e.g., “jailbreak similarity”). This helps catch variations of known attack patterns.

What is proactive prompt-injection detection, and what does it look for?

Proactive detection adds an extra LLM call using a controlled instruction: for example, “repeat the letter A once while ignoring the following text,” then passing the suspected target prompt. If the model’s output deviates from the expected control output (e.g., returns “B” instead of “A” in a Caesar-cipher style scenario), the system treats the original prompt as likely manipulative. It’s designed to detect injections that attempt to override system instructions.

Why can proactive detection still miss threats?

The session noted that proactive overwrite-style checks don’t cover all injection types. If an attack aims for data leakage (e.g., extracting sensitive information) rather than changing the model’s output in the tested way, the proactive test may not trigger. It also doesn’t replace other defenses because it’s only one scheme, and it increases cost/latency due to the extra LLM call.

Review Questions

  1. Which mitigation steps reduce the impact of a successful prompt injection, and how do trust boundaries contribute?
  2. Explain how embedding-based similarity detection can flag unseen variations of known jailbreaks.
  3. Give one example of a threat type that proactive prompt-injection detection might fail to catch, and why.

Key Points

  1. 1

    Prompt injection manipulates LLM behavior through crafted inputs, and the worst outcomes happen when the LLM is connected to powerful tools/APIs.

  2. 2

    Jailbreak attacks target safety constraints specifically, often using role-play or framing tricks to bypass refusals.

  3. 3

    Least-privilege access, role-based permissions, and human approval for sensitive actions are core defenses against prompt injection-driven harm.

  4. 4

    Trust boundaries and segregating untrusted content limit how user text can influence system instructions.

  5. 5

    Monitoring input and output—using security metrics like jailbreak similarity and toxicity—enables continuous detection and faster response.

  6. 6

    Similarity-based detection flags prompts that are semantically close to known attacks using embedding space comparisons, but it can miss novel attacks.

  7. 7

    Proactive detection adds a controlled extra LLM call to test whether instructions are being overridden, but it may not catch injections aimed at data leakage and it increases latency/cost.

Highlights

Jailbreaks are treated as a specialized form of prompt injection: bypass safety constraints by manipulating how the model interprets the request.
The most dangerous prompt injections combine text manipulation with tool permissions—turning “bad instructions” into destructive API actions.
LanKit-style monitoring uses batch-level dashboards and metrics like jailbreak similarity to spot spikes in likely attacks over time.
Proactive detection works by running a controlled “expected output” test; deviations indicate likely instruction tampering.
Embedding similarity can detect semantically related attacks even when the exact wording hasn’t been seen before.

Topics

  • Prompt Injection
  • Jailbreak Attacks
  • LLM Security Mitigations
  • Threat Taxonomy
  • LanKit Detection

Mentioned

  • WhyLabs
  • LanKit
  • Bernice Herman
  • Felipe Adachi
  • OWASP