Intro to LLM Security - OWASP Top 10 for Large Language Models (LLMs)

TL;DR

OWASP’s LLM Top 10 risks—especially prompt injections/jailbreaks, unsafe output handling, data leakage, and insecure plugin integrations—require continuous detection and enforcement, not one-time review.

Briefing Cornell Notes

Briefing

Large language model security is increasingly about catching risky behavior before it reaches users—and doing it continuously once models go live. A practical framework from OWASP’s “Top 10 for Large Language Models” highlights recurring failure modes such as prompt injections (often framed as jailbreaks), insecure handling of malicious outputs, training-data poisoning, denial-of-service via excessive resource use, supply-chain and permission gaps, data leakage, and insecure plugin integrations. The core takeaway is that these risks aren’t one-off bugs; they require detection, monitoring, and guardrails that evolve as prompts, models, and usage patterns change.

The session walks through the OWASP Top 10 themes and then connects them to measurable signals that can be extracted from prompts and model responses. Prompt injection risk can be approximated with “jailbreak similarity” style metrics—if a prompt resembles known jailbreak patterns, systems can block the prompt or withhold the model’s response. Insecure output handling can be monitored by looking for response-side indicators such as toxicity, jailbreak-related similarity, or pattern matches for sensitive content. Data leakage is treated as both directions of flow: sensitive information leaking out of the model (e.g., PII like phone numbers, Social Security numbers, or credit cards) and sensitive information being sent into a hosted model unintentionally. The talk also emphasizes “excessive agency” risk—when models are given too much control—along with the need for authorization and permission tracking, especially when plugins connect LLMs to external tools and accept free-form inputs.

To make these ideas operational, the workshop introduces WhyLabs’ open-source Linkit and its integration with YLabs for AI observability. Linkit profiles prompt/response pairs and produces privacy-preserving data profiles rather than storing raw text. Those profiles include a mix of security-relevant metrics (pattern detection for PII, toxicity, jailbreak similarity, sentiment, and other similarity/relevancy signals) and general quality signals (readability and distribution statistics). Once metrics are written into YLabs, teams can visualize trends over time, set monitors, and trigger alerts or workflows when metrics drift or spike—turning security monitoring into something closer to standard ML/data observability.

A hands-on portion uses Google Colab with Hugging Face’s gpt2 to generate responses, then profiles multiple prompt/response examples and uploads the resulting metrics into YLabs. The dashboard demonstrates time-series behavior: metrics like sentiment can shift across simulated “days,” and the platform supports monitors such as data drift detection using algorithms like Hellinger distance. For security use cases, the workflow is to (1) profile inputs/outputs, (2) set guardrails based on thresholds (e.g., block responses when toxicity exceeds a cutoff or when jailbreak similarity indicates risk), and (3) continuously monitor for changes that might indicate new attack patterns or regressions.

The session closes by showing how to implement guardrails locally using extracted metrics and simple conditional logic, and how to extend Linkit with custom metrics (user-defined functions) for organization-specific signals—such as similarity checks against a vector database of known jailbreaks. Overall, the message is clear: LLM security scales when detection and evaluation become part of the production pipeline, not an after-the-fact review.

Cornell Notes

OWASP’s LLM security “Top 10” frames common ways large language model systems fail—especially prompt injections/jailbreaks, unsafe output handling, data leakage, poisoning, denial-of-service, permission gaps, and insecure plugin integrations. The workshop turns those categories into measurable signals by profiling prompts and responses with WhyLabs’ open-source Linkit, which extracts privacy-preserving metrics like jailbreak similarity, toxicity, and PII/pattern matches. Those metrics can be sent to YLabs to visualize trends over time and set monitors (e.g., drift detection) that alert when behavior changes. Guardrails can then be enforced either in production via thresholds or locally with simple conditional logic based on extracted metrics. This approach treats LLM security as continuous observability and iterative hardening, not a one-time checklist.

How does prompt injection/jailbreaking get operationalized into something a system can detect and block?

The workshop describes prompt injection as manipulating the model into doing actions it shouldn’t, including bypassing guardrails through clever rephrasing. A practical mitigation is detecting prompt injection risk using metrics such as “jailbreak similarity.” In the hands-on flow, Linkit extracts a jailbreak similarity score from prompts (and also supports response-side signals). If the score crosses a threshold, the system can avoid sending the prompt to the model or avoid returning the model’s response to the user—so the risky behavior never reaches the user-facing layer.

What’s the difference between monitoring for data leakage out of the model versus preventing sensitive data from entering it?

Data leakage is treated as two directions. First, the model can leak sensitive information in its outputs—so monitoring looks for PII patterns in responses, including Social Security numbers, phone numbers, and credit cards. Second, sensitive proprietary data can be sent into a hosted model—so monitoring/guardrails should also inspect prompts before they’re submitted. Linkit supports pattern-based detection for common PII and can be used to trigger alerts or block requests when sensitive patterns appear.

Why are “insecure output handling” and “excessive agency” linked in practice?

Insecure output handling focuses on malicious or unsafe outputs produced by the model—such as code, toxic content, or jailbreak-adjacent responses. Excessive agency is about giving the model too much control over actions. Together, they motivate output validation and permission/guardrail enforcement: systems should validate outputs (e.g., toxicity thresholds, pattern matches) and limit what the model is allowed to do, especially when external tools or plugins are involved.

How does the Linkit + YLabs approach make LLM security monitoring scalable over time?

Linkit profiles prompt/response pairs into privacy-preserving data profiles (summary statistics and extracted metrics) rather than storing raw text. Those metrics are then written into YLabs, where teams can view time-series trends and set monitors. Instead of manually reviewing logs, monitors can detect spikes or drift (for example, using Hellinger distance for non-discrete columns) and trigger alerts via email/Slack/PagerDuty or workflows like retraining.

What does “guardrails” look like when implemented with extracted metrics?

The workshop demonstrates a local guardrail pattern: profile a prompt (or response) to extract a metric like toxicity, then apply a threshold in code. If toxicity is above a cutoff, the system blocks or substitutes the response (e.g., returning “I can’t help with that” rather than calling the model). The same pattern can be applied to jailbreak similarity, PII/pattern matches, or other security metrics extracted by Linkit.

How can teams add their own security signals beyond the built-in metrics?

Linkit supports custom metrics via user-defined functions (UDFs). The workshop notes that teams can register a function that returns a label or numeric score, such as a similarity score computed against a vector database of known jailbreaks. Once registered, that custom metric becomes part of the extracted profile and can be visualized in YLabs and used for monitors and guardrails.

Review Questions

Which OWASP LLM risk categories in the workshop map most directly to prompt-side detection versus response-side detection?
Describe a concrete workflow for preventing jailbreaks from reaching users using jailbreak similarity metrics and thresholds.
How would you set up a monitor to catch security-relevant changes over time, and what kinds of metrics would you choose?

Key Points

1
OWASP’s LLM Top 10 risks—especially prompt injections/jailbreaks, unsafe output handling, data leakage, and insecure plugin integrations—require continuous detection and enforcement, not one-time review.
2
Jailbreak similarity and other similarity/pattern metrics can be used to block risky prompts or withhold risky responses before they reach users.
3
Data leakage monitoring should cover both directions: sensitive information leaking in outputs and sensitive proprietary data entering prompts.
4
AI observability with Linkit + YLabs turns extracted prompt/response metrics into time-series dashboards and alertable monitors.
5
Guardrails can be implemented with simple threshold logic based on extracted metrics like toxicity, jailbreak similarity, and PII/pattern matches.
6
Monitoring should include drift/spike detection (e.g., using Hellinger distance) to catch changes in model behavior or user behavior over time.
7
Linkit’s custom metrics (UDFs) let teams add organization-specific security signals, such as similarity checks against a known jailbreak corpus.

Highlights

OWASP-style LLM security becomes actionable when prompt/response risks are converted into measurable signals like jailbreak similarity, toxicity, and PII/pattern matches.

Linkit profiles prompts and responses into privacy-preserving summaries, enabling monitoring without storing raw sensitive text.

YLabs monitoring supports time-series analysis and drift detection, so security alerts can trigger when behavior changes across days.

Guardrails can be enforced with straightforward code: extract a metric, compare to a threshold, and block or reroute the model call accordingly.

Custom security metrics can be added via user-defined functions, including embedding-based similarity checks against a jailbreak database.

Topics

OWASP Top 10
Prompt Injection
Data Leakage
AI Observability
Guardrails

Mentioned

WhyLabs
Linkit
YLabs
Hugging Face
Google Colab
LangChain
OpenAI
PagerDuty
Slack
Gradio
S Elliot
OWASP
LLM
NIST
PII
DOS
CPU
ML
MLOps
UDF
API