Intro to LLM Security - OWASP Top 10 for Large Language Models (LLMs)
Based on WhyLabs's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Assume LLM applications can be manipulated; design with hostile prompts in mind rather than relying on “best effort” prompt engineering.
Briefing
LLM security hinges on treating every prompt-and-response cycle as potentially hostile—then building monitoring and guardrails that catch failures early. The workshop frames OWASP’s “Top 10 for Large Language Models” as a practical checklist for teams shipping LLM-powered features, especially when models are accessed through APIs, wrapped in plugins, or connected to downstream systems like databases and automation tools.
A key starting point is the lack of control teams have over model internals. When using popular LLMs via APIs (or even open-source models), developers often don’t know what data the model was trained on, what post-processing happens behind the scenes, or whether training data poisoning and other supply-chain issues occurred. Even “internal” deployments don’t eliminate risk because LLMs can be manipulated through natural-language attacks. That uncertainty is why the OWASP list matters: it turns a squishy, fast-moving threat landscape into concrete categories teams can design against and measure.
The first and most emphasized threat is prompt injection. Malicious users can craft instructions that override prior system or developer guidance, potentially leading the model to behave toxically or to take actions it otherwise shouldn’t. The mitigation guidance is blunt: assume LLM applications are malicious, apply least-privilege design so the model can’t control the whole user experience or upstream/downstream systems, monitor prompts and responses for unusual patterns (including known injection signatures), and use a human-in-the-loop for high-risk flows when feasible.
From there, the workshop walks through additional OWASP categories with the same theme: validate outputs, isolate risky components, and instrument everything. Insecure output handling is illustrated with an LLM generating SQL—if outputs aren’t validated, an attacker could steer the model toward destructive queries. Training data poisoning is treated as especially hard when the model is an API black box; teams should assume undesirable training data may exist and isolate the model’s impact, while those fine-tuning should validate, clean, anonymize, and test their datasets. Denial of service is addressed through input validation, monitoring resource utilization, and enforcing rate limits and resource caps.
Supply-chain vulnerabilities broaden the lens to the entire lifecycle: training data, libraries, plugins, and even vendor terms that might allow data reuse or retraining. Sensitive information disclosure (PII leakage) is handled through prevention (block prompts containing sensitive data), careful dataset hygiene for fine-tuning, and monitoring for leaks in responses. Plugin risks and “excessive agency” are treated as closely related: plugins should run with least privilege, be vetted, and require user confirmation for sensitive actions; open-ended actions like arbitrary URL opening are hard to secure and should be avoided.
The later items—overreliance and model theft—focus on operational and business risk. Overreliance happens when users treat fluent but incorrect outputs as authoritative; mitigation includes quality monitoring, interface nudges that force review, and cross-verification for high-stakes decisions. Model theft is addressed with strong access control to model assets and monitoring for suspicious activity.
To make the guidance actionable, the workshop introduces WhyLabs’ open-source LLM telemetry library “Lanit” and a hands-on monitoring demo. The approach extracts security-relevant signals (e.g., jailbreak similarity, toxicity, refusal likelihood, and PII patterns) from prompts and responses, then sends only derived scores—not raw text—to a dashboard for visibility and alerting. In the demo, the dashboards flag PII leakage in both prompts and responses and show how spikes in jailbreak similarity and toxicity can correlate with increased PII leakage, enabling targeted investigation even without storing sensitive raw content.
Cornell Notes
OWASP’s Top 10 for Large Language Models turns LLM security into a set of design and monitoring targets teams can act on. The workshop stresses that developers rarely control what happens inside an LLM (especially via APIs), so systems must assume hostile inputs and validate outputs. Mitigations repeatedly return to least privilege, isolation between the LLM and downstream systems, and observability—monitoring prompts/responses for jailbreaks, PII leakage, toxicity, and refusal patterns. A hands-on demo shows how WhyLabs’ “Lanit” can extract security signals and feed dashboards for quick detection and investigation, while sending only scores (not raw prompts/responses) to reduce exposure of sensitive data. The practical takeaway: security improves when teams measure LLM behavior continuously and intervene when metrics drift.
Why does prompt injection remain the top OWASP concern, and what mitigations were emphasized?
How does insecure output handling create real downstream risk, and what controls reduce it?
What makes training data poisoning especially difficult for API-based LLMs, and what should teams do anyway?
How do denial-of-service risks apply to LLMs, and what practical defenses were listed?
What’s the connection between plugin security and “excessive agency”?
How does the hands-on demo use monitoring without sending raw sensitive text to WhyLabs?
Review Questions
- Which OWASP categories were tied most directly to the need for least-privilege design, and how do those categories differ (prompt injection vs. excessive agency vs. insecure plugin design)?
- For an LLM feature that generates SQL, what specific validation and monitoring steps would you implement to reduce insecure output handling risk?
- In the demo approach, what signals are extracted by Lanit, and why is sending only scores (not raw text) a meaningful security choice?
Key Points
- 1
Assume LLM applications can be manipulated; design with hostile prompts in mind rather than relying on “best effort” prompt engineering.
- 2
Use least-privilege and isolation so LLMs can’t control the full user experience or upstream/downstream systems after a successful attack.
- 3
Validate LLM outputs—especially when outputs trigger actions like SQL, API calls, or automation—then monitor for expected structure, length, and quality.
- 4
Treat training data poisoning and supply-chain risks as real uncertainties, particularly for API-based models where training data and lifecycle details are opaque.
- 5
Prevent and detect sensitive information disclosure by blocking sensitive inputs, anonymizing datasets for fine-tuning, and monitoring responses for PII patterns.
- 6
Harden plugin and tool integrations with least privilege, vetted sources, testing, and user confirmations for sensitive actions.
- 7
Reduce overreliance by monitoring output quality and adding interface and workflow steps that force review for high-stakes decisions.