5 LLM Security Threats- The Future of Hacking?

TL;DR

Prompt injection works by embedding attacker-controlled directives inside inputs so the model ignores earlier instructions and follows the injected ones.

Briefing Cornell Notes

Briefing

Large language models are vulnerable to attacks that manipulate what they follow—especially when prompts can be smuggled through websites, images, or encoded text. Prompt injection is the clearest example: attackers craft inputs that cause an LLM to ignore earlier instructions and instead follow hidden directives, potentially exposing sensitive data, bypassing content filters, or triggering unintended actions.

One demonstration used a normal-looking webpage that contained an instruction hidden in the page content. When the site was scraped, the hidden instruction became visible to the model, telling it to output a specific message (“you are on a private website leave now”). The key risk isn’t the message itself; it’s the mechanism—scraping can surface attacker-controlled instructions embedded in headers, metadata, or other “invisible” page elements. The same pattern appeared in multimodal settings. An image of a house looked ordinary, but hidden text inside the image instructed the model not to describe the image and instead output a different phrase. In another variant, a blurred instruction inside an image caused a vision-capable model to return an HTML link when asked to process the image. That kind of output could be weaponized with malicious URLs (for example, phishing or crypto scams), turning image-based inputs into a delivery channel for harmful instructions.

The transcript also ties prompt injection to real-world search behavior. A classic example attributed to Andrej Karpathy describes how search results can include persuasive “however” style content—complete with a link and urgency—that nudges users toward scams. The underlying lesson is that attacker-controlled text can ride along in web snippets or page descriptions, and LLMs may reproduce it as if it were relevant to the user’s request.

Jailbreaks shift the focus from hidden instructions to bypassing refusal behavior. A jailbreak hijacks the initial prompt so the model generates content it would normally refuse. The transcript distinguishes prompt-level jailbreaks (often requiring more human work and social engineering) from token-level jailbreaks that can be automated by optimizing prompts with special encodings. A replicated example used base64 encoding: a request that would be refused in plain text could be transformed into an encoded string, which the model still interpreted well enough to produce disallowed output. In the demonstration, the encoded prompt ultimately led to the generation of phishing-style email content targeting Bank of America.

Multimodal jailbreaks add another attack surface. The transcript describes research where carefully designed “noise patterns” embedded in an image can cause a model to break and follow harmful instructions. The panda example illustrates how structured noise—imperceptible to humans—can be optimized so the model treats it as a cue, enabling instruction-following that should not occur.

Overall, the message is blunt: as LLMs and multimodal models become more embedded in products and workflows (text, images, and eventually video), attackers gain more ways to smuggle malicious directives. The likely defense will mirror broader cybersecurity: continuous patching, layered controls, and an ongoing arms race between attackers and mitigations.

Cornell Notes

Prompt injection manipulates what an LLM follows by embedding attacker instructions in inputs like web pages or images, causing the model to ignore prior instructions. The transcript shows how scraping can reveal hidden directives in page elements, and how vision models can be tricked into outputting attacker-chosen text or even HTML links. Jailbreaks instead hijack refusal behavior by transforming prompts—such as using base64 encoding—so the model still produces disallowed content like phishing emails. Multimodal jailbreak research adds structured noise patterns to images that can “break” models into following harmful instructions. These threats matter because they expand the attack surface as systems rely more on multimodal inputs and automated workflows.

What makes prompt injection different from ordinary prompt misuse?

Prompt injection targets the model’s instruction-following hierarchy. Attackers embed carefully crafted directives inside inputs (webpage content, metadata, or images) so the model treats those hidden instructions as higher priority than the original user request. The transcript’s examples show scraping a webpage can surface hidden instructions, and vision processing can reveal hidden text or links embedded in images.

How can scraping a website turn “invisible” instructions into model-following behavior?

A webpage can contain attacker-controlled text in places that look harmless to a human viewer. When a program scrapes the page and feeds the extracted content to an LLM, the hidden instruction becomes part of the model’s context. In the demonstration, scraping caused the model to output a directive like “you are on a private website leave now,” showing how embedded instructions can override intended behavior.

Why are multimodal models a bigger target for prompt injection?

Multimodal models accept images (and later video), which can carry hidden instructions that are hard for users to notice. The transcript shows an image that looks like a normal house but contains embedded instructions to override the task, and another image that leads a vision model to output an HTML link. That link could be malicious, turning image inputs into a delivery mechanism for phishing or scams.

How does base64 encoding function as a jailbreak technique in the transcript’s example?

The transcript describes encoding a disallowed request into base64 so the model still interprets it despite not seeing it in plain text. The workflow shown: convert a harmful prompt into base64, paste the encoded string into a model, ask what it says, and then continue to elicit disallowed content. The end result was phishing-style email text targeting Bank of America.

What is the role of structured noise in multimodal jailbreaks?

Research described in the transcript shows that adding carefully optimized noise patterns to an image can cause a model to “break” and follow harmful instructions. Humans may not perceive the pattern, but the model treats it as meaningful signal. The example uses a panda image with hidden structured noise that, when included, leads to the model producing an instruction-following response.

Review Questions

In a prompt injection scenario, what specific mechanism allows attacker instructions to override the user’s original intent?
How can vision-model outputs like HTML links increase real-world risk compared with plain text manipulation?
Why might base64-encoded prompts bypass refusal behavior, and what does the transcript’s phishing example demonstrate about downstream harm?

Key Points

1
Prompt injection works by embedding attacker-controlled directives inside inputs so the model ignores earlier instructions and follows the injected ones.
2
Scraping can expose hidden webpage instructions that are not obvious in normal browsing, turning benign-looking content into an attack payload.
3
Multimodal prompt injection can hide instructions inside images, including cases where vision models output attacker-chosen text or HTML links.
4
Jailbreaks hijack refusal behavior by transforming prompts—token-level techniques like base64 encoding can make disallowed requests succeed.
5
Multimodal jailbreaks can use structured noise patterns in images that are imperceptible to humans but trigger harmful instruction-following.
6
As LLMs and multimodal models become more integrated into products, the attack surface expands and defenses must evolve continuously.

Highlights

A normal webpage can contain hidden instructions that only become effective after scraping, letting attackers steer an LLM’s output.

Vision models can be tricked into producing HTML links from images with barely visible embedded directives—turning image inputs into a phishing delivery channel.

Base64 encoding can transform a refused request into one that still yields disallowed content, including phishing-style emails targeting Bank of America.

Structured noise in images can act as an optimized jailbreak trigger, causing a multimodal model to follow harmful instructions.

Topics

Prompt Injection
Jailbreaks
Multimodal Attacks
Phishing
Base64 Encoding

Mentioned

Andrej Karpathy