Can GPT-4 Vision Detect Deepfake AI Images?

TL;DR

GPT-4 Vision can assign very high fake probabilities (up to 90%) when composites show measurable inconsistencies like mismatched noise patterns and lighting direction.

Briefing Cornell Notes

Briefing

GPT-4 Vision can flag some AI-made images with high confidence, but it also produces shaky results—especially when the manipulation is subtle or when “common sense” cues dominate over pixel-level evidence. In repeated tests, the system often latched onto visible inconsistencies (like mismatched lighting, edge clarity, or compositing artifacts) to assign fake probabilities ranging from as low as 5% for a real photo to as high as 90% for certain manipulated composites.

A clear win came with a cat-and-Trump composite. After analyzing artifacts, pixel characteristics, noise levels, and lighting direction, it concluded the image was likely manipulated, assigning a 90% probability of being fake. The reasoning leaned on measurable-looking differences: the cat’s edges (around ears and tail) appeared sharper than expected, the face and surrounding elements showed inconsistent clarity, and the cat and man seemed to carry different noise patterns. Those quality mismatches—typical of overlayer insertion—pushed the assessment toward fabrication.

The system also struggled with false positives and inconsistent scoring. A real Pope Francis image was rated as likely genuine, with only a 5% chance of being fake, suggesting the model can sometimes avoid overreacting when no obvious manipulation signals appear. But a separate viral “Pope in a puffy jacket” image drew an 80% fake probability, with the critique hinging largely on contextual plausibility: the attire and how it appears on a public religious figure raised suspicion even when the artifact checklist didn’t find dramatic technical anomalies.

When the manipulation was harder to see—or when the image looked broadly coherent—scores dropped. A Lamborghini image generated from stable diffusion (and an alternate off-road-style version) was given only a 10% fake probability, despite the creator’s expectation that it would be obviously synthetic. Similarly, a grainy UFO-related image attributed to U.S. Navy context was rated 20% fake (80% authentic), with the analysis treating the grain and screen-like overlays as consistent with device capture rather than editing.

The most telling pattern emerged in a direct comparison test using the Taylor Swift tattoo scenario. In one run, GPT-4 Vision assigned moderate fake probabilities (40–65%) to an image where tattoos were added. After uploading the “real” version for comparison, the model’s conclusion sharpened dramatically: the tattoo-added image was rated 90% fake. The takeaway is that the model’s confidence improves when it can anchor its judgment against a reference image, making “what changed” easier to detect.

Overall, the results point to a practical conclusion: GPT-4 Vision isn’t reliable enough to be trusted blindly for deepfake detection, but it can be useful—especially when manipulations create detectable compositing inconsistencies or when comparisons to known-good references are available. The broader implication is that AI-assisted detection may be necessary as synthetic imagery becomes more common and harder for humans to spot by eye, even if current tools still miss the mark.

Cornell Notes

GPT-4 Vision sometimes detects manipulated images with strong confidence, particularly when edits create clear compositing signals such as mismatched noise patterns, inconsistent lighting direction, or unusually sharp edges around inserted elements. In tests, a cat-and-Trump composite was rated 90% fake, while a real Pope Francis photo was rated only 5% fake. Results were less dependable for other cases: a stable-diffusion Lamborghini variant received just 10% fake, and a grainy UFO-related capture was rated 20% fake. Confidence improved most when a reference image was provided—after comparing two Taylor Swift versions, the tattoo-added image jumped to 90% fake, suggesting “difference detection” boosts accuracy.

What kinds of visual inconsistencies most reliably pushed GPT-4 Vision toward “fake” in the tests?

The strongest fake calls tied to compositing-style mismatches. In the cat-and-Trump example, the cat’s edges (ears and tail) looked clearer than expected, the face and surrounding elements showed unequal clarity, and the cat and man appeared to have different noise characteristics. Lighting also appeared to come from different directions across the composite, which the analysis treated as a sign of digital insertion.

How did GPT-4 Vision handle false positives when given real images?

It sometimes avoided false alarms. The Pope Francis photo was judged likely authentic with a 5% fake probability, with no glaring manipulation evidence flagged. That suggests the model can recognize when an image’s overall quality and internal consistency look natural enough to avoid overcalling edits.

Why did the “Pope puffy jacket” case land at a high fake probability even without obvious artifact findings?

The critique leaned heavily on contextual plausibility—how unusual the jacket and styling looked for a public religious figure—rather than on a dramatic technical artifact signature. The analysis still assigned an 80% fake probability, implying that “common sense” reasoning can dominate when the artifact checklist doesn’t produce decisive evidence.

What happened in the Taylor Swift tattoo experiment, and what changed after comparison?

Without a reference, the tattoo-added image received mixed scores (40–65%), with the model focusing on tattoo clarity and consistency rather than finding decisive technical flaws. After uploading the real Taylor Swift image for comparison, the conclusion flipped: the tattoo-added version was rated 90% fake. The improvement suggests the model benefits when it can identify what differs between two versions.

Why might AI-generated-looking images still receive low fake probabilities?

Some synthetic images can look internally coherent, making artifact-based detection harder. The Lamborghini stable-diffusion variant was rated 10% fake, and the analysis noted no immediate signs of manipulation. In such cases, the model may rely on subtle cues (like lighting or historical plausibility) that may not be strong enough to raise confidence.

Review Questions

Which specific visual signals (e.g., noise, edge sharpness, lighting direction) most increased the fake probability in the cat-and-Trump composite?
What evidence did the model rely on more heavily in the “Pope puffy jacket” case—technical artifacts or contextual plausibility?
How did providing a reference image change the Taylor Swift tattoo detection outcome, and what does that imply about difference-based forensics?

Key Points

1
GPT-4 Vision can assign very high fake probabilities (up to 90%) when composites show measurable inconsistencies like mismatched noise patterns and lighting direction.
2
The model can also avoid false positives on some real photos, such as rating a Pope Francis image as only 5% likely fake.
3
Confidence drops when edits produce images that remain visually coherent, as seen with a stable-diffusion Lamborghini variant rated 10% fake.
4
Contextual “common sense” reasoning can drive high fake scores even when the artifact checklist finds little—illustrated by the Pope puffy jacket case at 80%.
5
Providing a reference image can dramatically improve detection, as the Taylor Swift tattoo-added image rose to 90% fake after comparison.
6
Deepfake detection remains unreliable for blind trust; it works best when manipulations create detectable compositing artifacts or when comparisons are available.

Highlights

A cat-and-Trump composite was rated 90% fake, with the analysis pointing to sharper inserted edges, inconsistent clarity, different noise patterns, and lighting direction mismatches.

A real Pope Francis photo was rated only 5% fake, suggesting the system can sometimes avoid false positives when no strong manipulation cues appear.

The Taylor Swift tattoo test improved from mixed 40–65% calls to 90% fake after uploading a real reference image for comparison.

A stable-diffusion Lamborghini variant received just 10% fake probability, showing how easily synthetic images can slip past artifact-based checks when they look coherent.

Topics

Deepfake Detection
GPT-4 Vision
Image Forensics
False Positives
Reference Comparison