Local AI Surveillance Is Getting SCARY Good (Qwen3-VL)

TL;DR

Qwen3-VL can run locally on a snapshot-based camera feed to produce low-latency boolean decisions for security automation.

Briefing Cornell Notes

Briefing

A local vision-language model can power a practical, low-latency security setup that doesn’t just detect “a person,” but can trigger actions only for highly specific visual conditions—like a person wearing an orange jacket—then escalate to physical deterrence using a drone.

The build starts with Qwen3-VL, tested in multiple sizes (2B, 4B, 8B). For this security use case, the system leans on the 2B model for speed, using a simple true/false output to keep decision time low. Frames come from a mobile IP camera feed: an old Android phone runs an IP Webcam Pro stream, while a Python script periodically grabs snapshots (set to about every two seconds) at a reduced resolution (640×360) so the model can analyze quickly enough to act in near real time.

The first detection mode is intentionally basic: ask the model whether any person appears in the frame, and return only “true” or “false.” When the model flips to true, the system triggers an alarm. In testing, the alarm fires almost immediately after the person enters the camera view, even when the subject barely makes it into frame—an outcome attributed to the combination of frequent snapshots and fast inference from the smaller Qwen3-VL variant.

The system then adds specificity without training custom object-detection models. Instead of bounding boxes or dataset-driven classifiers, it chains two text-based checks: first confirm a person is present, then ask whether that person is wearing an orange jacket. Only when both conditions return true does the alarm activate. The result is a “semantic trigger” that behaves like a rule written in plain language: walk in without the jacket and the alarm stays quiet; put on the orange jacket and the alarm goes off.

Beyond intrusion detection, the same pipeline supports environment monitoring. A separate mode checks whether curtains/blinds are open or closed. Here, the setup switches to the 4B model for improved accuracy, and the alarm triggers when the state changes to the target condition. The system also logs events in text with timestamps, snapshot numbers, and camera identifiers, creating an audit trail that can be extended later.

Finally, the project connects detection to hardware deterrence. A drone is kept armed and ready; when the “person + orange jacket” condition becomes true, the drone lifts (roughly up to about 50 cm to 1 m) and performs an intimidating maneuver—flying toward the subject, spinning, then landing. The drone does nothing for a person without the jacket, demonstrating that the model’s boolean outputs can gate real-world actions.

Overall, the setup frames local AI as an end-to-end control system: camera → snapshot → Qwen3-VL reasoning → boolean decision → logging and alerts → optional physical response. The next steps hinted at include integrating other hardware (Flipper Zero, Raspberry Pi, cameras on drones) and using local LLMs to orchestrate more complex automation.

Cornell Notes

Qwen3-VL is used to build a local, low-latency security system that turns camera snapshots into simple boolean decisions. Frames from an Android IP Webcam Pro stream are sampled about every two seconds at 640×360 resolution, then analyzed by Qwen3-VL with prompts that force outputs like “true” or “false.” The system can trigger alarms for general presence (“is there a person?”) and for highly specific conditions (“is there a person wearing an orange jacket?”) by chaining two checks. It also monitors room state (curtains/blinds open vs. closed), using the larger 4B model for better accuracy. When the orange-jacket condition is met, a drone is armed to lift and perform an intimidating maneuver, while remaining idle for non-matching people.

Why does the system use a “true/false only” prompt instead of richer outputs?

The setup is designed for fast reaction. By constraining Qwen3-VL to output only “true” or “false,” the decision step stays lightweight and predictable, which matters when snapshots arrive every ~2 seconds. That simplicity also makes it easy to wire the result directly into downstream actions like alarms and drone control.

How does the project achieve “orange jacket only” detection without training a custom detector?

It relies on text-based visual reasoning. First, it asks whether any person is in the frame. If that’s true, it runs a second check asking whether the person is wearing an orange jacket. Only when both booleans are true does it trigger the alarm, effectively turning the model into a rule-following visual classifier driven by plain-language prompts.

What role do model size choices (2B vs 4B vs 8B) play in the system?

Multiple Qwen3-VL sizes are tested, but the security workflow prioritizes speed for simple tasks. The 2B model is used for the person and orange-jacket checks because the job is essentially boolean classification. For the curtains/blinds open/closed task, the system switches to the 4B model because it was “a bit more accurate,” trading some speed for reliability when the visual distinction matters.

How is the camera feed integrated, and why does resolution matter?

An Android phone runs IP Webcam Pro and streams video to a base URL. The Python code periodically captures snapshots from that stream. Resolution is reduced to 640×360 so the model can analyze frames quickly enough for near-instant alerts; higher quality would increase file size and slow inference.

How does detection translate into physical action with the drone?

The drone is armed ahead of time and kept ready. When the model outputs the matching condition (orange jacket present), the drone lifts roughly 50 cm to 1 m, moves toward the subject, performs a spin-like intimidation maneuver, then lands. If the condition is not met (e.g., a person appears without the orange jacket), the drone remains idle even though person detection may still be true.

Review Questions

What design choices in the pipeline (snapshot rate, resolution, and prompt format) are meant to reduce latency, and how do they affect responsiveness?
How does chaining two boolean checks (“person present” AND “orange jacket”) change the system’s behavior compared with a single “person present” trigger?
Why might the system choose the 4B model for curtain/blinds detection but the 2B model for person/orange-jacket detection?

Key Points

1
Qwen3-VL can run locally on a snapshot-based camera feed to produce low-latency boolean decisions for security automation.
2
Sampling frames about every two seconds and using reduced resolution (640×360) helps keep inference fast enough to trigger actions quickly.
3
Constraining prompts to “true” or “false” makes model outputs easy to wire directly into alarms and hardware control.
4
Specific visual rules like “person wearing an orange jacket” can be implemented via chained text prompts rather than training bounding-box detectors.
5
Curtains/blinds monitoring uses the 4B model for better accuracy, showing a practical speed–accuracy tradeoff.
6
Event logging records timestamps, snapshot numbers, camera identifiers, and descriptions to create an audit trail.
7
A drone can be gated by the model’s boolean output: it arms in advance but only performs deterrent maneuvers when the orange-jacket condition is met.

Highlights

The system triggers almost instantly after someone enters the frame, helped by frequent snapshots and a fast 2B Qwen3-VL boolean prompt.

Orange-jacket-only detection is achieved by chaining two model checks—person present, then orange jacket—so the alarm stays off for non-matching people.

Curtains/blinds monitoring switches to Qwen3-VL 4B for improved accuracy, then triggers an alarm when the target state is detected.

The drone remains idle for a person without the orange jacket, but lifts and intimidates when the matching condition becomes true.

Topics

Local AI Security
Qwen3-VL Vision
Boolean Prompting
IP Camera Streaming
Drone Deterrence