Open Source LLMs on GOD mode. Local LLMs MAXXED OUT on the RTX 5090!
Based on MattVidPro's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
LM Studio can run open-weight LLMs locally on an RTX 5090, enabling full offline access once models are downloaded.
Briefing
Running large language models entirely on a home PC is no longer a novelty—it’s practical, fast, and surprisingly capable when paired with a high-VRAM GPU. Using LM Studio on an RTX 5090, the workflow turns “download once, run forever” into a real option: DeepSeek R1 and other open-weight models generate long answers locally with no internet dependency, while still delivering usable speeds and context windows.
The most concrete results come from testing DeepSeek R1 variants at different sizes and quantization levels. A 7B model (quant 8) is configured with an 8,000-token context window and full GPU offload across all model layers. It responds at roughly 77.8 tokens per second with a first-token latency around 0.3 seconds, and the system can visibly track the model’s reasoning chain while it attempts tasks like solving a Rubik’s Cube and producing a kid-friendly jingle. Scaling up to a 14B model (also quant 8, with the same 8,000-token context) increases VRAM usage to about 18 GB and still produces a “comprehensive business plan” quickly—around 47 tokens per second—suggesting that larger open models remain interactive on consumer hardware.
The experiment then pushes toward the edge of what the RTX 5090 can hold. A 32B model at quant 4 runs with full GPU offload and uses about 20 GB of VRAM, generating detailed political career guidance at roughly 33 tokens per second. When switching to a 32B quant 8 configuration, the system saturates the full 32 GB of GPU memory and spills into system RAM and even disk offload. Generation slows dramatically (down to single-digit tokens per second in one run), and the creator concludes quant 8 is not worth it for day-to-day use—quant 4 distillation offers a better balance of speed, quality, and resource consumption.
Beyond text, the setup demonstrates local multimodal AI using Gemma 327B (quant 4) with image understanding enabled in LM Studio. The model streams responses quickly—over 50 tokens per second with very low first-token latency—and can roast a user’s room from a photo, dissect memes, and analyze posters in detail. The results are strong on interpreting text and context, though character-level accuracy can slip when the image resembles a known pop-culture figure.
Finally, the tests include a much smaller 360M-parameter quant 8 model to measure how far “tiny LLM” performance can go. With aggressive evaluation batch sizes, it generates at hundreds of tokens per second (around 400+ tokens per second) with near-instant first-token responses, producing usable story and recipe-style outputs. The overall takeaway is that a high-end GPU plus open-weight models can deliver both high-quality reasoning and multimodal analysis locally—fast enough to feel conversational—while reserving heavier workloads for the best hardware and the right quantization settings.
The series’ next step is positioned as image and video generation, with diffusion and video models expected to be more complex to set up than local chat and vision in LM Studio.
Cornell Notes
Local LLMs become genuinely usable on a home PC when they’re run with LM Studio on a high-VRAM GPU like the RTX 5090. DeepSeek R1 variants show that 7B and 14B models can run with full GPU offload and an 8,000-token context window at interactive speeds, while 32B models require careful quantization to avoid heavy RAM/disk offloading. Quant 4 distillation stays practical; quant 8 can saturate 32 GB VRAM and slow generation sharply when it overflows into system memory and storage. Multimodal performance is demonstrated with Gemma 327B (quant 4), which can roast images and dissect memes/posters locally. Even a 360M-parameter model can generate hundreds of tokens per second, making summaries and creative text tasks fast.
What settings made the 7B DeepSeek R1 run feel fast and responsive locally?
How did scaling from 7B to 14B affect speed and memory, and what task was used to judge capability?
Why did the 32B quant 8 attempt become much slower than quant 4?
What evidence showed that Gemma 327B was truly multimodal and running locally?
What did the small 360M model reveal about throughput and practical limits?
Review Questions
- When does quantization help most for local LLMs, and when can it backfire by forcing RAM/disk offload?
- How do context length and evaluation batch size trade off against VRAM usage and generation speed in these tests?
- What kinds of image understanding tasks did Gemma 327B handle well, and where did it make noticeable mistakes?
Key Points
- 1
LM Studio can run open-weight LLMs locally on an RTX 5090, enabling full offline access once models are downloaded.
- 2
DeepSeek R1 7B (quant 8) with an 8,000-token context and full GPU offload generated around 77.8 tokens per second with ~0.3s first-token latency.
- 3
DeepSeek R1 14B (quant 8) increased VRAM usage to about 18 GB while still producing detailed outputs at roughly 47 tokens per second.
- 4
32B quant 4 stayed practical with full GPU offload and ~20 GB VRAM usage, but 32B quant 8 saturated 32 GB VRAM and spilled into RAM/disk, dropping speed to single-digit tokens per second.
- 5
Gemma 327B (quant 4) demonstrated local multimodal capability, streaming image-based roasts and meme/poster dissections at ~50+ tokens per second.
- 6
A 360M-parameter quant 8 model generated hundreds of tokens per second and showed that evaluation batch size has diminishing returns beyond a certain point.