Get AI summaries of any video or article — Sign up free
Open Source LLMs on GOD mode. Local LLMs MAXXED OUT on the RTX 5090! thumbnail

Open Source LLMs on GOD mode. Local LLMs MAXXED OUT on the RTX 5090!

MattVidPro·
5 min read

Based on MattVidPro's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

LM Studio can run open-weight LLMs locally on an RTX 5090, enabling full offline access once models are downloaded.

Briefing

Running large language models entirely on a home PC is no longer a novelty—it’s practical, fast, and surprisingly capable when paired with a high-VRAM GPU. Using LM Studio on an RTX 5090, the workflow turns “download once, run forever” into a real option: DeepSeek R1 and other open-weight models generate long answers locally with no internet dependency, while still delivering usable speeds and context windows.

The most concrete results come from testing DeepSeek R1 variants at different sizes and quantization levels. A 7B model (quant 8) is configured with an 8,000-token context window and full GPU offload across all model layers. It responds at roughly 77.8 tokens per second with a first-token latency around 0.3 seconds, and the system can visibly track the model’s reasoning chain while it attempts tasks like solving a Rubik’s Cube and producing a kid-friendly jingle. Scaling up to a 14B model (also quant 8, with the same 8,000-token context) increases VRAM usage to about 18 GB and still produces a “comprehensive business plan” quickly—around 47 tokens per second—suggesting that larger open models remain interactive on consumer hardware.

The experiment then pushes toward the edge of what the RTX 5090 can hold. A 32B model at quant 4 runs with full GPU offload and uses about 20 GB of VRAM, generating detailed political career guidance at roughly 33 tokens per second. When switching to a 32B quant 8 configuration, the system saturates the full 32 GB of GPU memory and spills into system RAM and even disk offload. Generation slows dramatically (down to single-digit tokens per second in one run), and the creator concludes quant 8 is not worth it for day-to-day use—quant 4 distillation offers a better balance of speed, quality, and resource consumption.

Beyond text, the setup demonstrates local multimodal AI using Gemma 327B (quant 4) with image understanding enabled in LM Studio. The model streams responses quickly—over 50 tokens per second with very low first-token latency—and can roast a user’s room from a photo, dissect memes, and analyze posters in detail. The results are strong on interpreting text and context, though character-level accuracy can slip when the image resembles a known pop-culture figure.

Finally, the tests include a much smaller 360M-parameter quant 8 model to measure how far “tiny LLM” performance can go. With aggressive evaluation batch sizes, it generates at hundreds of tokens per second (around 400+ tokens per second) with near-instant first-token responses, producing usable story and recipe-style outputs. The overall takeaway is that a high-end GPU plus open-weight models can deliver both high-quality reasoning and multimodal analysis locally—fast enough to feel conversational—while reserving heavier workloads for the best hardware and the right quantization settings.

The series’ next step is positioned as image and video generation, with diffusion and video models expected to be more complex to set up than local chat and vision in LM Studio.

Cornell Notes

Local LLMs become genuinely usable on a home PC when they’re run with LM Studio on a high-VRAM GPU like the RTX 5090. DeepSeek R1 variants show that 7B and 14B models can run with full GPU offload and an 8,000-token context window at interactive speeds, while 32B models require careful quantization to avoid heavy RAM/disk offloading. Quant 4 distillation stays practical; quant 8 can saturate 32 GB VRAM and slow generation sharply when it overflows into system memory and storage. Multimodal performance is demonstrated with Gemma 327B (quant 4), which can roast images and dissect memes/posters locally. Even a 360M-parameter model can generate hundreds of tokens per second, making summaries and creative text tasks fast.

What settings made the 7B DeepSeek R1 run feel fast and responsive locally?

The 7B model was loaded in LM Studio with an 8,000-token context window and full GPU offload (28/28 model layers). An evaluation batch size was increased (to 248), trading extra memory use for speed. With the RTX 5090’s VRAM headroom, the run used about 10 GB of VRAM for that configuration and produced roughly 77.8 tokens per second with about 0.3 seconds to the first token.

How did scaling from 7B to 14B affect speed and memory, and what task was used to judge capability?

Moving to 14B (quant 8) kept the 8,000-token context window and full GPU offload, but VRAM usage rose to about 18 GB. Generation slowed to around 47.28 tokens per second, yet remained quick enough to deliver a “comprehensive business plan” prompt that included reasoning about AI’s impact on small businesses and revenue growth. The output was shorter than newer frontier models, but still described actionable steps and target markets.

Why did the 32B quant 8 attempt become much slower than quant 4?

Quant 8 pushed memory beyond what the GPU could hold comfortably. The system saturated the full 32 GB of GPU VRAM and then spilled into system RAM (and even disk offload). That offloading became the bottleneck: one run dropped to about 2.8 tokens per second, with generation speed limited by the slower memory/storage path rather than the GPU’s compute.

What evidence showed that Gemma 327B was truly multimodal and running locally?

LM Studio displayed a vision capability indicator, and the model accepted images alongside text. The tests included sending a photo to “roast” the user’s room and asking the model to dissect memes and posters. Responses streamed quickly (about 53.54 tokens per second with ~0.17 seconds to first token), and the creator emphasized that the analysis happened without internet access.

What did the small 360M model reveal about throughput and practical limits?

The 360M-parameter quant 8 model generated extremely fast—around 411 tokens per second with near-instant first-token latency. Increasing evaluation batch size from 8,192 to 65,536 did not keep scaling linearly; generation speed leveled off, suggesting a throughput ceiling where batch size no longer improves performance. Despite the smaller size, outputs for stories and recipe/decor ideas were still usable.

Review Questions

  1. When does quantization help most for local LLMs, and when can it backfire by forcing RAM/disk offload?
  2. How do context length and evaluation batch size trade off against VRAM usage and generation speed in these tests?
  3. What kinds of image understanding tasks did Gemma 327B handle well, and where did it make noticeable mistakes?

Key Points

  1. 1

    LM Studio can run open-weight LLMs locally on an RTX 5090, enabling full offline access once models are downloaded.

  2. 2

    DeepSeek R1 7B (quant 8) with an 8,000-token context and full GPU offload generated around 77.8 tokens per second with ~0.3s first-token latency.

  3. 3

    DeepSeek R1 14B (quant 8) increased VRAM usage to about 18 GB while still producing detailed outputs at roughly 47 tokens per second.

  4. 4

    32B quant 4 stayed practical with full GPU offload and ~20 GB VRAM usage, but 32B quant 8 saturated 32 GB VRAM and spilled into RAM/disk, dropping speed to single-digit tokens per second.

  5. 5

    Gemma 327B (quant 4) demonstrated local multimodal capability, streaming image-based roasts and meme/poster dissections at ~50+ tokens per second.

  6. 6

    A 360M-parameter quant 8 model generated hundreds of tokens per second and showed that evaluation batch size has diminishing returns beyond a certain point.

Highlights

A 7B DeepSeek R1 run hit ~77.8 tokens/second with an 8,000-token context window using full GPU offload on the RTX 5090.
Pushing DeepSeek R1 32B to quant 8 filled all 32 GB VRAM and forced RAM/disk offload, collapsing generation speed to about 2.8 tokens/second.
Gemma 327B handled image-based tasks locally—roasting a room photo and dissecting memes/posters—while streaming responses at ~50+ tokens per second.
Even a 360M model produced ~400+ tokens per second with near-instant first-token latency, making small-model local use feel snappy.

Mentioned

  • LM Studio
  • NVIDIA RTX 5090
  • DeepSeek R1
  • Gemma 327B