Quantization — Topic Summaries

AI-powered summaries of 8 videos about Quantization.

8 summaries

No matches found.

Something Strange Happens When You Trust Quantum Mechanics

Veritasium · 3 min read

Quantum particles don’t follow a single, definite route between two points. Instead, they effectively “try” every possible path at once, and the...

Least ActionPath IntegralsBlackbody Radiation

Ollama - Local Models on your machine

Sam Witteveen · 2 min read

Ollama is a user-friendly way to run large language models locally on a Mac or Linux machine by downloading them and serving them through a local...

Local LLMsOllama SetupModel Downloads

Generative AI Fine Tuning LLM Models Crash Course

Krish Naik · 3 min read

Fine-tuning large language models becomes practical on limited hardware when three ideas work together: quantization to shrink model weights,...

QuantizationLoRAQLoRA

QLoRA is all you need (Fast and lightweight model fine-tuning)

sentdex · 3 min read

QLoRA (quantized low-rank adapters) is positioned as a practical, lightweight way to fine-tune large language models without the months-long,...

QLoRA Fine-TuningLow-Rank AdaptersQuantization

EmbeddingGemma - Micro Embeddings for Mobile Devices

Sam Witteveen · 2 min read

EmbeddingGemma is a family of tiny, text-only embedding models designed to run on-device, enabling retrieval, semantic search, clustering, and “micro...

Embedding ModelsOn-Device AIMicro RAG

Run any LLMs locally: Ollama | LM Studio | GPT4All | WebUI | HuggingFace Transformers

AI Researcher · 3 min read

Running large language models locally boils down to one trade-off: keeping data on-device and gaining control over models and prompts, while paying...

Local LLMsGPU InferenceQuantization

Deploying Local LLM but It Is Slow? Here's How to Fix It (Hopefully) | LLMOps with vLLM

Venelin Valkov · 2 min read

Deploying a local LLM can feel painfully slow when using the default Hugging Face Transformers inference pipeline, but switching to vLLM can cut...

Local LLM LatencyvLLM vs TransformersPaged Attention

Hardware/Mobile (7) - Testing & Deployment - Full Stack Deep Learning

The Full Stack · 3 min read

Deploying deep learning models on mobile and embedded hardware is less about model design in the abstract and more about surviving the constraints of...

Mobile DeploymentQuantizationTorchScript