Run your own AI (but private)

TL;DR

Run open LLMs locally to keep prompts and documents off third-party servers, enabling “private AI” even without internet access.

Briefing Cornell Notes

Briefing

Local “private AI” is becoming practical: a person can run an LLM entirely on a laptop or workstation, keep data off third-party servers, and then extend it to answer questions about personal notes or company documents using RAG (retrieval-augmented generation). The core setup is built around downloading an open model and running it through a local inference tool, so the model runs without internet access and without shipping prompts or files to a remote provider.

The walkthrough starts by demystifying what an AI model is—an LLM pre-trained on large datasets—and points to Hugging Face as a catalog of hundreds of thousands of models, including Llama 2 variants. Llama 2 is highlighted as a large language model trained on trillions of tokens and instruction data, with training described in terms of massive compute (thousands of GPUs, millions of GPU hours, and an estimated tens of millions of dollars). The key takeaway isn’t the training bill; it’s that the resulting model can be downloaded and run locally.

To run those models, the guide centers on Ollama (Olama.ai). After installing Ollama on macOS or Linux—or on Windows via WSL (Windows Subsystem for Linux)—the user can pull a model like “Llama 2” and start chatting immediately. Performance depends heavily on hardware: GPUs speed things up, while CPU-only setups are slower. The transcript also notes that Mac systems with M1/M2/M3 chips can work well, and that Nvidia GPU drivers may be needed in WSL for full acceleration.

Running a local model solves privacy, but it doesn’t automatically know personal facts or internal business information. The transcript illustrates this with “hallucination” examples—questions about identities or current events can come back wrong because the model’s training data is incomplete or outdated. That gap leads to the next step: teaching the AI the right context.

Two approaches are presented. Fine-tuning adapts the model by training it on new, proprietary examples—useful for internal procedures, help-desk knowledge, or product documentation. The transcript contrasts the enormous resources needed for original pre-training with the smaller scale of fine-tuning, describing an example workflow where only a small fraction of parameters are updated using a few thousand labeled examples. The other approach is RAG, which avoids retraining by connecting the LLM to a vector database (PostgreSQL is mentioned) so it can retrieve relevant passages from an internal knowledge base before answering.

VMware’s role is framed as packaging the infrastructure and tooling needed for private AI in enterprise environments. The transcript references VMware Private AI with Nvidia, plus supporting ecosystems involving Intel and IBM, and describes a typical enterprise stack: vSphere for virtualized infrastructure, deep learning VMs preloaded with tools, GPU assignment via passthrough, and a data scientist workflow using Jupyter notebooks and datasets of prompt/answer pairs. RAG is positioned as the practical bridge between a general LLM and an organization’s private documents.

Finally, the transcript demonstrates a more hands-on “Private GPT” side project (separate from VMware) that uses RAG to let a user upload documents and chat with them through a web interface. In the example, a user ingests markdown journal entries and asks questions like what happened in specific places and when—showing the promise of private, document-grounded Q&A even if results aren’t perfect. The overall message: local inference plus RAG is a credible path to AI that stays private while becoming useful for real work.

Cornell Notes

Local private AI can be run on a laptop by downloading an open LLM and serving it locally, keeping prompts and data off third-party servers. The transcript uses Ollama to install and run models like Llama 2, with speed depending on GPU availability (Nvidia via WSL, or Mac M1/M2/M3). Because local models can still be wrong or outdated, the guide then turns to customization: fine-tuning on proprietary examples or using RAG to retrieve answers from a private document store before responding. VMware Private AI with Nvidia is presented as an enterprise-ready bundle for the infrastructure and tooling needed for fine-tuning and deployment, while a separate “Private GPT” project shows how RAG can power chat with uploaded notes and journals. This matters because privacy and security constraints often block workplace use of public chatbots.

How does running an LLM locally change privacy compared with using a public chatbot?

Local inference keeps the model and conversation on the user’s machine. The transcript describes “private contained” operation where the AI runs on the computer and does not require internet access, so prompts and data aren’t sent to a third-party company. That distinction matters in workplaces where privacy and security rules can prevent employees from using public LLM services.

What is Ollama, and what steps are needed to run a model like Llama 2 on different operating systems?

Ollama (Olama.ai) is the local runtime used to download and run LLMs. The transcript says it’s available on macOS and Linux, while Windows uses WSL (Windows Subsystem for Linux). After installing WSL (via a terminal command) and then installing Ollama inside Linux, the user runs something like “ollama run” and selects a model (e.g., Llama 2). The first run downloads the model manifest and weights.

Why do local models still produce incorrect answers, and how does the transcript demonstrate that?

Local LLMs inherit limitations from their training data. The transcript gives examples where the model confuses identities (e.g., mixing up “Network Chuck” with incorrect personal details) and answers about a person or topic using outdated or wrong information. It also shows a different model (“mytral”) answering “who is Network Chuck” incorrectly, reinforcing that local doesn’t automatically mean accurate or up-to-date.

What’s the difference between fine-tuning and RAG for making an LLM useful with private documents?

Fine-tuning retrains or adapts the model using proprietary examples—prompt/answer pairs—so the model’s behavior changes based on new training data. RAG (retrieval-augmented generation) instead keeps the model mostly unchanged and connects it to a private knowledge base (via a vector database such as one built on PostgreSQL). Before answering, the system retrieves relevant passages from the documents and grounds responses in that retrieved content.

What does the VMware Private AI with Nvidia stack add for enterprises?

The transcript frames VMware as packaging the complex setup: vSphere-based infrastructure, deep learning VMs preloaded with tools, GPU assignment (including passthrough), and an environment where data scientists can prepare datasets and run fine-tuning. It also mentions Nvidia’s AI tooling and broader partner coverage (Intel and IBM), aiming to reduce the guesswork of assembling the full pipeline for private LLM deployment.

How does the “Private GPT” side project work in the transcript’s hands-on example?

The transcript describes a separate “Private GPT” project (not Ollama-based) that uses RAG. The user sets it up on Windows using WSL and an Nvidia GPU (including driver installation). They then upload documents—first a VMware-related article, then markdown journal entries—ingest the folder via a command, and query the system through a web browser. Example questions include asking what happened in Takayama and what was eaten in Tokyo, showing document-grounded Q&A potential even if not perfect.

Review Questions

What two techniques are presented for making a local LLM answer questions about private information, and how does each one change the system?
Why does GPU availability matter for local LLM performance, and what setup path does the transcript recommend for Windows users?
In the transcript’s enterprise workflow, what roles do vSphere and deep learning VMs play in preparing data scientists to fine-tune or deploy LLMs?

Key Points

1
Run open LLMs locally to keep prompts and documents off third-party servers, enabling “private AI” even without internet access.
2
Use Ollama to download and run models like Llama 2; on Windows, rely on WSL to get a Linux environment for installation and execution.
3
Expect speed differences: Nvidia GPUs (including via WSL driver setup) make local chatting much faster than CPU-only runs.
4
Local LLMs can still be wrong or outdated, so accuracy requires grounding—either fine-tuning on proprietary examples or using RAG to retrieve from a private knowledge base.
5
RAG uses a vector database (PostgreSQL is mentioned) to fetch relevant passages before generating an answer, reducing the need for retraining.
6
VMware Private AI with Nvidia is positioned as an enterprise bundle that prepackages infrastructure, deep learning VMs, and tooling to simplify fine-tuning and deployment.
7
A separate “Private GPT” RAG project demonstrates document upload and chat with personal markdown journals, showing the practical workflow beyond enterprise tooling.

Highlights

Ollama turns local hardware into an LLM workstation: install, run “ollama run,” and the model downloads and chats without requiring internet access.

Privacy isn’t just about where the model runs; it’s also about how answers get grounded—RAG retrieves from private documents instead of relying on the model’s memory.

Fine-tuning is framed as smaller-scale than original pre-training: a few thousand examples and limited parameter updates can shift behavior for a specific use case.

VMware’s pitch is packaging the hard parts—vSphere infrastructure, deep learning VMs, GPU assignment, and data-science tooling—so companies can run private AI on-prem.

The “Private GPT” demo shows the end goal: upload notes or journals, ingest them, and ask questions through a web interface grounded in retrieved content.

Topics

Mentioned

Hugging Face
Ollama
VMware
Nvidia
Intel
IBM
Red Hat
OpenShift
PostgreSQL
WSL
Ubuntu
vSphere
Jupyter
LLM
GPU
WSL
RAG
VM
VMware
GPU
CPU
PCIE
SDK
SDKs
MPI
PyTorch
TensorFlow
pandas
transformers
fast ai
AI