The A-to-Z AI Literacy Guide (2025 Edition)

TL;DR

Tokenization breaks text into chunks, so letter-level tasks can fail when the model counts or reasons over tokens rather than individual characters.

Briefing Cornell Notes

Briefing

AI literacy in 2025 comes down to understanding how language models turn text into tokens, map those tokens into mathematical meaning, and then generate outputs under specific constraints. Mastering a compact set of concepts—especially tokenization, embeddings, latent space, and positional encoding—lets users predict why an AI produces a given answer, why it sometimes “hallucinates,” and how to steer it toward more reliable results.

At the foundation is tokenization: models don’t read letters; they split text into token chunks (sometimes whole words, sometimes word fragments, sometimes punctuation). That design explains common failure modes like miscounting letters inside words—because the model “sees” chunks rather than individual characters. Next comes embeddings, which assign each token a vector of numbers that places it in a semantic space. Similar concepts cluster together, enabling operations like analogical reasoning (for example, “king − man + woman” landing near “queen”) and powering context matching.

Those vectors then move through latent space, described as a vast “imagination zone” where meanings and connections exist. When a query lands in sparse or poorly supported regions, the model may confidently generate plausible-sounding but incorrect content—an intuition for why hallucinations happen. Positional encoding addresses a separate weakness: without order markers, “the cat ate the mouse” could be treated like “the mouse ate the cat.” By injecting position information (using sine/cosine patterns), modern models track word order, handle long-range dependencies, and maintain coherence across longer passages.

Once the mechanics are clear, the guide shifts to what users can control. Prompt engineering and context engineering determine what information the model uses—examples, constraints, and output formats—so vague requests produce vague “AI slop,” while specific instructions yield usable results. Temperature acts as a creativity dial: low values favor predictable, high-probability choices for factual tasks and coding; higher values increase randomness and can produce creative but less reliable text.

Context window limits define how much conversation the model can retain at once. When the window fills, some systems refuse new context while others silently drop earlier details, causing mid-conversation “forgetting” and drift in long chats. For generation strategy, beam search, top-k, and nucleus sampling change how the model explores candidate continuations—shaping the “personality” of outputs (careful editor vs reliable assistant vs creative collaborator). Under the hood, attention heads specialize in patterns like grammar, names, or pronoun resolution, while residual streams and layer norms help information accumulate across many layers without losing the original query.

The guide also explains why interpretability is hard: feature superposition means single neurons can represent overlapping concepts, leading to unexpected associations. Mixture of experts routes inputs to a small subset of specialized modules, improving capability without paying the full compute cost every time. Learning dynamics matter too: gradient descent adjusts weights to reduce error over millions of steps; fine-tuning specializes a pre-trained model; RLHF uses human feedback to optimize helpfulness and reduce harmful behavior; and catastrophic forgetting warns that new training can overwrite old skills.

Finally, the guide connects these literacy concepts to modern capabilities and deployment: RAG (retrieval-augmented generation) and retrieval-augmented feedback loops let models consult fresh sources and iteratively refine answers; speculative decoding speeds generation by having a smaller model propose ahead and a larger model verify. Efficiency techniques like quantization shrink models for edge devices, while LoRA-style adapters (“LoRA and Qura” in the transcript) enable swappable task-specific behavior without retraining the whole network. It closes with safety and multimodal fundamentals: prompt injection attacks hide malicious instructions in “innocent” text; diffusion models generate images by denoising from random noise; and multimodal fusion maps text, images, audio, and video into shared embedding space for unified perception.

The practical takeaway is direct: pick three concepts and experiment—tune temperature, test prompt-injection defenses, and use retrieval or context strategies—so users can move from “why did it do that?” to “how do I fix it?” with the same AI tools everyone else uses.

Cornell Notes

The guide frames AI literacy as learning how models process text (tokenization, embeddings, latent space, positional encoding) and how users can steer outputs (prompt/context engineering, temperature, context window, and decoding strategies like beam search, top-k, and nucleus sampling). It explains why failures happen: tokenization hides letters inside chunks, latent-space sparsity drives hallucinations, and limited context causes forgetting. It also connects internal architecture to behavior, including attention heads, residual streams, feature superposition, and mixture-of-experts routing. Finally, it links these ideas to real-world tools—RAG, retrieval-augmented feedback loops, speculative decoding, quantization, adapters (LoRA), and safety against prompt injection—so users can predict, improve, and secure AI results.

Why does an AI sometimes miscount letters or fail at word games like “count the Rs in strawberry”?

Because tokenization splits text into chunks (tokens) rather than reading individual letters. “strawberry” can be tokenized so that the “r” characters are embedded inside token chunks like “straw” and “berry.” When the model counts, it counts token-level patterns it learned, not literal per-letter positions—so hidden letters can be missed or misinterpreted.

How do embeddings and latent space explain both context understanding and hallucinations?

Embeddings map tokens into a hyperdimensional semantic space using hundreds of numbers per token, so related concepts cluster (e.g., “cat” near “dog”). Queries then travel through latent space to find answer coordinates. When the path leads into regions with limited support, the model may still generate confident, plausible text even if it’s not grounded—producing hallucinations.

What role does positional encoding play in language quality?

Positional encoding injects order information so the model distinguishes word sequences. Without it, “the cat ate the mouse” could be treated similarly to “the mouse ate the cat.” By adding sine/cosine patterns tied to each word’s position, the model tracks grammar and long-distance dependencies, improving coherence and translation.

What practical controls most affect output quality for everyday users?

Prompt/context engineering determines what the model uses (examples, constraints, desired format). Temperature controls randomness: low temperature favors predictable, high-probability outputs for factual work; higher temperature increases creative but riskier choices. The context window limits how much prior conversation is retained; once full, earlier details can be dropped or the model may refuse new context.

How do RAG and retrieval-augmented feedback loops reduce outdated answers and improve multi-step problem solving?

RAG triggers a search and injects relevant fresh documents into the prompt, letting the model answer with current information instead of relying only on pre-training. Retrieval-augmented feedback loops extend this by repeatedly searching, executing, observing results, and refining—like a detective debugging its own reasoning (e.g., searching flights, realizing missing departure city info, searching again for alternate dates, and iterating to lower cost).

Why can fine-tuning or user feedback cause the model to lose older skills?

Catastrophic forgetting describes how new learning can overwrite weights that encoded earlier capabilities. The guide notes this can happen when training on new tasks or when feedback-driven updates push the model away from prior behavior—making it hard to update models without careful techniques like rehearsal buffers that preserve older skills.

Review Questions

Which parts of the model’s pipeline are responsible for (1) reading text at all and (2) preserving word order?
How do temperature and decoding methods (beam/top-k/nucleus) differ in what they control?
What mechanisms in the guide explain why long chats drift or why models can confidently hallucinate?

Key Points

1
Tokenization breaks text into chunks, so letter-level tasks can fail when the model counts or reasons over tokens rather than individual characters.
2
Embeddings place tokens into a semantic vector space, enabling context matching and analogies through vector arithmetic.
3
Latent space navigation explains both creativity and hallucinations: sparse regions can produce confident but incorrect outputs.
4
Prompt and context engineering, temperature, and context-window management are the highest-leverage controls for improving day-to-day AI results.
5
Decoding strategy (beam search, top-k, nucleus sampling) changes how candidate continuations are explored, shaping output “personality” beyond temperature.
6
RLHF and catastrophic forgetting clarify why AI can become more helpful yet still refuse certain requests, and why updates can overwrite older skills.
7
RAG, retrieval-augmented feedback loops, speculative decoding, quantization, and adapter layers (LoRA-style) connect literacy to practical capability, speed, and deployment—while prompt injection highlights key safety risks.

Highlights

Tokenization is why “count the Rs” can go wrong: the model sees token chunks, not letters.

Latent space is the “imagination zone” where hallucinations emerge when queries land in under-supported regions.

Positional encoding prevents word order from collapsing into “word soup,” enabling long-range grammar and coherence.

RAG turns a model from a memorizer into a researcher by injecting fresh sources at query time.

Prompt injection is the language-model equivalent of SQL injection: hidden instructions in “innocent” text can hijack behavior.

Topics

AI Tokenization
Embeddings
Latent Space
Prompt Engineering
RAG

Mentioned

Nate B Jones
RLHF
RAG
MCP