Regular Expressions (Regex) Tutorial: How to Match Any Pattern of Text

TL;DR

Regular expressions match patterns rather than exact text, making them useful for searching and extracting variable data like phone numbers, emails, and URLs.

Briefing Cornell Notes

Briefing

Regular expressions let people search text by pattern, not by exact wording—turning messy, variable data (like phone numbers, emails, and URLs) into something that can be reliably found and extracted. The tutorial builds that capability from the ground up: start with literal matching, then add “meta characters” for classes of text, then use anchors and quantifiers to control where matches occur and how long they can be.

It begins with practical mechanics in a text editor: open the regex-enabled search tool, ensure the “dot asterisk” option is active so patterns are treated as regular expressions, and enable match-case for predictable behavior. From there, literal searches behave as expected—ABC matches only the exact casing and order. The first major lesson is that regex symbols often have special meaning. A period matches “any character” (except newline), so finding a literal dot requires escaping it with a backslash. The same escape logic applies to other meta characters, including the backslash itself.

Next comes the toolkit of character classes. Backslash D matches digits (0–9), while uppercase D negates that set. Backslash W matches “word characters” (A–Z, a–z, 0–9, and underscore), and uppercase W negates it. Backslash S matches whitespace (spaces, tabs, newlines), while uppercase S negates it. The tutorial then introduces anchors—patterns that don’t match characters but match positions. A word boundary (backslash B) marks transitions between word and non-word characters; the caret (^) anchors the start of a string, and the dollar sign ($) anchors the end.

With those building blocks, the tutorial shifts to real-world patterns. Phone numbers are matched using repeated digit classes and a separator pattern. To restrict separators to only dashes or dots, it uses a character set in square brackets (e.g., [.-]) so the regex consumes exactly one separator character—avoiding accidental matches like an asterisk. It also demonstrates ranges inside character sets (e.g., [1-7] for digits 1 through 7, [a-z] for lowercase letters, and combined ranges for case-insensitive matching).

Quantifiers then remove repetition. The asterisk matches zero or more, plus matches one or more, question mark matches zero or one, and curly braces specify exact or ranged counts (e.g., {3} or {3,4}). This is used to match structured text like names with optional punctuation (mr. vs mr) and to handle variable-length strings with word-character quantifiers.

Groups and alternation (parentheses and |) handle branching patterns such as mr, miss, and misses. Finally, the tutorial tackles email and URL matching. It shows how to build a regex that matches multiple email formats by allowing different top-level domains (com, edu, net) and by expanding allowed characters in the local-part and domain. It also demonstrates capturing groups with parentheses and using back-references (like group 1, group 2, group 3) to extract just the domain and top-level domain from URLs and rewrite matches during “replace all.” The end result is a workflow for reading, writing, and repurposing regex patterns to clean and extract information efficiently.

Cornell Notes

Regular expressions provide a way to match text by pattern, using special symbols to represent classes of characters, positions, and repetition. Literal characters match exactly, but meta characters like “.” require escaping to find a real dot. Character classes such as \d, \w, and \s (and their uppercase negations) let regex match digits, word characters, and whitespace, while anchors like \b, ^, and $ constrain matches to boundaries or string ends. Square-bracket character sets and quantifiers ({m,n}, *, +, ?, etc.) control which characters are allowed and how many are consumed. Parentheses create groups that can be captured and reused via back-references for tasks like extracting domains from URLs.

Why does searching for “.” behave differently than searching for a literal period, and how is that fixed?

In regex, the period is a meta character that matches “any character except newline.” So a pattern like “.” will match many characters rather than the dot itself. To match a literal period, the dot must be escaped with a backslash: “\.”. The same escaping idea applies to other meta characters (including the backslash), where “\” must be written as “\\” to represent a literal backslash.

What’s the practical difference between character classes (\d, \w, \s) and anchors (\b, ^, $)?

Character classes match actual characters. For example, “\d” matches digits 0–9, “\w” matches word characters (A–Z, a–z, 0–9, underscore), and “\s” matches whitespace (space, tab, newline). Anchors don’t consume characters; they match invisible positions. “\b” matches a word boundary, “^” matches the start of a string/line, and “$” matches the end. That’s why anchors can control where a match occurs without changing what characters are included.

How do square-bracket character sets improve phone-number matching compared with using “.”?

Using “.” as a separator placeholder matches any character, which can incorrectly match non-separators (like an asterisk). A square-bracket character set restricts the allowed separator to exactly one character from a list, such as [\.-] (dash or dot). Even though the set contains multiple characters, the regex still matches only one character at that position, then moves on to the next token (like the next digit class).

How do quantifiers reduce repetition, and what do the common ones mean?

Quantifiers specify how many times the preceding pattern should repeat. “*” matches zero or more, “+” matches one or more, “?” matches zero or one, and “{n}” matches exactly n. Ranges like “{min,max}” allow variability. In the tutorial, quantifiers replace repeated digit patterns (e.g., matching exactly three digits) and handle optional punctuation (e.g., “mr” vs “mr.” using “\.?”).

How can groups and back-references be used to extract parts of matches (like domains from URLs)?

Parentheses create capturing groups. For URLs, one group can capture the domain name (e.g., the word characters before the top-level domain), and another group can capture the top-level domain (like .com or .gov). In replace operations, back-references (e.g., $1, $2, $3 in the editor) can substitute the captured groups into the replacement text—turning full URLs into cleaned “domain + TLD” outputs.

Review Questions

When would you choose an anchor like ^ or $ instead of a character class like \d?
Write a regex fragment to match either “mr” or “misses” prefixes, allowing an optional period after “mr”. What grouping and alternation would you use?
Given a string with “cat mat Hat bat”, how would you use a negated character set inside square brackets to match words that end in “t” but exclude “bat”?

Key Points

1
Regular expressions match patterns rather than exact text, making them useful for searching and extracting variable data like phone numbers, emails, and URLs.
2
Meta characters often have special meanings (e.g., “.” matches any character except newline), so literal matching requires escaping with backslashes (e.g., “\.”).
3
Character classes such as \d, \w, and \s match digits, word characters, and whitespace; uppercase versions negate those sets.
4
Anchors like \b, ^, and $ match positions (word boundaries, start, end) rather than consuming characters.
5
Square-bracket character sets restrict matches to specific allowed characters (e.g., [.-] for dash or dot) and support ranges like [1-7] or [a-z].
6
Quantifiers (*, +, ?, {m}, {m,n}) control repetition, eliminating the need to rewrite the same token multiple times.
7
Capturing groups with parentheses and back-references (e.g., $1, $2, $3) enable extraction and rewriting, such as converting full URLs into just domain + top-level domain.

Highlights

A literal dot requires escaping: “.” matches any character (except newline), but “\.” matches an actual period.

Anchors don’t match characters—they match positions, which is why \b can find word boundaries without consuming text.

Character sets in square brackets match exactly one character from the set, letting phone-number separators be restricted to only “-” or “.”.

Quantifiers like + and {n} make regexes concise and precise by controlling how many characters must appear.

Capturing groups plus back-references allow “replace all” workflows that extract domains from URLs and rewrite matches automatically.

Topics

Regex Basics
Meta Characters
Character Classes
Anchors and Quantifiers
Groups and Back-References

Mentioned

Corey Schafer