Regular Expressions (Regex) Tutorial: How to Match Any Pattern of Text
Based on Corey Schafer's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Regular expressions match patterns rather than exact text, making them useful for searching and extracting variable data like phone numbers, emails, and URLs.
Briefing
Regular expressions let people search text by pattern, not by exact wording—turning messy, variable data (like phone numbers, emails, and URLs) into something that can be reliably found and extracted. The tutorial builds that capability from the ground up: start with literal matching, then add “meta characters” for classes of text, then use anchors and quantifiers to control where matches occur and how long they can be.
It begins with practical mechanics in a text editor: open the regex-enabled search tool, ensure the “dot asterisk” option is active so patterns are treated as regular expressions, and enable match-case for predictable behavior. From there, literal searches behave as expected—ABC matches only the exact casing and order. The first major lesson is that regex symbols often have special meaning. A period matches “any character” (except newline), so finding a literal dot requires escaping it with a backslash. The same escape logic applies to other meta characters, including the backslash itself.
Next comes the toolkit of character classes. Backslash D matches digits (0–9), while uppercase D negates that set. Backslash W matches “word characters” (A–Z, a–z, 0–9, and underscore), and uppercase W negates it. Backslash S matches whitespace (spaces, tabs, newlines), while uppercase S negates it. The tutorial then introduces anchors—patterns that don’t match characters but match positions. A word boundary (backslash B) marks transitions between word and non-word characters; the caret (^) anchors the start of a string, and the dollar sign ($) anchors the end.
With those building blocks, the tutorial shifts to real-world patterns. Phone numbers are matched using repeated digit classes and a separator pattern. To restrict separators to only dashes or dots, it uses a character set in square brackets (e.g., [.-]) so the regex consumes exactly one separator character—avoiding accidental matches like an asterisk. It also demonstrates ranges inside character sets (e.g., [1-7] for digits 1 through 7, [a-z] for lowercase letters, and combined ranges for case-insensitive matching).
Quantifiers then remove repetition. The asterisk matches zero or more, plus matches one or more, question mark matches zero or one, and curly braces specify exact or ranged counts (e.g., {3} or {3,4}). This is used to match structured text like names with optional punctuation (mr. vs mr) and to handle variable-length strings with word-character quantifiers.
Groups and alternation (parentheses and |) handle branching patterns such as mr, miss, and misses. Finally, the tutorial tackles email and URL matching. It shows how to build a regex that matches multiple email formats by allowing different top-level domains (com, edu, net) and by expanding allowed characters in the local-part and domain. It also demonstrates capturing groups with parentheses and using back-references (like group 1, group 2, group 3) to extract just the domain and top-level domain from URLs and rewrite matches during “replace all.” The end result is a workflow for reading, writing, and repurposing regex patterns to clean and extract information efficiently.
Cornell Notes
Regular expressions provide a way to match text by pattern, using special symbols to represent classes of characters, positions, and repetition. Literal characters match exactly, but meta characters like “.” require escaping to find a real dot. Character classes such as \d, \w, and \s (and their uppercase negations) let regex match digits, word characters, and whitespace, while anchors like \b, ^, and $ constrain matches to boundaries or string ends. Square-bracket character sets and quantifiers ({m,n}, *, +, ?, etc.) control which characters are allowed and how many are consumed. Parentheses create groups that can be captured and reused via back-references for tasks like extracting domains from URLs.
Why does searching for “.” behave differently than searching for a literal period, and how is that fixed?
What’s the practical difference between character classes (\d, \w, \s) and anchors (\b, ^, $)?
How do square-bracket character sets improve phone-number matching compared with using “.”?
How do quantifiers reduce repetition, and what do the common ones mean?
How can groups and back-references be used to extract parts of matches (like domains from URLs)?
Review Questions
- When would you choose an anchor like ^ or $ instead of a character class like \d?
- Write a regex fragment to match either “mr” or “misses” prefixes, allowing an optional period after “mr”. What grouping and alternation would you use?
- Given a string with “cat mat Hat bat”, how would you use a negated character set inside square brackets to match words that end in “t” but exclude “bat”?
Key Points
- 1
Regular expressions match patterns rather than exact text, making them useful for searching and extracting variable data like phone numbers, emails, and URLs.
- 2
Meta characters often have special meanings (e.g., “.” matches any character except newline), so literal matching requires escaping with backslashes (e.g., “\.”).
- 3
Character classes such as \d, \w, and \s match digits, word characters, and whitespace; uppercase versions negate those sets.
- 4
Anchors like \b, ^, and $ match positions (word boundaries, start, end) rather than consuming characters.
- 5
Square-bracket character sets restrict matches to specific allowed characters (e.g., [.-] for dash or dot) and support ranges like [1-7] or [a-z].
- 6
Quantifiers (*, +, ?, {m}, {m,n}) control repetition, eliminating the need to rewrite the same token multiple times.
- 7
Capturing groups with parentheses and back-references (e.g., $1, $2, $3) enable extraction and rewriting, such as converting full URLs into just domain + top-level domain.