Python Tutorial: re Module - How to Write and Match Regular Expressions (Regex)
Based on Corey Schafer's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Use raw strings (`r'...'`) so backslashes in regex patterns reach the regex engine unchanged.
Briefing
Regular expressions in Python become practical once you learn how to (1) pass patterns safely into `re`, (2) interpret matches via spans and groups, and (3) build patterns from reusable building blocks like meta-characters, character sets, anchors, and quantifiers. The core takeaway is that `re` lets you turn messy text—mixed case, punctuation, URLs, phone numbers, and emails—into structured data by matching repeatable text patterns, then extracting the parts you actually care about.
The tutorial starts with essentials: import the built-in `re` module and use raw strings (prefix `R`/`r`) so backslashes in regex patterns aren’t treated as Python escape sequences. Patterns are compiled with `re.compile(...)`, then searched using methods like `finditer`, which returns match objects. Those match objects provide the matched text plus a `span` (start and end indices), enabling precise slicing of the original string to retrieve the exact match. A simple example shows why regex matching is case-sensitive—`ABC` matches but `AbC` does not—and why literal punctuation must be escaped: an unescaped `.` matches “any character,” while `\.` matches a literal period.
From there, the lesson builds a toolbox of meta-characters. `\d` matches digits; `\D` matches non-digits. `\w` matches word characters (letters, digits, underscore), while `\W` negates that. `\s` matches whitespace (spaces, tabs, newlines), and `\S` negates it. Anchors add positional logic without consuming characters: `\b` marks word boundaries, while `^` and `$` anchor to the beginning and end of a string (or, with flags, the beginning/end of lines). These pieces are then combined into a phone-number pattern: three digits, a separator, three digits, another separator, and four digits. The tutorial demonstrates running the same regex against a file (`data.txt`) to extract phone numbers from larger datasets, including a quick fix for a Unicode decode issue by specifying `encoding='utf-8'`.
To make patterns stricter, the tutorial introduces character sets (`[...]`). Instead of using `.` (which matches any character), a set can restrict separators to just `-` or `.`. Character sets also support ranges like `1-5` for digits or `a-z` for lowercase letters, and negation via a leading `^` inside the set. Quantifiers (`*`, `+`, `?`, and `{m,n}`) then control repetition: `*` means zero or more, `+` means one or more, `?` means optional, and `{3}` or `{2,4}` sets exact or ranged counts. This is used to handle real-world variability, such as matching name prefixes like `Mr`, `Mr.`, `Miss`, and `Mrs`.
The final sections show how to scale from matching to extracting. Groups created with parentheses capture sub-parts of a match, and `group(n)` retrieves them (with `group(0)` representing the entire match). A URL example captures the optional `www`, the domain name, and the top-level domain (like `.com` or `.gov`), then uses `pattern.sub(...)` with backreferences (`\2`, `\3`) to rewrite URLs into a simplified “domain + TLD” format. The tutorial closes by comparing `finditer` with `findall` (strings/lists), `match` (beginning of string only), and `search` (anywhere in the string), and by introducing flags such as `re.IGNORECASE` to make patterns case-insensitive.
In short: regex power in Python comes from assembling small, well-understood pattern components, then using match spans and groups to extract exactly what matters from messy text.
Cornell Notes
Python’s `re` module turns text pattern matching into a repeatable workflow: compile a regex, search it, and then use match metadata to extract results. Raw strings (`r'...'`) prevent Python from interpreting backslashes, which is crucial for regex syntax. `finditer` returns match objects with `span` locations, while groups in parentheses let code pull out specific sub-parts using `group(n)` and rewrite text with `sub()` backreferences. Meta-characters (`\d`, `\w`, `\s` and their uppercase negations), anchors (`\b`, `^`, `$`), character sets (`[...]` with ranges and negation), and quantifiers (`*`, `+`, `?`, `{m,n}`) provide the building blocks for patterns like phone numbers, names, emails, and URLs. Flags like `re.IGNORECASE` reduce pattern complexity for case-insensitive matching.
Why does the tutorial insist on using raw strings for regex patterns in Python?
How do `finditer`, `findall`, `match`, and `search` differ in what they return and when they match?
What’s the practical difference between `.` and `\.` in regex patterns?
How do character sets tighten a regex compared with using meta-characters like `.`?
How do groups and backreferences enable extraction and rewriting (e.g., from URLs)?
What role do quantifiers play when patterns must handle variable-length text?
Review Questions
- When would you choose `finditer` over `findall`, and what extra capabilities do match objects provide?
- Design a regex for a string that contains either `cat` or `mat` as a whole word but not `bat`. Which anchors and character-set negation would you use?
- How would you modify a phone-number regex so the separator must be either `-` or `.` but not other characters?
Key Points
- 1
Use raw strings (`r'...'`) so backslashes in regex patterns reach the regex engine unchanged.
- 2
Compile patterns with `re.compile(...)` when you plan to reuse them across multiple searches.
- 3
Prefer `finditer` when you need match objects with `span` and `group(n)` for extraction and indexing.
- 4
Escape regex meta-characters like `.` when you need literal punctuation; otherwise `.` matches any character.
- 5
Build robust patterns using meta-characters (`\d`, `\w`, `\s`), anchors (`\b`, `^`, `$`), and character sets (`[...]` with ranges and negation).
- 6
Use quantifiers (`*`, `+`, `?`, `{m,n}`) to control repetition and handle optional or variable-length segments.
- 7
Capture sub-parts with groups and rewrite text with `sub()` using backreferences like `\2` and `\3`.