Get AI summaries of any video or article — Sign up free
Python Tutorial: re Module - How to Write and Match Regular Expressions (Regex) thumbnail

Python Tutorial: re Module - How to Write and Match Regular Expressions (Regex)

Corey Schafer·
5 min read

Based on Corey Schafer's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Use raw strings (`r'...'`) so backslashes in regex patterns reach the regex engine unchanged.

Briefing

Regular expressions in Python become practical once you learn how to (1) pass patterns safely into `re`, (2) interpret matches via spans and groups, and (3) build patterns from reusable building blocks like meta-characters, character sets, anchors, and quantifiers. The core takeaway is that `re` lets you turn messy text—mixed case, punctuation, URLs, phone numbers, and emails—into structured data by matching repeatable text patterns, then extracting the parts you actually care about.

The tutorial starts with essentials: import the built-in `re` module and use raw strings (prefix `R`/`r`) so backslashes in regex patterns aren’t treated as Python escape sequences. Patterns are compiled with `re.compile(...)`, then searched using methods like `finditer`, which returns match objects. Those match objects provide the matched text plus a `span` (start and end indices), enabling precise slicing of the original string to retrieve the exact match. A simple example shows why regex matching is case-sensitive—`ABC` matches but `AbC` does not—and why literal punctuation must be escaped: an unescaped `.` matches “any character,” while `\.` matches a literal period.

From there, the lesson builds a toolbox of meta-characters. `\d` matches digits; `\D` matches non-digits. `\w` matches word characters (letters, digits, underscore), while `\W` negates that. `\s` matches whitespace (spaces, tabs, newlines), and `\S` negates it. Anchors add positional logic without consuming characters: `\b` marks word boundaries, while `^` and `$` anchor to the beginning and end of a string (or, with flags, the beginning/end of lines). These pieces are then combined into a phone-number pattern: three digits, a separator, three digits, another separator, and four digits. The tutorial demonstrates running the same regex against a file (`data.txt`) to extract phone numbers from larger datasets, including a quick fix for a Unicode decode issue by specifying `encoding='utf-8'`.

To make patterns stricter, the tutorial introduces character sets (`[...]`). Instead of using `.` (which matches any character), a set can restrict separators to just `-` or `.`. Character sets also support ranges like `1-5` for digits or `a-z` for lowercase letters, and negation via a leading `^` inside the set. Quantifiers (`*`, `+`, `?`, and `{m,n}`) then control repetition: `*` means zero or more, `+` means one or more, `?` means optional, and `{3}` or `{2,4}` sets exact or ranged counts. This is used to handle real-world variability, such as matching name prefixes like `Mr`, `Mr.`, `Miss`, and `Mrs`.

The final sections show how to scale from matching to extracting. Groups created with parentheses capture sub-parts of a match, and `group(n)` retrieves them (with `group(0)` representing the entire match). A URL example captures the optional `www`, the domain name, and the top-level domain (like `.com` or `.gov`), then uses `pattern.sub(...)` with backreferences (`\2`, `\3`) to rewrite URLs into a simplified “domain + TLD” format. The tutorial closes by comparing `finditer` with `findall` (strings/lists), `match` (beginning of string only), and `search` (anywhere in the string), and by introducing flags such as `re.IGNORECASE` to make patterns case-insensitive.

In short: regex power in Python comes from assembling small, well-understood pattern components, then using match spans and groups to extract exactly what matters from messy text.

Cornell Notes

Python’s `re` module turns text pattern matching into a repeatable workflow: compile a regex, search it, and then use match metadata to extract results. Raw strings (`r'...'`) prevent Python from interpreting backslashes, which is crucial for regex syntax. `finditer` returns match objects with `span` locations, while groups in parentheses let code pull out specific sub-parts using `group(n)` and rewrite text with `sub()` backreferences. Meta-characters (`\d`, `\w`, `\s` and their uppercase negations), anchors (`\b`, `^`, `$`), character sets (`[...]` with ranges and negation), and quantifiers (`*`, `+`, `?`, `{m,n}`) provide the building blocks for patterns like phone numbers, names, emails, and URLs. Flags like `re.IGNORECASE` reduce pattern complexity for case-insensitive matching.

Why does the tutorial insist on using raw strings for regex patterns in Python?

Raw strings (written as `r'...'`) tell Python not to treat backslashes as escape characters. Without a raw string, a pattern like `\t` can become a literal tab before the regex engine ever sees it. With a raw string, backslashes remain intact so regex tokens like `\d`, `\w`, `\s`, and escaped punctuation like `\.` are interpreted correctly by the regex engine.

How do `finditer`, `findall`, `match`, and `search` differ in what they return and when they match?

`finditer` scans the entire string and yields match objects, which include extra information like `span` and `group(n)`. `findall` also scans the string but returns plain strings (or tuples of group strings) rather than match objects. `match` checks only the beginning of the string and returns the first match object (or `None`). `search` scans anywhere in the string and returns the first match object (or `None`).

What’s the practical difference between `.` and `\.` in regex patterns?

A plain `.` is a meta-character that matches any character except a newline, so it can accidentally match too much (e.g., in a URL or phone-number separator). Escaping it as `\.` forces a literal period match. The tutorial demonstrates this by showing that searching for `.` without escaping produces many matches, while `\.` matches only actual periods in the text.

How do character sets tighten a regex compared with using meta-characters like `.`?

Character sets restrict matches to specific characters using `[...]`. For phone numbers, using `.` as a separator matches any character, so even an unexpected separator like `*` gets captured. Replacing that with a set like `[\-.]` (conceptually: “either dash or dot”) ensures only `-` or `.` are allowed at that position. Character sets can also include ranges like `1-5` or `a-z`, and a leading `^` negates the set (match anything except those characters).

How do groups and backreferences enable extraction and rewriting (e.g., from URLs)?

Parentheses create capturing groups. After matching, `group(1)`, `group(2)`, etc. return the captured sub-parts, while `group(0)` returns the entire match. For rewriting, `pattern.sub(replacement, text)` can use backreferences like `\2` and `\3` to insert captured group contents into a new string. The URL example captures the optional `www`, the domain name, and the top-level domain, then substitutes them into a simplified output.

What role do quantifiers play when patterns must handle variable-length text?

Quantifiers control repetition: `*` matches zero or more, `+` matches one or more, `?` makes the preceding token optional, and `{m}` or `{m,n}` sets exact or ranged counts. The tutorial uses quantifiers to avoid typing repeated tokens (like fixed digit counts in phone numbers) and to handle optional punctuation in names (like `Mr.` vs `Mr`).

Review Questions

  1. When would you choose `finditer` over `findall`, and what extra capabilities do match objects provide?
  2. Design a regex for a string that contains either `cat` or `mat` as a whole word but not `bat`. Which anchors and character-set negation would you use?
  3. How would you modify a phone-number regex so the separator must be either `-` or `.` but not other characters?

Key Points

  1. 1

    Use raw strings (`r'...'`) so backslashes in regex patterns reach the regex engine unchanged.

  2. 2

    Compile patterns with `re.compile(...)` when you plan to reuse them across multiple searches.

  3. 3

    Prefer `finditer` when you need match objects with `span` and `group(n)` for extraction and indexing.

  4. 4

    Escape regex meta-characters like `.` when you need literal punctuation; otherwise `.` matches any character.

  5. 5

    Build robust patterns using meta-characters (`\d`, `\w`, `\s`), anchors (`\b`, `^`, `$`), and character sets (`[...]` with ranges and negation).

  6. 6

    Use quantifiers (`*`, `+`, `?`, `{m,n}`) to control repetition and handle optional or variable-length segments.

  7. 7

    Capture sub-parts with groups and rewrite text with `sub()` using backreferences like `\2` and `\3`.

Highlights

`finditer` returns match objects that include both the matched text and its `span`, making it easy to slice the original string precisely.
Unescaped `.` matches almost any character; escaping it as `\.` is required to match literal periods.
Character sets (`[...]`) let you restrict matches to specific characters (like only `-` or `.` as phone separators), and ranges like `a-z` or `1-5` simplify pattern writing.
Groups with parentheses turn “matching” into “extraction,” and backreferences in `sub()` enable fast text rewriting.
Quantifiers make regex patterns practical by handling repetition counts and optional punctuation without manually duplicating tokens.

Topics

  • Regular Expressions
  • Python re Module
  • Regex Meta-Characters
  • Character Sets
  • Regex Groups & Backreferences

Mentioned