Regular Expressions in Python: Patterns, Groups, and the re Module

Pyford Notes • July 1, 2026 • 9 min read

Key points

Use raw strings (r"pattern") to avoid double-escaping backslashes.
re.search() finds a match anywhere; re.match() only checks the start.
Parentheses create capturing groups; match.group(1) retrieves the first one.
Compile a pattern with re.compile() when you use it more than once.

Pattern syntax reference

Token	Matches	Example
`.`	Any character except newline	`r"c.t"` matches `cat`, `cut`
`\d`	Digit 0-9	`r"\d+"` matches `42`
`\w`	Word character [a-zA-Z0-9_]	`r"\w+"` matches `hello_3`
`\s`	Whitespace	`r"\s+"` matches spaces/tabs
`^`	Start of string	`r"^Error"`
`$`	End of string	`r"\.py$"`
`*`	0 or more of preceding	`r"ab*"` matches `a`, `ab`, `abbb`
`+`	1 or more	`r"\d+"` requires at least one digit
`?`	0 or 1	`r"colou?r"` matches both spellings
`{n,m}`	Between n and m repetitions	`r"\d{2,4}"`
`[abc]`	Character class	`r"[aeiou]"`
`[^abc]`	Negated class	`r"[^\s]+"` non-whitespace
`a\|b`	a or b	`r"cat\|dog"`

Core functions

import re

text = "Order #4821 placed on 2026-06-15."

# search: find first match anywhere in string
m = re.search(r"\d{4}-\d{2}-\d{2}", text)
if m:
    print(m.group())   # "2026-06-15"

# match: only checks at the beginning
m = re.match(r"Order", text)   # succeeds
m = re.match(r"\d+", text)     # fails (text starts with "Order")

# findall: returns list of all non-overlapping matches
numbers = re.findall(r"\d+", text)
# ["4821", "2026", "06", "15"]

# finditer: returns iterator of match objects
for m in re.finditer(r"\d+", text):
    print(m.group(), "at", m.start())

Capturing groups

Wrap part of a pattern in parentheses to capture that substring:

log_line = "2026-06-15 14:32:07 ERROR failed to connect"
pattern = r"(\d{4}-\d{2}-\d{2}) (\d{2}:\d{2}:\d{2}) (\w+)"

m = re.search(pattern, log_line)
if m:
    date   = m.group(1)   # "2026-06-15"
    time   = m.group(2)   # "14:32:07"
    level  = m.group(3)   # "ERROR"
    all_   = m.group(0)   # entire match

Named groups make the code self-documenting:

pattern = r"(?P<date>\d{4}-\d{2}-\d{2}) (?P<level>\w+)"
m = re.search(pattern, log_line)
print(m.group("date"))    # "2026-06-15"
print(m.group("level"))   # "ERROR"
print(m.groupdict())      # {"date": "...", "level": "ERROR"}

Flags

Pass flags as a third argument or inline with (?flags):

re.search(r"hello", text, re.IGNORECASE)   # case-insensitive
re.search(r"^line", text, re.MULTILINE)    # ^ matches start of each line
re.search(r".", text, re.DOTALL)           # . matches newlines too

# Inline flag inside pattern:
re.search(r"(?i)hello", text)              # same as IGNORECASE

Compiled patterns

Calling re.compile() parses the pattern once and returns a reusable object. This is efficient when the pattern is used repeatedly in a loop:

DATE_RE = re.compile(r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})")

for line in log_lines:
    m = DATE_RE.search(line)
    if m:
        yield m.groupdict()

Substitution with re.sub()

cleaned = re.sub(r"\s+", " ", text)           # collapse whitespace
redacted = re.sub(r"\b\d{4}\b", "XXXX", text) # redact 4-digit numbers

# Replacement can be a function
def double(m):
    return str(int(m.group()) * 2)

result = re.sub(r"\d+", double, "10 items at 5 each")
# "20 items at 10 each"

Common pitfalls

Greedy vs non-greedy: .* is greedy and matches as much as possible. Add ? after a quantifier to make it non-greedy: .*? stops at the first opportunity.
Forgetting raw strings: "\d" is a string containing backslash-d only in some Python contexts; r"\d" is unambiguous. Always use raw strings for regex patterns.
Using regex for HTML/XML: Regular expressions cannot reliably parse nested structures. Use a proper parser like html.parser or lxml instead.