Home Tutorials Standard Library

Standard Library

Regular Expressions in Python: Patterns, Groups, and the re Module

Pyford Notes July 1, 2026 9 min read
Key points
  • Use raw strings (r"pattern") to avoid double-escaping backslashes.
  • re.search() finds a match anywhere; re.match() only checks the start.
  • Parentheses create capturing groups; match.group(1) retrieves the first one.
  • Compile a pattern with re.compile() when you use it more than once.

Pattern syntax reference

TokenMatchesExample
.Any character except newliner"c.t" matches cat, cut
\dDigit 0-9r"\d+" matches 42
\wWord character [a-zA-Z0-9_]r"\w+" matches hello_3
\sWhitespacer"\s+" matches spaces/tabs
^Start of stringr"^Error"
$End of stringr"\.py$"
*0 or more of precedingr"ab*" matches a, ab, abbb
+1 or morer"\d+" requires at least one digit
?0 or 1r"colou?r" matches both spellings
{n,m}Between n and m repetitionsr"\d{2,4}"
[abc]Character classr"[aeiou]"
[^abc]Negated classr"[^\s]+" non-whitespace
a|ba or br"cat|dog"

Core functions

import re

text = "Order #4821 placed on 2026-06-15."

# search: find first match anywhere in string
m = re.search(r"\d{4}-\d{2}-\d{2}", text)
if m:
    print(m.group())   # "2026-06-15"

# match: only checks at the beginning
m = re.match(r"Order", text)   # succeeds
m = re.match(r"\d+", text)     # fails (text starts with "Order")

# findall: returns list of all non-overlapping matches
numbers = re.findall(r"\d+", text)
# ["4821", "2026", "06", "15"]

# finditer: returns iterator of match objects
for m in re.finditer(r"\d+", text):
    print(m.group(), "at", m.start())

Capturing groups

Wrap part of a pattern in parentheses to capture that substring:

log_line = "2026-06-15 14:32:07 ERROR failed to connect"
pattern = r"(\d{4}-\d{2}-\d{2}) (\d{2}:\d{2}:\d{2}) (\w+)"

m = re.search(pattern, log_line)
if m:
    date   = m.group(1)   # "2026-06-15"
    time   = m.group(2)   # "14:32:07"
    level  = m.group(3)   # "ERROR"
    all_   = m.group(0)   # entire match

Named groups make the code self-documenting:

pattern = r"(?P<date>\d{4}-\d{2}-\d{2}) (?P<level>\w+)"
m = re.search(pattern, log_line)
print(m.group("date"))    # "2026-06-15"
print(m.group("level"))   # "ERROR"
print(m.groupdict())      # {"date": "...", "level": "ERROR"}

Flags

Pass flags as a third argument or inline with (?flags):

re.search(r"hello", text, re.IGNORECASE)   # case-insensitive
re.search(r"^line", text, re.MULTILINE)    # ^ matches start of each line
re.search(r".", text, re.DOTALL)           # . matches newlines too

# Inline flag inside pattern:
re.search(r"(?i)hello", text)              # same as IGNORECASE

Compiled patterns

Calling re.compile() parses the pattern once and returns a reusable object. This is efficient when the pattern is used repeatedly in a loop:

DATE_RE = re.compile(r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})")

for line in log_lines:
    m = DATE_RE.search(line)
    if m:
        yield m.groupdict()

Substitution with re.sub()

cleaned = re.sub(r"\s+", " ", text)           # collapse whitespace
redacted = re.sub(r"\b\d{4}\b", "XXXX", text) # redact 4-digit numbers

# Replacement can be a function
def double(m):
    return str(int(m.group()) * 2)

result = re.sub(r"\d+", double, "10 items at 5 each")
# "20 items at 10 each"

Common pitfalls

  • Greedy vs non-greedy: .* is greedy and matches as much as possible. Add ? after a quantifier to make it non-greedy: .*? stops at the first opportunity.
  • Forgetting raw strings: "\d" is a string containing backslash-d only in some Python contexts; r"\d" is unambiguous. Always use raw strings for regex patterns.
  • Using regex for HTML/XML: Regular expressions cannot reliably parse nested structures. Use a proper parser like html.parser or lxml instead.