Home› Tutorials› Standard Library
Standard LibraryRegular Expressions in Python: Patterns, Groups, and the re Module
Key points
- Use raw strings (
r"pattern") to avoid double-escaping backslashes. re.search()finds a match anywhere;re.match()only checks the start.- Parentheses create capturing groups;
match.group(1)retrieves the first one. - Compile a pattern with
re.compile()when you use it more than once.
Pattern syntax reference
| Token | Matches | Example |
|---|---|---|
. | Any character except newline | r"c.t" matches cat, cut |
\d | Digit 0-9 | r"\d+" matches 42 |
\w | Word character [a-zA-Z0-9_] | r"\w+" matches hello_3 |
\s | Whitespace | r"\s+" matches spaces/tabs |
^ | Start of string | r"^Error" |
$ | End of string | r"\.py$" |
* | 0 or more of preceding | r"ab*" matches a, ab, abbb |
+ | 1 or more | r"\d+" requires at least one digit |
? | 0 or 1 | r"colou?r" matches both spellings |
{n,m} | Between n and m repetitions | r"\d{2,4}" |
[abc] | Character class | r"[aeiou]" |
[^abc] | Negated class | r"[^\s]+" non-whitespace |
a|b | a or b | r"cat|dog" |
Core functions
import re
text = "Order #4821 placed on 2026-06-15."
# search: find first match anywhere in string
m = re.search(r"\d{4}-\d{2}-\d{2}", text)
if m:
print(m.group()) # "2026-06-15"
# match: only checks at the beginning
m = re.match(r"Order", text) # succeeds
m = re.match(r"\d+", text) # fails (text starts with "Order")
# findall: returns list of all non-overlapping matches
numbers = re.findall(r"\d+", text)
# ["4821", "2026", "06", "15"]
# finditer: returns iterator of match objects
for m in re.finditer(r"\d+", text):
print(m.group(), "at", m.start())
Capturing groups
Wrap part of a pattern in parentheses to capture that substring:
log_line = "2026-06-15 14:32:07 ERROR failed to connect"
pattern = r"(\d{4}-\d{2}-\d{2}) (\d{2}:\d{2}:\d{2}) (\w+)"
m = re.search(pattern, log_line)
if m:
date = m.group(1) # "2026-06-15"
time = m.group(2) # "14:32:07"
level = m.group(3) # "ERROR"
all_ = m.group(0) # entire match
Named groups make the code self-documenting:
pattern = r"(?P<date>\d{4}-\d{2}-\d{2}) (?P<level>\w+)"
m = re.search(pattern, log_line)
print(m.group("date")) # "2026-06-15"
print(m.group("level")) # "ERROR"
print(m.groupdict()) # {"date": "...", "level": "ERROR"}
Flags
Pass flags as a third argument or inline with (?flags):
re.search(r"hello", text, re.IGNORECASE) # case-insensitive
re.search(r"^line", text, re.MULTILINE) # ^ matches start of each line
re.search(r".", text, re.DOTALL) # . matches newlines too
# Inline flag inside pattern:
re.search(r"(?i)hello", text) # same as IGNORECASE
Compiled patterns
Calling re.compile() parses the pattern once and returns a reusable object. This is efficient when the pattern is used repeatedly in a loop:
DATE_RE = re.compile(r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})")
for line in log_lines:
m = DATE_RE.search(line)
if m:
yield m.groupdict()
Substitution with re.sub()
cleaned = re.sub(r"\s+", " ", text) # collapse whitespace
redacted = re.sub(r"\b\d{4}\b", "XXXX", text) # redact 4-digit numbers
# Replacement can be a function
def double(m):
return str(int(m.group()) * 2)
result = re.sub(r"\d+", double, "10 items at 5 each")
# "20 items at 10 each"
Common pitfalls
- Greedy vs non-greedy:
.*is greedy and matches as much as possible. Add?after a quantifier to make it non-greedy:.*?stops at the first opportunity. - Forgetting raw strings:
"\d"is a string containing backslash-d only in some Python contexts;r"\d"is unambiguous. Always use raw strings for regex patterns. - Using regex for HTML/XML: Regular expressions cannot reliably parse nested structures. Use a proper parser like
html.parserorlxmlinstead.