正则表达式一定要会

什么是正则表达式？#

正则表达式（Regular Expression）是一种用于描述字符串匹配规则的”语言”。几乎所有涉及文本处理的场景都离不开它——日志分析、数据清洗、表单校验、爬虫提取……

TIP
编写正则时，推荐使用原始字符串 r"..." 来避免反斜杠转义带来的困扰。例如 r"\d+" 比 "\\d+" 更清晰。

正则表达式语法速查#

常用元字符#

字符	含义
`.`	匹配除换行符外的任意字符
`^`	匹配字符串开头
`$`	匹配字符串末尾
`*`	前一个字符重复 0 ~ ∞ 次
`+`	前一个字符重复 1 ~ ∞ 次
`?`	前一个字符重复 0 或 1 次
`{m,n}`	前一个字符重复 m ~ n 次
`\|`	或，匹配左边或右边
`[]`	字符集合，如 `[a-z]`
`()`	捕获分组
`\`	转义特殊字符

常用预定义序列#

序列	含义	等价写法（ASCII 模式）
`\d`	数字	`[0-9]`
`\D`	非数字	`[^0-9]`
`\w`	单词字符	`[a-zA-Z0-9_]`
`\W`	非单词字符	`[^a-zA-Z0-9_]`
`\s`	空白字符	`[ \t\n\r\f\v]`
`\S`	非空白字符	`[^ \t\n\r\f\v]`
`\b`	单词边界	—

贪婪 vs 懒惰#

默认情况下，*、+、? 都是贪婪的，会尽可能多地匹配字符。在量词后加 ? 变为懒惰模式：

1
import re
2

3
text = "<h1>Title</h1>"
4

5
re.findall(r'<.*>', text)    # 贪婪：['<h1>Title</h1>']
6
re.findall(r'<.*?>', text)   # 懒惰：['<h1>', '</h1>']

核心函数详解#

re.search() — 扫描整个字符串#

在整个字符串中搜索第一个匹配，返回 Match 对象，未匹配返回 None。

1
import re
2

3
m = re.search(r'\d+', 'hello 42 world 100')
4
if m:
5
    print(m.group())   # '42'
6
    print(m.span())    # (6, 8)

re.match() — 从开头匹配#

只在字符串开头尝试匹配。注意：即使在 MULTILINE 模式下，match() 也只匹配字符串开头，不匹配每行开头。

1
re.match(r'\d+', '123abc')   # 匹配，返回 Match 对象
2
re.match(r'\d+', 'abc123')   # 不匹配，返回 None

NOTE
match() 和 search() 的区别：match() 只看开头，search() 扫描全文。大多数场景下 search() 更实用。

re.fullmatch() — 完全匹配#

要求整个字符串必须完全符合模式，常用于表单校验。

1
re.fullmatch(r'\d{11}', '13800138000')  # 匹配
2
re.fullmatch(r'\d{11}', '1380013800x')  # None

re.findall() — 找出所有匹配#

返回所有非重叠匹配的列表。如果模式中有捕获组，返回的是捕获组的内容。

1
# 无捕获组 → 返回字符串列表
2
re.findall(r'\d+', 'a1 b22 c333')
3
# ['1', '22', '333']
4

5
# 单个捕获组 → 返回字符串列表（组内容）
6
re.findall(r'(\d+)px', 'width:20px; height:30px')
7
# ['20', '30']
8

9
# 多个捕获组 → 返回元组列表
10
re.findall(r'(\w+)=(\d+)', 'width=20 height=30')
11
# [('width', '20'), ('height', '30')]

re.finditer() — 迭代所有匹配#

与 findall() 类似，但返回 Match 对象的迭代器，可以获取更多信息（位置、分组等）。

1
for m in re.finditer(r'\w+ly\b', 'He ran quickly and carefully'):
2
    print(f'{m.group()} at position {m.span()}')
3
# quickly at position (7, 14)
4
# carefully at position (19, 28)

re.sub() — 替换匹配内容#

替换所有匹配的子串，repl 可以是字符串或函数。

1
# 基本替换
2
re.sub(r'\s+', '-', 'hello   world   python')
3
# 'hello-world-python'
4

5
# 使用反向引用交换两个单词
6
re.sub(r'(\w+) (\w+)', r'\2 \1', 'hello world')
7
# 'world hello'
8

9
# 使用函数作为替换逻辑
10
def double(m):
11
    return str(int(m.group()) * 2)
12

13
re.sub(r'\d+', double, 'price: 10, tax: 3')
14
# 'price: 20, tax: 6'

re.split() — 按模式分割#

比 str.split() 更强大，支持正则模式分割。

1
re.split(r'[,;\s]+', 'apple, banana;cherry  date')
2
# ['apple', 'banana', 'cherry', 'date']
3

4
# 带捕获组时，分隔符也会保留在结果中
5
re.split(r'(\W+)', 'one-two-three')
6
# ['one', '-', 'two', '-', 'three']

re.compile() — 预编译模式#

当同一个正则需要多次使用时，预编译可以提升性能。

1
pattern = re.compile(r'\b[A-Z][a-z]+\b')
2

3
pattern.findall('Hello World Python')   # ['Hello', 'World', 'Python']
4
pattern.search('say Hello')             # <re.Match object; span=(4, 9), match='Hello'>

Match 对象常用方法#

当 search()、match() 等函数匹配成功时，返回一个 Match 对象：

1
m = re.search(r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})', '今天是 2026-02-11 星期三')
2

3
m.group()        # '2026-02-11'  完整匹配
4
m.group(1)       # '2026'        第 1 个捕获组
5
m.group('year')  # '2026'        命名捕获组
6
m.groups()       # ('2026', '02', '11')
7
m.groupdict()    # {'year': '2026', 'month': '02', 'day': '11'}
8
m.start()        # 4   匹配起始位置
9
m.end()          # 14  匹配结束位置
10
m.span()         # (4, 14)

TIP
Match 对象的布尔值始终为 True，所以可以直接用在 if 判断中：
1
if m := re.search(pattern, text):
2
    process(m)

常用编译标志#

通过 flags 参数改变正则的匹配行为，多个标志用 | 组合：

标志	缩写	作用
`re.IGNORECASE`	`re.I`	忽略大小写
`re.MULTILINE`	`re.M`	`^` / `$` 匹配每行的开头和末尾
`re.DOTALL`	`re.S`	`.` 匹配包括换行符在内的所有字符
`re.VERBOSE`	`re.X`	允许在模式中添加注释和空白，提高可读性
`re.ASCII`	`re.A`	`\w` `\d` `\s` 仅匹配 ASCII 字符

1
# VERBOSE 模式：写出可读性更好的正则
2
pattern = re.compile(r"""
3
    (?P<protocol>https?)    # 协议
4
    ://                     # 分隔符
5
    (?P<domain>[\w.-]+)     # 域名
6
    (?P<path>/\S*)?         # 路径（可选）
7
""", re.VERBOSE)
8

9
m = pattern.search('访问 https://docs.python.org/zh-cn/3/ 查看文档')
10
m.groupdict()
11
# {'protocol': 'https', 'domain': 'docs.python.org', 'path': '/zh-cn/3/'}

实战 Demo：日志分析器#

以下是一个小 Demo 来串联以上知识点——解析 Nginx 访问日志，提取关键信息并统计。

1
import re
2
from collections import Counter
3

4
# 模拟的 Nginx 访问日志
5
log_data = """
6
192.168.1.1 - - [11/Feb/2026:10:00:01 +0800] "GET /index.html HTTP/1.1" 200 1024
7
10.0.0.5 - - [11/Feb/2026:10:00:02 +0800] "POST /api/login HTTP/1.1" 200 512
8
192.168.1.1 - - [11/Feb/2026:10:00:03 +0800] "GET /images/logo.png HTTP/1.1" 304 0
9
172.16.0.10 - - [11/Feb/2026:10:00:04 +0800] "GET /about HTTP/1.1" 200 2048
10
10.0.0.5 - - [11/Feb/2026:10:00:05 +0800] "GET /api/users HTTP/1.1" 403 128
11
192.168.1.1 - - [11/Feb/2026:10:00:06 +0800] "DELETE /api/users/3 HTTP/1.1" 500 64
12
172.16.0.10 - - [11/Feb/2026:10:00:07 +0800] "GET /index.html HTTP/1.1" 200 1024
13
10.0.0.5 - - [11/Feb/2026:10:00:08 +0800] "PUT /api/users/1 HTTP/1.1" 200 256
14
""".strip()
15

16
# 使用 VERBOSE 模式编写可读的日志解析正则
17
log_pattern = re.compile(r"""
18
    (?P<ip>\d{1,3}(?:\.\d{1,3}){3})    # IP 地址
19
    \s-\s-\s
20
    \[(?P<time>[^\]]+)\]                # 时间戳
21
    \s
22
    "(?P<method>\w+)                    # HTTP 方法
23
    \s(?P<path>\S+)                     # 请求路径
24
    \s(?P<protocol>[^"]+)"             # 协议版本
25
    \s(?P<status>\d{3})                 # 状态码
26
    \s(?P<size>\d+)                     # 响应大小
27
""", re.VERBOSE)
28

29
# ========== 1. 解析所有日志条目 ==========
30
print("=" * 50)
31
print("日志解析结果")
32
print("=" * 50)
33

34
entries = []
35
for m in log_pattern.finditer(log_data):
36
    entry = m.groupdict()
37
    entry['status'] = int(entry['status'])
38
    entry['size'] = int(entry['size'])
39
    entries.append(entry)
40
    print(f"  {entry['ip']:>15} | {entry['method']:<6} | {entry['path']:<20} | {entry['status']}")
41

42
# ========== 2. 统计各 IP 的访问次数 ==========
43
print(f"\n{'=' * 50}")
44
print("IP 访问次数统计")
45
print("=" * 50)
46

47
ip_list = re.findall(r'\d{1,3}(?:\.\d{1,3}){3}', log_data)
48
for ip, count in Counter(ip_list).most_common():
49
    print(f"  {ip:<20} → {count} 次")
50

51
# ========== 3. 筛选异常请求（状态码 >= 400） ==========
52
print(f"\n{'=' * 50}")
53
print("异常请求（4xx / 5xx）")
54
print("=" * 50)
55

56
for entry in entries:
57
    if entry['status'] >= 400:
58
        print(f"  [{entry['status']}] {entry['method']} {entry['path']} ← {entry['ip']}")
59

60
# ========== 4. 用 re.sub 脱敏 IP 地址 ==========
61
print(f"\n{'=' * 50}")
62
print(" IP 脱敏处理")
63
print("=" * 50)
64

65
masked = re.sub(
66
    r'(\d{1,3}\.\d{1,3}\.)\d{1,3}\.\d{1,3}',
67
    r'\1*.*',
68
    log_data
69
)
70
# 只展示前 3 行
71
for line in masked.strip().split('\n')[:3]:
72
    print(f"  {line}")
73
print("  ...")

运行输出：

1
==================================================
2
📋 日志解析结果
3
==================================================
4
    192.168.1.1 | GET    | /index.html          | 200
5
       10.0.0.5 | POST   | /api/login           | 200
6
    192.168.1.1 | GET    | /images/logo.png     | 304
7
   172.16.0.10 | GET    | /about               | 200
8
       10.0.0.5 | GET    | /api/users           | 403
9
    192.168.1.1 | DELETE | /api/users/3         | 500
10
   172.16.0.10 | GET    | /index.html          | 200
11
       10.0.0.5 | PUT    | /api/users/1         | 200
12

13
==================================================
14
🔢 IP 访问次数统计
15
==================================================
16
  192.168.1.1          → 3 次
17
  10.0.0.5             → 3 次
18
  172.16.0.10          → 2 次
19

20
==================================================
21
⚠️ 异常请求（4xx / 5xx）
22
==================================================
23
  [403] GET /api/users ← 10.0.0.5
24
  [500] DELETE /api/users/3 ← 192.168.1.1
25

26
==================================================
27
🔒 IP 脱敏处理
28
==================================================
29
  192.168.*.* - - [11/Feb/2026:10:00:01 +0800] "GET /index.html HTTP/1.1" 200 1024
30
  10.0.*.* - - [11/Feb/2026:10:00:02 +0800] "POST /api/login HTTP/1.1" 200 512
31
  192.168.*.* - - [11/Feb/2026:10:00:03 +0800] "GET /images/logo.png HTTP/1.1" 304 0
32
  ...

NOTE
Q: 为什么 \[(?P<time>[^\]]+)\] 这里，之前明明加过了 r 表示原始字符串了，这里依旧要使用 \[ 来表示呢？
A: 因为 r"" 只是表示在 Python 字符串阶段，忽略 \ 本身的转义，但是在正则解析的阶段，由于 [ 和 ] 有特殊含义,因此需要一个 \[ 表示输出 [这个中括号本身

常见踩坑提醒#

match() ≠ search()：match() 只匹配开头，想搜索全文请用 search()
findall() 有捕获组时，返回的是组内容而非完整匹配——如果只想分组但不影响返回值，用非捕获组 (?:...)
贪婪匹配陷阱：r'<.*>' 会匹配从第一个 < 到最后一个 > 的所有内容，用 r'<.*?>' 改为懒惰
别忘了 r 前缀：\b 在普通字符串中是退格符，在 r"\b" 中才是单词边界
re.compile() 不是必须的：Python 内部会缓存最近使用的模式，但频繁使用同一模式时预编译更清晰高效

小结#

需求	推荐函数
检查字符串是否包含某模式	`re.search()`
从开头匹配	`re.match()`
完整校验（如手机号格式）	`re.fullmatch()`
提取所有匹配	`re.findall()` / `re.finditer()`
替换文本	`re.sub()`
按模式分割	`re.split()`
多次复用同一模式	`re.compile()`

L1ngg's Home

什么是正则表达式？#

正则表达式语法速查#

常用元字符#

常用预定义序列#

贪婪 vs 懒惰#

核心函数详解#

re.search() — 扫描整个字符串#

re.match() — 从开头匹配#

re.fullmatch() — 完全匹配#

re.findall() — 找出所有匹配#

re.finditer() — 迭代所有匹配#

re.sub() — 替换匹配内容#

re.split() — 按模式分割#

re.compile() — 预编译模式#

Match 对象常用方法#

常用编译标志#

实战 Demo：日志分析器#

常见踩坑提醒#

小结#