正则表达式 #

一、基本匹配 #

python

import re

# 匹配
match = re.match(r'hello', 'hello world')
print(match.group())  # 'hello'

# 搜索
match = re.search(r'world', 'hello world')
print(match.group())  # 'world'

# 查找所有
matches = re.findall(r'\d+', 'a1b2c3')
print(matches)  # ['1', '2', '3']

二、常用元字符 #

字符	说明
`.`	任意字符
`\d`	数字
`\w`	字母数字下划线
`\s`	空白字符
`^`	开头
`$`	结尾
`*`	0次或多次
`+`	1次或多次
`?`	0次或1次
`{n}`	n次
`{m,n}`	m到n次
`[]`	字符集
`()`	分组
`\|`	或

三、常用模式 #

python

import re

# 邮箱
email_pattern = r'[\w.-]+@[\w.-]+\.\w+'

# 手机号（中国）
phone_pattern = r'1[3-9]\d{9}'

# URL
url_pattern = r'https?://[\w./-]+'

# IP地址
ip_pattern = r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}'

# 日期
date_pattern = r'\d{4}-\d{2}-\d{2}'

四、分组 #

python

import re

# 捕获分组
text = "2024-03-16"
match = re.search(r'(\d{4})-(\d{2})-(\d{2})', text)
print(match.group(0))  # '2024-03-16'
print(match.group(1))  # '2024'
print(match.group(2))  # '03'
print(match.groups())  # ('2024', '03', '16')

# 命名分组
match = re.search(r'(?P<year>\d{4})-(?P<month>\d{2})', text)
print(match.group('year'))   # '2024'
print(match.group('month'))  # '03'

五、替换 #

python

import re

# 简单替换
result = re.sub(r'\d+', 'X', 'a1b2c3')
print(result)  # 'aXbXcX'

# 使用函数替换
def double(match):
    return str(int(match.group()) * 2)

result = re.sub(r'\d+', double, '1 2 3')
print(result)  # '2 4 6'

# 使用反向引用
result = re.sub(r'(\w+) (\w+)', r'\2 \1', 'Hello World')
print(result)  # 'World Hello'

六、分割 #

python

import re

# 分割
parts = re.split(r'\s+', 'a b  c   d')
print(parts)  # ['a', 'b', 'c', 'd']

# 保留分隔符
parts = re.split(r'(\s+)', 'a b  c')
print(parts)  # ['a', ' ', 'b', '  ', 'c']

七、编译正则 #

python

import re

# 编译（提高性能）
pattern = re.compile(r'\d+')

# 使用编译后的模式
matches = pattern.findall('a1b2c3')
print(matches)  # ['1', '2', '3']

# 带标志
pattern = re.compile(r'hello', re.IGNORECASE)
match = pattern.match('HELLO')

八、常用标志 #

标志	说明
`re.I`	忽略大小写
`re.M`	多行模式
`re.S`	`.`匹配换行符
`re.X`	详细模式（可加注释）

九、实际应用 #

python

import re

def validate_email(email):
    pattern = r'^[\w.-]+@[\w.-]+\.\w+$'
    return bool(re.match(pattern, email))

def extract_urls(text):
    pattern = r'https?://[\w./-]+'
    return re.findall(pattern, text)

def clean_text(text):
    # 移除多余空白
    return re.sub(r'\s+', ' ', text).strip()