Python 性能优化｜大量使用正则匹配的优化方法

业务场景

使用 30 - 40 个正则表达式，逐个匹配了 3200 万条字符串。随着不断增加正则匹配式的数量，性能逐渐下降。

优化方法

对于频繁使用的正则表达式，使用如下方法：

PATTERN = re.compile("...")


def task(s):
  """被频繁调用的方法"""
  PATTERN.search(s)

而不要使用如下 2 种方法：

PATTERN = re.compile("...")

def task(s):
  """被频繁调用的方法"""
  re.search(PATTERN, s)

def task(s):
  """被频繁调用的方法"""
  re.search("...", s)

注意事项

在使用 re 模块前，我们已经阅读了文档，了解到对于频繁使用的正则表达式，将其使用 re.compile 初始化为正则表达式对象可以显著地提高性能。因此，我们在各处的使用均类似于：

PATTERN = re.compile("...")

def task(s):
  """被频繁调用的方法"""
  re.search(PATTERN, s)

但是，抽取少量样本，使用 cProfile 统计后，仍然发现 _compile 执行了非常多次，占用时间很高：

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
......
  4159844    4.635    0.000    5.892    0.000 __init__.py:272(_compile)
......

于是，怀疑 re.search() 等方法仍然调用了 _compile。于是查看 re.search() 的源码如下：

def search(pattern, string, flags=0):
    """Scan through string looking for a match to the pattern, returning
    a Match object, or None if no match was found."""
    return _compile(pattern, flags).search(string)

再查看 re._compile() 的源码如下：

_cache = {
    
    }  # ordered!

_MAXCACHE = 512
def _compile(pattern, flags):
    # internal: compile pattern
    if isinstance(flags, RegexFlag):
        flags = flags.value
    try:
        return _cache[type(pattern), pattern, flags]
    ......

通过这块源码，我们了解到，先 re.compile() 解析，将正则表达式对象传给 re.search() 等方法，与将字符串传给 re.search() 等方法读取缓存的过程是一样的。

更快的方法是，直接调用正则表达式对象的 search()等方法。即：

PATTERN = re.compile("...")


def task(s):
  """被频繁调用的方法"""
  PATTERN.search(s)

调整后，_compile 方法几乎不占用时间。

Python 性能优化｜大量使用正则匹配的优化方法

业务场景

优化方法

注意事项

猜你喜欢