Python performance optimization|Optimization methods that use extensive regular matching

Business scene

Using 30-40 regular expressions, 32 million strings were matched one by one. As the number of regular matching expressions continues to increase, the performance gradually decreases.

Optimization

For frequently used regular expressions, use the following method:

PATTERN = re.compile("...")


def task(s):
  """被频繁调用的方法"""
  PATTERN.search(s)

Instead of using the following 2 methods:

PATTERN = re.compile("...")

def task(s):
  """被频繁调用的方法"""
  re.search(PATTERN, s)
def task(s):
  """被频繁调用的方法"""
  re.search("...", s)
Precautions

Before using the re module, we have read the documentation and learned that for frequently used regular expressions, initializing them re.compileinto regular expression objects can significantly improve performance. So our usage everywhere looks like:

PATTERN = re.compile("...")

def task(s):
  """被频繁调用的方法"""
  re.search(PATTERN, s)

However, after extracting a small number of samples and using cProfile statistics, we still found that _compileit was executed many times and took up a lot of time:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
......
  4159844    4.635    0.000    5.892    0.000 __init__.py:272(_compile)
......

So, the suspicion re.search()and other methods are still called _compile. So check re.search()the source code as follows:

def search(pattern, string, flags=0):
    """Scan through string looking for a match to the pattern, returning
    a Match object, or None if no match was found."""
    return _compile(pattern, flags).search(string)

Check out re._compile()the source code as follows:

_cache = {
    
    }  # ordered!

_MAXCACHE = 512
def _compile(pattern, flags):
    # internal: compile pattern
    if isinstance(flags, RegexFlag):
        flags = flags.value
    try:
        return _cache[type(pattern), pattern, flags]
    ......

Through this source code, we understand that the process re.compile()of parsing first and passing the regular expression object to the other methods is the same as passing re.search()the string to the other methods to read the cache.re.search()

A faster method is to directly call the regular expression object's search()other methods. Right now:

PATTERN = re.compile("...")


def task(s):
  """被频繁调用的方法"""
  PATTERN.search(s)

After adjustment, _compilethe method takes up almost no time.

Guess you like

Origin blog.csdn.net/Changxing_J/article/details/133308468