Business scene
Using 30-40 regular expressions, 32 million strings were matched one by one. As the number of regular matching expressions continues to increase, the performance gradually decreases.
Optimization
For frequently used regular expressions, use the following method:
PATTERN = re.compile("...")
def task(s):
"""被频繁调用的方法"""
PATTERN.search(s)
Instead of using the following 2 methods:
PATTERN = re.compile("...")
def task(s):
"""被频繁调用的方法"""
re.search(PATTERN, s)
def task(s):
"""被频繁调用的方法"""
re.search("...", s)
Precautions
Before using the re module, we have read the documentation and learned that for frequently used regular expressions, initializing them re.compile
into regular expression objects can significantly improve performance. So our usage everywhere looks like:
PATTERN = re.compile("...")
def task(s):
"""被频繁调用的方法"""
re.search(PATTERN, s)
However, after extracting a small number of samples and using cProfile statistics, we still found that _compile
it was executed many times and took up a lot of time:
ncalls tottime percall cumtime percall filename:lineno(function)
......
4159844 4.635 0.000 5.892 0.000 __init__.py:272(_compile)
......
So, the suspicion re.search()
and other methods are still called _compile
. So check re.search()
the source code as follows:
def search(pattern, string, flags=0):
"""Scan through string looking for a match to the pattern, returning
a Match object, or None if no match was found."""
return _compile(pattern, flags).search(string)
Check out re._compile()
the source code as follows:
_cache = {
} # ordered!
_MAXCACHE = 512
def _compile(pattern, flags):
# internal: compile pattern
if isinstance(flags, RegexFlag):
flags = flags.value
try:
return _cache[type(pattern), pattern, flags]
......
Through this source code, we understand that the process re.compile()
of parsing first and passing the regular expression object to the other methods is the same as passing re.search()
the string to the other methods to read the cache.re.search()
A faster method is to directly call the regular expression object's search()
other methods. Right now:
PATTERN = re.compile("...")
def task(s):
"""被频繁调用的方法"""
PATTERN.search(s)
After adjustment, _compile
the method takes up almost no time.