The use of re module in python, regular expression

foreword

The basic knowledge of regular expressions will not be discussed. If you are interested, you can click here. There are generally two cases for extraction. One is to extract a string in a single position in the text, and the other is to extract characters in multiple consecutive positions. string. Log analysis will encounter this situation, and I will talk about the corresponding methods below.

1. String extraction from a single location

In this case, we can use (.+?) this regular expression to extract. For example, for a string "a123b", if we want to extract the value 123 between ab and ab, we can use findall with regular expressions, which will return a list containing all the matching cases.

code show as below:
import re
str = "a123b"
print re.findall(r"a(.+?)b",str)#
output['123']

1.1 Greedy and non-greedy matching

If we have a string "a123b456b", if we want to match all values ​​between a and the last b but not between a and the first occurrence of b, we can use ? to control regular greedy and non-greedy matching Case.

code show as below:  
import re
str = "a123b456b"
 
print re.findall(r"a(.+?)b", str)
#Output['123']#? Controls only match 0 or 1, so only the match between the closest b will be output
 
print re.findall(r"a(.+)b", str)
#output['123b456']
 
print re.findall(r"a(.*)b", str)
#output['123b456']

1.2 Multi-line matching

If you want to match multiple lines, you need to add the re.S and re.M flags. After adding re.S. Will match newlines, default . will not match newlines.
code show as below:
str = "a23b\na34b"
 
re.findall(r"a(\d+)b.+a(\d+)b", str)
#output[]
#Because it cannot handle the case where there is a \n newline in the middle of str
 
re.findall(r"a(\d+)b.+a(\d+)b", str, re.S)
#s output[('23', '34')]
After adding re.M, the ^$ flag will match every line. By default, ^ and $ will only match the first line.

code show as below:
str = "a23b\na34b"
 
re.findall(r"^a(\d+)b", str)
#output['23']
 
re.findall(r"^a(\d+)b", str, re.M)
#output['23', '34']

Second, the string extraction of multiple consecutive positions

In this case, we can use (?P<name>...) this regular expression to extract. For example, if we have a line of webserver access log: '192.168.0.1 25/Oct/2012:14:46:34 "GET /api HTTP/1.1" 200 44 "http://abc.com/search" "Mozilla/ 5.0"', we want to extract all the content in this line of logs, we can write multiple (?P<name>expr) to extract, where name can be changed to the variable you named the location string, and expr is changed to the extraction location can be regular.

code show as below:

import re
line ='192.168.0.1 25/Oct/2012:14:46:34 "GET /api HTTP/1.1" 200 44 "http://abc.com/search"
"Mozilla/5.0"'
reg = re.compile('^(?P<remote_ip>[^ ]*) (?P<date>[^ ]*) "(?P<request>[^"]*)"
(?P<status>[^ ]*) (?P<size>[^ ]*) "(?P<referrer>[^"]*)" "(?P<user_agent>[^"]*)"')
regMatch = reg.match(line)
linebits = regMatch.groupdict ()
print linebits
for k, v in linebits.items() :
 print k+": "+v

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326069385&siteId=291194637