foreword
The basic knowledge of regular expressions will not be discussed. If you are interested, you can click here. There are generally two cases for extraction. One is to extract a string in a single position in the text, and the other is to extract characters in multiple consecutive positions. string. Log analysis will encounter this situation, and I will talk about the corresponding methods below.1. String extraction from a single location
In this case, we can use (.+?) this regular expression to extract. For example, for a string "a123b", if we want to extract the value 123 between ab and ab, we can use findall with regular expressions, which will return a list containing all the matching cases.code show as below:
import re str = "a123b" print re.findall(r"a(.+?)b",str)# output['123']
1.1 Greedy and non-greedy matching
If we have a string "a123b456b", if we want to match all values between a and the last b but not between a and the first occurrence of b, we can use ? to control regular greedy and non-greedy matching Case.code show as below:
import re str = "a123b456b" print re.findall(r"a(.+?)b", str) #Output['123']#? Controls only match 0 or 1, so only the match between the closest b will be output print re.findall(r"a(.+)b", str) #output['123b456'] print re.findall(r"a(.*)b", str) #output['123b456']
1.2 Multi-line matching
If you want to match multiple lines, you need to add the re.S and re.M flags. After adding re.S. Will match newlines, default . will not match newlines.code show as below:
str = "a23b\na34b" re.findall(r"a(\d+)b.+a(\d+)b", str) #output[] #Because it cannot handle the case where there is a \n newline in the middle of str re.findall(r"a(\d+)b.+a(\d+)b", str, re.S) #s output[('23', '34')]After adding re.M, the ^$ flag will match every line. By default, ^ and $ will only match the first line.
code show as below:
str = "a23b\na34b" re.findall(r"^a(\d+)b", str) #output['23'] re.findall(r"^a(\d+)b", str, re.M) #output['23', '34']
Second, the string extraction of multiple consecutive positions
In this case, we can use (?P<name>...) this regular expression to extract. For example, if we have a line of webserver access log: '192.168.0.1 25/Oct/2012:14:46:34 "GET /api HTTP/1.1" 200 44 "http://abc.com/search" "Mozilla/ 5.0"', we want to extract all the content in this line of logs, we can write multiple (?P<name>expr) to extract, where name can be changed to the variable you named the location string, and expr is changed to the extraction location can be regular.code show as below:
import re line ='192.168.0.1 25/Oct/2012:14:46:34 "GET /api HTTP/1.1" 200 44 "http://abc.com/search" "Mozilla/5.0"' reg = re.compile('^(?P<remote_ip>[^ ]*) (?P<date>[^ ]*) "(?P<request>[^"]*)" (?P<status>[^ ]*) (?P<size>[^ ]*) "(?P<referrer>[^"]*)" "(?P<user_agent>[^"]*)"') regMatch = reg.match(line) linebits = regMatch.groupdict () print linebits for k, v in linebits.items() : print k+": "+v