Programming Xiaobai's self-study notes three (Python regular expressions)

Series Article Directory

Programming Xiaobai's self-study notes 2 (processing table files with python) 

Programming Xiaobai's self-study notes 1 (processing table files with python) 


Article directory


foreword

I have been hearing about regular expressions before, and they are still relatively difficult. Today I finally got in touch, and I am ready to gnaw hard bones. Let us feel it together.

1. Predetermined character "\d"

At first glance, the impression is that regular expressions are like the wildcards we use to find files. I will learn more about the specific scenarios in which they can be used.

The first topic is to match numbers with regular expressions. To use regular expressions, you need to import the module re. The re module is a built-in module. It does not need to be imported and can be used directly. You can use the findall method to automatically match expressions and strings. The specific code is as follows:

import re
string = '每天3问是哪个人才想出来的4.56'
print('匹配数字:',re.findall('\d',string))

The final output result is matching numbers: ['3', '4', '5', '6'], you can see that all the numbers are output, and the output results are saved in a list. \d means the number 0-9

Second, the predetermined character "\s"

The predetermined character \s matches blank characters, and blank characters are characters such as newline, backspace, and page feed. The specific code is as follows:

import re
string = '每天3问 是哪' \
         '个\\脑子\n想出\t来\f的4.56'
print('匹配数字:',re.findall('\s',string))

 The result of the operation is: matching numbers: [' ', '\n', '\t', '\x0c'], we found that \\ was not printed out, and I also tried \a, \b, and \e, all of which could not be printed out, don't they belong to whitespace characters?

3. Predetermined character "\w"

  "\w" matches non-blank characters, that is, the effect is the opposite of "\s", let's try it.

import re
string = '每天3问 是哪' \
         '个\e脑\a子\n想\b出\t来\f的4.56'
print('匹配数字:',re.findall('\w',string))

 The output result is: matching numbers: ['every', 'day', '3', 'ask', 'yes', 'which', 'a', 'e', ​​'brain', 'child', 'want ', '出', '来', '的', '4', '5', '6'], we found that numbers, letters and Chinese characters are all matched.

4. Parameter re.A

The findall method still takes parameters. In the previous example, we output numbers, letters, and Chinese characters. If we don’t want to match Chinese characters, we can bring the parameter re.A. The code is written like this print('Matching numbers:', re.findall ('\w',string,re.A)), the output result is: matching numbers: ['3', 'e', ​​'4', '5', '6'], there are no Chinese characters.

Five, qualifier

The book says that the * symbol matches the string 0 or more times, right? Indicates to match the string 0 or 1 time, for example o*r means to match the letter o 0 or more times.

import re

# 匹配0次或多次的字符串
pattern1 = 'a*'
text1 = 'aaabaaa'
result1 = re.findall(pattern1, text1)
print(result1) # ['aaa', '']

# 匹配0次或1次的字符串
pattern2 = 'a?'
text2 = 'aaabaaa'
result2 = re.findall(pattern2, text2)
print(result2) # ['a', 'a', 'a']

 

It’s a bit difficult to understand. I personally think it’s better to put it this way. o*r means that there are multiple consecutive o’s or no o’s plus r can be matched, for example: r, or, ooooor, etc. In the same way, o+r means that at least one o plus r can be matched, such as or, ooor. r does not meet the expression requirements


Summarize

Regular expressions are a special form of text patterns that can be used in programs and programming languages. It can be used to verify that input conforms to a given text pattern, or to find text that matches the pattern in a large block of text, or to replace matched text with other text.

The following are some commonly used regular expressions:
- .: matches any character (except newline)
- *: matches the preceding character 0 or more times
- +: matches the preceding character 1 or more times
- ?: matches the preceding character Character 0 or 1 time
- {n}: match the previous character n times
- {n,}: match the previous character at least n times
- {n,m}: match the previous character at least n times, but not more than m times
- [abc]: matches any character in a, b or c
- [^abc]: matches any character except a, b or c
- []: represents an empty set, that is, does not match any character

Guess you like

Origin blog.csdn.net/m0_49914128/article/details/131340359