Python100Days study notes --- Day12 strings and regular expressions

Using regular expressions

Regex knowledge
at the time of programming or web processing strings, there is often a need to find a string meet certain complex rules of regular expressions is the tool for a description of these rules, in other words a regular expression is a tool which defines a pattern string matches (how to check whether there is a character string with a pattern matching portion, or extracted from or replace a portion where the character string matches the pattern). If you've used Windows operating system files used to find and wildcards when specifying the file name (* and?), Then the regular expressions are also used for tools with a similar matching text, but compared to the regular expression wildcards style more powerful, it can more accurately describe your needs (of course, the price you pay is writing a regular expression is much more complex than a wildcard to play, to know anything to bring you all the benefits of a price, Like learning a programming language), for example, you can write a regular expression used to find all starts with 0, followed by 2-3 digits, then a hyphen "-", the last seven or eight numeric string (like 0813-7654321 or 028-12345678), is not this domestic landline number, please. Initially the computer to do the math and birth, information processing are basically value, and today we are dealing with information in their daily work are basically text data, we want the computer to be able to recognize and process certain patterns in line with text , the regular expression is very important. Today, almost all programming languages provide support for regular expression operations, Python supports regular expressions operated by the standard library re module.

We can consider the following question: We (could be a text file, it could be a news on the network) to get a string from somewhere, hoping to find the phone number and landline number in the string. Of course, we can set the phone number is a 11-bit digital (note not a random 11-digit number, because you have not seen "25012345678" This phone number it) and keep the same pattern described for a landline number, if not use regular expressions to accomplish this task will be very troublesome.

Knowledge about regular expressions, you can read a very famous blog called "regular expressions 30 minutes Guide", after reading this article you can read the following table, which is our expression of positive a brief summary of some of the basic symbols.

Here Insert Picture Description

Here Insert Picture Description

Here Insert Picture Description

Here Insert Picture Description
Description : If you need to match the character is a regular expression special characters, you can use \ escaping, for example, you want to match the decimal point can be written on it, because the direct write will match any character; the same token, you want to match. parentheses must be written (and), or grouping parentheses are treated as regular expressions.

Python expressions of support for regular
Python re module provides regular expression support related operations, following the re module is the core function.

函数	说明
compile(pattern, flags=0)	编译正则表达式返回正则表达式对象
match(pattern, string, flags=0)	用正则表达式匹配字符串 成功返回匹配对象 否则返回None
search(pattern, string, flags=0)	搜索字符串中第一次出现正则表达式的模式 成功返回匹配对象 否则返回None
split(pattern, string, maxsplit=0, flags=0)	用正则表达式指定的模式分隔符拆分字符串 返回列表
sub(pattern, repl, string, count=0, flags=0)	用指定的字符串替换原字符串中与正则表达式匹配的模式 可以用count指定替换的次数
fullmatch(pattern, string, flags=0)	match函数的完全匹配(从字符串开头到结尾)版本
findall(pattern, string, flags=0)	查找字符串所有与正则表达式匹配的模式 返回字符串的列表
finditer(pattern, string, flags=0)	查找字符串所有与正则表达式匹配的模式 返回一个迭代器
purge()	清除隐式编译的正则表达式的缓存
re.I / re.IGNORECASE	忽略大小写匹配标记
re.M / re.MULTILINE	多行匹配标记

Description : re module mentioned above these functions, the actual development may alternatively be used for these functions by the method of regular expression object, if a regular expression require repeated use, the first regular expression by compile compiled function style and create a regular expression object is undoubtedly a wiser choice.

Let us tell you through a series of examples of how to use regular expressions in Python.

Example 1: validate the input user name and QQ number is valid and the corresponding prompt information.

"""
验证输入用户名和QQ号是否有效并给出对应的提示信息

要求:用户名必须由字母、数字或下划线构成且长度在6~20个字符之间,QQ号是5~12的数字且首位不能为0
"""
import re


def main():
    username = input('请输入用户名: ')
    qq = input('请输入QQ号: ')
    # match函数的第一个参数是正则表达式字符串或正则表达式对象
    # 第二个参数是要跟正则表达式做匹配的字符串对象
    m1 = re.match(r'^[0-9a-zA-Z_]{6,20}$', username)
    if not m1:
        print('请输入有效的用户名.')
    m2 = re.match(r'^[1-9]\d{4,11}$', qq)
    if not m2:
        print('请输入有效的QQ号.')
    if m1 and m2:
        print('你输入的信息是有效的!')


if __name__ == '__main__':
    main()

Tip : above regular expressions used in the writing of the "original string" wording (in front of the string added r), that is, each character in the string is its original meaning so-called "raw string" put it more bluntly string is no such thing as an escape character friends. Because regular expression metacharacters and there are many places need to be escaped, if you do not use the original string'll need to write a backslash \ indicates, for example \ d have to write numbers into \ d, this is not only inconvenient to write While reading, it will be very difficult.

Example 2: Extract from a domestic phone number for a text.

The picture below is the end of 2017, the domestic three operators launched mobile phone number section.

Here Insert Picture Description

import re


def main():
    # 创建正则表达式对象 使用了前瞻和回顾来保证手机号前后不应该出现数字
    pattern = re.compile(r'(?<=\D)1[34578]\d{9}(?=\D)')
    sentence = '''
    重要的事情说8130123456789遍,我的手机号是13512346789这个靓号,
    不是15600998765,也是110或119,王大锤的手机号才是15600998765。
    '''
    # 查找所有匹配并保存到一个列表中
    mylist = re.findall(pattern, sentence)
    print(mylist)
    print('--------华丽的分隔线--------')
    # 通过迭代器取出匹配对象并获得匹配的内容
    for temp in pattern.finditer(sentence):
        print(temp.group())
    print('--------华丽的分隔线--------')
    # 通过search函数指定搜索位置找出所有匹配
    m = pattern.search(sentence)
    while m:
        print(m.group())
        m = pattern.search(sentence, m.end())


if __name__ == '__main__':
    main()

Description : domestic phone number above to match the regular expression is not good enough, because, like the beginning of the 14 numbers only 145 or 147, and above the regular expression did not consider this case, to match the domestic mobile phone number, the better regular expression the formula is written: (? <= \ D) (1 [38] \ d {9} | 14 [57] \ d {8} | 15 [0-35-9] \ d {8} | 17 [678 ] \ d {8}) ( ? = \ d), the country recently, it seems there are 19 and 16 at the beginning of the phone number, but this is not a temporary columns we consider it.

Example 3: in the replacement string inappropriate content

内容略

Description : n re module expression correlation function has a flags parameter, which represents the regular expression matching tags, whether or not ignore case when matching can be designated by the mark, whether multiple rows that match, whether debugging information Wait. If you need to specify multiple flags parameter values, can be superimposed or bitwise operators, such as flags = re.I | re.M.

Example 4: split long strings

import re


def main():
    poem = '窗前明月光,疑是地上霜。举头望明月,低头思故乡。'
    sentence_list = re.split(r'[,。, .]', poem)
    while '' in sentence_list:
        sentence_list.remove('')
    print(sentence_list)  # ['窗前明月光', '疑是地上霜', '举头望明月', '低头思故乡']


if __name__ == '__main__':
    main()

After the words
if you want to use in the development of reptiles, then the regular expression must be a very good helper, because it can help us to quickly find some kind of pattern we specify and extract the information we need from the web page code, of course, for starters, the right to write a proper regular expressions may not be an easy thing (of course, some common regular expressions can look directly online), so the actual development reptiles application time, there are many people will choose Beautiful Soup or Lxml to extract and match information, the former is simple and convenient, but poor performance, which is both easy to use and performs well, but the installation is a bit of trouble, we will be content to everyone in the latter part of the reptiles topic introduction.

Published 124 original articles · won praise 141 · views 160 000 +

Guess you like

Origin blog.csdn.net/weixin_36838630/article/details/105206843