4.re package learning (regular expression)

Regular expression: string pattern (judge whether the string meets certain standards)


Rules of Use

You can create a pattern object to achieve the purpose of reusing the pattern, or you can directly call the search method to search for a match

Create Schema Object
1. Create a pattern object
    # 创建一个模式对象,匹配所有连续的大写字母
    pat = re.compile("[A-Z]*")
    # 匹配所有小写字符'a'
    pat2 = re.compile("a")
2. Perform Find or Replace
    m = pat.search("ABC12s4Da433D5FGA53a")
    print(m)

search: left closed and right open interval, and the result is the first match from front to back

    m = pat.findall("ABC12s4Da433D5FGA53a")
    print(m)

findall: find all matching strings, the result set is a list

    m = pat2.sub("A", "abcdcasd")
    print(m)

Pattern object.sub(), from the second parameter of sub, find out all the content that can match the rules of the pattern object and replace them with the content of the first parameter


Do not create schema objects
    # 查找
    m = re.search("asd", "AasdF")  # 参数(匹配规则,要匹配的字符串)pattern,string
    print(m)

re.search(“asd”, “AasdF”)是:<re.Match object; span=(1, 4), match=‘asd’>

    # 查找
    print(re.findall("[A-Z]+", "AsDaDFGAa"))  # 一个到多个字符组合

re.findall("[A-Z]+", “AsDaDFGAa”)是:[‘A’, ‘D’, ‘DFGA’]

    # sub替换
    print(re.sub("a", "A", "abcdcasd"))  # 在"abcdcasd"中找到a并用A替换

re.sub (“a”, “A”, “abcdcasd”) Yes: AbcdcAsd

Precautions:
a = "\aadb-\'"
a = r"\aadb-\'"  # 不加r,\a会被当成一个字符

The output results of the above two statements are:Insert picture description here

It is recommended to add r before the string to be compared in the regular expression, so that you don’t have to worry about transferring characters.


Rules related to regular expressions

Commonly used operators of regular expressions

Operator Description Instance
. Represents any single character
[] Character set, giving a range of values ​​for a single character [abc] represents a, b, c, [az] represents a single character from a to z
[^ ] Non-character set, giving an exclusion range for a single character [^abc] means a single character other than a or b or c
* Extend the previous character 0 times or unlimited times abc * means ab, abc, abcc, abccc, etc.
+ Extend the previous character 1 time or unlimited times abc+ means abc, abcc, abccc, etc.
? 0 or 1 expansion of the previous character abc? means ab, abc
| Any one of left and right expressions abc|def means abc, def
{m} Extend the previous character m times ab{2}c means abbc
{m,n} Extend the previous character m~n times (including n) ab{l,2}c means abc, abbc
^ Match the beginning of the string ^abc means abc and is at the beginning of a string
$ Match end of string abc$ means abc and is at the end of a string
( ) Grouping mark, only the | operator can be used internally (abc) means abc, (abc
\d Number, equivalent to [0-9]
\w Word character, equivalent to [A-Za-z0-9_] Some foreign websites require user names to be alphanumeric underscore

The blog garden is recommended here: the most complete regular expressions in history


The main functions of the Re library

function Description
re.search() Search for the first position of a matching regular expression in a string, and return a match object
re.match() Match the regular expression from the beginning of a string and return the match object
re.findall() Search string, return all matching substrings in list type
re.split() Split a string according to the regular expression matching result and return the list type
re.finditer () Search string, return a matching result iteration type, each iteration element is a match object
re.sub() Replace all substrings matching regular expressions in a string, and return the replaced string

The blue part of the function needs to focus on learning to use


Optional flag modifier

Regular expressions can contain some optional flag modifiers to control the matching pattern. Modifiers are designated as optional flags. Multiple signs can be specified by reciting them by bitwise OR (|). Such as re.l | re.M is set to l and M signs

Modifier description
re.l Make the match case insensitive
re.L Do locale-aware matching
re.M Multi-line matching, affects ^ and $
re.S Make. Match all characters including newline
re.U Analyze characters according to the Unicode character set. This flag affects \w, \W, \b, \B.
re.X This flag makes the regular expression easier to understand by giving a more flexible format.

Task-driven development: analysis of data crawled by Douban Top250 movies

Analyzed to get: ***Movie details link, picture link, movie Chinese name, movie name in foreign language, rating, review number, summary, relevant information***

import re  # 正则表达式,进行文字匹配
# import bs4 #只需要使用bs4中的BeautifulSoup因此可以如下写法:
from bs4 import BeautifulSoup  # 网页解析,获取数据
import xlwt  # 进行excel操作
import sqlite3  # 进行SQLlite数据库操作
import urllib.request, urllib.error  # 指定url,获取网页数据


def main():
    # 爬取的网页
    baseurl = "https://movie.douban.com/top250?start="
    # # 保存的路径
    savepath = ".\\豆瓣电影Top250.xls"  # 使用\\表示层级目录或者在整个字符串前加r“.\豆瓣电影Top250”
    savepath2Db = "movies.db"
    # # 1.爬取网页
    # print(askURL(baseurl))
    datalist = getData(baseurl)
    print(datalist)



# 影片详情链接的规则
findLink = re.compile('<a href="(.*?)">')  # 创建正则表达式对象
# 影片图片的链接规则
findImgSrc = re.compile('<img alt=".*src="(.*?)"', re.S)  # re.S忽略换行
# 影片片名
findTitle = re.compile('<span class="title">(.*)</span>')
# 影片评分
findRating = re.compile('<span class="rating_num" property="v:average">(.*)</span>')
# 评价人数
# findJudge = re.compile('<span>(\d*)(.*)人评价</span>')
findJudge = re.compile('<span>(\d*)人评价</span>')
# 概况
findInq = re.compile('<span class="inq">(.*)</span>')
# 影片相关内容
findBd = re.compile('<p class="">(.*?)</p>', re.S)  # 中间有</br>,因此要忽略换行符


# 爬取网页
def getData(baseurl):
    datalist = []
    for i in range(0, 10):  # 一页25条电影
        url = baseurl + str(i*25)
        html = askURL(url)  # 保存获取到的网页源码
        # print(html)
        # 2.解析数据(逐一)
        soup = BeautifulSoup(html, "html.parser")  # 使用html.parser解析器解析html文档形成树形结构数据
        for item in soup.find_all("div", class_="item"):  # 查找符合要求的字符串,形成列表
            # print(item)
            data = []  # 保存一部电影的信息
            item = str(item)
            # 影片详情链接
            link = re.findall(findLink, item)[0]
            data.append(link)
            # 图片
            img = re.findall(findImgSrc, item)[0]
            data.append(img)
            # 标题
            titles = re.findall(findTitle, item)
            if(len(titles) == 2):
                ctitle = titles[0]  # 中文名
                data.append(ctitle)
                otitle = titles[1].replace("/", "")
                data.append(otitle)  # 外文名
            else:
                data.append(titles[0])
                data.append(' ')  # 外文名留空
            # data.append(title)
            # 评分
            rating = re.findall(findRating, item)[0]
            data.append(rating)
            # 评价人数
            judgeNum = re.findall(findJudge, item)[0]
            # print(judgeNum)
            data.append(judgeNum)
            # 添加概述
            inq = re.findall(findInq, item)
            if len(inq) == 0:
                data.append(" ")
            else:
                data.append(inq[0].replace("。", ""))
            # 影片相关内容
            bd = re.findall(findBd, item)[0]
            bd = re.sub('<br(\s+)?/>(\s+)?', " ", bd)  # 去掉</br>
            bd = re.sub('/', " ", bd)  # 替换/
            data.append(bd.strip())  # 去掉前后的空格

            datalist.append(data)  # 把处理好的一部电影的信息保存
        # for it in datalist:
        #     print(it)
    return datalist


# 得到执行url的网页信息
def askURL(url):
    # 头部信息 其中用户代理用于伪装浏览器访问网页
    head = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
                          "Chrome/87.0.4280.88 Safari/537.36"}
    req = urllib.request.Request(url, headers=head)
    html = ""  # 获取到的网页源码
    try:
        response = urllib.request.urlopen(req)
        html = response.read().decode("utf-8")
    except urllib.error.URLError as e:
        if hasattr(e, "code"):  # has attribute
            print(e.code)
        if hasattr(e, "reason"):
            print(e.reason)
    return html


if __name__ == '__main__':
    main()

When appending the parsed data, you need to pay attention that some values ​​are empty, such as the nickname of the movie and the movie overview. You need to judge before storing. If the value is space-time, it needs to be stored as an empty string or given a space, otherwise it will Inserting a table or database later will cause an impact

Guess you like

Origin blog.csdn.net/qq_43808700/article/details/113590395