Python crawls Maoyan movie TOP100 information

Crawl Maoyan TOP100 information

1. Goal:

Use python's requests library to crawl the movie names, release times, ratings and other information of the top 100 Maoyan movies. The crawled website
is " https://maoyan.com/board/4 ", and the results are saved in file form.

2. Idea analysis

First open " https://maoyan.com/board/4 " and you can see the results as shown in the figure below
Insert image description here
. You can see that the top-ranked movie is Farewell My Concubine, and you can see the starring, release time, ratings and other information. .
Scroll down and we can observe that there is a paged list below, and each page has information about 10 movies. Let us turn to the next page
Insert image description here
and we will see that the address bar of the browser has changed. The last offset parameter of the URL has changed from 0 to 10. If we turn to the next page, we will find that the offset has changed to 20. From this we find the pattern. : Offset increases by 10 for each page turned.
In this way, we get the URLs of all the web pages that need to be crawled, so we only need to write the code to crawl one of the pages, and then use it on all URLs to crawl all the information.

3. Page scraping

Write getPageText(base_url, params)a function to get the text of a certain URL, where params are the parameters of the URL (offset in this article)

import requests
def getPageText(base_url, params):
    headers = {
    
    
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3676.400 QQBrowser/10.4.3469.400',
    }
    r = requests.get(base_url, params = params, headers = headers)
    if r.status_code == 200:
        return r.text
    else:
        return None
4. Information extraction

Right-click in the browser -> Inspect, or use F12 to enter the console to view the source code.
Insert image description here
After finding the target part, observe its structure and then use regular expressions to extract the information. Because most of the crawled HTML text contains a large number of whitespace characters (invisible characters), we first use the sub method to replace the whitespace characters in the text with empty strings (that is, delete the space characters)

import re
def fromTextGetData(text):
    text = re.sub('[\s\n]+','', text)
    pattern = re.compile(r'<pclass="name"><a[^>]*?>(.*?)</a></p><pclass="star">(.*?)</p>.*?<pclass="releasetime">(.*?)</p>', re.S)
    score_pattern = re.compile(r'<iclass="integer">(.*?)</i><iclass="fraction">(.*?)</i></p>', re.S)		#提取评分
    temp_result = pattern.findall(text)
    score_result = score_pattern.findall(text)
    result = [None]*len(temp_result)
    for i in range(0, len(temp_result)):
        if (i < len(score_result)):
            string = '分数:' + str(score_result[i][0] + score_result[i][1])
            dic = {
    
    			#将结果存入字典中
                '影片名字': temp_result[i][0],
                '主演:':temp_result[i][1][3:],
                '上映时间:':temp_result[i][2][5:],
                '分数:':string[3:]
            }
            result[i] = dic
    return result
5.Write to file

The data crawled by the crawler generally needs to be saved locally. We first save it as a TXT file.
Here we use the dumps() method in the JSON library to serialize the dictionary (convert the object into one that can be transmitted over the network or stored on the local disk. The process of data format is called serialization; conversely, it is called deserialization)
Because json.dumps uses ascii encoding by default for Chinese when serializing, you need to use ensure_ascii=False to specify Chinese, so as to ensure that the output result is Chinese format , rather than Unicode encoding,
the type returned by json.dumps is str type

def writeToFile(filePath, content):
    with open(filePath, 'a', encoding = 'utf-8') as f:
        #print(type(json.dumps(content)))
        f.write(json.dumps(content, ensure_ascii = False) + '\n')
6. Code integration

Call other functions in the main method to facilitate subsequent management and expansion.

def main():
    base_url = 'https://maoyan.com/board/4'
    offset = 0
    result = []
    for i in range(offset, 100, 10):
        params = {
    
    
            'offset':i
        }
        text = getPageText(base_url, params)
        result += fromTextGetData(text)
    print(result)
    for i in result:
        writeToFile('maoyan.txt', i)
7. Complete code
import requests
import json
import re
def getPageText(base_url, params):
    headers = {
    
    
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3676.400 QQBrowser/10.4.3469.400',
    }
    r = requests.get(base_url, params = params, headers = headers)
    if r.status_code == 200:
        return r.text
    else:
        return None

def fromTextGetData(text):
    text = re.sub('[\s\n]+','', text)
    pattern = re.compile(r'<pclass="name"><a[^>]*?>(.*?)</a></p><pclass="star">(.*?)</p>.*?<pclass="releasetime">(.*?)</p>', re.S)
    score_pattern = re.compile(r'<iclass="integer">(.*?)</i><iclass="fraction">(.*?)</i></p>', re.S)
    temp_result = pattern.findall(text)
    score_result = score_pattern.findall(text)
    result = [None]*len(temp_result)
    for i in range(0, len(temp_result)):
        if (i < len(score_result)):
            string = '分数:' + str(score_result[i][0] + score_result[i][1])
            dic = {
    
    
                '影片名字': temp_result[i][0],
                '主演:':temp_result[i][1][3:],
                '上映时间:':temp_result[i][2][5:],
                '分数:':string[3:]
            }
            result[i] = dic
    return result

def writeToFile(filePath, content):
    with open(filePath, 'a', encoding = 'utf-8') as f:
        #print(type(json.dumps(content)))
        f.write(json.dumps(content, ensure_ascii = False) + '\n')

def main():
    base_url = 'https://maoyan.com/board/4'
    offset = 0
    result = []
    for i in range(offset, 100, 10):
        params = {
    
    
            'offset':i
        }
        text = getPageText(base_url, params)
        result += fromTextGetData(text)
    print(result)
    for i in result:
        writeToFile('maoyan.txt', i)

if __name__ == '__main__':
    main()

おすすめ

転載: blog.csdn.net/weixin_40735291/article/details/89762391