Requests+ regular expression crawling cat's eye movies

Target

Extract the movie name, time, rating, picture and other information of the TOP100 Maoyan movies. The URL of the extraction site is http://maoyan.com/board/4, and the extracted results are saved in the form of text.

Ready to work

Please install the requests library

pip install requests

 The basic usage of the requests library can be parameterized in this article: http://www.cnblogs.com/0bug/p/8899841.html

Crawl analysis

The target site we need to crawl is http://maoyan.com/board/4  After opening, you can see the list information, as shown in the figure.

The first-ranked movie is Farewell My Concubine. The effective information displayed on the page includes the movie name, starring role, release time, release area, rating, pictures and other information.

Scroll the page to the bottom, you can find a paginated list, click the second page directly, and observe how the URL content of the page changes, as shown in the figure.

It can be found that the URL of the page becomes http://maoyan.com/board/4?offset=10, which has one more parameter than the previous URL, that is offset=10, and the current ranking result is ranked 11~20. Movie, it is preliminarily inferred that this is an offset parameter, and then click the next page to find that the URL of the page has become http://maoyan.com/board/4?offset=20, the parameter offset has become 20, and The results displayed are movies ranked 21-30.

From this, it can be concluded that the offset represents the offset. If the offset is n, the displayed movie serial numbers are from n+1 to n+10, and each page displays 10 movies. Therefore, if you want to get the TOP100 movies, you only need to separate them. Request 10 times, and set the offset parameters of 10 times to 0, 10, 20... All movie info is out.

Process framework:

Get Home

import requests


def get_one_page(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.117 Safari/537.36'
    }
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        return response.text
    return None


def main():
    url = 'http://maoyan.com/board/4'
    html = get_one_page(url)
    print(html)


main()

Regular extraction

Go back to the page and view the source code in the Network Monitor component in developer mode.

Note that it is not necessary to view the source code directly in the Elements tab, because the source code there may be manipulated by JavaScript and differ from the original request.

Among them, the source code of a strip of wood is as follows:

It can be seen that the source code corresponding to a movie information is a dd node, and we use regular expressions to extract some movie information in it.

1. Ranking

The ranking information is in the i-node whose class is board-index. Here, non-greedy matching is used to extract the information in the i-node. The regular expression is written as:

<dd>.*?board-index.*?>(.*?)</i>

2. Pictures of the movie

Next, you can see that there are two img nodes inside the a node. After inspection, it is found that the data-src attribute of the second img node is an image link, and the regular expression is rewritten as:

<dd>.*?board-index.*?>(.*?)</i>.*?data-src="(.*?)"

3. Movie name

After that, extract the movie name. In the p node behind it, the class is name, so you can use name as a flag, and then further extract the text content of the a node during the expiration date:

<dd>.*?board-index.*?>(.*?)</i>.*?data-src="(.*?)".*?name.*?a.*?>(.*?)</a>

4. Then extract the starring, time, rating, etc., it is the same principle, the final regular expression is written as:

<dd>.*?board-index.*?>(.*?)</i>.*?data-src="(.*?)".*?name.*?a.*?>(.*?)</a>.*?star.*?>(.*?)</p>.*?releasetime.*?>(.*?)</p>.*?integer.*?>(.*?)</i>.*?fraction.*?>(.*?)</i>.*?</dd>

<dd>.*?board-index.*?>(.*?)</i>.*?data-src="(.*?)".*?name.*?a.*?>(.*?)</a>.*?star.*?>(.*?)</p>.*?releasetime.*?>(.*?)</p>.*?integer.*?>(.*?)</i>.*?fraction.*?>(.*?)</i>.*?</dd>

Such a regular expression can match the result of a movie, which matches 7 hee hee you, and then call the findall() method to extract all the content

import re
def parse_one_page(html):
    pattern = re.compile(
        '<dd>.*?board-index.*?>(.*?)</i>.*?data-src="(.*?)".*?name.*?a.*?>(.*?)</a>.*?star.*?>(.*?)</p>.*?releasetime.*?>(.*?)</p>.*?integer.*?>(.*?)</i>.*?fraction.*?>(.*?)</i>.*?</dd>',
        re.S)
    item = re.findall(pattern, html)
    print(item)

  

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324643521&siteId=291194637