[Actual] development Python3 Web crawler to crawl Cat's Eye 3.4 Top movies

Abstract In this section, we use a regular expression libraries and requests to grab relevant content of cat's eye movie TOP100. requests more convenient than urllib use, but we have not yet learned HTML parsing library system, so there will be a selection of regular expression parsing tools.

1. The objective of this section

In this section, we want to extract the name of the movie Cat's Eye TOP100 movie, time, scores, pictures and other information, extracted site URL is http://maoyan.com/board/4, extraction results will be saved as a file.

2. Preparation

Before beginning this section, make sure you have properly installed the requests library. If not, refer to the installation instructions in Chapter 1.

3. Analysis crawl

We need to grab the target site for http://maoyan.com/board/4, after opening can view the list of information, shown in Figure 3-11. 

3-11.jpg

Figure 3-11 list information

Number one movie is Farewell My Concubine, effective information display pages have video name, starring, release time, release areas, scores, pictures and other information.

Scroll to the bottom of the page, you can find a list of pages, just click on page 2, URL and page content to observe what changes have occurred, shown in Figure 3-12.

3-12.jpg

Figure 3-12 page URL changes

Can be found in the URL of the page becomes http://maoyan.com/board/4?offset=10, one more parameters than the previous URL, that is, offset = 10, and the results show that the current ranking of 11 to 20 movie, we concluded that this is an offset parameter. Then click Next, find the URL of the page into a http://maoyan.com/board/4?offset=20, parameters offset becomes 20, and the results show that the film ranked 21 to 30.

It can be summarized Law, offset denotes an offset value, if the offset amount is n, the number is displayed movie n + 1 to n + 10, the display page 10. After So, if you want to get TOP100 movie, only to separate request 10 times, 10 times and offset parameters are set to 0, 10, 90 ... can be, so get a different page, and then the regular expression to extract relevant information, you can get information on all the movies of the TOP100.

4. Crawl Home

Then use the code to achieve this process. First crawl the content of the first page. We achieved get_one_page () method, and pass it to the url parameter. Then the results page will be returned to crawl, then the method call by main (). Preliminary code to achieve the following:

import requests

def get_one_page(url):
    response = requests.get(url)
    if response.status_code == 200:
        return response.text
    return None

def main():
    url = 'http://maoyan.com/board/4'
    html = get_one_page(url)
    print(html)

main()

After this operation, we can successfully get the source code of the home page. After obtaining the source code, you need to parse the page to extract the information we want.

The regular extraction

Next, look at the back page of the real source page. Network monitor component in the developer model view the source code shown in Figure 3-13.

3-13.jpg

Figure 3-13 Source Code

Note that here not to directly view the source code in the Elements tab, where the JavaScript source code may go through different operations with the original request, but need to view the source code of the original request obtained from the Network tab section.

See one of the entries of the source code shown in Figure 3-14.

3-14.jpg

FIG source 3-14

We can see, a movie information corresponding source code is a dd node, we use regular expressions to extract some information inside this movie. First, we need to extract its ranking information. And it is in the class ranking information within the i-node board-index, where the use of non-greedy matching to extract information in the i-node, regular expression written as:

<dd>.*?board-index.*?>(.*?)</i>

Then you need to extract the movie pictures. Can be seen, there are a node behind which there are two internal nodes img. After examination revealed, data-src attribute img second node is the link of the picture. Here img node extracts a second data-src attribute, a regular expression can be rewritten as follows:

<dd>.*?board-index.*?>(.*?)</i>.*?data-src="(.*?)"

Later on, the need to extract the name of the movie, which is in the back of the p-node, class to name. Therefore, the name can be used to make a flag, and then further to extract the contents of the text within a node, then the regular expression is rewritten as follows:

<dd>.*?board-index.*?>(.*?)</i>.*?data-src="(.*?)".*?name.*?a.*?>(.*?)</a>

When re-extracted starring, release time, score, etc., are the same principles. Finally, the regular expression written as:

<dd>.*?board-index.*?>(.*?)</i>.*?data-src="(.*?)".*?name.*?a.*?>(.*?)</a>.*?star.*?>(.*?)</p>.*?releasetime.*?>(.*?)</p>.*?integer.*?>(.*?)</i>.*?fraction.*?>(.*?)</i>.*?</dd>

Such a regular expression can match a movie's results, which matches the seven information. Next, a method to extract all of the content by calling findall ().

Next, we define the analytical methods parse_one_page page () mainly through regular expressions to extract the contents from the results we want to achieve the following code:

def parse_one_page(html):
    pattern = re.compile(
        '<dd>.*?board-index.*?>(.*?)</i>.*?data-src="(.*?)".*?name.*?a.*?>(.*?)</a>.*?star.*?>(.*?)</p>.*?releasetime.*?>(.*?)</p>.*?integer.*?>(.*?)</i>.*?fraction.*?>(.*?)</i>.*?</dd>',
        re.S)
    items = re.findall(pattern, html)
    print(items)

So that you can successfully 10 movie information page are extracted, which is in the form of a list, the output results are as follows:

[('1', 'http://p1.meituan.net/movie/20803f59291c47e1e116c11963ce019e68711.jpg@160w_220h_1e_1c', 
'霸王别姬', '\n                主演:张国荣,张丰毅,巩俐\n        ', '上映时间:1993-01-01(中国香港)',
 '9.', '6'), ('2', 'http://p0.meituan.net/movie/__40191813__4767047.jpg@160w_220h_1e_1c', '肖申克的救赎',
  '\n                主演:蒂姆·罗宾斯,摩根·弗里曼,鲍勃·冈顿\n        ', '上映时间:1994-10-14(美国)',
   '9.', '5'), ('3', 'http://p0.meituan.net/movie/fc9d78dd2ce84d20e53b6d1ae2eea4fb1515304.jpg@160w_220h_1e_1c',
    '这个杀手不太冷', '\n                主演:让·雷诺,加里·奥德曼,娜塔莉·波特曼\n        ', '
    上映时间:1994-09-14(法国)', '9.', '5'), ('4', 'http://p0.meituan.net/movie/23/6009725.jpg@160w_220h_1e_1c',
     '罗马假日', '\n                主演:格利高利·派克,奥黛丽·赫本,埃迪·艾伯特\n     
        ', '上映时间:1953-09-02(美国)', '9.', '1'), ('5', 'http://p0.meituan.net/movie/53/1541925.jpg@160w_220h_1e_1c',
         '阿甘正传', '\n                主演:汤姆·汉克斯,罗宾·怀特,加里·西尼斯\n       
          ', '上映时间:1994-07-06(美国)', '9.', '4'), ('6', 'http://p0.meituan.net/movie/11/324629.jpg@160w_220h_1e_1c',
           '泰坦尼克号', '\n                主演:莱昂纳多·迪卡普里奥,凯特·温丝莱特,比利·赞恩\n     
              ', '上映时间:1998-04-03', '9.', '5'), 
              ('7', 'http://p0.meituan.net/movie/99/678407.jpg@160w_220h_1e_1c',
               '龙猫', '\n                主演:日高法子,坂本千夏,糸井重里\n        ', '
               上映时间:1988-04-16(日本)', '9.', '2'),
                ('8', 'http://p0.meituan.net/movie/92/8212889.jpg@160w_220h_1e_1c', 
               '教父', '\n                主演:马龙·白兰度,阿尔·帕西诺,詹姆斯·凯恩\n   
                    ', '上映时间:1972-03-24(美国)', '9.', '3'), 
                    ('9', 'http://p0.meituan.net/movie/62/109878.jpg@160w_220h_1e_1c', '唐伯虎点秋香',
                     '\n                主演:周星驰,巩俐,郑佩佩\n        
                     ', '上映时间:1993-07-01(中国香港)', '9.', '2'),
   ('10', 'http://p0.meituan.net/movie/9bf7d7b81001a9cf8adbac5a7cf7d766132425.jpg@160w_220h_1e_1c', 
   '千与千寻', '\n                主演:柊瑠美,入野自由,夏木真理\n        ',
    '上映时间:2001-07-20(日本)', '9.', '3')]

But this is not enough, data comparison messy, we then match results to handle it, traversing extraction results and generate the dictionary, then the method read as follows:

def parse_one_page(html):
    pattern = re.compile(
        '<dd>.*?board-index.*?>(.*?)</i>.*?data-src="(.*?)".*?name.*?a.*?>(.*?)</a>.*?star.*?>(.*?)</p>.*?releasetime.*?>(.*?)</p>.*?integer.*?>(.*?)</i>.*?fraction.*?>(.*?)</i>.*?</dd>',
        re.S)
    items = re.findall(pattern, html)
    for item in items:
        yield {
            'index': item[0],
            'image': item[1],
            'title': item[2].strip(),
            'actor': item[3].strip()[3:] if len(item[3]) > 3 else '',
            'time': item[4].strip()[5:] if len(item[4]) > 5 else '',
            'score': item[5].strip() + item[6].strip()
        }

This can be successfully extracted movie rankings, pictures, titles, actors, time, and so the score, and assign it to one dictionary, the formation of structured data. Results are as follows:

{'image': 'http://p1.meituan.net/movie/20803f59291c47e1e116c11963ce019e68711.jpg@160w_220h_1e_1c', 'actor': '张国荣,张丰毅,巩俐',
 'score': '9.6', 'index': '1', 'title': '霸王别姬', 'time': '1993-01-01(中国香港)'}
{'image': 'http://p0.meituan.net/movie/__40191813__4767047.jpg@160w_220h_1e_1c', 'actor': '蒂姆·罗宾斯,摩根·弗里曼,鲍勃·冈顿',
 'score': '9.5', 'index': '2', 'title': '肖申克的救赎', 'time': '1994-10-14(美国)'}
{'image': 'http://p0.meituan.net/movie/fc9d78dd2ce84d20e53b6d1ae2eea4fb1515304.jpg@160w_220h_1e_1c', 'actor': 
'让·雷诺,加里·奥德曼,娜塔莉·波特曼', 'score': '9.5', 'index': '3', 'title': '这个杀手不太冷', 'time': '1994-09-14(法国)'}
{'image': 'http://p0.meituan.net/movie/23/6009725.jpg@160w_220h_1e_1c', 'actor': '格利高利·派克,奥黛丽·赫本,埃迪·艾伯特', 
'score': '9.1', 'index': '4', 'title': '罗马假日', 'time': '1953-09-02(美国)'}
{'image': 'http://p0.meituan.net/movie/53/1541925.jpg@160w_220h_1e_1c', 'actor': '汤姆·汉克斯,罗宾·怀特,加里·西尼斯', 
'score': '9.4', 'index': '5', 'title': '阿甘正传', 'time': '1994-07-06(美国)'}
{'image': 'http://p0.meituan.net/movie/11/324629.jpg@160w_220h_1e_1c', 'actor':
 '莱昂纳多·迪卡普里奥,凯特·温丝莱特,比利·赞恩', 'score': '9.5', 'index': '6', 'title': '泰坦尼克号', 'time': '1998-04-03'}
{'image': 'http://p0.meituan.net/movie/99/678407.jpg@160w_220h_1e_1c', 'actor': '日高法子,坂本千夏,糸井重里',
 'score': '9.2', 'index': '7', 'title': '龙猫', 'time': '1988-04-16(日本)'}
{'image': 'http://p0.meituan.net/movie/92/8212889.jpg@160w_220h_1e_1c', 'actor': '马龙·白兰度,阿尔·帕西诺,詹姆斯·凯恩',
 'score': '9.3', 'index': '8', 'title': '教父', 'time': '1972-03-24(美国)'}
 
{'image': 'http://p0.meituan.net/movie/62/109878.jpg@160w_220h_1e_1c', 'actor': '周星驰,巩俐,郑佩佩', 'score': '9.2', 
'index': '9', 'title': '唐伯虎点秋香', 'time': '1993-07-01(中国香港)'}
{'image': 'http://p0.meituan.net/movie/9bf7d7b81001a9cf8adbac5a7cf7d766132425.jpg@160w_220h_1e_1c', 'actor': '柊瑠美,
入野自由,夏木真理', 'score': '9.3', 'index': '10', 'title': '千与千寻', 'time': '2001-07-20(日本)'}

6. Write file

We will then extract the results written to the file, write here directly into a text file. This is achieved by dumps JSON library () method of the dictionary sequence and specify ensure_ascii parameter is False, this ensures that the output is not in the form of Chinese Unicode encoding. code show as below:

def write_to_json(content):
    with open('result.txt', 'a') as f:
        print(type(json.dumps(content)))
        f.write(json.dumps(content, ensure_ascii=False,).encode('utf-8'))

By calling write_to_json () method to achieve the dictionary text file written to a process, content extraction parameters here is the result of a film, it is a dictionary.

7. Integration Code

Finally, to achieve main () method to invoke methods implemented earlier, the film will be written to the results of a single-page document. Related code is as follows:

def main():
    url = 'http://maoyan.com/board/4'
    html = get_one_page(url)
    for item in parse_one_page(html):
        write_to_json(item)

So far, we have completed the one-page extract of the film, which is the home of the 10 movies can be successfully extracted and saved to a text file.

8. tab crawling

Because we need to grab that TOP100 movie, so it needs to traverse it, to offset the incoming link parameters to achieve crawling other 90 films, this time to add the following call:

if __name__ == '__main__':
    for i in range(10):
        main(offset=i * 10)

There is also need to main () method change, which receives a offset value as an offset, and then constructing a URL crawling. Codes are as follows:

def main(offset):
    url = 'http://maoyan.com/board/4?offset=' + str(offset)
    html = get_one_page(url)
    for item in parse_one_page(html):
        print(item)
        write_to_file(item)

So far, our cat's eye movie TOP100 reptiles to be completed, and then a little tidy, complete code is as follows:

import json
import requests
from requests.exceptions import RequestException
import re
import time

def get_one_page(url):
    try:
        response = requests.get(url)
        if response.status_code == 200:
            return response.text
        return None
    except RequestException:
        return None

def parse_one_page(html):
    pattern = re.compile('<dd>.*?board-index.*?>(\d+)</i>.*?data-src="(.*?)".*?name"><a'
                         + '.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>'
                         + '.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>', re.S)
    items = re.findall(pattern, html)
    for item in items:
        yield {
            'index': item[0],
            'image': item[1],
            'title': item[2],
            'actor': item[3].strip()[3:],
            'time': item[4].strip()[5:],
            'score': item[5] + item[6]
        }

def write_to_file(content):
    with open('result.txt', 'a', encoding='utf-8') as f:
        f.write(json.dumps(content, ensure_ascii=False) + '\n')

def main(offset):
    url = 'http://maoyan.com/board/4?offset=' + str(offset)
    html = get_one_page(url)
    for item in parse_one_page(html):
        print(item)
        write_to_file(item)

if __name__ == '__main__':
    for i in range(10):
        main(offset=i * 10)
        time.sleep(1)

Cat's Eye now more anti-reptile, if the speed is too fast, it will not respond, so here added a delay waiting for.

9. Run results

Finally, we run the code, the output similar to the following:

{'index': '1', 'image': 'http://p1.meituan.net/movie/20803f59291c47e1e116c11963ce019e68711.jpg@160w_220h_1e_1c', 'title': 
'霸王别姬', 'actor': '张国荣,张丰毅,巩俐', 'time': '1993-01-01(中国香港)', 'score': '9.6'}
{'index': '2', 'image': 'http://p0.meituan.net/movie/__40191813__4767047.jpg@160w_220h_1e_1c', 'title': '肖申克的救赎',
 'actor': '蒂姆·罗宾斯,摩根·弗里曼,鲍勃·冈顿', 'time': '1994-10-14(美国)', 'score': '9.5'}
...
{'index': '98', 'image': 'http://p0.meituan.net/movie/76/7073389.jpg@160w_220h_1e_1c', 'title':
 '东京物语', 'actor': '笠智众,原节子,杉村春子', 'time': '1953-11-03(日本)', 'score': '9.1'}
{'index': '99', 'image': 'http://p0.meituan.net/movie/52/3420293.jpg@160w_220h_1e_1c', 'title': 
'我爱你', 'actor': '宋在河,李彩恩,吉海延', 'time': '2011-02-17(韩国)', 'score': '9.0'}
{'index': '100', 'image': ' 
 'title': '迁徙的鸟', 'actor': '雅克·贝汉,菲利普·拉波洛,Philippe Labro', 'time': '2001-12-12(法国)', 'score': '9.1'}

The output is omitted herein middle portion. It can be seen so successfully TOP100 movie information crawling down.

Then we look at the text file, the results shown in Figure 3-15.

3-15.jpg

Figure 3-15 operating results

It can be seen all the movie information is also saved to a text file, you're done!

10. This section of code

This code section address https://github.com/Python3WebSpider/MaoYan .

In this section, we crawled through the cat's eye TOP100 movie information requests and practice the use of regular expressions. This is a most basic example, I hope you can use this example have a basic idea of ​​the realization of reptiles, but also the use of these two libraries have a deeper understanding.

Source: Huawei cloud community  Author: Cui Shu Jing Qing only seek

Guess you like

Origin blog.csdn.net/devcloud/article/details/94554468