python reptile of the regular expression cat crawling in front of the movie 100 (g)

import json
import requests
from requests.exceptions import RequestException
import re
import time


def get_one_page(url):
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36'
        }
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            return response.text
        return None
    except RequestException:
        return None


def parse_one_page(html):
    pattern = re.compile('<dd>.*?board-index.*?>(\d+)</i>.*?data-src="(.*?)".*?name"><a'
                         + '.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>'
                         + '.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>', re.S)
    items = re.findall(pattern, html)
    for item in items:
        yield {
            'index': item[0],
            'image': item[1],
            'title': item[2],
            'actor': item[3].strip()[3:],#通过F12打开网页,network工作台就自然可以理解,也就是把不需要的字给删除了
            'time': item[4].strip()[5:],#同理
            'score': item[5] + item[6]#同理
        }


def write_to_file(content):
    with open('result.txt', 'a', encoding='utf-8') as f:
        f.write(json.dumps(content, ensure_ascii=False) + '\n')


def main(offset):
    url = 'http://maoyan.com/board/4?offset=' + str(offset)
    html = get_one_page(url)
    for item in parse_one_page(html):
        print(item)
        write_to_file(item)


if __name__ == '__main__':
    for i in range(100):
        main(offset=i * 10)
        time.sleep(1)

analysis:

1. First need to get html source, i.e., we use requests.get () method to obtain, at the same time determines whether the status code is 200, the source code for performing the print. This is what we get_one_page function
2. (1) to obtain the source code in the future, we need to get what we want based on regular expressions, ranking here when we want, pictures, film name, director, time and score.
(2) returns through a matching object re.compile method, the method find_all (), returns the dictionary, by traversing to make more beautiful, more structured
thing after (3) obtaining the necessary thing, we can obtain the documents stored inside, where we used the method dumps json library may make it possible to print Chinese, the key is ensure_ascii = False parameters,
(4) Description: an offset when offset parameters here, https: // maoyan .com / board / 4? offset = 10, this URL offset is 10, is the second page, third page if, on the offset is 20, and so on, so we can use the offset method of string concatenation url spliced to go, in addition to the film print more pages of information

The results take the climb:

Here Insert Picture Description

Published 63 original articles · won praise 12 · views 4046

Guess you like

Origin blog.csdn.net/qq_45353823/article/details/104228044