How reptiles crawling cat-eye movie TOP list data

Today hippopotamus reptile proxy for everyone to share, how reptiles crawling cat-eye movie TOP list data. The main contents are crawled rankings, pictures, movies, name, starring, showtimes and ratings information. Before we crawled, we first open the opal film TOP100 page, page analysis study, we need to find information on the location, and then crawl.

code show as below:

import json

import requests

from requests.exceptions import RequestException

import re

import time

def get_one_page(url):

try:

    headers = { 'User-Agent': 'agent信息'}

    response = requests.get(url, headers=headers)

    if response.status_code == 200:

        return response.text

    return None

except RequestException:

    return None

def parse_one_page(html):

pattern = re.compile('<dd>.*?board-index.*?>(\d+)</i>.*?data-src="(.*?)".*?name"><a'

                     + '.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>'

                     + '.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>', re.S)

items = re.findall(pattern, html)

for item in items:

    yield {

        'index': item[0],

        'image': item[1],

        'title': item[2],

        'actor': item[3].strip()[3:],

        'time': item[4].strip()[5:],

        'score': item[5] + item[6]

    }

def write_to_file(content):

with open('result.txt', 'a', encoding='utf-8') as f:

    f.write(json.dumps(content, ensure_ascii=False) + '\n')

def main(offset):

url = 'http://maoyan.com/board/4?offset=' + str(offset)

html = get_one_page(url)

for item in parse_one_page(html):

    print(item)

    write_to_file(item)

if name == 'main':

for i in range(10):

    main(offset=i * 10)

    time.sleep(1)

By the above code, we can get to a cat's eye movie TOP list of the data. Hippo http proxy IP to provide you with safe and stable, efficient and convenient reptiles proxy IP services, more questions, please contact customer service.

Guess you like

Origin www.cnblogs.com/hema2213/p/11039181.html