Today hippopotamus reptile proxy for everyone to share, how reptiles crawling cat-eye movie TOP list data. The main contents are crawled rankings, pictures, movies, name, starring, showtimes and ratings information. Before we crawled, we first open the opal film TOP100 page, page analysis study, we need to find information on the location, and then crawl.
code show as below:
import json
import requests
from requests.exceptions import RequestException
import re
import time
def get_one_page(url):
try:
headers = { 'User-Agent': 'agent信息'}
response = requests.get(url, headers=headers)
if response.status_code == 200:
return response.text
return None
except RequestException:
return None
def parse_one_page(html):
pattern = re.compile('<dd>.*?board-index.*?>(\d+)</i>.*?data-src="(.*?)".*?name"><a'
+ '.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>'
+ '.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>', re.S)
items = re.findall(pattern, html)
for item in items:
yield {
'index': item[0],
'image': item[1],
'title': item[2],
'actor': item[3].strip()[3:],
'time': item[4].strip()[5:],
'score': item[5] + item[6]
}
def write_to_file(content):
with open('result.txt', 'a', encoding='utf-8') as f:
f.write(json.dumps(content, ensure_ascii=False) + '\n')
def main(offset):
url = 'http://maoyan.com/board/4?offset=' + str(offset)
html = get_one_page(url)
for item in parse_one_page(html):
print(item)
write_to_file(item)
if name == 'main':
for i in range(10):
main(offset=i * 10)
time.sleep(1)
By the above code, we can get to a cat's eye movie TOP list of the data. Hippo http proxy IP to provide you with safe and stable, efficient and convenient reptiles proxy IP services, more questions, please contact customer service.