Using xpath crawling cat's eye movie list

A recent study xpath, when looking for information online and found a novice often used to practice hand items, crawling information opal film before one hundred ranking, many of which are now CUI Qing was very similar, basic copy. Here on the use xpath wrote a program, it is also crawling cat's eye movies, access to information is the same, providing a further solution here.

To tell the truth, to match the information on the website, it is recommended to use xpath, though we really can achieve the effect, but the statement was too complicated, do not pay attention to things not match, especially for the novice, itself is not familiar with regular expressions, wrong are not to be easily persuaded. I generally used in regular processing files, simply artifact.

I posted the following code.

import requests
from requests.exceptions import RequestException
from lxml import etree
import csv
import re


def get_page(url): """ 获取网页的源代码 :param url: :return: """ try: headers = { 'User-Agent': 'Mozilla / 5.0(X11;Linuxx86_64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / ' '76.0.3809.100Safari / 537.36', } response = requests.get(url, headers=headers) if response.status_code == 200: return response.text return None except RequestException: return None def parse_page(text): """ 解析网页源代码 :param text: :return: """ html = etree.HTML(text) movie_name = html.xpath("//p[@class='name']/a/text()") actor = html.xpath("//p[@class='star']/text()") actor = list(map(lambda item: re.sub('\s+', '', item), actor)) time = html.xpath("//p[@class='releasetime']/text()") grade1 = html.xpath("//p[@class='score']/i[@class='integer']/text()") grade2 = html.xpath("//p[@class='score']/i[@class='fraction']/text()") new = [grade1[i] + grade2[i] for i in range(min(len(grade1), len(grade2)))] ranking = html.xpath("///dd/i/text()") return zip(ranking, movie_name, actor, time, new) def change_page(number): """ 翻页 :param number: :return: """ base_url = 'https://maoyan.com/board/4' url = base_url + '?offset=%s' % number return url def save_to_csv(result, filename): """ 保存 :param result: :param filename: :return: """ with open('%s' % filename, 'a') as csvfile: writer = csv.writer(csvfile, dialect='excel') writer.writerow(result) def main(): """ 主函数 :return: """ for i in range(0, 100, 10): url = change_page(i) text = get_page(url) result = parse_page(text) for j in result: save_to_csv(j, filename='message.csv') if __name__ == '__main__': main()

Guess you like

Origin www.cnblogs.com/lattesea/p/11746488.html