requests爬取猫眼排行榜

关于爬取猫眼排行榜的教程网上可以说是烂大街了,因此感谢那些踩坑的前辈,我又再次把你们的坑在踩了一次,手动哭泣

这是我的思路:

得到网页url——爬取网页源代码——使用正则表达式分析网页——写入TXT文件

-----------------------------------------------------------------------------------------------------------------------------

得到网页url,这没得说

def get_page_url(n):
    url=('https://maoyan.com/board/4?offset='+str(n)+'0')
    return url

爬取网页源代码

def get_one_page(url):
    page=requests.get(url)
    return page.text

正则分析网页源码,这里踩了一个坑,正则表达式忘记添加了re.compile了,导致执行报错

def parse_page(page):
    pattern=re.compile('<dd>.*?board-index.*?>(.*?)</i>.*?data-src=(.*?)alt=.*?data-act.*?>(.*?)</a>.*?class="star".*?>(.*?)</p>.*?releasetime">(.*?)</p>.*?integer.*?>(.*?)</i>.*?fraction.*?>(.*?)</i>',re.S)
    paged=re.findall(pattern,page)
    for item in paged:
        print(item)
    return paged

写入文件,再次踩坑,这里import了个OS模块,使用os.open,导致一直报错,正常情况下是只写open()的

def write_to_txt(paged):
    paged=str(paged)
    maoyan=open('猫眼电影排行榜.txt','a')
    maoyan.write(paged)
    maoyan.write('\n')
    maoyan.close()

全部代码

import requests
import re
#首页url='https://maoyan.com/board/4?offset=0'
def get_page_url(n):
    url=('https://maoyan.com/board/4?offset='+str(n)+'0')
    return url

def get_one_page(url):
    page=requests.get(url)
    return page.text

def parse_page(page):
    pattern=re.compile('<dd>.*?board-index.*?>(.*?)</i>.*?data-src=(.*?)alt=.*?data-act.*?>(.*?)</a>.*?class="star".*?>(.*?)</p>.*?releasetime">(.*?)</p>.*?integer.*?>(.*?)</i>.*?fraction.*?>(.*?)</i>',re.S)
    paged=re.findall(pattern,page)
    for item in paged:
        print(item)
    return paged

def write_to_txt(paged):
    paged=str(paged)
    maoyan=open('猫眼电影排行榜.txt','a')
    maoyan.write(paged)
    maoyan.write('\n')
    maoyan.close()

def main():
    for i in range(0,10):
        url=get_page_url(i)
        page=get_one_page(url)
        writed=parse_page(page)
        write_to_txt(writed)

main()

还有坑待补,一个是每个list中的元素单独放一排,现在是1个list放一排,以及写入excel分析

猜你喜欢

转载自www.cnblogs.com/yunman5/p/11448909.html