10 minutes crawling recent fire complex with 4 film critic with Python

To directly download the code file, pay attention to our public No. Oh! View history messages can be!


"The Avengers 4: The final battle" has been released fast three weeks, breaking global box office of $ 2.4 billion, the domestic box office breaking four billion yuan.

Although the heat gradually decline, but we still shamelessly rub rub to heat. Released at the beginning of "complex with 4" watercress score had broken nine points.

Later continue to decline, and now "re-linked 4" ratings stable at 8.6 points. Although Tucao people watercress daily injection of serious, malicious score a lot, but because it is good to climb duck ~, we chose watercress as a crawling target. The comments of watercress have text and images, and other elements of simplicity, only to climb this essay.

 View watercress in a browser on Commentary complex with 4, take a look at the structure of the url:

https://movie.douban.com/subject/26100958/comments?start=20&limit=20&sort=new_score&status=P

Visible, we can go to a different page by modifying the start value:

Right View Source can see the browser to get the html page code. Ctrl F critics of the first search keywords to quickly locate critic Tags:

Critics can be seen in the span tag content, class is "short".

An overview of what steps crawling content:

1) Access url, get html page text, this step is that we use to requests module.

2) parse the returned text, extract the contents of reptiles, use this step is beautifulSoup module.

Both modules can be downloaded directly through the pip.

The first is the main function:

def main():
    discuss = []
    a = 0
    for i in range(0,100,20):
        url = 'https://movie.douban.com/subject/26100958/comments?start='+ str(i) +'&limit=20&sort=new_score&status=P'
        HTMLpage = getHTML(url)
        #print(HTMLpage)
        for t in parseHTML(HTMLpage):
            discuss.append(t)
    for i in discuss:
        print(str(a) + ':' + i)
#        print(i)
        a = a + 1

Because of watercress a display 20 critics, we climb before 100, so here the first 5 pages visited:

def getHTML(url):
    try:
        r = requests.get(url)
        r.raise_for_status()
print("get html successfully")
        r.encoding = 'utf-8'
        #print(r.text)
        return r.text
    except:
        return ""

In getHTML function, we apply for access to the target page, html page and return to the text. It should be noted that the coding mode is set. 8-UTF , if provided Found r.encoding = r.apparent_encoding, the program can not guess the correct encoding.

当r.raise_for_status() 没有抛出异常时,程序通知我们获取html成功。如果有异常,返回空字符串。

下一步是解析:

如前所述影评是class为short的span,所以可以直接使用bs4的find_all()函数得到一个含有所有影评的tag的列表。我们只需要把tag中的文字提取出来就可以返回到主函数了。

首先要生成一个beautifulSoup类的对象,使用html的解析器。html页面是树状分布的,可以通过各种树的遍历找到我们需要的标签,这里bs4提供了一个简单粗暴的find_all,可以直接使用。

find_all()函数返回的是一个保存着tag的列表。

def parseHTML(html):
    try:
        soup = BeautifulSoup(html,"html.parser")
        A = soup.find_all('span',attrs = {'class':'short'})
        B = []
        for i in A:
            B.append(i.get_text())
        return B
    except:
        return []

用get_text函数去掉span标签,只留下内容的文本,加入到B列表里。然后就可以返回了。同理,如果出错了,返回空列表。

好了以上就是一个非常简单的小爬虫,通过修改爬取的数量可以爬取任意页面的评论。当然了后续还会对这些数据进行一些有趣的分析,请关注我们。同时因为作者本人能力有限,本系列可能又要无限托更了/呲牙

下附完整版代码和运行结果【代码下载移步留言区】

import requests
from bs4 import BeautifulSoup
def getHTML(url):
    try:
        r = requests.get(url)
        r.raise_for_status()
        print("get html successfully")
        r.encoding = 'utf-8'
        #print(r.text)
        return r.text
    except:
        return ""
def parseHTML(html):
    try:
        soup = BeautifulSoup(html,"html.parser")
        A = soup.find_all('span',attrs = {'class':'short'})
        B = []
        for i in A:
            B.append(i.get_text())
        return B
    except:
        return []
def main():
    discuss = []
    a = 0
    for i in range(0,100,20):
        url = 'https://movie.douban.com/subject/26100958/comments?start='+ str(i) +'&limit=20&sort=new_score&status=P'
        HTMLpage = getHTML(url)
        #print(HTMLpage)
        for t in parseHTML(HTMLpage):
            discuss.append(t)
    for i in discuss:
        print(str(a) + ':' + i)
#        print(i)
        a = a + 1
        
if __name__ == "__main__":
main()

运行结果:

Guess you like

Origin www.cnblogs.com/dengfaheng/p/10959146.html