Python Crawler Practice - Douban Movie Top250 (to be continued)

Crawl the web page: https://movie.douban.com/top250
Crawl target: Get the movie title, rating, shortest movie review, and the number of reviews, and parse the crawled data into a txt document.
write picture description here
Check the HTML code corresponding to the
write picture description here
webpage The tail information of the webpage:
write picture description here
the position of the movie list on the page is an ol tag whose class attribute is grid_view.
a. Each page has 25 movies, a total of 10 pages.
The movie name of each movie is in: the first class attribute value under the div tag with the hd value is in the span tag with the title value;
b. Rating The rating of
each movie is in the corresponding li tag (unique ) In a span tag with a class attribute value of rating_num;
c. Number of comments The number of people rated for
each movie is the last number in a div tag with a class attribute value of star in the corresponding li tag;
d. Short comments:
each movie The short comment is in a span tag with a class attribute value of inq in the corresponding li tag.

Obtaining data:
Analysis:
Problem: If there is 403
• No login for the website that needs to be logged in
• The server refuses access (the crawler is recognized)
Solution: Request header processing
• UserAgent identifies the type
of browser • Masquerading as a browser

import requests
import codecs
from bs4 import BeautifulSoup
URL = 'http://movie.douban.com/top250'
def download_page(url):
    header = {
            'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, \
            like Gecko) Chrome/53.0.2785.89 Safari/537.36'
     }
    data = requests.get(url,headers=header).content
    return data

def main():
    print(download_page(URL))
if __name__ == '__main__':
    main()

Parse web pages with BeautifulSoup

def getData(html):
    soup = BeautifulSoup(html, "html.parser")
    movieList=soup.find('ol',attrs={'class':'grid_view'})#找到第一个class属性值为grid_view的ol标签
    moveInfo=[]
    for movieLi in movieList.find_all('li'):#找到所有li标签
        data = []
        #得到电影名字
        movieHd=movieLi.find('div',attrs={'class':'hd'})#找到第一个class属性值为hd的div标签
        movieName=movieHd.find('span',attrs={'class':'title'}).getText()#找到第一个class属性值为title的span标签
                                                                           #也可使用.string方法
        data.append(movieName)

        #得到电影的评分
        movieScore=movieLi.find('span',attrs={'class':'rating_num'}).getText()
        data.append(movieScore)

        #得到电影的评价人数
        movieEval=movieLi.find('div',attrs={'class':'star'})
        movieEvalNum=re.findall(r'\d+',str(movieEval))[-1]
        data.append(movieEvalNum)

        # 得到电影的短评
        movieQuote = movieLi.find('span', attrs={'class': 'inq'})
        if(movieQuote):
            data.append(movieQuote.getText())
        else:
            data.append("无")
        print(outputMode.format(data[0], data[1], data[2],data[3],chr(12288)))

Pagination:

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325717082&siteId=291194637