Crawl the web page: https://movie.douban.com/top250
Crawl target: Get the movie title, rating, shortest movie review, and the number of reviews, and parse the crawled data into a txt document.
Check the HTML code corresponding to the
webpage The tail information of the webpage:
the position of the movie list on the page is an ol tag whose class attribute is grid_view.
a. Each page has 25 movies, a total of 10 pages.
The movie name of each movie is in: the first class attribute value under the div tag with the hd value is in the span tag with the title value;
b. Rating The rating of
each movie is in the corresponding li tag (unique ) In a span tag with a class attribute value of rating_num;
c. Number of comments The number of people rated for
each movie is the last number in a div tag with a class attribute value of star in the corresponding li tag;
d. Short comments:
each movie The short comment is in a span tag with a class attribute value of inq in the corresponding li tag.
Obtaining data:
Analysis:
Problem: If there is 403
• No login for the website that needs to be logged in
• The server refuses access (the crawler is recognized)
Solution: Request header processing
• UserAgent identifies the type
of browser • Masquerading as a browser
import requests
import codecs
from bs4 import BeautifulSoup
URL = 'http://movie.douban.com/top250'
def download_page(url):
header = {
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, \
like Gecko) Chrome/53.0.2785.89 Safari/537.36'
}
data = requests.get(url,headers=header).content
return data
def main():
print(download_page(URL))
if __name__ == '__main__':
main()
Parse web pages with BeautifulSoup
def getData(html):
soup = BeautifulSoup(html, "html.parser")
movieList=soup.find('ol',attrs={'class':'grid_view'})#找到第一个class属性值为grid_view的ol标签
moveInfo=[]
for movieLi in movieList.find_all('li'):#找到所有li标签
data = []
#得到电影名字
movieHd=movieLi.find('div',attrs={'class':'hd'})#找到第一个class属性值为hd的div标签
movieName=movieHd.find('span',attrs={'class':'title'}).getText()#找到第一个class属性值为title的span标签
#也可使用.string方法
data.append(movieName)
#得到电影的评分
movieScore=movieLi.find('span',attrs={'class':'rating_num'}).getText()
data.append(movieScore)
#得到电影的评价人数
movieEval=movieLi.find('div',attrs={'class':'star'})
movieEvalNum=re.findall(r'\d+',str(movieEval))[-1]
data.append(movieEvalNum)
# 得到电影的短评
movieQuote = movieLi.find('span', attrs={'class': 'inq'})
if(movieQuote):
data.append(movieQuote.getText())
else:
data.append("无")
print(outputMode.format(data[0], data[1], data[2],data[3],chr(12288)))
Pagination: