Crawling the movies playing watercress
1. Objectives
Crawling watercress display information on movies playing, including film name, score, director, starring and other information. Save it in a CSV
file, you can use Excel to open view.
2. Analysis of ideas
1. Obtain the URL of the page
2. The source of the page request
3. parse source code, object information extraction
4. Save the information
3. Preparations
1. Request page source usingwebdriver.Chrome()
2. Parse page using xpath
4. The implementation phase
1. Obtain URL
https://movie.douban.com/cinema/nowplaying/xian/
2. The source of the page request
driver = webdriver.Chrome()
driver.get(r'https://movie.douban.com/cinema/nowplaying/xian/')
html=driver.page_source
driver.close()
3. Analyze page source
View Wanye use Chrome source code, you can find movies are being released under the label
Moreover, each film in a li tag, and the information we need to include in them, we just need to get to, and you can extract it out.
4. extract target information
html = etree.HTML(html)
title = html.xpath('//*[@id="nowplaying"]//li/ul/li[2]/a/@title')
actor = html.xpath('//*[@id="nowplaying"]//li/@data-actors')
score = html.xpath('//*[@id="nowplaying"]//li/@data-score')
duration = html.xpath('//*[@id="nowplaying"]//li/@data-duration')
director = html.xpath('//*[@id="nowplaying"]//li/@data-director')
5. Save the information
Use pandas create DataFrame
and store data as a .csv
file
df=pd.DataFrame(data=data,columns=['电影','评分','导演','主演','时长'])
df.to_csv('豆瓣最近上映.csv',encoding='gb18030')
In order to use Excel to view the saved file, save the file is encoded using the waygb18030
5. Source Codes
# -*- coding: utf-8 -*-
from selenium import webdriver
from lxml import etree
import pandas as pd
def get():
driver = webdriver.Chrome()
driver.get(r'https://movie.douban.com/cinema/nowplaying/xian/')
html=driver.page_source
driver.close()
#整理文档对象
html = etree.HTML(html)
title = html.xpath('//*[@id="nowplaying"]//li/ul/li[2]/a/@title')
actor = html.xpath('//*[@id="nowplaying"]//li/@data-actors')
score = html.xpath('//*[@id="nowplaying"]//li/@data-score')
duration = html.xpath('//*[@id="nowplaying"]//li/@data-duration')
director = html.xpath('//*[@id="nowplaying"]//li/@data-director')
data=list(zip(title,score,director,actor,duration))
return data
def saving(data):
df=pd.DataFrame(data=data,columns=['电影','评分','导演','主演','时长'])
df.to_csv('豆瓣最近上映.csv',encoding='gb18030')
def main():
data=get()
saving(data)
if __name__ == '__main__':
main()
6. The results show
7. Evaluation
1. Information can be found to crawl part is missing or an abnormality, for example, a score, the data value is 0. Write movie ratings, we can also use simple data pandas cleaning.
- Save the
CSV
file to add your own index number, if you do not want to save the file when you can set up their own.