Crawling the movies playing watercress

Crawling the movies playing watercress

1. Objectives

Crawling watercress display information on movies playing, including film name, score, director, starring and other information. Save it in a CSVfile, you can use Excel to open view.

2. Analysis of ideas

1. Obtain the URL of the page
2. The source of the page request
3. parse source code, object information extraction
4. Save the information

3. Preparations

1. Request page source usingwebdriver.Chrome()
2. Parse page using xpath

4. The implementation phase

1. Obtain URL

https://movie.douban.com/cinema/nowplaying/xian/

2. The source of the page request
 driver = webdriver.Chrome()
 driver.get(r'https://movie.douban.com/cinema/nowplaying/xian/')
 html=driver.page_source
 driver.close()
3. Analyze page source

View Wanye use Chrome source code, you can find movies are being released under the label

Moreover, each film in a li tag, and the information we need to include in them, we just need to get to, and you can extract it out.

Here Insert Picture Description

4. extract target information
html = etree.HTML(html)
title = html.xpath('//*[@id="nowplaying"]//li/ul/li[2]/a/@title')
actor = html.xpath('//*[@id="nowplaying"]//li/@data-actors')
score = html.xpath('//*[@id="nowplaying"]//li/@data-score')
duration = html.xpath('//*[@id="nowplaying"]//li/@data-duration')
director = html.xpath('//*[@id="nowplaying"]//li/@data-director')
5. Save the information

Use pandas create DataFrameand store data as a .csvfile

df=pd.DataFrame(data=data,columns=['电影','评分','导演','主演','时长'])
df.to_csv('豆瓣最近上映.csv',encoding='gb18030')

In order to use Excel to view the saved file, save the file is encoded using the waygb18030

5. Source Codes

# -*- coding: utf-8 -*-
from selenium import webdriver
from lxml import etree
import pandas as pd
def get():
    driver = webdriver.Chrome()
    driver.get(r'https://movie.douban.com/cinema/nowplaying/xian/')
    html=driver.page_source
    driver.close()
    #整理文档对象
    html = etree.HTML(html)
    title = html.xpath('//*[@id="nowplaying"]//li/ul/li[2]/a/@title')
    actor = html.xpath('//*[@id="nowplaying"]//li/@data-actors')
    score = html.xpath('//*[@id="nowplaying"]//li/@data-score')
    duration = html.xpath('//*[@id="nowplaying"]//li/@data-duration')
    director = html.xpath('//*[@id="nowplaying"]//li/@data-director')

    data=list(zip(title,score,director,actor,duration))
    return data
def saving(data):
    df=pd.DataFrame(data=data,columns=['电影','评分','导演','主演','时长'])
    df.to_csv('豆瓣最近上映.csv',encoding='gb18030')
def main():
    data=get()
    saving(data)
if __name__ == '__main__':
    main()

6. The results show

Here Insert Picture Description

7. Evaluation

1. Information can be found to crawl part is missing or an abnormality, for example, a score, the data value is 0. Write movie ratings, we can also use simple data pandas cleaning.

  1. Save the CSVfile to add your own index number, if you do not want to save the file when you can set up their own.

Guess you like

Origin blog.csdn.net/qq_45066719/article/details/95307927