Python Reptile: how to automate the download Joey poster?

Python Reptile: how to automate the download Joey poster?

Reptile process

How to write reptiles to grab data? Reptile actually use a browser to access a way to simulate the process of accessing the site, the whole process has three stages: open the page, extract data and saving data

In Python, these three stages has a corresponding tool can be used

"Open page" step, you can use the Requests access the page, get back to our server data, including HTML pages, and JSON data

"Extracted data", uses two tools for HTML pages, using XPath elemental locate, extract data; for JSON data, using JSON parsing

"Save data", using Pandas save the data, export CSV file

Requests to access the page

Requests are Python's HTTP client library, there are two access methods: get and post, get the parameters contained in the URL, and post to pass parameters through the request body, if we want to access watercress, accessed with get, then the code as :

 r  = requests.get('http://www.douban.com')

'R' is the result of access to the get request, then r.text r.content acquired HTML or text, post want to use for transmission form, as follows:

r = requests.post('http://xxx.com' , data={'key':'value'})

It is a form of data transmission parameters, data type, a dictionary data structure, using the key and value storage manner

XPath location

Navigation, position by the positioning elements and attributes

expression meaning
node All child nodes of the selected node node
// Select all of the current node, regardless of location
@ Select Properties
| Or a total of two nodes
text() Text contents of the current path

for example:

  • xpath ( 'node') is selected all child nodes of the node node
  • xpath ( '/ div') selected from the root node div
  • xpath ( '// div') Select all nodes div
  • xpath ( './ div') div select the current node under the node
  • xpath ( '...') back to a node
  • xpath ( '// @ id') Select all attributes ID
  • xpath ( '// book [@id]') Select all have attributes named ID book element
  • xpath ( '// book [@ id = "abc"]') Select all book elements, and these elements have the book ID = "abc" attribute
  • xpath ( '// book / title | // book / price') Select all title and price elements of all book elements

Using xpath location, will use a Python parsing library lxml, just call the HTML parsing command, and then make calls to HTML xpath function

For example, to locate a list of all items in HTML:

from lxml import etree
html = etree.HTML(html)
result = html.xpath('//li')

JSON object

The JSON object converted into Python objects, and more convenient for data analysis

method meaning
json.dumps() Converting the object into a JSON object Python
json.loads() Converting the object into a Python object JSON

JSON format converted into code Python object:

import json
jsonData = '{"a" : 1, "b" : 2 , "c" :3 , "d" : 4, " e" : 5}';
input = json.loads(jsonData)
print input

How to use the two angles Python crawling poster, a JSON data is crawling, a crawler is positioned xpath

How to use the JSON data is automatically downloaded Joey Wang poster

operating:

  • open the Web page
  • Enter the keyword "Joey"
  • Select "Picture" in the search results page
  • Download all the pictures page poster

If crawling a page is a dynamic page, need to focus on XHR data, HTTP because the principle of dynamic page is sent via XHR data objects native request, obtain data returned by the server after further processing, XHR will be used in the background and the server data exchange

Watercress search for "WangZuXian" were simulated and found that there is a data request XHR is a <https://www.douban.com/j/search_photo?q=%E7%8E%8B%E7%A5 % 96% E8% B4% A4 & limit = 20 & start = 0>

The URL is URL-encoded Chinese is garbled, open, we saw the json format object:

{"images":
       [{"src": …, "author": …, "url":…, "id": …, "title": …, "width":…, "height":…},
    …
   {"src": …, "author": …, "url":…, "id": …, "title": …, "width":…, "height":…}],
 "total":22471,"limit":20,"more":true}

In this json object, to see pictures of Joey Wang, a total of 22,471, one of which was returned 20, the data is put into images the object, it is the structure of the array, each array element is a dictionary-type, respectively He told the SRC, author, URL, ID, title, width, height fields that represent the original picture address, author, publication address, picture ID, title, image width, height, etc. pictures

With json information, but also you need to find a URL law XHR requests

See URL itself <https://www.douban.com/j/search_photo?q= Joey & limit = 20 & start = 0>

URL has three parameters: q, limit, start, actually starting ID start request, it identifies the order of the picture is counted from 0, if you want to start the download from 21, to start = 20

Joey Wang a total of 22,471, to write a for loop to finish all the requests:

# coding:utf-8
import requests
import json
query = '王祖贤'
''' 下载图片 '''
def download(src, id):
  dir = './' + str(id) + '.jpg'
  try:
    pic = requests.get(src, timeout=10)
    fp = open(dir, 'wb')
    fp.write(pic.content)
    fp.close()
  except requests.exceptions.ConnectionError:
    print('图片无法下载')
            
''' for 循环 请求全部的 url '''
for i in range(0, 22471, 20):
  url = 'https://www.douban.com/j/search_photo?q='+query+'&limit=20&start='+str(i)
  html = requests.get(url).text    # 得到返回结果
  response = json.loads(html,encoding='utf-8') # 将 JSON 格式转换成 Python 对象
  for image in response['images']:
    print(image['src']) # 查看当前下载的图片网址
    download(image['src'], image['id']) # 下载一张图片

How to use xpath automatically download Joey poster cover

Joey Wang watercress want to download movies from the movie cover, operating procedures are:

  • Open the Web page movie.douban.com;
  • Enter the keyword "Joey";
  • Download all the movie cover picture page.

Need to use URL and title of the movie xpath positioning of the picture, a rapid positioning xpath method is to use a browser plug-xpath Helper, you want to locate the selected element

xpath Helper has two parameters, one is the Query, one Results, Query allows you to enter xpath syntax, and then see the results Results

We want to match all of the posters, the need to reduce xpath expressions, get movie poster xpath (assumed to be variable src_xpath):

//div[@class='item-root']/a[@class='cover-link']/img[@class='cover']/@src

And xpath movie name (assumed to be variable title_xpath):

//div[@class='item-root']/div[@class='detail']/div[@class='title']/a[@class='title-text']

Need a tool to load the page has finished loading the simulation until after you complete HTML, it is selenium:

from selenium import webdriver
driver =  webdriver.Chrome()
driver.get(request_url)

selenium is a web application testing tools can be run directly in the browser, the principle is to simulate user operation, you need to quote selenium in webdriver library, and then create drive a Chrome browser by webdriver, and then for a complete HTML page is accessed via drive

When you get a full HTML, you can extract the HTML for the xpath, need to find pictures srcs address and name of the movie titles, because multiple elements, each element needs to be extracted for use for loop

srcs = html.xpath(src_xpath)
titles = html.xpath(title_path)
for src, title in zip(srcs, titles):
  download(src, title.text)
以下是课后练习题:爬取宫崎骏的电影海报, Python3.6 IDLE
>>> import json
>>> import requests as req
>>> from lxml import etree
>>> from selenium import webdriver
>>> import os
>>> request_url = 'https://movie.douban.com/subject_search?search_text=宫崎骏&cat=1002'
>>> src_xpath = "//div[@class='item-root']/a[@class='cover-link']/img[@class='cover']/@src"
>>> title_xpath = "//div[@class='item-root']/div[@class='detail']/div[@class='title']/a[@class='title-text']"
>>> driver = webdriver.Chrome('/Users/apple/Downloads/chromedriver')
>>> driver.get(request_url)
>>> html = etree.HTML(driver.page_source)
>>> srcs = html.xpath(src_xpath)
>>> print (srcs) #大家可要看下打印出来的数据是否只是一页的内容,以及图片url的后缀格式
>>> picpath = '/Users/apple/Downloads/宫崎骏电影海报'
>>> if not os.path.isdir(picpath):
os.mkdir(picpath)
>>> def download(src, id):
dic = picpath + '/' + str(id) + '.webp'
try:
pic = req.get(src, timeout = 30)
fp = open(dic, 'wb')
fp.write(pic.content)
fp.close()
except req.exceptions.ConnectionError:
print ('图片无法下载')
>>> for i in range(0, 150, 15):
url = request_url + '&start=' + str(i)
driver.get(url)
html = etree.HTML(driver.page_source)
srcs = html.xpath(src_xpath)
titles = html.xpath(title_xpath)
for src,title in zip(srcs, titles):
download(src, title.text)
Published 75 original articles · won praise 9 · views 9170

Guess you like

Origin blog.csdn.net/ywangjiyl/article/details/104758970