Automatically load the browser to download pictures with selenium

Batch download a picture carried out with the requests this library, so you can only do this because of XHR interface (API) provided watercress, and interface data type returned by json format, so use very convenient, but sometimes we need to analyze the data format or html xml format, extract links required before downloading, which came in handy when selenium.

A manual download poster

Donnie Yen to download the poster as an example, we generally turn on IMDb website: https://movie.douban.com/ and enter a keyword Donnie Yen, and then go download the poster.

Two automatic download processing ideas

Automatically downloaded, we need to be able to analyze the kind of poster art specific web page address, then go through the download process.

2.1 xpath learning

Xpath address here to search by image, xpath XML Path Language is the abbreviation of the original search for a specific path in xml, the same applies to searches of html elements, simple syntax description follows:
xpath basic grammar
In the python, the library can be applied lxml html xpath turn into an object, and then analyzed, very convenient, lxml fault tolerant library can be processed unclosed html tag element.
Look at a simple example:

from lxml import etree

text = '''
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul>
</div>
'''

Resolution:

#读取字符串,读取文件可以用
#html=etree.parse('test.html',etree.HTMLParser()) 
html = etree.HTML(text)
#转成补全字节
r = etree.tostring(html,encoding='utf-8')
#打印补全结果
#print(r.decode('utf-8'))
#搜下下面所有为li的子孙节点
resultLi = html.xpath("//li")
print("//li: "+ str(resultLi))
#搜寻li节点下面的a节点,并取href属性的值
reLiA = html.xpath("//li/a/@href")
print("//li/a/@href :"+ str(reLiA))
#获取href的属性值为link2.html的a节点的上层节点的class熟悉值
reClass=html.xpath('//a[@href="link2.html"]/../@class')
print('//a[@href="link2.html"]/../@class :'+ str(reClass))
#搜寻li节点下面的a节点,并取href属性的值
reLiText = html.xpath("//li/a/text()")
print("//li/a/text() :"+ str(reLiText))

The code was originally period, there is a problem in parsing markdown, the two changed.
Print results are as follows:

//li: [<Element li at 0x1cb14b89908>, <Element li at 0x1cb14b89988>, <Element li at 0x1cb14b899c8>, <Element li at 0x1cb14b89a08>, <Element li at 0x1cb14b89a48>]
//li/a/@href :['link1.html', 'link2.html', 'link3.html', 'link4.html', 'link5.html']
//a[@href="link2.html"]/../@class :['item-1']
//li/a/text() :['first item', 'second item', 'third item', 'fourth item', 'fifth item']

2.2 pictures xpath path to extract

By the above example, xpath syntax although not complicated, but sometimes even memory, fortunately chorme there xpath helper browser plug-ins, after installing the mouse over the picture, press ctrl + shift + x key to pop up the dialog :
xpath help get the path
mouse back and forth movement of the poster, found that some changes and modifications xpath, remove the front fixed prefix, the index list into a fixed value, obtained as follows:
xpath adjustment
obtained xpath poster:

//div[@id='recent_movies']/div[@class='bd']/ul[@class='list-s']/*/div[@class='pic']/a/img/@src

The xpath can get into the picture following address:

https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2577437186.webp
https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2537133715.webp
https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2542380253.webp
https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2528842218.webp
https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2499052494.webp

We use selenium analog browser to load the html and xpath queries, after obtaining the address, you can download pictures by downloading function.

III. The use of selenium were poster download

Search in the watercress movie "Donnie"
https://search.douban.com/movie/subject_search?search_text=%E7%94%84%E5%AD%90%E4%B8%B9&cat=1002
adjustment under xpath:

//div[1]/div[@class='sc-bZQynM jbSySb sc-bxivhb gemzcp'][*]/div[@class='item-root']/a[@class='cover-link']/img[@class='cover']/@src

The resulting 15 results:

https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2577437186.webp
...

If you need to flip it, add a link to instructions from start = 15 16 posters start showing.
Gets Movie Name:

//div[@class='_ytukbl17q']/div[1]/div[@class='sc-bZQynM cBnAay sc-bxivhb gemzcp'][*]/div[@class='item-root']/div[@class='detail']/div[@class='title']/a[@class='title-text']

got the answer:

武侠‎ (2011)
西游记之大闹天宫‎ (2014)
...

The final Download Code:

# -*- coding: utf-8 -*-
import requests
import json
import sys
import io
import os
from selenium import webdriver
from lxml import etree

def download(picPath,src, id):
  if not os.path.isdir(picPath):
      os.mkdir(picPath)
  dir = picPath+'/' + str(id) + '.webp'
  print(src)
  imageHeader  = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    #'accept-encoding': 'gzip, deflate',
    'accept-language': 'zh-CN,zh;q=0.9',
    'cache-control': 'max-age=0',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-site': 'none',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'
  }
  try:
    pic = requests.get(src,headers=imageHeader,timeout=50)
    fp = open(dir, 'wb')
    fp.write(pic.content)
    fp.close()
  except requests.exceptions.ConnectionError:
    print('Sorrry,image cannot downloaded, url is error{}.'.format(src))

def query_img(query,downloadUrl):
    realUrl = downloadUrl.format(query)
    print(realUrl)
    driver = webdriver.Chrome('D:\\py3\\Lib\\site-packages\\selenium\\webdriver\\chrome\\chromedriver_win32\\chromedriver.exe')
    driver.get(realUrl)
    #解析html
    html = etree.HTML(driver.page_source)
    image_url_path = "//div[1]/div[*]/div[@class='item-root']/a[@class='cover-link']/img[@class='cover']/@src"
    movie_name_path = "//div/div[1]/div[*]/div[@class='item-root']/div[@class='detail']/div[@class='title']/a[@class='title-text']/text()"
    urls = html.xpath(image_url_path)
    names = html.xpath(movie_name_path)
    picPath = 'F:\\python\\images'
    for (url,name) in zip(urls,names):
        download(picPath,url,name)

if __name__ == "__main__":
    query = '甄子丹'
    url = 'https://search.douban.com/movie/subject_search?search_text=\'{}\'&cat=1002'
    query_img(query,url)

Note : there is a driver using the chrome browser, a different browser driver can go to https://selenium-python.readthedocs.io/installation.html links to download, mainly to its own browser and version consistent.
chrome version of the browser by typing in the browser: chrome: // Version / to view it.

happy Winter Solstice to everyone!

Guess you like

Origin www.cnblogs.com/seaspring/p/12079861.html