Betta reptile, crawling anchor picture and name value color channels

In betta interface, if you do not pull down the scroll bar, then the following pictures are just pictures of a fish, so make the browser automatically pull the scroll bar, you can use the python selenium library,

1, configure the browser

To use selenium, also need to install chromedriver.exe, here is using the Chrome browser, first at https://npm.taobao.org/mirrors/chromedriver this URL to download their own version of the browser chromedriver.exe, then install in the root directory of the Chrome browser


2 Use selenium library, access to live betta entire page

Automatically open the Chrome browser, and then enter the betta maximize the live web page and then automatically implement the method of pulling the wheel to get the entire broadcast page

In the browser, press F12 to enter the inspection mode, you can see all the anchor tag in the pages li class = "layout-Cover-list" of ul tag,

In this way it is possible to get the data


3, get pictures and link anchor name

Find the location where the picture, in every class = "DyImg-content is-normal" in the src tag

Each name class = class = "DyListCover-user" d h2 of the


4. Save the resulting picture and name to the local

Because the resulting image link may be some unnecessary necessary, so use regular expressions to match only useful place, and then saved,


5, complete code

import re
import requests
from selenium import webdriver
import time
# 1. 准备url
url = 'https://www.douyu.com/g_yz'

# 2. 获取element对象,Chrome后面是chromedriver.exe安装的根目录
driver = webdriver.Chrome('C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe')
driver.get(url)
time.sleep(1)

# 3.加载页面
# 页面最大化
driver.maximize_window()

# 拉动滚动条
for i in range(16):
    time.sleep(1)
    driver.execute_script("window.scrollBy(0, 500)")

# 4. 获取数据
lis = driver.find_elements_by_xpath('//ul[@class="layout-Cover-list"]/li')

# 5.发送请求保存图片
for li in lis:
    url = li.find_element_by_xpath('.//img[@class="DyImg-content is-normal "]').get_attribute("src")
    name = li.find_element_by_xpath('.//h2[@class="DyListCover-user"]').text
    try:
        url = re.match(".*\.jpg", url).group()
        response = requests.get(url)
        with open("./img/" + name + ".jpg", "wb") as f:
            f.write(response.content)
    except:
        url = re.match(".*\.png", url).group()
        response = requests.get(url)
        with open("./img/" + name + ".png", "wb") as f:
            f.write(response.content)
    print(name)
    print(url)

# 6.退出/下一页

driver.close()


Results Figure

Guess you like

Origin www.cnblogs.com/Dandelion-L/p/11229109.html