Python selenium crawler teaching

Please install the python environment first O(∩_∩)O

preliminary work

First install the chrome driver
plug-in website: chrome driver
insert image description here
finds the corresponding download on the plug-in website:

insert image description here
Unzip Mobile:
insert image description here
There is a version number that shows success.
insert image description here
Remember under mac, be sure to let go of the permissions! ! ! !
insert image description here
First, the above figure demonstrates the result:

insert image description hereinsert image description here
Only a part is shown, because there are too many data sources, and the complete code will be released in the end. The
principle is that the code controls the browser, and after the data is loaded asynchronously, the complete web page data can be obtained
(this website is for reference only, please test other data, the principle all the same)


Necessary library, remember to import

import time
from selenium import webdriver
from selenium.webdriver.common.by import By  # 与下面的2个都是等待时要用到
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

Login part:
check the data of the web page first: log in to the website,
how to view the web page will not be taught
insert image description here

1. Get the xpath of two inputs

insert image description here
Code: fill username password:

	url = 'http://www.mirrorpsy.cn/index.php?s=member&c=login&m=index'
    driver = webdriver.Chrome()
    driver.get(url)
    wait = WebDriverWait(driver, 25)
    driver.maximize_window()  # 窗口最大化
    user = wait.until(
        EC.presence_of_element_located((By.XPATH, '//*[@id="myform"]/div/div[1]/input')))
    user.send_keys('xxxx') # 用户密码就不暴露了
    pw = wait.until(
        EC.presence_of_element_located((By.XPATH, '//*[@id="myform"]/div/div[2]/input')))
    pw.send_keys('xxxxx')
    print('login param input success')

Run it directly, you can see that the browser starts automatically, and enter the username and password

2. Login

Get the xpath of the login button

	login = wait.until(EC.element_to_be_clickable((By.XPATH, '//*[@id="myform"]/div/div[4]/button')))  # 登录按钮
    login.click()
    print('login success')
    time.sleep(10)

Sleep for 10s in order for the page to fully load

3、cookies

Some websites use cookies, but this website [does not use it], post the one that gets cookies

	cookies = driver.get_cookies()
    cookie_list = []
    for dict in cookies:
        cookie = dict['name'] + '=' + dict['value']
        cookie_list.append(cookie)
    cookie = ';'.join(cookie_list)
    print('cookie' + cookie)

4. Officially crawl the data and get the picture link

This paragraph is the list of pictures you need (we will talk about it after pagination)
insert image description here
directly check the xpath of the ul tag (I won’t teach it here, there is a way to view it above)

//*[@id=“cn_perio_list”]

After obtaining, get the label li under ul:

items = driver.find_element_by_xpath('//*[@id="cn_perio_list"]')
lis = items.find_elements_by_xpath('li')

The obtained li is a collection, traverse to obtain the subtag a , and obtain the attribute src of a

for li in lis:
    imgs = li.find_element_by_tag_name('img')
    print(imgs)
    print(imgs.get_attribute('src'))
    image_list.append(imgs.get_attribute('src'))

In this way, one page of data is just fine

paging

There are pagination labels below, but this website is quite special, the number of pages is dynamically loaded, and the content of the last page keeps changing

insert image description here

insert image description here

So directly use the method of the next page. If there is no li in the ul , just jump out of the infinite loop: Of course, how to get the page number is also provided (but it is useless)
insert image description here

# 本来想做页数解析的,但是他的页数动态改变,所以页数就不用了
pages = driver.find_element_by_xpath('//*[@id="laypage_1"]')
a = pages.find_elements_by_xpath('a')
pageEle = pages.find_element_by_xpath('//*[@id="laypage_1"]/a[' + str(len(a) - 1) + ']')
print('page:' + str(pageEle.text))
page = int(pageEle.text)

The above completes the crawling
insert image description here
complete code:

import time

from selenium import webdriver
from selenium.webdriver.common.by import By  # 与下面的2个都是等待时要用到
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

if __name__ == "__main__":
    url = 'http://www.mirrorpsy.cn/index.php?s=member&c=login&m=index'
    driver = webdriver.Chrome()
    driver.get(url)
    wait = WebDriverWait(driver, 25)
    driver.maximize_window()  # 窗口最大化
    user = wait.until(
        EC.presence_of_element_located((By.XPATH, '//*[@id="myform"]/div/div[1]/input')))
    user.send_keys('xxx') #想要测试可留言
    pw = wait.until(
        EC.presence_of_element_located((By.XPATH, '//*[@id="myform"]/div/div[2]/input')))
    pw.send_keys('xxxx')
    print('login param input success')
    login = wait.until(EC.element_to_be_clickable((By.XPATH, '//*[@id="myform"]/div/div[4]/button')))  # 登录按钮
    login.click()
    print('login success')
    time.sleep(10)

    cookies = driver.get_cookies()
    cookie_list = []
    for dict in cookies:
        cookie = dict['name'] + '=' + dict['value']
        cookie_list.append(cookie)
    cookie = ';'.join(cookie_list)
    print('cookie' + cookie)
    print('get knowledge')
    # 学科知识
    url_knowledge = 'http://www.mirrorpsy.cn/index.php?s=xkzs'
    # 心理视频
    url_video = 'http://www.mirrorpsy.cn/index.php?s=video'
    # 心理音频
    url_jyxl = 'http://www.mirrorpsy.cn/index.php?s=jyxl'
    # 心理图片
    url_photo = 'http://www.mirrorpsy.cn/index.php?s=photo'
    # 学术论文
    url_xslw = 'http://www.mirrorpsy.cn/index.php?s=xslw'
    # 心理案例
    url_xlal = 'http://www.mirrorpsy.cn/index.php?s=xlal'
    # 心理专家
    url_news = 'http://www.mirrorpsy.cn/index.php?s=news'

    # driver.get(url_photo)
    # 本来想做页数解析的,但是他的页数动态改变,所以页数就不用了
    # pages = driver.find_element_by_xpath('//*[@id="laypage_1"]')
    # a = pages.find_elements_by_xpath('a')
    # pageEle = pages.find_element_by_xpath('//*[@id="laypage_1"]/a[' + str(len(a) - 1) + ']')
    # print('page:' + str(pageEle.text))
    # page = int(pageEle.text)
    image_list = []
    pageWeb = 0
    while True:
        print('get page:' + str(pageWeb))
        url_pic = 'http://www.mirrorpsy.cn/index.php?s=photo&c=search&page=' + str(pageWeb + 1)
        driver.get(url_pic)
        items = driver.find_element_by_xpath('//*[@id="cn_perio_list"]')
        lis = items.find_elements_by_xpath('li')
        print('len:' + str(len(lis)))
        if len(lis) == 0:
            break
        for li in lis:
            imgs = li.find_element_by_tag_name('img')
            print(imgs)
            print(imgs.get_attribute('src'))
            image_list.append(imgs.get_attribute('src'))
        time.sleep(5)
        pageWeb = pageWeb + 1
        # pages = driver.find_element_by_xpath('//*[@id="laypage_1"]')
        # a = pages.find_elements_by_xpath('a')
        # pageEle = pages.find_element_by_xpath('//*[@id="laypage_1"]/a[' + str(len(a) - 1) + ']')
        # page = int(pageEle.text)

    print('图片爬取结束')
    print('图片集合结果:' + str(image_list))
    time.sleep(500000)

Thank you, pay attention to the wave, and update it irregularly every week in the future

Guess you like

Origin blog.csdn.net/u013377003/article/details/107297982