Python: [2] Use Selenium to crawl multi-page form data

I haven't been exposed to crawlers or HTML before, but there are many people around who learn Python because of crawlers. The overall idea refers to the blog of the great god: reference blog [1] . Because it is not sure whether the data is authorized, the website information is hidden , and only the feasibility of the method is discussed for your reference. If there are mistakes, I hope everyone can point them out~ 

Table of contents

Environment and module preparation

element positioning

Page turning settings

write to file

overall code

reference blog

Environment and module preparation

Need to install: Python3, Selenium, Chrome browser, chromedrive

For the installation steps of chromedrive, please refer to: refer to blog [2] , please note that the version must be consistent with the version of the Chrome browser you have installed. For the viewing method, please refer to: refer to blog [3] .

 

element positioning

First, you need to observe the source code of the website and find the source code location of the corresponding element.

For Selenium's positioning method for website elements and other basic operations, see: Reference blog [4] . But after the actual operation, I have been unable to successfully locate it. I consulted Ning Ning and found that the iframe was ignored. It is said that it is equivalent to the embedded website in the website. See: Reference blog [5], but it is quite lucky , I I saw a URL in the iframe tag of this website, and after entering it, I can locate the element.

 After entering this URL, I found that it is a webpage with a separate table, and I can successfully locate the elements here.

The positioning method used in this part is to use the class name for positioning, or you can also use xpath. The code used for element positioning and accessing data is as follows:

biao = browser.find_element_by_class_name("list")
td_content = biao.find_elements_by_tag_name("td")

 After trying, the text data can be successfully extracted and saved in the list.

 

Page turning settings

For the setting of page turning, I chose a relatively simple method to locate the position of the next page button, click once after each collection, and use a loop to limit the number of clicks. The code used is as follows:

key = WebDriverWait(browser, 10).until(
EC.visibility_of_element_located((By.XPATH,'/html/body/div/div[4]/nav/ul/li[8]/a'))
)
key.click()

 Other people's codes have added some codes that feel more robust. Due to time constraints, I will add this part later. In short, it can be used to crawl data at present.

 

write to file

After the data is obtained, it is a one-dimensional list data. My method is to use numpy to convert the dimension of the list, then convert it to a string format, and finally remove the ' and other redundant characters in it, and change it into data separated by spaces, and store it in txt file, which is convenient for importing into excel. The code used is as follows:

for td in td_content:
    lst.append(td.text)
if(page >= 197):
    lst = np.array(lst).reshape(8,5)
else:
    lst = np.array(lst).reshape(15,5)

string = str(lst)
string = string.replace('\'\'', '-')
string = string.replace(',', '')
string = string.replace('\'', '')
string = string.replace('[[', ' ')
string = string.replace(']]', ' ')
string = string.replace('[', '')
string = string.replace(']', '')  
    
stringdata = string + '\n'
data_write_txt('data.txt', stringdata)
print('该页已存储')

In order to prevent the string from taking up too much memory, write to the file every time a page is crawled, instead of putting everything into the string and then writing. 

 

overall code

import numpy as np
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait

browser = webdriver.Chrome()
browser.get('http://URL')
wait = WebDriverWait(browser, 10)NONE

def get_data(page):
    lst = [] #新建空列表
    stringdata = '' #新建空字符串
    print('正在爬取第:%s/197 页' % page) #显示当前页码
    if (page < 197):
        key = WebDriverWait(browser, 10).until(
        EC.visibility_of_element_located((By.XPATH, '/html/body/div/div[4]/nav/ul/li[8]/a'))
        )
    #到达197页就不用进行点击了
    #定位表格和数据
    biao = browser.find_element_by_class_name("list")
    td_content = biao.find_elements_by_tag_name("td")
    #爬取数据,添加到列表中
    for td in td_content:
        lst.append(td.text)
    if(page >= 197):
        lst = np.array(lst).reshape(8,5) #第197页只有8行数据
    else:
        lst = np.array(lst).reshape(15,5) #前196页有15行数据

    string = str(lst) #list转换为字符串格式
    #去除标点和多余字符
    string = string.replace('\'\'', '-')
    string = string.replace(',', '')
    string = string.replace('\'', '')
    string = string.replace('[[', ' ')
    string = string.replace(']]', ' ')
    string = string.replace('[', '')
    string = string.replace(']', '')  
    
    stringdata = string + '\n'
    #写入txt文件中,注意是append模式
    data_write_txt('data.txt', stringdata)
    print('该页已存储')
    if (page < 197):
        key.click() #点击下一页

def data_write_txt(file_name, datas):
    file = open(file_name,'a');
    file.write(datas)
    file.close();
                
def main():
    for page in range(1,198):
        get_data(page)
    print("保存文件成功,处理结束")

if __name__ == '__main__':
    main()

The result is as follows:

 There are also certain problems. For some data in the second column that is too long, a branch is made in the web page, and the result is that only one row is occupied when it is crawled out. I only found out after importing excel, because there are only 6 such data, so I solved it manually, I don’t know if there is any other good way. Leave a hole here, and find a way to solve it later.

For importing txt data into excel, see: Reference blog [6] .

 

reference blog

【1】:https://www.cnblogs.com/sanduzxcvbnm/p/10276617.html

【2】:https://blog.csdn.net/qq_38486203/article/details/82852240

【3】:https://jingyan.baidu.com/article/bad08e1ed2d0d709c9512155.html

【4】:https://www.yukunweb.com/2017/7/python-spider-Selenium-PhantomJS-basic/

【5】:https://www.cnblogs.com/alliefu/p/6554773.html

【6】:https://blog.csdn.net/qq_35893120/article/details/90054410

Guess you like

Origin blog.csdn.net/Alex497259/article/details/104775828