Python：【2】使用Selenium爬取多页表格数据

之前没有接触过爬虫或者是HTML，但是周围有好多人是因为爬虫才学习的Python。整体思路参照了大神的博客：参考博客【1】。因为不确定数据是否授权，对网站信息进行了隐藏，只讨论方法的可行性，供大家参考。如果有错误希望大家能够指出~

目录

环境和模块准备

元素定位

翻页设置

写入文件

整体代码

参考博客

环境和模块准备

需要安装：Python3，Selenium，Chrome浏览器，chromedrive

chromedrive安装步骤见：参考博客【2】，需注意版本要和本身安装的Chrome浏览器版本一致，查看方法见：参考博客【3】。

元素定位

首先需要观察网站的源代码，查找对应的元素的源码位置。

关于Selenium对网站元素的定位方法和其他基本操作，见：参考博客【4】。但是实际操作之后，一直没办法成功定位，咨询了宁宁同学，发现是忽略了iframe这个东西，据说是相当于网站中的内嵌网站，见：参考博客【5】，不过算是比较幸运，我在这个网站的iframe标签里看到了一个网址，进去之后就可以定位到元素了。

进入这个网址之后，发现是单独表格存在的一个网页，在这里对元素进行定位就可以成功了。

这部分使用的定位方法是利用类名进行定位，或者也可以使用xpath，对元素定位以及访问数据使用的代码如下：

biao = browser.find_element_by_class_name("list")
td_content = biao.find_elements_by_tag_name("td")

经过尝试，可以成功提取其中的文字数据保存到列表中。

翻页设置

对于翻页的设置，我选择了一个比较简单的方法，定位下一页按钮的位置，每次采集之后就点击一次，并用循环来限制点击的次数。利用的代码如下：

key = WebDriverWait(browser, 10).until(
EC.visibility_of_element_located((By.XPATH,'/html/body/div/div[4]/nav/ul/li[8]/a'))
)
key.click()

其他人的代码里加入了一些感觉是增加鲁棒性的代码，由于时间紧，之后会来补充一下这部分。总之目前能用来爬取数据就可以了。

写入文件

获得数据后，是一维的list数据，我的做法是先利用numpy转换list的维度，之后转换为字符串格式，最后去除其中的‘以及其他多余字符，变为用空格分割的数据，储存到txt文档中，方便excel导入。利用的代码如下：

for td in td_content:
    lst.append(td.text)
if(page >= 197):
    lst = np.array(lst).reshape(8,5)
else:
    lst = np.array(lst).reshape(15,5)

string = str(lst)
string = string.replace('\'\'', '-')
string = string.replace(',', '')
string = string.replace('\'', '')
string = string.replace('[[', ' ')
string = string.replace(']]', ' ')
string = string.replace('[', '')
string = string.replace(']', '')  
    
stringdata = string + '\n'
data_write_txt('data.txt', stringdata)
print('该页已存储')

为了防止字符串占用内存过大，每次爬完一页就写入文件，而不是所有的都放进字符串再进行写入。

整体代码

import numpy as np
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait

browser = webdriver.Chrome()
browser.get('http://URL')
wait = WebDriverWait(browser, 10)NONE

def get_data(page):
    lst = [] #新建空列表
    stringdata = '' #新建空字符串
    print('正在爬取第：%s/197 页' % page) #显示当前页码
    if (page < 197):
        key = WebDriverWait(browser, 10).until(
        EC.visibility_of_element_located((By.XPATH, '/html/body/div/div[4]/nav/ul/li[8]/a'))
        )
    #到达197页就不用进行点击了
    #定位表格和数据
    biao = browser.find_element_by_class_name("list")
    td_content = biao.find_elements_by_tag_name("td")
    #爬取数据，添加到列表中
    for td in td_content:
        lst.append(td.text)
    if(page >= 197):
        lst = np.array(lst).reshape(8,5) #第197页只有8行数据
    else:
        lst = np.array(lst).reshape(15,5) #前196页有15行数据

    string = str(lst) #list转换为字符串格式
    #去除标点和多余字符
    string = string.replace('\'\'', '-')
    string = string.replace(',', '')
    string = string.replace('\'', '')
    string = string.replace('[[', ' ')
    string = string.replace(']]', ' ')
    string = string.replace('[', '')
    string = string.replace(']', '')  
    
    stringdata = string + '\n'
    #写入txt文件中，注意是append模式
    data_write_txt('data.txt', stringdata)
    print('该页已存储')
    if (page < 197):
        key.click() #点击下一页

def data_write_txt(file_name, datas):
    file = open(file_name,'a');
    file.write(datas)
    file.close();
                
def main():
    for page in range(1,198):
        get_data(page)
    print("保存文件成功，处理结束")

if __name__ == '__main__':
    main()

结果如下：

也是有一定问题的，对于有些第二列数据过长，在网页中进行了分行，结果爬出来也是单占了一行。我是导入excel之后才发现的，因为只有6个这样的数据，就手动解决了，不知道有没有其他好办法。这里留个坑，以后想办法解决。

将txt数据导入excel见：参考博客【6】。

参考博客

【1】：https://www.cnblogs.com/sanduzxcvbnm/p/10276617.html

【2】：https://blog.csdn.net/qq_38486203/article/details/82852240

【3】：https://jingyan.baidu.com/article/bad08e1ed2d0d709c9512155.html

【4】：https://www.yukunweb.com/2017/7/python-spider-Selenium-PhantomJS-basic/

【5】：https://www.cnblogs.com/alliefu/p/6554773.html

【6】：https://blog.csdn.net/qq_35893120/article/details/90054410