python reptiles: Selenium crawling Oriental Fortune City online company financial statements

Original link: https://mp.weixin.qq.com/s?src=11×tamp=1572075945&ver=1935&signature=P8UKE6o5J6DJShbc22yrRvtBkjOcUxkWocpnGjxj2He1VG3sM8iI7sgMyx*I3-FBczDns1KttyYQu7YNLb8Uj8M6q2*xkMnQLtflshY0j7WE3EB7WsAywy6S*3oziBtX&new=1

1. The actual background

Many sites offer a listed company announcements, financial statements and other financial information and investment data, such as: Finance Tencent, NetEase, Sina Finance, net wealth of the East. Of these, Eastern wealth network discovery data is very complete.

Eastern wealth network has a data center: http: //data.eastmoney.com/center/, the data center provide a lot of data, including characteristic data, research reports, quarterly report, etc., (see below).
Here Insert Picture Description
To the quarterly report category, for example, we look at the classification point to open the 2018 mid-year report (see below), you can see under this category also includes: data performance reports, results of Letters, income statement and other seven reports. A performance report as an example, the report contains performance data for all reports more than 3,000 stocks, a total of 70 pages.
Here Insert Picture Description
If we want to get all the stock performance reporting data in 2018, and then some of the data analysis. Take the manual method to copy more than 70 pages can barely complete. But if you want to get in any year, any quarter, any report data, the method again by manually copying the workload will be very large. For example, suppose you want to get 10 years (40 quarters), the data for all seven reports, then manually copy the workload will be approximately: 40 × 7 × 70 (about 70 per report), similar to the repeatability copied 20,000 times! ! ! It can be said that artificial impossible task. Therefore, the goal of this paper is to use Selenium automation technology, crawling under the quarterly report category, any year (site data to date), any financial statement data. We need to do, is just simply enter a few characters, it is all to the other computer, and then after a while open excel, you can see the required data "lying there quietly," is not very cool?

Well, here we began to look at practical operation. First, we need to analyze the pages of the object to be crawled.

2. web analytics

Before, we had to climb over tabular data, so the structure of the table data should not be too unfamiliar.

We are here to report the above results reported in 2018, for example, look at the form of a table.

Website url: http: //data.eastmoney.com/bbsj/201806/lrb.html,bbsj on behalf of the quarterly report, a quarterly in 2018 on behalf of 201803, similarly, 201806 represents mid-year report; lrb income statement is the first letter of the abbreviation , empathy, yjbb represent performance reports. As can be seen, the URL format is simple, convenient to construct url.

接着,我们点击下一页按钮,可以看到表格更新后url没有发生改变,可以判定是采用了Javscript。那么,我们首先判断是不是采用了Ajax加载的。方法也很简单,右键检查或按F12,切换到network并选择下面的XHR,再按F5刷新。可以看到只有一个Ajax请求,点击下一页也并没有生成新的Ajax请求,可以判断该网页结构不是常见的那种点击下一页或者下拉会源源不断出现的Ajax请求类型,那么便无法构造url来实现分页爬取。
Here Insert Picture Description
XHR选项里没有找到我们需要的请求,接下来试试看能不能再JS里找到表格的数据请求。将选项选为JS,再次F5刷新,可以看到出现了很多JS请求,然后我们点击几次下一页,会发现弹出新的请求来,然后右边为响应的请求信息。url链接非常长,看上去很复杂。好,这里我们先在这里打住不往下了。

可以看到,通过分析后台元素来爬取该动态网页的方法,相对比较复杂。那么有没有干脆、直截了当地就能够抓取表格内容的方法呢?有的,就是本文接下来要介绍的Selenium大法。
Here Insert Picture Description

3. Selenium知识

Selenium 是什么?一句话,自动化测试工具。它是为了测试而出生的,但在近几年火热的爬虫领域中,它摇身一变,变成了爬虫的利器。直白点说, Seleninm能控制浏览器, 像人一样"上网"。比如,可以实现网页自动翻页、登录网站、发送邮件、下载图片/音乐/视频等等。举个例子,写几行python代码就可以用Selenium实现登录IT桔子,然后浏览网页的功能。
Here Insert Picture Description
只需要记住重要的一点就是:Selenium能做到"可见即可爬"。也就是说网页上你能看到的东西,Selenium基本上都能爬取下来。包括上面我们提到的东方财富网的财务报表数据,它也能够做到,而且非常简单直接,不用去后台查看用了什么JavaScript技术或者Ajax参数。下面我们就实际来操练下吧

编码实现

思路

  • Installed and configured to run Selenium-related environment, the browser can use Chrome, Firefox, PhantomJS etc., I use Chrome;
  • Financial data Eastern wealth network without logging in can be obtained directly, Selenium easier crawling;
  • First single web of financial statements, for example, table data structure is simple, can be positioned directly to the entire table, and table cell content acquiring all disposable td corresponding to the node;
  • Then the cycle paging crawling all listed company data, and save it as a csv file.
  • Reconstruct flexible url, implementation can crawl any time, any one data of the financial statements.

According to the above ideas, the following can be implemented with codes a step.

Crawling one-page form
we first mid-year report 2018 income statement, for example, fetched the page first page table data, web url:http://data.eastmoney.com/bbsj/201806/lrb.html

Here Insert Picture Description
Rapid positioning to the node where the table: id = dt_1, Selenium may then be used to crawl, the method is as follows:

 1from selenium import webdriver
 2browser = webdriver.Chrome()
 3# 当测试好能够顺利爬取后,为加快爬取速度可设置无头模式,即不弹出浏览器
 4# 添加无头headlesss 1使用chrome headless,2使用PhantomJS
 5# 使用 PhantomJS 会警告高不建议使用phantomjs,建议chrome headless
 6# chrome_options = webdriver.ChromeOptions()
 7# chrome_options.add_argument('--headless')
 8# browser = webdriver.Chrome(chrome_options=chrome_options)
 9# browser = webdriver.PhantomJS()
10# browser.maximize_window()  # 最大化窗口,可以选择设置
11
12browser.get('http://data.eastmoney.com/bbsj/201806/lrb.html')
13element = browser.find_element_by_css_selector('#dt_1')  # 定位表格,element是WebElement类型
14# 提取表格内容td
15td_content = element.find_elements_by_tag_name("td") # 进一步定位到表格内容所在的td节点
16lst = []  # 存储为list
17for td in td_content:
18    lst.append(td.text)
19print(lst) # 输出表格内容

The complete code

  1from selenium import webdriver
  2from selenium.common.exceptions import TimeoutException
  3from selenium.webdriver.common.by import By
  4from selenium.webdriver.support import expected_conditions as EC
  5from selenium.webdriver.support.wait import WebDriverWait
  6import time
  7import pandas as pd
  8import os
  9
 10# 先chrome,后phantomjs
 11# browser = webdriver.Chrome()
 12# 添加无头headlesss
 13chrome_options = webdriver.ChromeOptions()
 14chrome_options.add_argument('--headless')
 15browser = webdriver.Chrome(chrome_options=chrome_options)
 16
 17# browser = webdriver.PhantomJS() # 会报警高提示不建议使用phantomjs,建议chrome添加无头
 18browser.maximize_window()  # 最大化窗口
 19wait = WebDriverWait(browser, 10)
 20
 21def index_page(page):
 22    try:
 23        print('正在爬取第: %s 页' % page)
 24        wait.until(
 25            EC.presence_of_element_located((By.ID, "dt_1")))
 26        # 判断是否是第1页,如果大于1就输入跳转,否则等待加载完成。
 27        if page > 1:
 28            # 确定页数输入框
 29            input = wait.until(EC.presence_of_element_located(
 30                (By.XPATH, '//*[@id="PageContgopage"]')))
 31            input.click()
 32            input.clear()
 33            input.send_keys(page)
 34            submit = wait.until(EC.element_to_be_clickable(
 35                (By.CSS_SELECTOR, '#PageCont > a.btn_link')))
 36            submit.click()
 37            time.sleep(2)
 38        # 确认成功跳转到输入框中的指定页
 39        wait.until(EC.text_to_be_present_in_element(
 40            (By.CSS_SELECTOR, '#PageCont > span.at'), str(page)))
 41    except Exception:
 42        return None
 43
 44def parse_table():
 45    # 提取表格第一种方法
 46    # element = wait.until(EC.presence_of_element_located((By.ID, "dt_1")))
 47    # 第二种方法
 48    element = browser.find_element_by_css_selector('#dt_1')
 49
 50    # 提取表格内容td
 51    td_content = element.find_elements_by_tag_name("td")
 52    lst = []
 53    for td in td_content:
 54        # print(type(td.text)) # str
 55        lst.append(td.text)
 56
 57    # 确定表格列数
 58    col = len(element.find_elements_by_css_selector('tr:nth-child(1) td'))
 59    # 通过定位一行td的数量,可获得表格的列数,然后将list拆分为对应列数的子list
 60    lst = [lst[i:i + col] for i in range(0, len(lst), col)]
 61
 62    # 原网页中打开"详细"链接,可以查看更详细的数据,这里我们把url提取出来,方便后期查看
 63    lst_link = []
 64    links = element.find_elements_by_css_selector('#dt_1 a.red')
 65    for link in links:
 66        url = link.get_attribute('href')
 67        lst_link.append(url)
 68
 69    lst_link = pd.Series(lst_link)
 70    # list转为dataframe
 71    df_table = pd.DataFrame(lst)
 72    # 添加url列
 73    df_table['url'] = lst_link
 74
 75    # print(df_table.head())
 76    return df_table
 77
 78# 写入文件
 79def write_to_file(df_table, category):
 80    # 设置文件保存在D盘eastmoney文件夹下
 81    file_path = 'D:\\eastmoney'
 82    if not os.path.exists(file_path):
 83        os.mkdir(file_path)
 84    os.chdir(file_path)
 85    df_table.to_csv('{}.csv' .format(category), mode='a',
 86                    encoding='utf_8_sig', index=0, header=0)
 87
 88# 设置表格获取时间、类型
 89def set_table():
 90    print('*' * 80)
 91    print('\t\t\t\t东方财富网报表下载')
 92    print('作者:高级农民工  2018.10.6')
 93    print('--------------')
 94
 95    # 1 设置财务报表获取时期
 96    year = int(float(input('请输入要查询的年份(四位数2007-2018):\n')))
 97    # int表示取整,里面加float是因为输入的是str,直接int会报错,float则不会
 98    # https://stackoverflow.com/questions/1841565/valueerror-invalid-literal-for-int-with-base-10
 99    while (year < 2007 or year > 2018):
100        year = int(float(input('年份数值输入错误,请重新输入:\n')))
101
102    quarter = int(float(input('请输入小写数字季度(1:1季报,2-年中报,3:3季报,4-年报):\n')))
103    while (quarter < 1 or quarter > 4):
104        quarter = int(float(input('季度数值输入错误,请重新输入:\n')))
105
106    # 转换为所需的quarter 两种方法,2表示两位数,0表示不满2位用0补充,
107    # http://www.runoob.com/python/att-string-format.html
108    quarter = '{:02d}'.format(quarter * 3)
109    # quarter = '%02d' %(int(month)*3)
110    date = '{}{}' .format(year, quarter)
111    # print(date) 测试日期 ok
112
113    # 2 设置财务报表种类
114    tables = int(
115        input('请输入查询的报表种类对应的数字(1-业绩报表;2-业绩快报表:3-业绩预告表;4-预约披露时间表;5-资产负债表;6-利润表;7-现金流量表): \n'))
116
117    dict_tables = {1: '业绩报表', 2: '业绩快报表', 3: '业绩预告表',
118                   4: '预约披露时间表', 5: '资产负债表', 6: '利润表', 7: '现金流量表'}
119    dict = {1: 'yjbb', 2: 'yjkb/13', 3: 'yjyg',
120            4: 'yysj', 5: 'zcfz', 6: 'lrb', 7: 'xjll'}
121    category = dict[tables]
122
123    # 3 设置url
124    # url = 'http://data.eastmoney.com/bbsj/201803/lrb.html' eg.
125    url = 'http://data.eastmoney.com/{}/{}/{}.html' .format(
126        'bbsj', date, category)
127
128    # # 4 选择爬取页数范围
129    start_page = int(input('请输入下载起始页数:\n'))
130    nums = input('请输入要下载的页数,(若需下载全部则按回车):\n')
131    print('*' * 80)
132
133    # 确定网页中的最后一页
134    browser.get(url)
135    # 确定最后一页页数不直接用数字而是采用定位,因为不同时间段的页码会不一样
136    try:
137        page = browser.find_element_by_css_selector('.next+ a')  # next节点后面的a节点
138    except:
139        page = browser.find_element_by_css_selector('.at+ a')
140    # else:
141    #     print('没有找到该节点')
142    # 上面用try.except是因为绝大多数页码定位可用'.next+ a',但是业绩快报表有的只有2页,无'.next+ a'节点
143    end_page = int(page.text)
144
145    if nums.isdigit():
146        end_page = start_page + int(nums)
147    elif nums == '':
148        end_page = end_page
149    else:
150        print('页数输入错误')
151    # 输入准备下载表格类型
152    print('准备下载:{}-{}' .format(date, dict_tables[tables]))
153    print(url)
154    yield{
155        'url': url,
156        'category': dict_tables[tables],
157        'start_page': start_page,
158        'end_page': end_page
159    }
160
161def main(category, page):
162    try:
163        index_page(page)
164        # parse_table() #测试print
165        df_table = parse_table()
166        write_to_file(df_table, category)
167        print('第 %s 页抓取完成' % page)
168        print('--------------')
169    except Exception:
170        print('网页爬取失败,请检查网页中表格内容是否存在')
171# 单进程
172if __name__ == '__main__':
173
174    for i in set_table():
175        # url = i.get('url')
176        category = i.get('category')
177        start_page = i.get('start_page')
178        end_page = i.get('end_page')
179
180    for page in range(start_page, end_page):
181        # for page in range(44,pageall+1): # 如果下载中断,可以尝试手动更改网页继续下载
182        main(category, page)
183    print('全部抓取完成')

Guess you like

Origin blog.csdn.net/fei347795790/article/details/102757496
Recommended