Python crawls the profit statement data of listed companies: data capture, data storage and data visualization in one go

background

Small crawler exercise: Crawl the income statement data of the listed company Conch Cement from the Sina Finance website, use JSONfiles and MySQLas two persistence methods, and realize the data visualization of the company's total operating income and total operating costs in the past 10 years.

eg: https://money.finance.sina.com.cn/corp/go.php/vFD_ProfitStatement/stockid/600585/ctrl/2013/displaytype/4.phtml

2023-08-12-Table.png

PythonThere are many advantages in the world of crawling and data visualization that make it one of the programming languages ​​of choice.

In terms of reptiles, Pythonit has the following advantages:

  1. Easy to learn: PythonThe grammar is concise and clear, easy to understand and learn, even beginners can get started quickly.
  2. Powerful library and framework support: PythonThere are rich third-party libraries and frameworks, such as BeautifulSoup, Scrapyetc. These libraries and frameworks provide a wealth of functions and tools, making crawler development more efficient and convenient.
  3. Multi-thread and asynchronous support: PythonSupport multi-thread and asynchronous programming, which enables the crawler to process multiple tasks at the same time, improving crawling efficiency.
  4. Strong data processing ability: PythonWith powerful data processing and analysis libraries, such as Pandas and NumPy, it can easily clean, analyze and process the crawled data.

In terms of data visualization, Pythonthere are also the following advantages:

  1. Rich visualization libraries: PythonThere are multiple powerful visualization libraries, such as Matplotlib, , Seabornand Plotly, etc. These libraries provide rich chart types and customization options to meet various data visualization needs.
  2. Flexibility and scalability: Pythonthe visualization library has high flexibility and scalability, and can be customized and expanded according to requirements to meet individual visualization needs.
  3. Seamless integration with data processing: Pythonthe data processing and visualization library can be seamlessly integrated, making the process from data processing to visualization more smooth and efficient.

The libraries used in this practice Pythonare as follows:

  • requests
  • BeautifulSoup4
  • json
  • matplotlib
  • pandas
  • pymysql

Coding in collaboration with GPT

As Pythonan entry-level player, I am not very good at writing crawler programs, so let GPTme generate a scaffold for me first.

Prompt: The crawler obtains the table data in https://money.finance.sina.com.cn/corp/go.php/vFD_ProfitStatement/stockid/600585/ctrl/2013/displaytype/4.phtml and saves it to a json file , pay attention to the file encoding format

2023-08-12-GPT.png
GPTI'm afraid that I can't understand it, and I also thoughtfully added comments to the key codes. .

Crawler, save as JSON, drawing

In order to simplify the difficulty of the problem, we will deal with it step by step:

  1. Data capture: crawl raw data and save it as JSON;
  2. Data preprocessing: use the first value of each element as the attribute name, and the remaining elements as values ​​(to facilitate the next step of JSON merging);
  3. Data Merge: Merge 10 years of data into one JSON.
  4. Data visualization: using

data capture

First crawl the data for 1 year.

import requests
from bs4 import BeautifulSoup
import json

# 发送HTTP请求获取网页内容
url = "https://money.finance.sina.com.cn/corp/go.php/vFD_ProfitStatement/stockid/600585/ctrl/2013/displaytype/4.phtml"
response = requests.get(url)
html_content = response.text

# 使用BeautifulSoup解析HTML内容
soup = BeautifulSoup(html_content, "html.parser")

# 找到表格元素
table = soup.find("table", {
    
    "id": "ProfitStatementNewTable0"})

# 提取表格数据
data = []
for row in table.find_all("tr"):
    row_data = []
    for cell in row.find_all("td"):
        row_data.append(cell.text.strip())
    # 过滤掉空行
    if len(row_data) != 0:
        data.append(row_data)
    print(data)

# 将JSON数据保存到文件
with open("data.json", "w", encoding="utf-8") as f:
    json.dump(data, f, ensure_ascii=False)

2023-08-12-DataRaw.jpg

The 10-year data is nothing more than a cycle in the outer layer.

import requests
from bs4 import BeautifulSoup
import json

# 近10年的利润表
years = 10
for i in range(years):
    # 发送HTTP请求获取网页内容
    url = "https://money.finance.sina.com.cn/corp/go.php/vFD_ProfitStatement/stockid/600585/ctrl/{}/displaytype/4.phtml".format(2013 + i)
    response = requests.get(url)
    html_content = response.text

    # 使用BeautifulSoup解析HTML内容
    soup = BeautifulSoup(html_content, "html.parser")

    # 找到表格元素
    table = soup.find("table", {
    
    "id": "ProfitStatementNewTable0"})

    # 提取表格数据
    data = []
    for row in table.find_all("tr"):
        row_data = []
        for cell in row.find_all("td"):
            row_data.append(cell.text.strip())
        # 过滤掉空行与不完整的数据行
        if len(row_data) > 1:
            data.append(row_data)
        print(data)

    # 将JSON数据保存到文件
    with open("data-{}.json".format(2013 + i), "w", encoding="utf-8") as f:
        json.dump(data, f, ensure_ascii=False)

data preprocessing

Convert the data of each year into the form JSONof format key:valueto facilitate subsequent merging operations.

import json

# 循环读取JSON文件
years = 10
for i in range(years):
    with open('data-{}.json'.format(2013+i), 'r', encoding="utf-8") as f:
        first_data = json.load(f)

    result = {
    
    }
    for item in first_data:
        # 转换:将每一个元素的第一个值作为属性名,剩余元素作为值
        result.update([(item[0], item[1:])])

        # 将数据保存为JSON格式
        json_data = json.dumps(result)

        # 将JSON数据保存到文件
        with open("data-transform-{}.json".format(2013 + i), "w", encoding="utf-8") as f:
            json.dump(result, f, ensure_ascii=False)

2023-08-12-DataTransformed.jpg

data merge

Merge the preprocessed 10-year data into one large JSONfile.

import json

# 循环读取合并JSON文件
years = 10
merged_data = {
    
    }
transformed = {
    
    }
for i in range(years):
    with open('data-transform-{}.json'.format(2013+i), 'r', encoding="utf-8") as f:
        first_data = json.load(f)
        transformed[2013 + i] = first_data
        merged_data = {
    
    **merged_data, **transformed}
        
# 将合并后的JSON文件写入磁盘
with open('data-merged.json', 'w', encoding="utf-8") as f:
    json.dump(merged_data, f, indent=4, ensure_ascii=False)

data visualization

Obtain JSONthe operating income and operating cost data of Conch Cement Company in the data in the past 10 years, and use to matplotlibdraw line charts and column charts. If you need to do large-screen visualization, you can use the AJ-Report open source data visualization engine to get started .

import json
import matplotlib.pyplot as plt

with open('data-merged.json', 'r', encoding='utf-8') as f:
    data = json.load(f)
    
x_data = []
y1_data = []
y2_data = []
for key,value in data.items():
    # print(key)
    print(float(data[key]['营业收入'][0].replace(',', '')))
    print()
    print(float(data[key]['营业成本'][0].replace(',', '')))

    x_data.append(key)
    y1_data.append(float(data[key]['营业收入'][0].replace(',', '')))
    y2_data.append(float(data[key]['营业成本'][0].replace(',', '')))

plt.rcParams['font.sans-serif']=['SimHei']
plt.rcParams['axes.unicode_minus']=False

plt.bar(x_data,y1_data, label='营业收入')
plt.plot(x_data,y2_data, label='营业成本', color='cyan', linestyle='--')
plt.title('海螺水泥近10年营业收入与营业成本')
plt.xlabel('年份')
plt.ylabel('营业收入与营业成本/万元')

# 关闭纵轴的科学计数法
axis_y = plt.gca()
axis_y.ticklabel_format(axis='y', style='plain')

# 图例
plt.legend()

# 显示图表
plt.show()

2023-08-12-Chart.jpg

reptile, photography

Data table design (only some fields)

Design the data table based on the actual table we see in the web page.

2023-08-12-MySQLTable.jpg

CREATE TABLE `b_profit_test` (
	`id` BIGINT(20) NOT NULL AUTO_INCREMENT COMMENT '主键',
	`company_id` BIGINT(20) NOT NULL DEFAULT '0' COMMENT '公司ID',
	`total_operating_income` DECIMAL(15,2) NOT NULL DEFAULT '0.00' COMMENT '一、营业总收入',
	`operating_income` DECIMAL(15,2) NOT NULL DEFAULT '0.00' COMMENT '营业收入',
	`total_operating_cost` DECIMAL(15,2) NOT NULL DEFAULT '0.00' COMMENT '二、营业总成本',
	`operating_cost` DECIMAL(15,2) NOT NULL DEFAULT '0.00' COMMENT '营业成本',
	`taxes_and_surcharges` DECIMAL(15,2) NOT NULL DEFAULT '0.00' COMMENT '营业税金及附加',
	`sales_expense` DECIMAL(15,2) NOT NULL DEFAULT '0.00' COMMENT '销售费用',
	`management_costs` DECIMAL(15,2) NOT NULL DEFAULT '0.00' COMMENT '管理费用',
	`financial_expenses` DECIMAL(15,2) NOT NULL DEFAULT '0.00' COMMENT '财务费用',
	`rd_expenses` DECIMAL(15,2) NOT NULL DEFAULT '0.00' COMMENT '研发费用',
	`operating_profit` DECIMAL(15,2) NOT NULL DEFAULT '0.00' COMMENT '三、营业利润',
	`net_profit` DECIMAL(15,2) NOT NULL DEFAULT '0.00' COMMENT '五、净利润',
	`basic_earnings_per_share` DECIMAL(15,2) NOT NULL DEFAULT '0.00' COMMENT '基本每股收益',
	`diluted_earnings_per_share` DECIMAL(15,2) NOT NULL DEFAULT '0.00' COMMENT '稀释每股收益',
	`report_period` VARCHAR(50) NOT NULL DEFAULT '' COMMENT '报告期' COLLATE 'utf8_general_ci',
	`date` DATE NULL DEFAULT NULL COMMENT '日期',
	PRIMARY KEY (`id`) USING BTREE
)
COMMENT='利润表测试'
COLLATE='utf8_general_ci'
ENGINE=InnoDB
;

Data crawling and data storage

Different from the crawler method above, the method used here pandasdirectly read_htmlobtains the table data in the web page. Here, pay attention to the number of the target table, which needs to be checked and located in the browser.

Taking a single link as an example, the result of the crawler is a two-dimensional table. By transposing, setting the first row to header, replacing the column name (to facilitate correspondence with the data table fields), deleting the entire row as NaN, and NaNreplacing a single row with 0 After a series of data preprocessing operations, directly connect to MySQLthe database to complete the data storage.

2023-08-12-Notebook.png

import pandas as pd
import pymysql

# 爬虫获取网页中的表格,注意这里的参数13,与实际的网页中表格有关
df=pd.read_html('https://money.finance.sina.com.cn/corp/go.php/vFD_ProfitStatement/stockid/600585/ctrl/2013/displaytype/4.phtml')[13]

# 可在Jupyter Notebook中直接查看整个表格
df.head(32)

# 预处理
df = df.transpose() # 转置,方便处理
df = df.rename(columns=df.iloc[0]).drop(df.index[0]) # 将第一行设置为header
df.rename(columns={
    
    '报表日期': 'report_period', '一、营业总收入': 'total_operating_income','营业收入': 'operating_income','二、营业总成本': 'total_operating_cost','营业成本': 'operating_cost','营业税金及附加': 'taxes_and_surcharges','销售费用': 'sales_expense','管理费用': 'management_costs','财务费用': 'financial_expenses','三、营业利润': 'operating_profit','五、净利润': 'net_profit','基本每股收益(元/股)': 'basic_earnings_per_share','稀释每股收益(元/股)': 'diluted_earnings_per_share'},inplace=True)
df.dropna(axis=0, how='all', inplace=True) # 删除整行都为NaN的行
df.fillna(0, inplace=True) # 单个NaN替换为0

# 写库
conn = pymysql.connect(
    user = 'root',
    host = 'localhost',
    password= 'root',
    db = 'financial-statement',
    port = 3306,
)

cur = conn.cursor()
for index, row in df.iterrows():
    # print(index) # 输出每行的索引值
    # print(row) # 输出每行的索引值

    sql = "insert into b_profit_test(report_period, total_operating_income, operating_income, total_operating_cost, operating_cost, taxes_and_surcharges, sales_expense, management_costs, financial_expenses, operating_profit, net_profit, basic_earnings_per_share, diluted_earnings_per_share) values ('" + str(row['report_period']) + "'," + str(row['total_operating_income']) + ',' + str(row['operating_income']) + ',' + str(row['total_operating_cost']) + ',' + str(row['operating_cost']) + ',' + str(row['taxes_and_surcharges']) + ',' + str(row['sales_expense']) + ',' + str(row['management_costs']) + ',' + str(row['financial_expenses']) + ',' + str(row['operating_profit']) + ','  + str(row['net_profit']) + ',' + str(row['basic_earnings_per_share']) + ',' + str(row['diluted_earnings_per_share']) + ');'
    print(sql)
    cur.execute(sql)
conn.commit()
cur.close()
conn.close()

2023-08-12-MySQLData.jpg
After completing the data storage, we will let us do the follow-up: data modeling, data analysis, and data visualization. .

small summary

In summary, we Pythoncrawled the profit statement data of listed companies: data capture, data warehousing and data visualization in one go, and experienced Pythoncrawlers and data visualization that are easy to learn, powerful library and framework support, multi-threaded and asynchronous support, Strong data processing ability and other advantages.


If you have any questions or any bugs are found, please feel free to contact me.
Your comments and suggestions are welcome!

Guess you like

Origin blog.csdn.net/u013810234/article/details/132255454