Getting started with pandas 01-DataFrame data processing and inserting into mysql database

Preface

The stock market has been very hot recently, so I thought about making a quantitative analysis program to speculate in stocks, maybe I will be at the pinnacle of my life.

Most of the data interfaces of the online stock market are python, and the returned data structure is DataFrame. This article will record the process of learning DataFrame for data processing and inserting into the database.

Environmental preparation

Python environment, anaconda is recommended, there are many online installation tutorials
mysql database, docker install mysql database tutorial portal
Install tushare (provides an interface for obtaining stock data)
Install pymysql

tushare and pymysql can be installed using the anaconda interface, or can be installed from the command line conda install xxx or pip install xxx

After installing python in the windows environment, typing python in the cmd command line interface may jump directly to the application store. At this time, you need to move the environment variable path up and move it to the application store environment variable.

DataFrame data preparation

import tushare as ts
import pandas as pd

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
#加上上面2行，保证控制台打印完整数据
pro = ts.pro_api()
df = pro.query('trade_cal', start_date='20200101', end_date='20201231')
print(df)

Get the stock trading day data as follows:

    exchange  cal_date  is_open
0        SSE  20200101        0
1        SSE  20200102        1
2        SSE  20200103        1
3        SSE  20200104        0
4        SSE  20200105        0
...

DataFrame data processing

Modify the header

Since the data of the DataFrame needs to be stored in the database, the column name of the database is the column name of the DataFrame by default

#方法一
#直接修改，此方法有个缺点，必须写明每一列，不然会报错
df.columns=['exchange','trade_date','isopen']
#方法二
#指定修改，推荐
df.rename(columns={'cal_date':'trade_date','is_open':'is_open'},inplace=True)

change the data

Since the returned date format is yyyyMMdd, I want to change it to yyyy-MM-dd format

df['trade_date']=df['trade_date'].apply(lambda x:x[0:4]+"-"+x[4:6]+"-"+x[6:8])

Data storage

import tushare as ts
import pandas as pd
from sqlalchemy import create_engine

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pro = ts.pro_api()

df = pro.query('trade_cal', start_date='20200101', end_date='20201231')
print(df)
df.rename(columns={
    
    'exchange':'exchange','cal_date':'trade_date','is_open':'is_open'},inplace=True)
df['trade_date']=df['trade_date'].apply(lambda x:x[0:4]+"-"+x[4:6]+"-"+x[6:8])

#建立连接
conn = create_engine('mysql+pymysql://root:123456@localhost:3306/investment',encoding='utf8')
#写入数据，t_stock_trade_date为数据库表名，‘replace’表示如果同名表存在就替换掉
pd.io.sql.to_sql(df, "t_stock_trade_date", conn, if_exists='replace',index=False)

Python supports automatic creation of database tables when they do not exist, controlled by the if_exists attribute

if_exists : {'fail', 'replace', 'append'}, default 'fail'
    How to behave if the table already exists.

    * fail: Raise a ValueError.
    * replace: Drop the table before inserting new values.
    * append: Insert new values to the existing table.

In addition, when the DataFrame is inserted into the database, there is an index field by default, you can remove this field, and index=false.

DataFrame addition, deletion, modification, and checking

There are many ways to add, delete, modify, and check, and here are only a few commonly used.

The parameter inplace is False by default, and the editing effect can only be achieved in the generated new data block. When inplace=True, internal editing is executed, no value is returned, and the original data is changed.

import numpy as np
import pandas as pd

#测试数据
df = pd.DataFrame(data = [['lisa','f',22],['joy','f',22],['tom','m','21']],index = [1,2,3],columns = ['name','sex','age'])
print(df)
#输出测试数据
name sex age
1  lisa   f  22
2   joy   f  22
3   tom   m  21

increase

Increase by column

#在第0列，加上column名称为city，值为citys的数值。
citys = ['ny','zz','xy']
df.insert(0,'city',citys) 

#默认在df最后一列加上column名称为job，值为jobs的数据。
jobs = ['student','AI','teacher']
df['job'] = jobs 

#在df最后一列加上column名称为salary，值为等号右边数据。
df.loc[:,'salary'] = ['1k','2k','2k','2k','3k']

Increase by line

#若df中没有index为“4”的这一行的话，该行代码作用是往df中加一行index为“4”，值为等号右边值的数据。
#若df中有index为“4”的这一行，则该行代码作用是把df中index为“4”的这一行修改为等号右边数据。
df.loc[4] = ['zz','mason','m',24,'engineer’]
             
df_insert = pd.DataFrame({
    
    'name':['mason','mario'],'sex':['m','f'],'age':[21,22]},index = [4,5])
#返回添加后的值，并不会修改df的值。
#ignore_index默认为False，意思是不忽略index值，即生成的新的ndf的index采用df_insert中的index值。
#ignore_index为True时，则新的ndf的index值不使用df_insert中的index值，而是自己默认生成。
ndf = df.append(df_insert,ignore_index = True)

check

df[‘column_name’] 和df[row_start_index, row_end_index]

  df['name']
  df['gender']
  df[['name','gender']] #选取多列，多列名字要放在list里
  df[0:]	#第0行及之后的行，相当于df的全部数据，注意冒号是必须的
  df[:2]	#第2行之前的数据（不含第2行）
  df[0:1]	#第0行
  df[1:3] #第1行到第2行（不含第3行）
  df[-1:] #最后一行
  df[-3:-1] #倒数第3行到倒数第1行（不包含最后1行即倒数第1行）

df.loc[index,column]

#df.loc[index, column_name],选取指定行和列的数据
df.loc[0,'name'] # 'Snow'
df.loc[0:2, ['name','age']] #选取第0行到第2行，name列和age列的数据, 注意这里的行选取是包含下标的。
df.loc[[2,3],['name','age']] 		 #选取指定的第2行和第3行，name和age列的数据
df.loc[df['gender']=='M','name'] 	 #选取gender列是M，name列的数据
df.loc[df['gender']=='M',['name','age']] #选取gender列是M，name和age列的数据

iloc[row_index, column_index]

df.iloc[0,0]		#第0行第0列的数据，'Snow'
df.iloc[1,2]		#第1行第2列的数据，32
df.iloc[[1,3],0:2]	#第1行和第3行，从第0列到第2列（不包含第2列）的数据
df.iloc[1:3,[1,2]	#第1行到第3行（不包含第3行），第1列和第2列的数据

change

Change row and column headings

#尽管我们只想把’sex’改为’gender’，但是仍然要把所有的列全写上，否则报错。
df.columns = ['name','gender','age'] 
#只修改name和age。inplace若为True，直接修改df，否则，不修改df，只是返回一个修改后的数据。
df.rename(columns = {
    
    'name':'Name','age':'Age'},inplace = True) 
#把index改为a,b,c.直接修改了df。
df.index = list('abc')
#无返回值，直接修改df的index。
df.rename({
    
    1:'a',2:'b',3:'c'},axis = 0,inplace = True)

Change value

Use loc

#修改index为‘1’，column为‘name’的那一个值为aa
df.loc[1,'name'] = 'aa' 
#修改index为‘1’的那一行的所有值
df.loc[1] = ['bb','ff',11]
#修改index为‘1’，column为‘name’的那一个值为bb，age列的值为11
df.loc[1,['name','age']] = ['bb',11]

Use iloc[row_index, column_index]

df.iloc[1,2] = 19#修改某一无素
df.iloc[:,2] = [11,22,33] #修改一整列
df.iloc[0,:] = ['lily','F',15] #修改一整行

delete

Delete row

df.drop([1,3],axis = 0,inplace = False)#删除index值为1和3的两行

Delete column

df.drop(['name'],axis = 1,inplace = False) #删除name列
del df['name'] #删除name列
ndf = df.pop('age’) #删除age列，操作后，df都丢掉了age列,age列返回给了ndf