Python data processing - use of Pandas module (1)

1. Introduction to Pandas

When dealing with actual financial data, a piece of data usually contains multiple types of data, for example, the stock code is a string, the closing price is a float, and the volume is an integer, etc. In C++, it can be implemented as a container of a given structure as a unit, such as a vector (vector, a specific data structure in C++). In Python, pandas includes the advanced data structures Series and DataFrame, which makes working with data in Python very convenient, fast and simple. There are some incompatibilities between different versions of pandas, so we need to know which version of pandas we are using.

2. Introduction to Pandas data structure

There are two kinds of very important data structures in pandas, namely series Series and data frame DataFrame. Series is similar to one-dimensional arrays in numpy, except that it takes all the functions or methods available for one-dimensional arrays, and it can obtain data by indexing labels, and it also has the function of automatic index alignment; DataFrame is similar to two-dimensional arrays in numpy , which can also use the functions and methods of numpy arrays in general, and also have other flexible applications, which will be introduced later.

1. Creation of Series

There are three main ways to create a Series:

1) Create a sequence from a one-dimensional array

import numpy as np
import pandas as pd

arr1 = np.arange(5)
print(arr1)
print(type(arr1))
s1 = pd.Series(arr1)
print(s1)
print(type(s1))

operation result
2) Create a sequence by means of a dictionary

import pandas as pd

arr1 = {'a':10,'b':20,'c':30,'d':40,'e':50}
print(arr1)
print(type(arr1))
s1 = pd.Series(arr1)
print(s1)
print(type(s1))

operation result
3) Create a sequence through a row or column in the DataFrame, see DataFrame Creation for details.

2. Creation of DataFrame

There are three main ways to create a data frame:

1) Create a data frame from a 2D array

import numpy as np
import pandas as pd

arr1 = np.array(np.arange(12)).reshape(4, 3)
print(arr1)
print(type(arr1))
df1 = pd.DataFrame(arr1)
print(df1)
print(type(df1))

write picture description here
2) Create a data frame by means of a dictionary The
following two dictionaries are used to create a data frame, one is a list of dictionaries, and the other is a nested dictionary.

import pandas as pd

dic1 = {'a': [1, 2, 3, 4], 'b': [5, 6, 7, 8],
        'c': [9, 10, 11, 12], 'd': [13, 14, 15, 16]}
print(dic1)
print(type(dic1))
df1 = pd.DataFrame(dic1)
print(df1)
print(type(df1))

dic2 = {'one': {'a': 1, 'b': 2, 'c': 3, 'd': 4},
        'two': {'a': 5, 'b': 6, 'c': 7, 'd': 8},
        'three': {'a': 9, 'b': 10, 'c': 11, 'd': 12}}
print(dic2)
print(type(dic2))
df2 = pd.DataFrame(dic2)
print(df2)
print(type(df2))

write picture description here
3) Create a data frame by means of a data frame

df3=df2[['one','three']]
print(df3)
print(type(df3))
s3 = df3['one']
print(s3)
print(type(s3))

write picture description here

3. Data index index

Careful friends may find a phenomenon, whether it is a sequence or a data frame, there is always a non-original data object on the far left of the object. What is this? Yes, that's the index we're going to cover next.
In my opinion, the index of a sequence or data frame has two major uses. One is to obtain the target data through the index value or index label, and the other is to use the index to automate the calculation and operation of the sequence or data frame. Let's take a look at the application of these two functions.

1. Get data by index value or index label

If the sequence is not given a specified index value, the sequence automatically generates an auto-incrementing index starting from 0. The index of the sequence can be viewed by index:

s4.index
Now we set a custom index value for the sequence:

s4 = pd.Series(np.array([1,1,2,3,5,8]))
print(s4)
# 输出
# 0    1
# 1    1
# 2    2
# 3    3
# 4    5
# 5    8
# dtype: int32

Once the sequence has an index, data can be obtained by index value or index label:

s4[3]
s4['e']
s4[[1,3,5]]
s4[['a','b','d','f']]
s4[:4]
s4['c':]
s4['b':'e']

Be careful: If the data is obtained through the index label, the value corresponding to the end label can be returned! In a one-dimensional array, it is impossible to obtain data through index labels, which is also one aspect of the sequence that is different from one-dimensional arrays.

2. Automatic alignment
If there are two sequences, arithmetic operations need to be performed on these two sequences, then the existence of the index reflects its value - automatic alignment.

s5 = pd.Series(np.array([10,15,20,30,55,80]),
               index = ['a','b','c','d','e','f'])
print(s5)
# a    10
# b    15
# c    20
# d    30
# e    55
# f    80
# dtype: int32
s6 = pd.Series(np.array([12,11,13,15,14,16]),
               index = ['a','c','g','b','d','f'])
print(s6)
# a    12
# c    11
# g    13
# b    15
# d    14
# f    16
# dtype: int32

print(s5 + s6)
# a    22.0
# b    30.0
# c    31.0
# d    44.0
# e     NaN
# f    96.0
# g     NaN
# dtype: float64

print(s5/s6)
# a    0.833333
# b    1.000000
# c    1.818182
# d    2.142857
# e         NaN
# f    5.000000
# g         NaN
# dtype: float64

Since there is no corresponding g index in s5 and no corresponding e index in s6, the operation of the data will produce two missing values ​​NaN. Note that the arithmetic result here achieves automatic alignment of the two sequence indices, rather than simply adding or dividing the two sequences. For the alignment of the data frame, not only the automatic alignment of the row index, but also the automatic alignment of the column index (variable name)

The data frame also has an index, and the data frame is a generalization of a two-dimensional array, so it not only has a row index, but also a column index. The index in the data frame is much more powerful than the application of the sequence. This part of the content It will be explained in the data query.

Fourth, use pandas to query data

The query data here is equivalent to the subset function in the R language, which can select a subset, specified row, specified column, etc. of the original data in a targeted manner through the Boolean index. We first import a student dataset:

student = pd.io.parsers.read_csv('C:\\Users\\admin\\Desktop\\student.csv')
# 查询数据的前5行或末尾5

student.head()
student.tail()
# 查询指定的行

student.ix[[0,2,4,5,7]] #这里的ix索引标签函数必须是中括号[]
# 查询指定的列

student[['Name','Height','Weight']].head()  #如果多个列的话,必须使用双重中括号
# 也可以通过ix索引标签查询指定的列

student.ix[:,['Name','Height','Weight']].head()
# 查询指定的行和列

student.ix[[0,2,4,5,7],['Name','Height','Weight']].head()
# 以上是从行或列的角度查询数据的子集,现在我们来看看如何通过布尔索引实现数据的子集查询。
# 查询所有女生的信息

student[student['Sex']=='F']
# 查询出所有12岁以上的女生信息

student[(student['Sex']=='F') & (student['Age']>12)]
查询出所有12岁以上的女生姓名、身高和体重

student[(student['Sex']=='F') & (student['Age']>12)][['Name','Height','Weight']]

The above query logic is actually very simple. It should be noted that if it is a query with multiple conditions, the conditions at both ends of & (and) or | (or) must be enclosed in parentheses.

5. Statistical analysis

The pandas module provides us with a lot of indicator functions for descriptive statistical analysis, such as sum, mean, minimum, maximum, etc. Let’s take a look at these functions in detail:
First, randomly generate three sets of data

np.random.seed(1234)
d1 = pd.Series(2*np.random.normal(size = 100)+3)
d2 = np.random.f(2,4,size = 100)
d3 = np.random.randint(1,100,size = 100)
d1.count()  #非空元素计算
d1.min()    #最小值
d1.max()    #最大值
d1.idxmin() #最小值的位置,类似于R中的which.min函数
d1.idxmax() #最大值的位置,类似于R中的which.max函数
d1.quantile(0.1)    #10%分位数
d1.sum()    #求和
d1.mean()   #均值
d1.median() #中位数
d1.mode()   #众数
d1.var()    #方差
d1.std()    #标准差
d1.mad()    #平均绝对偏差
d1.skew()   #偏度
d1.kurt()   #峰度
d1.describe()   #一次性输出多个描述性统计指标

It must be noted that the descirbe method can only be used for sequences or data frames, and there is no such method for one-dimensional arrays.
Here is a custom function to aggregate all these statistical description indicators together:

def stats(x):
    return pd.Series([x.count(),x.min(),x.idxmin(),
               x.quantile(.25),x.median(),
               x.quantile(.75),x.mean(),
               x.max(),x.idxmax(),
               x.mad(),x.var(),
               x.std(),x.skew(),x.kurt()],
              index = ['Count','Min','Whicn_Min',
                       'Q1','Median','Q3','Mean',
                       'Max','Which_Max','Mad',
                       'Var','Std','Skew','Kurt'])
stats(d1)

In actual work, we may need to deal with a series of numerical data frames. How to apply this function to each column of the data frame? You can use the apply function, which is very similar to the apply method in R.
Build the data frame from the d1, d2, d3 data created earlier:

df = pd.DataFrame(np.array([d1,d2,d3]).T,columns=['x1','x2','x3'])
df.head()
df.apply(stats)

Perfect for creating statistical descriptions of numerical data in such a simple way. What about discrete data? This statistical caliber cannot be used. We need to count the number of observations, the number of unique values, the mode level and the number of discrete variables. You only need to use the describe method to achieve such statistics.
student['Sex'].describe()

In addition to the above simple descriptive statistics, it also provides solutions for the correlation coefficient (corr) and covariance matrix (cov) of continuous variables, which is consistent with the R language.
df.corr()

For the calculation of the correlation coefficient, the pearson method, the kendell method or the spearman method can be called, and the pearson method is used by default.
df.corr('spearman')

If you only want to focus on the correlation coefficient between a certain variable and other variables, you can use corrwith. The following method only cares about the correlation coefficient between x1 and other variables:
df.corrwith(df['x1'])

Covariance matrix between numeric variables
df.cov()

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326069316&siteId=291194637