Road python learning - data analysis tool pandas (a)

pandas practicality believe Needless to say, we all know that its powerful place, whether it is "the use of python data analysis" or "python scientific computing and data analysis" and so books have spent a great length to introduce its Features. Today its a simple summary based on individual learning experience and work experience to use.

1. Data Structure

pandas, there are two commonly used data structures, which are one-dimensional Series (a set of indices and a set of data) and the two-dimensional dataframe. a series of indices and a set of groups of data, and the data must be the same type, and by two dataframe index (row index have both column index) and a plurality of groups of data, the data may be a variety of characters, numeric, etc. that is, we are more familiar with table structure, so dataframe is what we are most familiar with the most commonly used data structures.

1.1 Seires

#构建series
import pandas as pd
import numpy as np
s1 = pd.Series(np.arange(5),index = ["a","c","e","d","b"])
print(s1)
print("s1的索引:{}".format(s1.index))
a    0
c    1
e    2
d    3
b    4
dtype: int32
s1的索引:Index(['a', 'c', 'e', 'd', 'b'], dtype='object')

#reindex重排索引顺序,如找不到对应的索引值,就引入缺失值NaN。
s1.reindex(["a","b","c","d","e","f"])
a    0.0
b    4.0
c    1.0
d    3.0
e    2.0
f    NaN
dtype: float64

#当然觉得显示的是NaN不好看,可以通过fill_value来设置
s1.reindex(["a","b","c","d","e","f"],fill_value = 0)
a    0
b    4
c    1
d    3
e    2
f    0
dtype: int32

#数据的选取
print("位置索引:\n{}".format(s1[1]))             
print("切片索引:\n{}".format(s1[:3]))           
print("布尔索引:\n{}".format(s1[s1>0]))          
print("普通索引:\n{}".format(s1["c"]))
位置索引:
1
切片索引:
a    0
c    1
e    2
dtype: int32
布尔索引:
c    1
e    2
d    3
b    4
dtype: int32
普通索引:
1

# series最为重要的一个功能就是计算时会自动对齐不同索引的数据
s1 = pd.Series(np.arange(5),index = ["a","c","e","d","b"])
s2 = pd.Series(np.arange(2,5),index = ["a","c","e"])
s1 + s2
a    2.0
b    NaN
c    4.0
d    NaN
e    6.0
dtype: float64

1.2 Dataframe

#dataframe构建
df1 = pd.DataFrame(np.arange(1,10).reshape(3,3),index = ["a","b","c"],columns = ["A","B","C"])  #利用numpy的数组构建
df2 = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]],index = ["a","b","c"],columns = ["A","B","C"])     #利用嵌套列表构建
df3 = pd.DataFrame({"A":[1,4,7],"B":[2,5,8],"C":[3,6,9]},index = ["a","b","c"])                 #利用字典构建
print("利用numpy数组构建的df1:\n{}".format(df1))
print("利用嵌套列表构建的df2:\n{}".format(df2))
print("利用字典构建的df3:\n{}".format(df3))
利用numpy数组构建的df1:
   A  B  C
a  1  2  3
b  4  5  6
c  7  8  9
利用嵌套列表构建的df2:
   A  B  C
a  1  2  3
b  4  5  6
c  7  8  9
利用字典构建的df3:
   A  B  C
a  1  2  3
b  4  5  6
c  7  8  9

Dataframe three ways can be constructed, and the constructed dataframe is the same, then a small text illustrated embodiment numpy prefer to construct an array.

#df1的行列索引
print("df1的行索引:{}".format(df1.index))
print("df1的列索引:{}".format(df1.columns))
df1的行索引:Index(['a', 'b', 'c'], dtype='object')
df1的列索引:Index(['A', 'B', 'C'], dtype='object')

#df1的数据选取:可利用列索引、loc普通索引、iloc位置索引以及ix交叉索引
print("利用列索引选取数据:\n{}".format(df1[["A","C"]]))  #使用列索引选择一列或者多列数据(注:不适用于行索引)
利用列索引选取数据:
   A  C
a  1  3
b  4  6
c  7  9
print("利用loc普通索引选取数据:\n{}".format(df1.loc["a","B"]))    #使用普通索引选取数据
利用loc普通索引选取数据:
2
print("利用loc普通索引选取数据:\n{}".format(df1.loc[["a","c"],["B","C"]]))   #使用普通索引选取数据
利用loc普通索引选取数据:
   B  C
a  2  3
c  8  9
print("利用iloc位置索引选取数据:\n{}".format(df1.iloc[:2,1:]))    #使用位置索引选取数据
利用iloc位置索引选取数据:
   B  C
a  2  3
b  5  6
print("利用ix交叉索引选取数据:\n{}".format(df1.ix[["a","c"],1:]))   #使用交叉索引选取数据
利用ix交叉索引选取数据:
   B  C
a  2  3
c  8  9

报错:
C:\Users\wenjianhua\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.

Of course, when using ix cross-reference, python will prompt ix is ​​deprecated, please select loc or iloc chosen words, so it is still an ordinary index loc and location-based index iloc.

2. Common Operations

For series or, dataframe or, additions and changes than the conventional puncturing operation check. Charles index is through the way query above has been introduced, followed by additions and changes introduced deleted.

2.1 increase

# 通过insert增加一列,insert(位置,列名,数值,allow_duplicates =True/False)
df1 = pd.DataFrame(np.arange(1,10).reshape(3,3),index = ["a","b","c"],columns = ["A","B","C"]) 
df1.insert(3,"D",[1,2,3])
print(df1)
A  B  C  D
a  1  2  3  1
b  4  5  6  2
c  7  8  9  3

#通过索引的方式增加一列
df1["F"] = [3,5,8]
print(df1)
A  B  C  D  F
a  1  2  3  1  3
b  4  5  6  2  5
c  7  8  9  3  8

#通过pd.merge()实现多列合并,类似于sql的内连接,左连接,右连接以及并联
#pd.merge(df1,df2,on,how = "inner"/"left"/"right"/"outer",left_on,right_on,left_index = True/False,right_index = True/False)
df2 = pd.DataFrame(np.random.randn(4).reshape(2,2),index = ["b","c"],columns = ["X","Y"])
print(pd.merge(df1,df2,left_index = True,right_index = True,how = "inner" ))      #公共列为索引
A  B  C  D  F         X         Y
b  4  5  6  2  5 -0.507837 -0.994847
c  7  8  9  3  8 -0.106815  0.780901
df2 = pd.DataFrame([[1,3],[4,2],[7,10]],index = ["d","e","f"],columns = ["A","Y"])
print(pd.merge(df1,df2,how = "left",on = df1["A"]))        #以df1为主表左联df2,公共列为df1["A"]
key_0  A_x  B  C  D  F  A_y   Y
0      1    1  2  3  1  3    1   3
1      4    4  5  6  2  5    4   2
2      7    7  8  9  3  8    7  10

At this time, we can see df1 [ "A"] is a common column, since the same column index, respectively, and therefore when A_x, A_y, the parameters can be set by suffixes, because additional df1 differ df2 with row index, the row merged index of default from zero.

print(pd.merge(df1,df2,how = "left",on = df1["A"],suffixes = ["_a","_b"]))
key_0  A_a  B  C  D  F  A_b   Y
0      1    1  2  3  1  3    1   3
1      4    4  5  6  2  5    4   2
2      7    7  8  9  3  8    7  10

df2 = pd.DataFrame([["a",3],["b",2],["c",10]],index = ["d","e","f"],columns = ["A","Y"])
print(pd.merge(df1,df2,how = "left",left_index = True,right_on = df2["A"]))    #左公共列为索引,右公共列为df2["A"]
key_0  A_x  B  C  D  F A_y   Y
d     a    1  2  3  1  3   a   3
e     b    4  5  6  2  5   b   2
f     c    7  8  9  3  8   c  10

#通过pd.concat()增加行,pd.concat(数据,ignore_index = True/False)
df3 = pd.DataFrame(np.random.randn(10).reshape(2,5),index = ["a","b"],columns = ["A","B","C","D","F"])
print(df3)
A         B         C         D         F
a -0.813396  0.596950  0.299149  1.289255  1.172996
b -0.672459  0.314604  0.497577 -0.968354 -0.600552

df4 = pd.concat([df1,df3],ignore_index = True)  #合并后行索引会出现不连续或者重复的情况,此时需要用ignore_index重设行索引
print(df4)
A         B         C         D         F
0  1.000000  2.000000  3.000000  1.000000  3.000000
1  4.000000  5.000000  6.000000  2.000000  5.000000
2  7.000000  8.000000  9.000000  3.000000  8.000000
3 -0.813396  0.596950  0.299149  1.289255  1.172996
4 -0.672459  0.314604  0.497577 -0.968354 -0.600552

2.2 deleted

#通过del方式删除某一列
del df1["D"]
print(df1)
A  B  C  F
a  1  2  3  3
b  4  5  6  5
c  7  8  9  8

#通过drop函数删除某一列/行或者多列/行(列名/行名,axis = 0/1,inplace = True/False)
#当axis = 1时,删除列
#当inplace = True时,替换原数据
print(df1.drop(columns = ["B","C"]))           #使用columns时不需要指定axis
A  F
a  1  3
b  4  5
c  7  8
print(df1.drop(["C"],axis = 1))              #直接使用列索引时,必须指定axis
A  B  F
a  1  2  3
b  4  5  5
c  7  8  8
print(df1.drop(["b"],axis = 0))            #直接使用行索引时,必须指定axis
A  B  C  F
a  1  2  3  3
c  7  8  9  8
print(df1.drop(df1.columns[[1,3]],axis = 1))          #使用位置索引时,必须指定axis
A  C
a  1  3
b  4  6
c  7  9

After you can see the use of these three methods of drop, df1 remained unchanged, it is because inplace default is False, without modifying the original data, such as to modify the settings to True

df1.drop(df1.columns[[1,3]],axis = 1,inplace = True) 
print(df1)
A  C
a  1  3
b  4  6
c  7  9

#列的删除还可以通过pop()删除并返回删除的列
print(df1.pop("C"))
a    3
b    6
c    9
Name: C, dtype: int32

Change 2.3

df1 = pd.DataFrame(np.arange(1,10).reshape(3,3),index = ["a","b","c"],columns = ["A","B","C"]) 
print(df1)
A  B  C
a  1  2  3
b  4  5  6
c  7  8  9

#数值的修改直接通过普通索引的方式进行修改
df1["A"] = 0
print(df1)
A  B  C
a  0  2  3
b  0  5  6
c  0  8  9

#数值的修改还可以通过位置索引的方式进行修改
df1.iloc[:,0] = [1,4,7]
print(df1)
A  B  C
a  1  2  3
b  4  5  6
c  7  8  9

#数值的修改还可以通过repalce()来实现
print(df1.replace(5,np.NaN))
A    B  C
a  1  2.0  3
b  4  NaN  6
c  7  8.0  9

#行/列索引的修改可通过index/columns修改,唯一的缺点就是就算只想修改某一行/列都必须全输入
print(df1.index)
Index(['a', 'b', 'c'], dtype='object')
df1.index = ["a","c","d"]
print(df1.index)
Index(['a', 'c', 'd'], dtype='object')

#修改行/列索引更为方便的方式是使用rename
df1.rename(index = {"c":"b","d":"c"},inplace = True)
print(df1.index)
Index(['a', 'b', 'c'], dtype='object')

3. Describe statistical methods

  • Number of statistical non-NA values: count
  • describe: Series or numerical summary statistics of each column DataFrame
  • min: minimum value
  • max: maximum
  • quantile: quantile
  • sum: sum
  • mean: Mean (numpy)
  • median: median (numpy)
  • mad: calculating an average from the mean absolute deviation
  • var: variance (numpy)
  • std: standard deviation (numpy)
  • skew: skewness (third moment)
  • kurt: kurtosis (four moments)
def summary(x):
    print("求和:",sum(x))
    print("最小值:",min(x))
    print("最大值:",max(x))
    print("均值:",np.mean(x))
    print("中位数:",np.median(x))
    print("方差:",np.var(x))
    print("标准差:",np.std(x))

x = np.random.randn(9)
print(x)
[ 0.3278447   0.35207847  1.49437789 -2.11866331  1.13839775 -1.63416003
  0.3934872   0.85271989  0.36729024]

summary(x)
求和: 1.173372792257903
最小值: -2.1186633130744097
最大值: 1.4943778881381498
均值: 0.13037475469532256
中位数: 0.3672902376436664
方差: 1.3092951819301024
标准差: 1.1442443715964272

Of course, we can define a function for calculating their respective indicators described, may be accomplished by describe, but only describe corresponding serie or dataframe.

print(df1.describe())
A    B    C
count  3.0  3.0  3.0
mean   4.0  5.0  6.0
std    3.0  3.0  3.0
min    1.0  2.0  3.0
25%    2.5  3.5  4.5
50%    4.0  5.0  6.0
75%    5.5  6.5  7.5
max    7.0  8.0  9.0
Published 33 original articles · won praise 30 · views 30000 +

Guess you like

Origin blog.csdn.net/d345389812/article/details/88718750