The study notes pandas python

First, the concept
pandas NumPy is a tool, the tool to solve data analysis tasks created based on. Pandas included a large library and some standard data model provides the tools needed to efficiently operate large data sets. pandas provides a number of functions and methods enable us to quickly and easily handle the data. You will soon find that it is one of the important factors that make Python become a powerful and efficient data analysis environment.

Second, Code Detailed

1. Basic

pandas Series is the core and DataFrame two data structures.

Both types of data structures following comparison:

name Dimensions Explanation
Series 1D Different types of data can be stored
DataFrame 2-dimensional Table structure, with the tag, the size of the variable, and there may be a variety of data types

Defined Series
np.nan corresponds to null, null is defined

s = pd.Series([1,3,6,np.nan,44,1])
print(s)

result:

0     1.0
1     3.0
2     6.0
3     NaN
4    44.0
5     1.0
dtype: float64

Defined DataFrame
pd.date_range () generates a sequence of dates
DataFrame (1,2,3)
The first argument is the data Data
second parameter is index, index of
the third parameter is the name of the table columns is a column
default rule that only data

pd.Categorical () the digital content storage, faster

dates = pd.date_range('20190607',periods=6)
print(dates)

df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=['a','b','c','d'])
print(df)

df1 = pd.DataFrame(np.arange(12).reshape((3,4)))    # 默认df规则
print(df1)

df2 = pd.DataFrame({'A':1.,
                    'B':pd.Timestamp('20130102'),
                    'C':pd.Series(1,index=list(range(4)),dtype='float32'),
                    'D':np.array([3]*4,dtype='int32'),
                    'E':pd.Categorical(['test','train','test','train']),
                    'F':'foo'
                    })
print(df2)

result:

DatetimeIndex(['2019-06-07', '2019-06-08', '2019-06-09', '2019-06-10',
               '2019-06-11', '2019-06-12'],
              dtype='datetime64[ns]', freq='D')
                   a         b         c         d
2019-06-07  0.484568 -0.439881 -0.960222 -1.520919
2019-06-08  1.054979  1.705260 -0.369167 -0.323814
2019-06-09  1.735345  0.404412 -0.306179 -0.380139
2019-06-10  2.583616  0.947599  0.700119 -3.001477
2019-06-11 -0.469525 -0.147207 -0.044570 -1.684648
2019-06-12 -0.345939  0.294284 -0.434633  0.006824
   0  1   2   3
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11
     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo

Here are some DataFrame properties
df2 refers to a table DataFrame

name Explanation
df2.dtypes Output character type of each column
df2.index Output No. Name
df2.columns Output Column Name
df2.values All of the output value
df2.describe() Description of the dataframe output (digital output type only)
df2.T Transposed output df2
df2.sort_index(axis=1,ascending=False) Lie face sort outputting, for column 1, the representative reverse to false
df2.sort_index(axis=0,ascending=False) The output of the line numbers are sorted, representing lines 0, reverse to false representatives
df2.sort_values(by=‘E’) Sort the output according to the value

Code;:

print(df2.dtypes)         # 输出每一列的字符类型
print(df2.index)            # 输出序号名
print(df2.columns)          # 输出列名
print(df2.values)           # 输出所有的值
print(df2.describe())       # 输出dataframe的描述
print(df2.T)                # 输出df2的转置
print(df2.sort_index(axis=1,ascending=False))   #输出对列名进行排序
print(df2.sort_index(axis=0,ascending=False))   # 输出对行号进行排序
print(df2.sort_values(by='E'))                  # 输出根据值进行排序

Output:

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

Int64Index([0, 1, 2, 3], dtype='int64')

Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')

[[1.0 Timestamp('2013-01-02 00:00:00') 1.0 3 'test' 'foo']
 [1.0 Timestamp('2013-01-02 00:00:00') 1.0 3 'train' 'foo']
 [1.0 Timestamp('2013-01-02 00:00:00') 1.0 3 'test' 'foo']
 [1.0 Timestamp('2013-01-02 00:00:00') 1.0 3 'train' 'foo']]
 
         A    C    D
count  4.0  4.0  4.0
mean   1.0  1.0  3.0
std    0.0  0.0  0.0
min    1.0  1.0  3.0
25%    1.0  1.0  3.0
50%    1.0  1.0  3.0
75%    1.0  1.0  3.0
max    1.0  1.0  3.0

                     0         ...                             3
A                    1         ...                             1
B  2013-01-02 00:00:00         ...           2013-01-02 00:00:00
C                    1         ...                             1
D                    3         ...                             3
E                 test         ...                         train
F                  foo         ...                           foo

[6 rows x 4 columns]

     F      E  D    C          B    A
0  foo   test  3  1.0 2013-01-02  1.0
1  foo  train  3  1.0 2013-01-02  1.0
2  foo   test  3  1.0 2013-01-02  1.0
3  foo  train  3  1.0 2013-01-02  1.0

     A          B    C  D      E    F
3  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
0  1.0 2013-01-02  1.0  3   test  foo

     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
2  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
3  1.0 2013-01-02  1.0  3  train  foo

2. Select the data

The first way:
ordinary index
df [ 'A'] = df.A filter are both as the A
df [0: 3], df [ '20190608': '20190610'] which are both filtering behavior 0: 3 values, that is the number of rows behind the name.
Code;

dates = pd.date_range('20190607',periods=6)
df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,columns=['A','B','C','D'])
print(df)
print(df['A'],df.A)                # 筛选列为A的
print(df[0:3],df['20190608':'20190610'])   # 筛选行为0:3的值

result;

 A   B   C   D
2019-06-07   0   1   2   3
2019-06-08   4   5   6   7
2019-06-09   8   9  10  11
2019-06-10  12  13  14  15
2019-06-11  16  17  18  19
2019-06-12  20  21  22  23

2019-06-07     0
2019-06-08     4
2019-06-09     8
2019-06-10    12
2019-06-11    16
2019-06-12    20
Freq: D, Name: A, dtype: int32 

2019-06-07     0
2019-06-08     4
2019-06-09     8
2019-06-10    12
2019-06-11    16
2019-06-12    20
Freq: D, Name: A, dtype: int32

            A  B   C   D
2019-06-07  0  1   2   3
2019-06-08  4  5   6   7
2019-06-09  8  9  10  11       
      
 A   B   C   D
2019-06-08   4   5   6   7
2019-06-09   8   9  10  11
2019-06-10  12  13  14  15

The second embodiment is an index tab
select by label: loc method
see loc codes by using the method, the following are the above-described df df
codes;

print(df.loc['20190608'])       # 筛选20190608这一行的数据
print(df.loc[:,['A','B']])      # 筛选A,B这两列的数据
print(df.loc['20190607',['A','B']])  # 筛选20190607这一行  和A,B这两列的数据

result:

A    4
B    5
C    6
D    7
Name: 2019-06-08 00:00:00, dtype: int32

             A   B
2019-06-07   0   1
2019-06-08   4   5
2019-06-09   8   9
2019-06-10  12  13
2019-06-11  16  17
2019-06-12  20  21

A    0
B    1
Name: 2019-06-07 00:00:00, dtype: int32

Ranks third by number index
the SELECT by position: iLoc
iLoc [] index only ranks number can not be the same as loc label
showing the code;

# select by position :iloc       #通过行号列号索引
print(df.iloc[3:5,1:3])          # 切片3:5行  1:3列
print(df.iloc[[1,3,5],1:3])      # 切片 1,3,5行,1:3列

result:

 B   C
2019-06-10  13  14
2019-06-11  17  18

             B   C
2019-06-08   5   6
2019-06-10  13  14
2019-06-12  21  22

The fourth embodiment IX
Mixed Selection: IX
IX method may mix label line Number
Code:

# mixed selection:ix
print(df.ix[:3,['A','C']])       # 筛选前三行和A,C列数据

result:

            A   C
2019-06-07  0   2
2019-06-08  4   6
2019-06-09  8  10

Fifth Index
df.A> 8: 8 meet all rows in column A is larger than
the code;

# Boolean indexing
print(df[df.A>8])   # 满足df.A大于8的所有行

result:

             A   B   C   D
2019-06-10  12  13  14  15
2019-06-11  16  17  18  19
2019-06-12  20  21  22  23

Lecture, set values

Dataframe value was changed by various methods
can be assigned by the above-described embodiment
LOC
iLoc
DF [df.A>. 4]
DF [ 'F.']
DF [] represents the contents of the time there is a single column, if it is 0: 3, it represents a 0-3 line

Code:

import numpy as np
import pandas as pd

dates = pd.date_range('20190607',periods=6)
df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,columns=['A','B','C','D'])
print(df)
df.iloc[2,2] =1111                 # 更改数值
df.loc['20190610','B']=2222              # 更改数值
#df[df.A>4] = 0                     # 当A列中的中大于4的项全部变成0,包括B列
#df.A[df.A>4] = 0                  #仅仅A列中大于4的项变成0
df.B[df.A>4] = 0                  #确认A列中大于4的项使得B列中与之对应的项变成零
df['F'] = np.nan
df['E'] = pd.Series([1,2,3,4,5,6],index=pd.date_range('20190607',periods=6))
print(df)

result;

             A   B   C   D
2019-06-07   0   1   2   3
2019-06-08   4   5   6   7
2019-06-09   8   9  10  11
2019-06-10  12  13  14  15
2019-06-11  16  17  18  19
2019-06-12  20  21  22  23


             A  B     C   D   F  E
2019-06-07   0  1     2   3 NaN  1
2019-06-08   4  5     6   7 NaN  2
2019-06-09   8  0  1111  11 NaN  3
2019-06-10  12  0    14  15 NaN  4
2019-06-11  16  0    18  19 NaN  5
2019-06-12  20  0    22  23 NaN  6

Fourth Lecture, handle missing data

Attributes Explanation
df.dropna(axis=1,how=‘any’) Deletion of discarded method, how = { 'any', 'all'} how represents lost way, the any means as long as there is one or more 1, the column must be discarded, must All this column to the full value of lost nan throw away.
df.fillna(value=0) The nan-filling value to a value
df.isnull() isunull determine whether nan value, if so, returns true at that position
np.any(df.isnull())==True Determining whether at least one nan value, returns True if

Code:

import numpy as np
import pandas as pd

dates = pd.date_range('20190607',periods=6)
df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,columns=['A','B','C','D'])

df.iloc[0,1] = np.nan
df.iloc[1,2] = np.nan
print(df)
print(df.dropna(axis=1,how='any'))   # how ={'any','all'}  how 代表的是丢掉方式
print(df.fillna(value=0))              # 将nan值填充成value值
print(df.isnull())                     # isnull判断是否有nan值,若有,即在那个位置返回true

print(np.any(df.isnull())==True)        # 判断是否至少有一个nan值,若有则返回True

result;

             A     B     C   D
2019-06-07   0   NaN   2.0   3
2019-06-08   4   5.0   NaN   7
2019-06-09   8   9.0  10.0  11
2019-06-10  12  13.0  14.0  15
2019-06-11  16  17.0  18.0  19
2019-06-12  20  21.0  22.0  23

             A   D
2019-06-07   0   3
2019-06-08   4   7
2019-06-09   8  11
2019-06-10  12  15
2019-06-11  16  19
2019-06-12  20  23

             A     B     C   D
2019-06-07   0   0.0   2.0   3
2019-06-08   4   5.0   0.0   7
2019-06-09   8   9.0  10.0  11
2019-06-10  12  13.0  14.0  15
2019-06-11  16  17.0  18.0  19
2019-06-12  20  21.0  22.0  23

                A      B      C      D
2019-06-07  False   True  False  False
2019-06-08  False  False   True  False
2019-06-09  False  False  False  False
2019-06-10  False  False  False  False
2019-06-11  False  False  False  False
2019-06-12  False  False  False  False

True

Fifth Lecture, import and export

Import and export many ways, but the format is very similar, is nothing more than open format and save different formats.
You can view the official document, simply look at the usage here.
Code;

import pandas as pd

data = pd.read_csv('student.csv')   # 读取文件
print(data)
print(type(data))

data.to_pickle('student.pickle')    # 保存文件

Code data can be read out student.csv
type pandas.core.frame.DataFrame data is
stored in the form of pickle, a serialized

Lecture, merge concat

The combined use of concat briefly dataframe
pd.concat ([DF1, DF2, DF3], axis = 0) # combined vertical
when axis is 1, a transverse merge.
pd.concat ([df1, df2, df3 ], axis = 0, ignore_index = True) # vertical combined, ignored index
ignore_index dataframe is true in each index is ignored, rearranging
the code;

# concatenating
df1 = pd.DataFrame(np.ones((3,4))*0,columns=['a','b','c','d'])
df2 = pd.DataFrame(np.ones((3,4))*1,columns=['a','b','c','d'])
df3 = pd.DataFrame(np.ones((3,4))*2,columns=['a','b','c','d'])

res = pd.concat([df1,df2,df3],axis=0)         # 竖向合并
print(res)

res = pd.concat([df1,df2,df3],axis=0,ignore_index=True)         # 竖向合并,忽略index
print(res)

See below concat Another parameter that join { 'inner', 'outer '}
for some of the table and column names exactly when to use merge join, outer join is the default mode, the column number of the combined form all of the columns out, no value is set nan value. inner is on the intersection of several table columns.
Note: After addition of python3 way to add a default parameter sort = True
following code shows by:

df1 = pd.DataFrame(np.ones((3,4))*0,columns=['a','b','c','d'],index=[1,2,3])
df2 = pd.DataFrame(np.ones((3,4))*1,columns=['b','c','d','e'],index=[2,3,4])


res = pd.concat([df1,df2],join='outer',sort = True)   # 默认加入方式
print(res)

res = pd.concat([df1,df2],join='inner',ignore_index=True)    # inner模式
print(res)

result;

a    b    c    d    e
1  0.0  0.0  0.0  0.0  NaN
2  0.0  0.0  0.0  0.0  NaN
3  0.0  0.0  0.0  0.0  NaN
2  NaN  1.0  1.0  1.0  1.0
3  NaN  1.0  1.0  1.0  1.0
4  NaN  1.0  1.0  1.0  1.0

     b    c    d
0  0.0  0.0  0.0
1  0.0  0.0  0.0
2  0.0  0.0  0.0
3  1.0  1.0  1.0
4  1.0  1.0  1.0
5  1.0  1.0  1.0

Another parameter join_axes
join_axes = [df1.index] df1 using an index table
codes:

# join_axes
df1 = pd.DataFrame(np.ones((3,4))*0,columns=['a','b','c','d'],index=[1,2,3])
df2 = pd.DataFrame(np.ones((3,4))*1,columns=['b','c','d','e'],index=[2,3,4])
res = pd.concat([df1,df2],axis=1,join_axes=[df1.index])       # 横向合并,按照df1的index合并
print(res)

result;

     a    b    c    d    b    c    d    e
1  0.0  0.0  0.0  0.0  NaN  NaN  NaN  NaN
2  0.0  0.0  0.0  0.0  1.0  1.0  1.0  1.0
3  0.0  0.0  0.0  0.0  1.0  1.0  1.0  1.0

append () method
to specify a df this method is called, and the table extension
may be added and df sequence
codes;

# append
df1 = pd.DataFrame(np.ones((3,4))*0,columns=['a','b','c','d'])
df2 = pd.DataFrame(np.ones((3,4))*1,columns=['a','b','c','d'])
df3 = pd.DataFrame(np.ones((3,4))*1,columns=['a','b','c','d'])
res=df1.append([df2,df3],ignore_index=True)          # 竖向添加
print(res)
s1 = pd.Series([1,2,3,4],index=['a','b','c','d'])
res = df1.append(s1,ignore_index=True)    # 添加一个序列
print(res)

result:

     a    b    c    d
0  0.0  0.0  0.0  0.0
1  0.0  0.0  0.0  0.0
2  0.0  0.0  0.0  0.0
3  1.0  1.0  1.0  1.0
4  1.0  1.0  1.0  1.0
5  1.0  1.0  1.0  1.0
6  1.0  1.0  1.0  1.0
7  1.0  1.0  1.0  1.0
8  1.0  1.0  1.0  1.0

     a    b    c    d
0  0.0  0.0  0.0  0.0
1  0.0  0.0  0.0  0.0
2  0.0  0.0  0.0  0.0
3  1.0  2.0  3.0  4.0

SEVEN, merge merge

The combined use of keywords, a bit like a primary key in the database.
Come take a simple example.
Code;

import pandas as pd
# merging two df by key/keys.(may be used in database)
# simple example
left = pd.DataFrame({'key':['K0','K1','K2','K3'],
                     'A':['A0','A1','A2','A3'],
                     'B':['B0','B1','B2','B3']})
right = pd.DataFrame({'key':['K0','K1','K2','K3'],
                     'C':['C0','C1','C2','C3'],
                      'D':['D0','D1','D2','D3']})

print(left)
print(right)
res = pd.merge(left,right,on='key')
print(res)

result:

    A   B key
0  A0  B0  K0
1  A1  B1  K1
2  A2  B2  K2
3  A3  B3  K3

    C   D key
0  C0  D0  K0
1  C1  D1  K1
2  C2  D2  K2
3  C3  D3  K3

    A   B key   C   D
0  A0  B0  K0  C0  D0
1  A1  B1  K1  C1  D1
2  A2  B2  K2  C2  D2
3  A3  B3  K3  C3  D3

As can be seen, according to the combined value is a key to this column.

See the following example of the control by the two Key
Merge () which has a parameter of how
how = [ 'left', 'right', 'Outer', 'Inner']
'left' refers to a content key according to a first dataframe to a
'right' refers to the contents of the keyword in accordance with the last dataframe to
'outer' refers keyword

# consider two keys
left = pd.DataFrame({'key1':['K0','K0','K1','K2'],
                     'key2':['K0','K1','K0','K1'],
                     'A':['A0','A1','A2','A3'],
                     'B':['B0','B1','B2','B3']})
right = pd.DataFrame({'key1':['K0','K1','K1','K2'],
                     'key2':['K0','K0','K0','K0'],
                     'C':['C0','C1','C2','C3'],
                      'D':['D0','D1','D2','D3']})

print(left)
print(right)
how = ['left','right','outer','inner']
res = pd.merge(left,right,on=['key1','key2'],how='inner')
print(res)

result:

    A   B key1 key2
0  A0  B0   K0   K0
1  A1  B1   K0   K1
2  A2  B2   K1   K0
3  A3  B3   K2   K1

    C   D key1 key2
0  C0  D0   K0   K0
1  C1  D1   K1   K0
2  C2  D2   K1   K0
3  C3  D3   K2   K0

    A   B key1 key2   C   D
0  A0  B0   K0   K0  C0  D0
1  A2  B2   K1   K0  C1  D1
2  A2  B2   K1   K0  C2  D2

The next parameter indicator parameter
when the time indicator = True, there is a case will give the two values is used to store a df
when indicator = 'indicator_column' renames column
codes;

# indicator
df1 = pd.DataFrame({'col1':[0,1],'col_left':['a','b']})
df2 = pd.DataFrame({'col1':[1,2,2],'col_right':[2,2,2]})
print(df1)
print(df2)
#res = pd.merge(df1,df2,on='col1',how='outer',indicator=True)   # 给出一列用来存储两个df中值的存在情况
# give the indicator a custom name
res = pd.merge(df1,df2,on='col1',how='outer',indicator='indicator_column')
print(res)

result:

   col1 col_left
0     0        a
1     1        b

   col1  col_right
0     1          2
1     2          2
2     2          2

   col1 col_left  col_right indicator_column
0     0        a        NaN        left_only
1     1        b        2.0             both
2     2      NaN        2.0       right_only
3     2      NaN        2.0       right_only

merge the index
pd.merge (left, right, left_index = True, right_index = True, how = 'outer')
Order left_index right_index are True and
simultaneously act the same as before with how
the code:

# merged by index

left = pd.DataFrame({'A':['A0','A1','A2'],
                     'B':['B0','B1','B2']},
                    index=['K0','K1','K2'])
right = pd.DataFrame({ 'C':['C0','C1','C2'],
                      'D':['D0','D1','D2']},
                     index=['K0','K2','K3'])
print(left)
print(right)

# left_index and right_index
res = pd.merge(left,right,left_index=True,right_index=True,how='outer')   # 基于index合并
# res = pd.merge(left,right,left_index=True,right_index=True,how='inner')
print(res)
print('*'*50)

result;

     A   B
K0  A0  B0
K1  A1  B1
K2  A2  B2

     C   D
K0  C0  D0
K2  C1  D1
K3  C2  D2

      A    B    C    D
K0   A0   B0   C0   D0
K1   A1   B1  NaN  NaN
K2   A2   B2   C1   D1
K3  NaN  NaN   C2   D2

When a table has the same keywords, all correspondence to
pd.merge (Boys, Girls, ON = 'K', suffixes, that = [ '_ Boy', '_ Girl'], = How 'Outer')
suffixes, that parameter is to merge the two columns in the table names behind the increase in content.
Code:

# handle overlapping
boys = pd.DataFrame({'k':['K0','K1','K2'],'age':[1,2,3]})
girls = pd.DataFrame({'k':['K0','K0','K3'],'age':[4,5,6]})

print(boys)
print(girls)
res = pd.merge(boys,girls,on='k',suffixes=['_boy','_girl'],how='outer')
print(res)

result:

   age   k
0    1  K0
1    2  K1
2    3  K2

   age   k
0    4  K0
1    5  K0
2    6  K3

   age_boy   k  age_girl
0      1.0  K0       4.0
1      1.0  K0       5.0
2      2.0  K1       NaN
3      3.0  K2       NaN
4      NaN  K3       6.0

Session Eight, plot draw

plot often used in matplotlib, can also be used in pandas.
Method, Plot:
'bar', 'hist', 'Box', 'kde', 'Area', 'Scatter', 'hexbin', 'PIE'
Plot had a lot of map, you can find the official documentation needed to learn about. Here we show you plot to use.
We randomly a set of data, the drawing operation.
Code;

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# plot data

# Series  线性数据
data = pd.Series(np.random.randn(1000),index=np.arange(1000))
data = data.cumsum()

# DataFrame
data = pd.DataFrame(np.random.randn(1000,4),
                    index=np.arange(1000),
                    columns=list("ABCD"))
print(data.head())
data=data.cumsum()  # 累加函数

data.plot()
# plot method:
# 'bar','hist','box','kde','area','scatter','hexbin','pie'
 ax=data.plot.scatter(x='A',y='B',color='DarkBlue',label='Class 1')
 data.plot.scatter(x='A',y='C',color='DarkGreen',label='Class 2',ax=ax)
plt.show()

result;
Linear FIG.
Scatter

(To be continued)

Guess you like

Origin blog.csdn.net/Prince_IT/article/details/92402790