First, the concept
pandas NumPy is a tool, the tool to solve data analysis tasks created based on. Pandas included a large library and some standard data model provides the tools needed to efficiently operate large data sets. pandas provides a number of functions and methods enable us to quickly and easily handle the data. You will soon find that it is one of the important factors that make Python become a powerful and efficient data analysis environment.
Second, Code Detailed
1. Basic
pandas Series is the core and DataFrame two data structures.
Both types of data structures following comparison:
name | Dimensions | Explanation |
---|---|---|
Series | 1D | Different types of data can be stored |
DataFrame | 2-dimensional | Table structure, with the tag, the size of the variable, and there may be a variety of data types |
Defined Series
np.nan corresponds to null, null is defined
s = pd.Series([1,3,6,np.nan,44,1])
print(s)
result:
0 1.0
1 3.0
2 6.0
3 NaN
4 44.0
5 1.0
dtype: float64
Defined DataFrame
pd.date_range () generates a sequence of dates
DataFrame (1,2,3)
The first argument is the data Data
second parameter is index, index of
the third parameter is the name of the table columns is a column
default rule that only data
pd.Categorical () the digital content storage, faster
dates = pd.date_range('20190607',periods=6)
print(dates)
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=['a','b','c','d'])
print(df)
df1 = pd.DataFrame(np.arange(12).reshape((3,4))) # 默认df规则
print(df1)
df2 = pd.DataFrame({'A':1.,
'B':pd.Timestamp('20130102'),
'C':pd.Series(1,index=list(range(4)),dtype='float32'),
'D':np.array([3]*4,dtype='int32'),
'E':pd.Categorical(['test','train','test','train']),
'F':'foo'
})
print(df2)
result:
DatetimeIndex(['2019-06-07', '2019-06-08', '2019-06-09', '2019-06-10',
'2019-06-11', '2019-06-12'],
dtype='datetime64[ns]', freq='D')
a b c d
2019-06-07 0.484568 -0.439881 -0.960222 -1.520919
2019-06-08 1.054979 1.705260 -0.369167 -0.323814
2019-06-09 1.735345 0.404412 -0.306179 -0.380139
2019-06-10 2.583616 0.947599 0.700119 -3.001477
2019-06-11 -0.469525 -0.147207 -0.044570 -1.684648
2019-06-12 -0.345939 0.294284 -0.434633 0.006824
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
Here are some DataFrame properties
df2 refers to a table DataFrame
name | Explanation |
---|---|
df2.dtypes | Output character type of each column |
df2.index | Output No. Name |
df2.columns | Output Column Name |
df2.values | All of the output value |
df2.describe() | Description of the dataframe output (digital output type only) |
df2.T | Transposed output df2 |
df2.sort_index(axis=1,ascending=False) | Lie face sort outputting, for column 1, the representative reverse to false |
df2.sort_index(axis=0,ascending=False) | The output of the line numbers are sorted, representing lines 0, reverse to false representatives |
df2.sort_values(by=‘E’) | Sort the output according to the value |
Code;:
print(df2.dtypes) # 输出每一列的字符类型
print(df2.index) # 输出序号名
print(df2.columns) # 输出列名
print(df2.values) # 输出所有的值
print(df2.describe()) # 输出dataframe的描述
print(df2.T) # 输出df2的转置
print(df2.sort_index(axis=1,ascending=False)) #输出对列名进行排序
print(df2.sort_index(axis=0,ascending=False)) # 输出对行号进行排序
print(df2.sort_values(by='E')) # 输出根据值进行排序
Output:
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
Int64Index([0, 1, 2, 3], dtype='int64')
Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
[[1.0 Timestamp('2013-01-02 00:00:00') 1.0 3 'test' 'foo']
[1.0 Timestamp('2013-01-02 00:00:00') 1.0 3 'train' 'foo']
[1.0 Timestamp('2013-01-02 00:00:00') 1.0 3 'test' 'foo']
[1.0 Timestamp('2013-01-02 00:00:00') 1.0 3 'train' 'foo']]
A C D
count 4.0 4.0 4.0
mean 1.0 1.0 3.0
std 0.0 0.0 0.0
min 1.0 1.0 3.0
25% 1.0 1.0 3.0
50% 1.0 1.0 3.0
75% 1.0 1.0 3.0
max 1.0 1.0 3.0
0 ... 3
A 1 ... 1
B 2013-01-02 00:00:00 ... 2013-01-02 00:00:00
C 1 ... 1
D 3 ... 3
E test ... train
F foo ... foo
[6 rows x 4 columns]
F E D C B A
0 foo test 3 1.0 2013-01-02 1.0
1 foo train 3 1.0 2013-01-02 1.0
2 foo test 3 1.0 2013-01-02 1.0
3 foo train 3 1.0 2013-01-02 1.0
A B C D E F
3 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
0 1.0 2013-01-02 1.0 3 test foo
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
2 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
3 1.0 2013-01-02 1.0 3 train foo
2. Select the data
The first way:
ordinary index
df [ 'A'] = df.A filter are both as the A
df [0: 3], df [ '20190608': '20190610'] which are both filtering behavior 0: 3 values, that is the number of rows behind the name.
Code;
dates = pd.date_range('20190607',periods=6)
df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,columns=['A','B','C','D'])
print(df)
print(df['A'],df.A) # 筛选列为A的
print(df[0:3],df['20190608':'20190610']) # 筛选行为0:3的值
result;
A B C D
2019-06-07 0 1 2 3
2019-06-08 4 5 6 7
2019-06-09 8 9 10 11
2019-06-10 12 13 14 15
2019-06-11 16 17 18 19
2019-06-12 20 21 22 23
2019-06-07 0
2019-06-08 4
2019-06-09 8
2019-06-10 12
2019-06-11 16
2019-06-12 20
Freq: D, Name: A, dtype: int32
2019-06-07 0
2019-06-08 4
2019-06-09 8
2019-06-10 12
2019-06-11 16
2019-06-12 20
Freq: D, Name: A, dtype: int32
A B C D
2019-06-07 0 1 2 3
2019-06-08 4 5 6 7
2019-06-09 8 9 10 11
A B C D
2019-06-08 4 5 6 7
2019-06-09 8 9 10 11
2019-06-10 12 13 14 15
The second embodiment is an index tab
select by label: loc method
see loc codes by using the method, the following are the above-described df df
codes;
print(df.loc['20190608']) # 筛选20190608这一行的数据
print(df.loc[:,['A','B']]) # 筛选A,B这两列的数据
print(df.loc['20190607',['A','B']]) # 筛选20190607这一行 和A,B这两列的数据
result:
A 4
B 5
C 6
D 7
Name: 2019-06-08 00:00:00, dtype: int32
A B
2019-06-07 0 1
2019-06-08 4 5
2019-06-09 8 9
2019-06-10 12 13
2019-06-11 16 17
2019-06-12 20 21
A 0
B 1
Name: 2019-06-07 00:00:00, dtype: int32
Ranks third by number index
the SELECT by position: iLoc
iLoc [] index only ranks number can not be the same as loc label
showing the code;
# select by position :iloc #通过行号列号索引
print(df.iloc[3:5,1:3]) # 切片3:5行 1:3列
print(df.iloc[[1,3,5],1:3]) # 切片 1,3,5行,1:3列
result:
B C
2019-06-10 13 14
2019-06-11 17 18
B C
2019-06-08 5 6
2019-06-10 13 14
2019-06-12 21 22
The fourth embodiment IX
Mixed Selection: IX
IX method may mix label line Number
Code:
# mixed selection:ix
print(df.ix[:3,['A','C']]) # 筛选前三行和A,C列数据
result:
A C
2019-06-07 0 2
2019-06-08 4 6
2019-06-09 8 10
Fifth Index
df.A> 8: 8 meet all rows in column A is larger than
the code;
# Boolean indexing
print(df[df.A>8]) # 满足df.A大于8的所有行
result:
A B C D
2019-06-10 12 13 14 15
2019-06-11 16 17 18 19
2019-06-12 20 21 22 23
Lecture, set values
Dataframe value was changed by various methods
can be assigned by the above-described embodiment
LOC
iLoc
DF [df.A>. 4]
DF [ 'F.']
DF [] represents the contents of the time there is a single column, if it is 0: 3, it represents a 0-3 line
Code:
import numpy as np
import pandas as pd
dates = pd.date_range('20190607',periods=6)
df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,columns=['A','B','C','D'])
print(df)
df.iloc[2,2] =1111 # 更改数值
df.loc['20190610','B']=2222 # 更改数值
#df[df.A>4] = 0 # 当A列中的中大于4的项全部变成0,包括B列
#df.A[df.A>4] = 0 #仅仅A列中大于4的项变成0
df.B[df.A>4] = 0 #确认A列中大于4的项使得B列中与之对应的项变成零
df['F'] = np.nan
df['E'] = pd.Series([1,2,3,4,5,6],index=pd.date_range('20190607',periods=6))
print(df)
result;
A B C D
2019-06-07 0 1 2 3
2019-06-08 4 5 6 7
2019-06-09 8 9 10 11
2019-06-10 12 13 14 15
2019-06-11 16 17 18 19
2019-06-12 20 21 22 23
A B C D F E
2019-06-07 0 1 2 3 NaN 1
2019-06-08 4 5 6 7 NaN 2
2019-06-09 8 0 1111 11 NaN 3
2019-06-10 12 0 14 15 NaN 4
2019-06-11 16 0 18 19 NaN 5
2019-06-12 20 0 22 23 NaN 6
Fourth Lecture, handle missing data
Attributes | Explanation |
---|---|
df.dropna(axis=1,how=‘any’) | Deletion of discarded method, how = { 'any', 'all'} how represents lost way, the any means as long as there is one or more 1, the column must be discarded, must All this column to the full value of lost nan throw away. |
df.fillna(value=0) | The nan-filling value to a value |
df.isnull() | isunull determine whether nan value, if so, returns true at that position |
np.any(df.isnull())==True | Determining whether at least one nan value, returns True if |
Code:
import numpy as np
import pandas as pd
dates = pd.date_range('20190607',periods=6)
df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,columns=['A','B','C','D'])
df.iloc[0,1] = np.nan
df.iloc[1,2] = np.nan
print(df)
print(df.dropna(axis=1,how='any')) # how ={'any','all'} how 代表的是丢掉方式
print(df.fillna(value=0)) # 将nan值填充成value值
print(df.isnull()) # isnull判断是否有nan值,若有,即在那个位置返回true
print(np.any(df.isnull())==True) # 判断是否至少有一个nan值,若有则返回True
result;
A B C D
2019-06-07 0 NaN 2.0 3
2019-06-08 4 5.0 NaN 7
2019-06-09 8 9.0 10.0 11
2019-06-10 12 13.0 14.0 15
2019-06-11 16 17.0 18.0 19
2019-06-12 20 21.0 22.0 23
A D
2019-06-07 0 3
2019-06-08 4 7
2019-06-09 8 11
2019-06-10 12 15
2019-06-11 16 19
2019-06-12 20 23
A B C D
2019-06-07 0 0.0 2.0 3
2019-06-08 4 5.0 0.0 7
2019-06-09 8 9.0 10.0 11
2019-06-10 12 13.0 14.0 15
2019-06-11 16 17.0 18.0 19
2019-06-12 20 21.0 22.0 23
A B C D
2019-06-07 False True False False
2019-06-08 False False True False
2019-06-09 False False False False
2019-06-10 False False False False
2019-06-11 False False False False
2019-06-12 False False False False
True
Fifth Lecture, import and export
Import and export many ways, but the format is very similar, is nothing more than open format and save different formats.
You can view the official document, simply look at the usage here.
Code;
import pandas as pd
data = pd.read_csv('student.csv') # 读取文件
print(data)
print(type(data))
data.to_pickle('student.pickle') # 保存文件
Code data can be read out student.csv
type pandas.core.frame.DataFrame data is
stored in the form of pickle, a serialized
Lecture, merge concat
The combined use of concat briefly dataframe
pd.concat ([DF1, DF2, DF3], axis = 0) # combined vertical
when axis is 1, a transverse merge.
pd.concat ([df1, df2, df3 ], axis = 0, ignore_index = True) # vertical combined, ignored index
ignore_index dataframe is true in each index is ignored, rearranging
the code;
# concatenating
df1 = pd.DataFrame(np.ones((3,4))*0,columns=['a','b','c','d'])
df2 = pd.DataFrame(np.ones((3,4))*1,columns=['a','b','c','d'])
df3 = pd.DataFrame(np.ones((3,4))*2,columns=['a','b','c','d'])
res = pd.concat([df1,df2,df3],axis=0) # 竖向合并
print(res)
res = pd.concat([df1,df2,df3],axis=0,ignore_index=True) # 竖向合并,忽略index
print(res)
See below concat Another parameter that join { 'inner', 'outer '}
for some of the table and column names exactly when to use merge join, outer join is the default mode, the column number of the combined form all of the columns out, no value is set nan value. inner is on the intersection of several table columns.
Note: After addition of python3 way to add a default parameter sort = True
following code shows by:
df1 = pd.DataFrame(np.ones((3,4))*0,columns=['a','b','c','d'],index=[1,2,3])
df2 = pd.DataFrame(np.ones((3,4))*1,columns=['b','c','d','e'],index=[2,3,4])
res = pd.concat([df1,df2],join='outer',sort = True) # 默认加入方式
print(res)
res = pd.concat([df1,df2],join='inner',ignore_index=True) # inner模式
print(res)
result;
a b c d e
1 0.0 0.0 0.0 0.0 NaN
2 0.0 0.0 0.0 0.0 NaN
3 0.0 0.0 0.0 0.0 NaN
2 NaN 1.0 1.0 1.0 1.0
3 NaN 1.0 1.0 1.0 1.0
4 NaN 1.0 1.0 1.0 1.0
b c d
0 0.0 0.0 0.0
1 0.0 0.0 0.0
2 0.0 0.0 0.0
3 1.0 1.0 1.0
4 1.0 1.0 1.0
5 1.0 1.0 1.0
Another parameter join_axes
join_axes = [df1.index] df1 using an index table
codes:
# join_axes
df1 = pd.DataFrame(np.ones((3,4))*0,columns=['a','b','c','d'],index=[1,2,3])
df2 = pd.DataFrame(np.ones((3,4))*1,columns=['b','c','d','e'],index=[2,3,4])
res = pd.concat([df1,df2],axis=1,join_axes=[df1.index]) # 横向合并,按照df1的index合并
print(res)
result;
a b c d b c d e
1 0.0 0.0 0.0 0.0 NaN NaN NaN NaN
2 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
3 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
append () method
to specify a df this method is called, and the table extension
may be added and df sequence
codes;
# append
df1 = pd.DataFrame(np.ones((3,4))*0,columns=['a','b','c','d'])
df2 = pd.DataFrame(np.ones((3,4))*1,columns=['a','b','c','d'])
df3 = pd.DataFrame(np.ones((3,4))*1,columns=['a','b','c','d'])
res=df1.append([df2,df3],ignore_index=True) # 竖向添加
print(res)
s1 = pd.Series([1,2,3,4],index=['a','b','c','d'])
res = df1.append(s1,ignore_index=True) # 添加一个序列
print(res)
result:
a b c d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0
5 1.0 1.0 1.0 1.0
6 1.0 1.0 1.0 1.0
7 1.0 1.0 1.0 1.0
8 1.0 1.0 1.0 1.0
a b c d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 1.0 2.0 3.0 4.0
SEVEN, merge merge
The combined use of keywords, a bit like a primary key in the database.
Come take a simple example.
Code;
import pandas as pd
# merging two df by key/keys.(may be used in database)
# simple example
left = pd.DataFrame({'key':['K0','K1','K2','K3'],
'A':['A0','A1','A2','A3'],
'B':['B0','B1','B2','B3']})
right = pd.DataFrame({'key':['K0','K1','K2','K3'],
'C':['C0','C1','C2','C3'],
'D':['D0','D1','D2','D3']})
print(left)
print(right)
res = pd.merge(left,right,on='key')
print(res)
result:
A B key
0 A0 B0 K0
1 A1 B1 K1
2 A2 B2 K2
3 A3 B3 K3
C D key
0 C0 D0 K0
1 C1 D1 K1
2 C2 D2 K2
3 C3 D3 K3
A B key C D
0 A0 B0 K0 C0 D0
1 A1 B1 K1 C1 D1
2 A2 B2 K2 C2 D2
3 A3 B3 K3 C3 D3
As can be seen, according to the combined value is a key to this column.
See the following example of the control by the two Key
Merge () which has a parameter of how
how = [ 'left', 'right', 'Outer', 'Inner']
'left' refers to a content key according to a first dataframe to a
'right' refers to the contents of the keyword in accordance with the last dataframe to
'outer' refers keyword
# consider two keys
left = pd.DataFrame({'key1':['K0','K0','K1','K2'],
'key2':['K0','K1','K0','K1'],
'A':['A0','A1','A2','A3'],
'B':['B0','B1','B2','B3']})
right = pd.DataFrame({'key1':['K0','K1','K1','K2'],
'key2':['K0','K0','K0','K0'],
'C':['C0','C1','C2','C3'],
'D':['D0','D1','D2','D3']})
print(left)
print(right)
how = ['left','right','outer','inner']
res = pd.merge(left,right,on=['key1','key2'],how='inner')
print(res)
result:
A B key1 key2
0 A0 B0 K0 K0
1 A1 B1 K0 K1
2 A2 B2 K1 K0
3 A3 B3 K2 K1
C D key1 key2
0 C0 D0 K0 K0
1 C1 D1 K1 K0
2 C2 D2 K1 K0
3 C3 D3 K2 K0
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A2 B2 K1 K0 C1 D1
2 A2 B2 K1 K0 C2 D2
The next parameter indicator parameter
when the time indicator = True, there is a case will give the two values is used to store a df
when indicator = 'indicator_column' renames column
codes;
# indicator
df1 = pd.DataFrame({'col1':[0,1],'col_left':['a','b']})
df2 = pd.DataFrame({'col1':[1,2,2],'col_right':[2,2,2]})
print(df1)
print(df2)
#res = pd.merge(df1,df2,on='col1',how='outer',indicator=True) # 给出一列用来存储两个df中值的存在情况
# give the indicator a custom name
res = pd.merge(df1,df2,on='col1',how='outer',indicator='indicator_column')
print(res)
result:
col1 col_left
0 0 a
1 1 b
col1 col_right
0 1 2
1 2 2
2 2 2
col1 col_left col_right indicator_column
0 0 a NaN left_only
1 1 b 2.0 both
2 2 NaN 2.0 right_only
3 2 NaN 2.0 right_only
merge the index
pd.merge (left, right, left_index = True, right_index = True, how = 'outer')
Order left_index right_index are True and
simultaneously act the same as before with how
the code:
# merged by index
left = pd.DataFrame({'A':['A0','A1','A2'],
'B':['B0','B1','B2']},
index=['K0','K1','K2'])
right = pd.DataFrame({ 'C':['C0','C1','C2'],
'D':['D0','D1','D2']},
index=['K0','K2','K3'])
print(left)
print(right)
# left_index and right_index
res = pd.merge(left,right,left_index=True,right_index=True,how='outer') # 基于index合并
# res = pd.merge(left,right,left_index=True,right_index=True,how='inner')
print(res)
print('*'*50)
result;
A B
K0 A0 B0
K1 A1 B1
K2 A2 B2
C D
K0 C0 D0
K2 C1 D1
K3 C2 D2
A B C D
K0 A0 B0 C0 D0
K1 A1 B1 NaN NaN
K2 A2 B2 C1 D1
K3 NaN NaN C2 D2
When a table has the same keywords, all correspondence to
pd.merge (Boys, Girls, ON = 'K', suffixes, that = [ '_ Boy', '_ Girl'], = How 'Outer')
suffixes, that parameter is to merge the two columns in the table names behind the increase in content.
Code:
# handle overlapping
boys = pd.DataFrame({'k':['K0','K1','K2'],'age':[1,2,3]})
girls = pd.DataFrame({'k':['K0','K0','K3'],'age':[4,5,6]})
print(boys)
print(girls)
res = pd.merge(boys,girls,on='k',suffixes=['_boy','_girl'],how='outer')
print(res)
result:
age k
0 1 K0
1 2 K1
2 3 K2
age k
0 4 K0
1 5 K0
2 6 K3
age_boy k age_girl
0 1.0 K0 4.0
1 1.0 K0 5.0
2 2.0 K1 NaN
3 3.0 K2 NaN
4 NaN K3 6.0
Session Eight, plot draw
plot often used in matplotlib, can also be used in pandas.
Method, Plot:
'bar', 'hist', 'Box', 'kde', 'Area', 'Scatter', 'hexbin', 'PIE'
Plot had a lot of map, you can find the official documentation needed to learn about. Here we show you plot to use.
We randomly a set of data, the drawing operation.
Code;
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# plot data
# Series 线性数据
data = pd.Series(np.random.randn(1000),index=np.arange(1000))
data = data.cumsum()
# DataFrame
data = pd.DataFrame(np.random.randn(1000,4),
index=np.arange(1000),
columns=list("ABCD"))
print(data.head())
data=data.cumsum() # 累加函数
data.plot()
# plot method:
# 'bar','hist','box','kde','area','scatter','hexbin','pie'
ax=data.plot.scatter(x='A',y='B',color='DarkBlue',label='Class 1')
data.plot.scatter(x='A',y='C',color='DarkGreen',label='Class 2',ax=ax)
plt.show()
result;
(To be continued)