pandas are based on a data analysis tool numpy, skilled use of pandas can greatly reduce our workload
to introduce pandas package:
import numpy as np
import pandas as pd
pandas Data Types
There are two types of data pandas: series and dataframe.
is a series of one-dimensional data structure, each element has an index, similar to the one-dimensional array. Index string or number can be made, Series following structure:
dataframe is two-dimensional data structure, which is present in similar excel table, a corresponding row and column with the following structure:
Create a Series object
Example 1
#我们可以直接用Series函数来创建对象
import numpy as np
import pandas as pd
a=pd.Series([1,2,3,4,5])
print(a)
'''
输出为:
0 1
1 2
2 3
3 4
4 5
dtype: int64
#程序会自动生成index,从0开始编号
'''
Example 2
#当然也可以指定index,并且Series可以使用已有的列表、元素来创建对象,也可以利用ndarray来创建
import numpy as np
import pandas as pd
a=np.array([1,2,3,4,5,6])
b=pd.Series(a,index=['a','b','c','d','e','f'])
print(b)
'''
输出为:
a 1
b 2
c 3
d 4
e 5
f 6
dtype: int32
'''
Example 3
#Series也可以用来创建时间序列,但是必须指定start,end,period中最少两个值
import numpy as np
import pandas as pd
b=pd.date_range('20200101','20200106')
print(b)
'''
输出为:
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
'2020-01-05', '2020-01-06'],
dtype='datetime64[ns]', freq='D')
'''
Creating DataFrame object
There DataFrame row and column names, ranks name we can define my personal understanding of the constructor as follows
pd.DataFrame(object,index=,columns=)
Example 1
import numpy as np
import pandas as pd
s=np.arange(1,7)
s=s.reshape(2,3)
a=pd.DataFrame(s,index=['a','b'],columns=['A','B','C'])
print(a)
'''
输出为:
A B C
a 1 2 3
b 4 5 6
'''
Example 2
#还可以通过字典来创建DataFrame对象,键值对应index
import numpy as np
import pandas as pd
a=pd.DataFrame({'name':pd.Categorical(['dn','muss']),
'age':pd.Categorical(['18','19']),
'score':pd.Categorical(['99','98'])})
print(a)
'''
输出为:
name age score
0 dn 18 99
1 muss 19 98
'''
DataFrame property
View Data Types
We can use DataFrame.dtypes function to view the data type of each column
import numpy as np
import pandas as pd
a=pd.DataFrame({'name':pd.Categorical(['dn','muss']),
'age':18,
'score':pd.Series(np.arange(98,100))})
print(a.dtypes)
'''
输出为:
name category
age int64
score int32
dtype: object
'''
View index and column names
Use DataFrame.index function to see the index
import numpy as np
import pandas as pd
a=pd.DataFrame({'name':pd.Categorical(['dn','muss']),
'age':18,
'score':pd.Series(np.arange(98,100))})
print(a.index)
'''
输出为:RangeIndex(start=0, stop=2, step=1)
'''
Use DataFrame.columns function to see the column name
import numpy as np
import pandas as pd
a=pd.DataFrame({'name':pd.Categorical(['dn','muss']),
'age':18,
'score':pd.Series(np.arange(98,100))})
print(a.columns)
#输出为Index(['name', 'age', 'score'], dtype='object')
View data and statistics
DataFrame.values to view the data using the function
import numpy as np
import pandas as pd
a=pd.DataFrame({'name':pd.Categorical(['dn','muss']),
'age':18,
'score':pd.Series(np.arange(98,100))})
print(a.values)
'''
输出为:
[['dn' 18 98]
['muss' 18 99]]
'''
Use DataFrame.describe () function to view statistics
import numpy as np
import pandas as pd
a=pd.DataFrame({'name':pd.Categorical(['dn','muss']),
'age':18,
'score':pd.Series(np.arange(98,100))})
print(a.describe())
'''
输出为:
age score
count 2.0 2.000000
mean 18.0 98.500000
std 0.0 0.707107
min 18.0 98.000000
25% 18.0 98.250000
50% 18.0 98.500000
75% 18.0 98.750000
max 18.0 99.000000
Process finished with exit code 0
'''
DataFrame common operations
Transpose
About transpose operation Needless to say, numpy part have said
import numpy as np
import pandas as pd
a=pd.DataFrame({'name':pd.Categorical(['dn','muss']),
'age':18,
'score':pd.Series(np.arange(98,100))})
print(a.T)
'''
输出为:
0 1
name dn muss
age 18 18
score 98 99
'''
Sequence
Sort sorted into sorting index value and by
According to sort index
import numpy as np
import pandas as pd
s=np.array([6,5,4,3,2,1]).reshape(3,2)
a=pd.DataFrame(s,index=[0,1,2],columns=['a','b'])
print(a.sort_index(axis=1))
'''
输出为:
a b
0 6 5
1 4 3
2 2 1
'''
Sort by value
import numpy as np
import pandas as pd
s=np.array([6,5,4,3,2,1]).reshape(3,2)
a=pd.DataFrame(s,index=[0,1,2],columns=['a','b'])
print(a.sort_values(by='b'))
'''
输出为:
a b
2 2 1
1 4 3
0 6 5
'''
Data Selection
Select the column
This will return a single column, the equivalent of a Series object
import numpy as np
import pandas as pd
s=np.arange(1,7).reshape(3,2)
a=pd.DataFrame(s,index=[0,1,2],columns=['a','b'])
print(a['a'])
'''
输出为:
0 1
1 3
2 5
'''
Select row
A truth
#这里我有点迷,直接用行索引会报错,只有使用切片才可以输出,假如想输出第一行就用0:1的方法。
import numpy as np
import pandas as pd
s=np.arange(1,7).reshape(3,2)
a=pd.DataFrame(s,index=[0,1,2],columns=['a','b'])
print(a[0:1])
'''
输出为:
a b
0 1 2
'''
Select the tab loc
loc used to obtain a crossing area, such as the first data we want a, b of the first and second column line:
import numpy as np
import pandas as pd
s=np.arange(1,7).reshape(3,2)
a=pd.DataFrame(s,index=[0,1,2],columns=['a','b'])
print(a)
print('.')
print(a.loc[[1,2],['a','b']])
'''
输出为:
a b
0 1 2
1 3 4
2 5 6
.
a b
1 3 4
2 5 6
'''
Of course, the data may be used to obtain the single separate.
import numpy as np
import pandas as pd
s=np.arange(1,7).reshape(3,2)
a=pd.DataFrame(s,index=[0,1,2],columns=['a','b'])
print(a.loc[1])
'''
输出为:
a b
0 1 2
1 3 4
2 5 6
.
a 3
b 4
Name: 1, dtype: int32
'''
Location selection iloc
loc is carried out by the label to the ranks of choice, iloc is selected by the position of the ranks, for example, you want the second column, write that column position 2 (assuming that starting from 1).
import numpy as np
import pandas as pd
s=np.arange(1,7).reshape(3,2)
a=pd.DataFrame(s,index=[0,1,2],columns=['a','b'])
print(a)
print('.')
print(a.iloc[[0,1],0:1])
'''
输出为
a b
0 1 2
1 3 4
2 5 6
.
a
0 1
1 3
'''
Read the file and export operations
Import function | Export function function |
---|---|
read_csv | to_csv |
read_excel | to_excel |
read_sql | to_sql |
read_json | to_json |
read_msgpack | to_msgpack |
read_html | to_html |
read_gbq | to_gbq |
read_stata | to_stata |
read_sas | to_sas |
read_clipboard | to_clipboard |
read_pickle | to_pickle |
Common statistical methods
function | Explanation |
---|---|
count | Number of non-NA values |
describe | Calculate summary statistics for DF Series or columns |
min , max | Minimum and maximum |
argmin , argmax | The minimum and maximum index position (integer) |
idxmin , idxmax | The minimum and maximum index values |
quantile | Sample quantiles (0-1) |
sum | Summing |
mean | Means |
median | Median |
mad | Calculating an average from the mean absolute deviation |
where | variance |
std | Standard deviation |
skew | Sample values skewness (third moment) |
kurt | Sample kurtosis values (FOM) |
cumsum | Cumulative and sample values |
cummin , cummax | Accumulated maximum and minimum sample values accumulated |
cumprod | The cumulative value of the product sample |
diff | Calculating a first difference (useful time-series) |
pct_change | Calculate the percentage change |