Pandas study notes
1. The basic data structure of pandas
1. Series object
A one-dimensional array of indexed data (binds a set of indices with a set of data)
mapping type keys to a set of type values
pd.Series(data,index = index)
#index is an optional parameter
Get Series object data:
values method, index method, label index
data = pd.Series([1,2,3.00,4,5.0])
data
–> 0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
dtype: float64
#values method
data.values
–>array([1., 2., 3., 4., 5.])
#index method
data.index
– >RangeIndex(start=0, stop=5, step=1)
#label index
data[0:3]
–>0 1.0
1 2.0
2 3.0
dtype: float64
series can define indexes
Example:
data = pd.Series([1,2,3.00,4,5.0],index = [‘a’,‘b’,‘c’,‘d’,‘e’])
data
–>a 1.0
b 2.0
c 3.0
d 4.0
e 5.0
dtype: float64
Note: Series is an explicit index, and the index can be changed. Numpy arrays are implicitly indexed, and the index is an integer index, which cannot be defined.
You can directly use the dictionary to create a Series object, the key of the dictionary is automatically created as an index, and the value remains unchanged
Series can use index to filter dictionary elements
Example:
pd.Series({0:' a ',1:' b ',2:' c '}, index = [0,2]) –>0
a
2 c
dtype: object
#Series objects will retain explicitly defined key-value pairs
2. DataFrame object
Can be seen as a multidimensional Series object
pd.DataFrame(data, columns = columns , index = index)
example:
name = pd.Series({'a':1,'b':2,'c':3})
num = pd.Series([11,22,33],index = ['a','b','c']) data = pd.DataFrame({'name':name,
'number':num}) data
–
>
name number
a 1 11
b 2 22
c 3 33
# The index value needs to be consistent
#If the index is inconsistent, there will be problems:num = pd.Series([11,22,33])
name = pd.Series({‘a’:1,‘b’:2,‘c’:3})
data = pd.DataFrame({‘name’:name,‘number’:num})
data
–>
name number
a 1.0 NaN
b 2.0 NaN
c 3.0 NaN
0 NaN 11.0
1 NaN 22.0
2 NaN 33.0
index attribute, columns attribute
#index attribute
get index
data.indexIndex([‘a’, ‘b’, ‘c’], dtype=‘object’)
#columns attribute
data.columnsIndex([‘name’, ‘number’], dtype=‘object’)
3. Index object
It can be regarded as an immutable array.
It follows the multiple usages of standard library collections, including intersection and union.
It contains some attributes similar to NumPy arrays, such as: .size , .shape , .ndim , .dtype, etc.
operation object | features |
---|---|
Serise object | Explicit index, can be changed |
DataFrame object | Explicit index, can be changed |
Index object | Index cannot be changed |
2. Data operation
1. Data value and selection
value
Series | DataFrame |
---|---|
key-value map | key-value map |
in operation | values attribute (view data by row) |
keys() | |
items() |
Select
Series: The indexer
uses the index directly. When the index is an integer, the slicing operation uses the implicit index by default, and the value operation uses the explicit index by default.
indexer | effect |
---|---|
loc | Indicates that value and slice are explicit |
iloc | Indicates that value and slice are both implicit |
ix | Hybrid form, equivalent to the standard library list value |
DataFrame: use Series indexer, preserve row and column labels
2. Numerical operations
Similar to the Numpy general function,
the index will be automatically aligned, and the index merging method is similar to the union set
DataFrame objects can set missing values through the fill_value parameter
3. Missing value
None:
It is a Python object, which can only be used for the Object array type, and cannot be used for missing values of the Numpy/Pandas array type.
If an array containing None is used for calculation, an error will be reported
NaN:
special floating-point number
No matter what operation is performed, it will be assimilated to NaN.
When using the cumulative function, no error will be reported, but the output will become NaN.
Numpy provides special cumulative functions that ignore missing value calculations, such as: np.nansum() , np.nansin() ...
Example:
x = np.array([1,np.nan,3,4])
x.dtypedtype(‘float64’)
x.sum()
in
np.nansum(x)
8.0
Note: pandas will automatically convert nan and none to nan
Pandas method for dealing with missing values
isnull(): Returns a masked data consisting of a Boolean type
x = pd.Series([1,np.nan,3,None])
x.isnull()0 False
1 True
2 False
3 True
dtype: bool
notnull(): the opposite of isnull()
x.notnull()
0 True
1 False
2 True
3 False
dtype: bool
Note: The isnull(), notnull() method generates a Boolean array, which can be used as an index to select non-missing data or missing data
dropna(): returns the data after removing missing values
#Series object
x.dropna()0 1.0
2 3.0
dtype: float64#DataFrame object
y = pd.DataFrame([[1,np.nan,3,4],[5,6,None]])
#Can only remove the entire row or column
#Default removes the entire row
#axis parameter setting
y.dropna()Empty DataFrame
Columns: [0, 1, 2, 3]
Index: []y.dropna(axis = 'columns') #here axis = 1 and the result is the same
0
0 1
1 5#Threshold for the number of missing values in rows or columns is set by how or thresh parameters
fillna(): Fill missing values
#Series object
#Directly use the specified value to fill
x.fillna(number)
#Use the valid value in front of the missing value to fill backward (forward-fill)
x.fillna(method = 'ffill')
#Forward fill
x.fillna(method = 'bfill')#DataFrame object
#Set the axis parameter
3. Merge operation
1. Concat and Append operations
1、 pd.concat
The syntax is similar to np.concatenate()
The index will be preserved : Compared with np.concatenate, pd.concat, the usage in pandas will also retain the original data index value when the index is the same, and will not be merged.
Actions available for this feature:
verify_integrity parameter
Set verify_integrity = True, if there are duplicate indexes in the merge result, an error message will be printedignore_index parameter
is set to True when merging will create a new integer index for the merging resultkey 参数 通过该参数 添加新的多级索引 join 和 join_axes 参数 交集合并 : join = 'inner' 并集合并 : join = 'outer'
2, append() method
2. pd.merge() and df1.join(df2)
1. pd.merge()
This operation method can perform one-to-one, many-to-one, many-to-many connection operations
One-to-one case: when the data keys of the two DataFrames are not duplicated, the key under a certain column will be automatically selected to merge ignoring the index
Many-to-one situation: when merging, duplicate values will be retained and the data content corresponding to the same key will be filled repeatedly
Many-to-many case: automatic alignment to fill each other
The on parameter
can be set to a column name string or a list containing multiple column names for the specified merge filling operation
The left_on and right_on parameters
merge data sets with different column names, and if there is an extra column with the same key, you can use the drop('name',axis = ) operation to delete the
parameter as the column name
left_index and right_index parameters
Merge column and index
parameter set to True
how parameter
data connection mode,
the default parameter is how = 'inner'
parameter | form |
---|---|
inner | Inner join, that is, intersection |
outer | Outer join, ie union |
left | Left join, take the first parameter as the criterion |
right | right join |
This method can mix indexes and columns by setting the _index and _on parameters
pd.merge(df1,df2,left_index = True, right_on = 'name')
This operation performs merging and column name selection at the same time, retaining the column name and position selected by right_on
suffixes parameter
to customize the suffix of repeated column names
2. df1.join(df2)
merges data directly according to the index, the effect is the same as the _index and _on parameters in the above merge operation
4. Accumulation and grouping
Grand total
Cumulative function, abbreviated
Grouping: GroupBy
1、groupby()
Get values by column and return a groupby object
df.groupby(‘column’)
2. Accumulation: aggregate()
Calculate all cumulative values at once
df.groupby(‘key’).aggregate([‘min’,np.median,max]])
Use a dictionary to specify cumulative functions for different columns
df.groupby(‘key’).aggregate({‘data1’: ‘max’,‘data2’: ‘min’})
3. Filter: filter()
Filter certain values by grouping attribute
example:
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
rng = np.random.RandomState(0)
df = pd.DataFrame({
'key':['A','B','C','A','B','C'],
'data1':range(6),
'data2':rng.randint(0,10,6)},
columns = ['key','data1','data2'])
print(df)
def filter_func(x):
return x['data2'].std() >4
print('def')
print(filter_func)
print(df.groupby('key')['data2'].std())
print('df.groupby(''key'')')
print(df.groupby('key').std())
print('filter')
print(df.groupby('key').filter(filter_func))
#即:df.groupby('key'['data2'].std())
4. Transformation: transform()