Pandas study notes (unfinished)

Pandas study notes

1. The basic data structure of pandas

1. Series object

A one-dimensional array of indexed data (binds a set of indices with a set of data)
mapping type keys to a set of type values

pd.Series(data,index = index)
#index is an optional parameter

Get Series object data:

values method, index method, label index
data = pd.Series([1,2,3.00,4,5.0])
data
–> 0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
dtype: float64
#values method
data.values
–>array([1., 2., 3., 4., 5.])
#index method
data.index
– >RangeIndex(start=0, stop=5, step=1)
#label index
data[0:3]
–>0 1.0
1 2.0
2 3.0
dtype: float64

series can define indexes
Example:

data = pd.Series([1,2,3.00,4,5.0],index = [‘a’,‘b’,‘c’,‘d’,‘e’])
data
–>a 1.0
b 2.0
c 3.0
d 4.0
e 5.0
dtype: float64

Note: Series is an explicit index, and the index can be changed. Numpy arrays are implicitly indexed, and the index is an integer index, which cannot be defined.

You can directly use the dictionary to create a Series object, the key of the dictionary is automatically created as an index, and the value remains unchanged

Series can use index to filter dictionary elements
Example:

pd.Series({0:' a ',1:' b ',2:' c '}, index = [0,2]) –>0
a
2 c
dtype: object
#Series objects will retain explicitly defined key-value pairs

2. DataFrame object

Can be seen as a multidimensional Series object

pd.DataFrame(data, columns = columns , index = index)

example:

name = pd.Series({'a':1,'b':2,'c':3})
num = pd.Series([11,22,33],index = ['a','b','c']) data = pd.DataFrame({'name':name,
'number':num}) data
–
>
name number
a 1 11
b 2 22
c 3 33
# The index value needs to be consistent
#If the index is inconsistent, there will be problems:

num = pd.Series([11,22,33])
name = pd.Series({‘a’:1,‘b’:2,‘c’:3})
data = pd.DataFrame({‘name’:name,‘number’:num})
data
–>
name number
a 1.0 NaN
b 2.0 NaN
c 3.0 NaN
0 NaN 11.0
1 NaN 22.0
2 NaN 33.0

index attribute, columns attribute

#index attribute
get index
data.index

Index([‘a’, ‘b’, ‘c’], dtype=‘object’)

#columns attribute
data.columns

Index([‘name’, ‘number’], dtype=‘object’)

3. Index object

It can be regarded as an immutable array.
It follows the multiple usages of standard library collections, including intersection and union.
It contains some attributes similar to NumPy arrays, such as: .size , .shape , .ndim , .dtype, etc.

operation object	features
Serise object	Explicit index, can be changed
DataFrame object	Explicit index, can be changed
Index object	Index cannot be changed

2. Data operation

1. Data value and selection

value

Series	DataFrame
key-value map	key-value map
in operation	values attribute (view data by row)
keys()
items()

Select
Series: The indexer
uses the index directly. When the index is an integer, the slicing operation uses the implicit index by default, and the value operation uses the explicit index by default.

indexer	effect
loc	Indicates that value and slice are explicit
iloc	Indicates that value and slice are both implicit
ix	Hybrid form, equivalent to the standard library list value

DataFrame: use Series indexer, preserve row and column labels

2. Numerical operations

Similar to the Numpy general function,
the index will be automatically aligned, and the index merging method is similar to the union set

DataFrame objects can set missing values through the fill_value parameter

3. Missing value

None:
It is a Python object, which can only be used for the Object array type, and cannot be used for missing values of the Numpy/Pandas array type.
If an array containing None is used for calculation, an error will be reported

NaN:
special floating-point number
No matter what operation is performed, it will be assimilated to NaN.
When using the cumulative function, no error will be reported, but the output will become NaN.
Numpy provides special cumulative functions that ignore missing value calculations, such as: np.nansum() , np.nansin() ...
Example:

x = np.array([1,np.nan,3,4])
x.dtype

dtype(‘float64’)

x.sum()

in

np.nansum(x)

8.0

Note: pandas will automatically convert nan and none to nan

Pandas method for dealing with missing values

isnull(): Returns a masked data consisting of a Boolean type

x = pd.Series([1,np.nan,3,None])
x.isnull()

0 False
1 True
2 False
3 True
dtype: bool

notnull(): the opposite of isnull()

x.notnull()

0 True
1 False
2 True
3 False
dtype: bool

Note: The isnull(), notnull() method generates a Boolean array, which can be used as an index to select non-missing data or missing data

dropna(): returns the data after removing missing values

#Series object
x.dropna()

0 1.0
2 3.0
dtype: float64

#DataFrame object
y = pd.DataFrame([[1,np.nan,3,4],[5,6,None]])
#Can only remove the entire row or column
#Default removes the entire row
#axis parameter setting
y.dropna()

Empty DataFrame
Columns: [0, 1, 2, 3]
Index: []

y.dropna(axis = 'columns') #here axis = 1 and the result is the same

0
0 1
1 5

#Threshold for the number of missing values in rows or columns is set by how or thresh parameters

fillna(): Fill missing values

#Series object

#Directly use the specified value to fill
x.fillna(number)
#Use the valid value in front of the missing value to fill backward (forward-fill)
x.fillna(method = 'ffill')
#Forward fill
x.fillna(method = 'bfill')

#DataFrame object

#Set the axis parameter

3. Merge operation

1. Concat and Append operations

1、 pd.concat

The syntax is similar to np.concatenate()

The index will be preserved : Compared with np.concatenate, pd.concat, the usage in pandas will also retain the original data index value when the index is the same, and will not be merged.

Actions available for this feature:

verify_integrity parameter
Set verify_integrity = True, if there are duplicate indexes in the merge result, an error message will be printed

ignore_index parameter
is set to True when merging will create a new integer index for the merging result
key 参数
通过该参数 添加新的多级索引
 
 join 和 join_axes 参数
 交集合并 ： join = 'inner'
 并集合并 ： join = 'outer'

2, append() method

2. pd.merge() and df1.join(df2)

1. pd.merge()
This operation method can perform one-to-one, many-to-one, many-to-many connection operations

One-to-one case: when the data keys of the two DataFrames are not duplicated, the key under a certain column will be automatically selected to merge ignoring the index

Many-to-one situation: when merging, duplicate values will be retained and the data content corresponding to the same key will be filled repeatedly

Many-to-many case: automatic alignment to fill each other

The on parameter
can be set to a column name string or a list containing multiple column names for the specified merge filling operation

The left_on and right_on parameters
merge data sets with different column names, and if there is an extra column with the same key, you can use the drop('name',axis = ) operation to delete the
parameter as the column name

left_index and right_index parameters
Merge column and index
parameter set to True

how parameter
data connection mode,
the default parameter is how = 'inner'

parameter	form
inner	Inner join, that is, intersection
outer	Outer join, ie union
left	Left join, take the first parameter as the criterion
right	right join

This method can mix indexes and columns by setting the _index and _on parameters

pd.merge(df1,df2,left_index = True, right_on = 'name')
This operation performs merging and column name selection at the same time, retaining the column name and position selected by right_on

suffixes parameter
to customize the suffix of repeated column names

2. df1.join(df2)
merges data directly according to the index, the effect is the same as the _index and _on parameters in the above merge operation

4. Accumulation and grouping

Grand total

Cumulative function, abbreviated

Grouping: GroupBy

1、groupby()

Get values by column and return a groupby object

df.groupby(‘column’)

2. Accumulation: aggregate()

Calculate all cumulative values at once

df.groupby(‘key’).aggregate([‘min’,np.median,max]])

Use a dictionary to specify cumulative functions for different columns

df.groupby(‘key’).aggregate({‘data1’: ‘max’,‘data2’: ‘min’})

3. Filter: filter()

Filter certain values by grouping attribute

example:

import numpy as np
import pandas as pd
from pandas import Series,DataFrame

rng = np.random.RandomState(0)
df = pd.DataFrame({
    
    'key':['A','B','C','A','B','C'],
'data1':range(6),
'data2':rng.randint(0,10,6)},
 columns = ['key','data1','data2'])
 
print(df)
def filter_func(x):
	return x['data2'].std() >4
print('def')
print(filter_func)
print(df.groupby('key')['data2'].std())

print('df.groupby(''key'')')
print(df.groupby('key').std())
print('filter')

print(df.groupby('key').filter(filter_func)) 
#即：df.groupby('key'['data2'].std())

4. Transformation: transform()