Pandas data analysis 2
foreword
Vue框架:
Learn Vue
OJ算法系列:
magic tricks from the project- Detailed algorithm explanation
Linux操作系统:
of Fenghouqimen-linux
C++11:
Tongtianlu-C++11
One line to check whether Pandas has been downloaded:
pip list
One line to download:
pip install pandas
21 commonly used statistical methods
function | Function |
---|---|
count | Count the number of non-null values |
max | maximum value |
min | minimum value |
sum | to sum |
prod | product |
cumsum | Tired and |
comply | multiplication |
cummax | cumulative maximum |
cummin | cumulative minimum |
mean | average value |
std | standard deviation |
was | variance |
median | Arithmetic median |
abs | absolute value |
unique | list of unique values |
nunique | number of unique values |
value_counts | unique values and their frequency |
skew | third order skewness |
kurt | fourth order kurtosis |
corr | correlation coefficient matrix |
those | covariance matrix |
describe():
describe will display count(), mean(), std(), max&min(), etc. of all numerical features
import pandas as pd
dataframe = pd.DataFrame({
'a' : [1, 2, 3, 4, 5],
'b' : [1.1, 1.2, 1.3, 1.4, 1.5],
'c' : ['a', 'b', 'c', 'd', 'e']
})
dataframe.describe()
numeric_only:
- Many of the above operations only support int and float operations, and other types need to overload operators such as +
- If this type of operation does not overload the operator or operation function, it may be automatically ignored, or it may terminate with an error
- In most cases, the above statistical functions are used with slice slices
# 均值
print(DataFrame.mean())
# 标准差
print(DataFrame.std())
# 累和
print(DataFrame.cumsum())
# 累乘
print(DataFrame.iloc[:, 0:2].cumprod())
Skewness:
Function:
- Used to describe the asymmetry of the data
meaning:
- skewness == 0: normal distribution
- skewness > 0: The positive deviation value is large, there are many extreme values on the right side of the data, and the overall distribution is right/positive
- skewness < 0: the negative deviation value is large, there are many extreme values on the left side of the data, and the overall distribution is left/negative
Calculation formula:
- s k e w n e s s = E [ ( x − E ( x ) ) / ( D ( x ) ) 3 ] skewness = E[(x - E(x)) / (\sqrt{D(x)})^3] skewness=And [( x−E ( x )) / (D(x))3]
Demo:
import numpy as np
dataframe = pd.DataFrame({
'id' : np.arange(10),
#等比数列:起点、终点、个数,幂
'value' : np.logspace(1, 10, 10, base = 2),
#等差数列:起点、终点、个数
'weight' : np.linspace(1, 10, 10)
})
print(dataframe)
#skew()>0,value右侧异常值比较多
print(dataframe.skew())
- Draw a picture to see:
#挑选数值型特征
num_feats = dataframe.dtypes[dataframe.dtypes != 'object'].index
import matplotlib.pyplot as plt
plt.figure(figsize = (8, 8))#8inch * 8inch
fig, ax = plt.subplots(2, 2)
for row in range(2):
for col in range(2):
if row*2+col > 2 :
continue
data = dataframe[num_feats[row*2+col]]
ax[row][col].plot(data.index, data.values)
ax[row][col].set_title(f'{
num_feats[row*2+col]}')
# 自动保持子图之间的正确间距。
fig.tight_layout()
plt.show()
Kurtosis value:
use:
- A statistic describing the steepness of the distribution of all values of a variable, that is, the sharpness of the data distribution
value:
- kurtosis == 0: the steepness is the same as the normal distribution
- kurtosis > 0: steeper than normal distribution peak, sharp peak
- kurtosis < 0: flatter peak than normal distribution, flat peak
Calculation formula:
K u r t o s i s = E [ ( x − E ( x ) ) / ( D ( x ) ) 4 ] − 3 Kurtosis = E[(x - E(x)) / (\sqrt{D(x)})^4] - 3 Kurtosis=And [( x−E ( x )) / (D(x))4]−3
Demo:
- Continue to demonstrate with the previous set of data:
print(dataframe.kurt())
cov covariance
Calculation formula:
cov ( X , Y ) = E [ ( X − E [ X ] ) ∗ ( Y − E [ Y ] ) ] , cov(X,Y)=E[ (XE[X]) * (YE[Y]) ],co v ( X ,Y)=And [( X−E [ X ])∗(Y−E [ Y ])] ,
- E[X] represents the expectation of variable X.
- From an intuitive point of view, covariance represents the expectation of the overall error of two variables.
- If one of them is greater than its own expected value and the other is also greater than its own expected value, the covariance between the two variables is
positive; - If one of the variables is greater than its expected value while the other is less than its own expected value, then the covariance between the two variables is negative.
- If X and Y are statistically independent, then the covariance between them is 0
value:
- corr() returns the correlation coefficient, between [-1, 1]
- |-1| and |1| indicate linear correlation
- The positive and negative signs indicate positive and negative correlations
Operation object:
- For a DataFrame with n eigenvalues, calculate the covariance between the two to form an n*n matrix
- In the covariance matrix, the diagonal is the variance, and the off-diagonal is the covariance
Demo:
- Continuing with the above data:
dataframe.cov()
Corr correlation coefficient:
Calculation formula:
- Based on cov covariance
ρ XY = C ov ( X , Y ) / [ D ( X ) ] [ D ( Y ) ] ρXY = Cov(X,Y) / \sqrt{[D(X)]} \sqrt{[ D(Y)}]ρXY=C o v ( X ,Y)/[D(X)][D(Y)]
value:
- corr() calculates the correlation coefficient between [-1, 1]
- |-1| and |1| indicate linear correlation
- The positive and negative signs indicate positive and negative correlations
Operation object:
- For a DataFrame with n eigenvalues, the correlation coefficient is calculated between two to form an n*n matrix
- The diagonal of the correlation coefficient matrix is always 1
Demo:
- Continuing with the above data:
dataframe.corr()
Five commonly used data processing functions:
map:
- Schematic diagram:
Function:
- Converts each value in a column in a DataFrame/Series to other values according to the given function/dictionary
Dictionary map:
- For the DataFrame / Series .map() method, pass a dictionary
#转型字典
gendermap = {
'F' : 0, 'M' : 1}
#数据
dataframe = pd.DataFrame({
"name":['Jack', 'Alice', 'Lily', 'Mshis', 'Gdli', 'Agosh', 'Filu', 'Mack', 'Lucy', 'Pony'],
"gender":['F', 'M', 'F', 'F', 'M', 'F', 'M', 'M', 'F', 'F'],
"age":np.random.randint(15,50,10),
"salary":np.random.randint(5,50,10),
})
#map方法
dataframe['gender'] = dataframe['gender'].map(gendermap)
print(dataframe)
Function map:
- Pass in a function pointer for DataFrame / Series .map()
dataframe = pd.DataFrame({
"name":['Jack', 'Alice', 'Lily', 'Mshis', 'Gdli', 'Agosh', 'Filu', 'Mack', 'Lucy', 'Pony'],
"gender":['F', 'M', 'F', 'F', 'M', 'F', 'M', 'M', 'F', 'F'],
"age":np.random.randint(15,50,10),
"salary":np.random.randint(5,50,10),
})
print(dataframe)
print('*'*40)
#转型函数
def gender_map(x) :
gender = 0 if x == 'F' else 1
return gender
dataframe['gender'] = dataframe['gender'].map(gender_map)
print(dataframe)
apply:
Function:
- Traverse the entire Series and DataFrame, and run the specified function on each element, which can be a custom function, or the 21 built-in functions mentioned above, etc.
apply anonymous lambda:
df=pd.DataFrame({
"name":['Jack', 'Alice', 'Lily', 'Mshis', 'Gdli', 'Agosh', 'Filu', 'Mack', 'Lucy', 'Pony'],
"gender":['F', 'M', 'F', 'F', 'M', 'F', 'M', 'M', 'F', 'F'],
"age":np.random.randint(15,50,10),
"salary":np.random.randint(5,50,10),
})
print(df)
print('*'*40)
print(df[['age', 'salary']].apply(lambda x: x*2))
apply built-in function
- Determines which columns the built-in function can be executed on
#传入的函数也可以是pandas和python内置函数
print(df[['age', 'salary']].apply(max))
print('*'*30)
print(df[['age', 'salary']].apply(np.mean))
apply own function:
#按值遍历调用
def apply_func(row):
a = row['name']
b = row['gender']
c = row['age']
return f'name:{
a},gender:{
b}, age:{
c}'
#原地修改,增加一列all
df["all"] = df.apply(lambda row:apply_func(row), axis = 1)
#axis = 1,每次row为dataframe内的一行
print(df)
group()
- The function is the same as groupby() in Mysql, the by parameter can pass in multiple characteristic values
- When multiple feature values are passed in, the grouping is the permutation and combination of multiple features, see dfc.groupby(by=['gender','age']) below
Don't have to have:
- The operation after groupby() is for each group of internal
import numpy as np
dfc=pd.DataFrame({
"name":['Jack', 'Alice', 'Lily', 'Mshis', 'Gdli', 'Agosh', 'Filu', 'Mack', 'Lucy', 'Pony'],
"gender":['F', 'M', 'F', 'F', 'M', 'F', 'M', 'M', 'F', 'F'],
"age":np.random.randint(25,28,10),
"salary":np.random.randint(5,50,10),
})
#划分成组后求对应组的和结果
print(dfc.groupby(by='gender').sum())
print("*"*25)
# groupby查传入的可以时多个属性
print(dfc.groupby(by=['gender','age']).sum())
groupby + apply:
- apply() incoming parameters:
- lambda
- built-in func()
- Personally written functions
- apply object:
Multiple groups after groupby(), that is, sub DataFrame
Custom sort:
- Execute sort_values() for each sub DataFrame
df=pd.DataFrame({
"name":['Jack', 'Alice', 'Lily', 'Mshis', 'Gdli', 'Agosh', 'Filu', 'Mack', 'Lucy', 'Pony'],
"gender":['F', 'M', 'F', 'F', 'M', 'F', 'M', 'M', 'F', 'F'],
"age":np.random.randint(25,28,10),
"salary":np.random.randint(5,50,10),
})
print(df)
print('*'*40)
#此处的x也是一个dataframe
def group_staff_salary(x):
df1 = x.sort_values(by = 'salary',ascending=True)
#ascending = True为从大到小的顺序,默认倒序
return df1
df.groupby('gender',as_index=True).apply(group_staff_salary)
Get the maximum value of each group:
- Restrict each child DataFrame to return objects
#只看每组最高工资:
df=pd.DataFrame({
"name":['Jack', 'Alice', 'Lily', 'Mshis', 'Gdli', 'Agosh', 'Filu', 'Mack', 'Lucy', 'Pony'],
"gender":['F', 'M', 'F', 'F', 'M', 'F', 'M', 'M', 'F', 'F'],
"age":np.random.randint(25,28,10),
"salary":np.random.randint(5,50,10),
})
print(df)
print("*"*40)
#此处的x也是一个dataframe
def group_staff_salary(x):
df1 = x.sort_values(by = 'salary',ascending=True)
return df1.iloc[-1, :]
df.groupby('gender',as_index=True).apply(group_staff_salary)
- Query the information of the highest wage earners for both men and women:
agg:
Function:
- Specify multiple execution functions for a set of data at the same time
Dictionary specifying built-in functions:
- The keys of the dictionary are the features of the DataFrame, and the values of the dictionary are the functions to be performed on the feature values
- When there are many functions to be performed on the dictionary, an array can be passed
# 1:字典:key为列,val为操作函数
df=pd.DataFrame({
"name":['Jack', 'Alice', 'Lily', 'Mshis', 'Gdli', 'Agosh', 'Filu', 'Mack', 'Lucy', 'Pony'],
"gender":['F', 'M', 'F', 'F', 'M', 'F', 'M', 'M', 'F', 'F'],
"age":np.random.randint(25,28,10),
"salary":np.random.randint(5,50,10),
})
df.agg({
'age':['max'], 'salary':['mean', 'std']})
- value as row index, key as column index
groupby + agg:
- Many functions in agg operate on all columns of the subDataFrame after groupby:
df.groupby('gender').agg(['max', 'min', 'median'])
- The group is used as the row index, and the agg internal function is used as the column index:
lambda anonymous function:
- Agg internal parameters can also be lambda expressions
df.groupby(['gender']).agg(lambda x: x.mean()-x.min())
- The eigenvalue of groupby is used as the row Index, and the eigenvalue of non-by is used as the column Index
Array of lambda anonymous functions:
- The inner parameter of agg() can also be an array of lambda expressions
df.groupby(['gender']).agg([lambda x: x.max()-x.min(), lambda x: x.mean()-x.min()])
- The row Index is the feature of by, and the column Index is the name of the lambda function
Two commonly used file operations:
Read and write csv files:
read csv:
pd.read_csv('./test.csv')
write csv:
df.to_csv('./test.csv',index=False)
#不写行名
Read and write excel files:
Read excel:
pd.read_excel('./test.xlsx')
write excel:
df.to_excel('./test.xlsx',index=True)
#写行名