Python - Pandas - Data Analysis(2)

pandas data analysis

foreword

Vue框架:Learn Vue
OJ算法系列:magic tricks from the project- Detailed algorithm explanation
Linux操作系统:of Fenghouqimen-linux
C++11:Tongtianlu-C++11

One line to check whether Pandas has been downloaded:
pip list
One line to download:
pip install pandas

21 commonly used statistical methods

function Function
count Count the number of non-null values
max maximum value
min minimum value
sum to sum
prod product
cumsum Tired and
comply multiplication
cummax cumulative maximum
cummin cumulative minimum
mean average value
std standard deviation
was variance
median Arithmetic median
abs absolute value
unique list of unique values
nunique number of unique values
value_counts unique values ​​and their frequency
skew third order skewness
kurt fourth order kurtosis
corr correlation coefficient matrix
those covariance matrix

describe():

describe will display count(), mean(), std(), max&min(), etc. of all numerical features
import pandas as pd
dataframe = pd.DataFrame({
    
    
	'a' : [1, 2, 3, 4, 5],
	'b' : [1.1, 1.2, 1.3, 1.4, 1.5],
	'c' : ['a', 'b', 'c', 'd', 'e']
})
dataframe.describe()

describe()

numeric_only:

  • Many of the above operations only support int and float operations, and other types need to overload operators such as +
  • If this type of operation does not overload the operator or operation function, it may be automatically ignored, or it may terminate with an error
  • In most cases, the above statistical functions are used with slice slices
# 均值
print(DataFrame.mean())

# 标准差
print(DataFrame.std())

# 累和
print(DataFrame.cumsum())

# 累乘
print(DataFrame.iloc[:, 0:2].cumprod())

cumsum

Skewness:

Function:

  • Used to describe the asymmetry of the data

meaning:

  • skewness == 0: normal distribution
  • skewness > 0: The positive deviation value is large, there are many extreme values ​​on the right side of the data, and the overall distribution is right/positive
  • skewness < 0: the negative deviation value is large, there are many extreme values ​​on the left side of the data, and the overall distribution is left/negative

Calculation formula:

  • s k e w n e s s = E [ ( x − E ( x ) ) / ( D ( x ) ) 3 ] skewness = E[(x - E(x)) / (\sqrt{D(x)})^3] skewness=And [( xE ( x )) / (D(x) )3]

Demo:

import numpy as np
dataframe = pd.DataFrame({
    
    
    'id' : np.arange(10),
    #等比数列:起点、终点、个数,幂
    'value' : np.logspace(1, 10, 10, base = 2),
    #等差数列:起点、终点、个数
    'weight' : np.linspace(1, 10, 10)           
})
print(dataframe)
#skew()>0,value右侧异常值比较多
print(dataframe.skew())

skew

  • Draw a picture to see:
#挑选数值型特征
num_feats = dataframe.dtypes[dataframe.dtypes != 'object'].index
import matplotlib.pyplot as plt
plt.figure(figsize = (8, 8))#8inch * 8inch
fig, ax = plt.subplots(2, 2) 
for row in range(2):
    for col in range(2):
        if row*2+col > 2 :
            continue
        data = dataframe[num_feats[row*2+col]]
        ax[row][col].plot(data.index, data.values)
        ax[row][col].set_title(f'{
      
      num_feats[row*2+col]}')
# 自动保持子图之间的正确间距。
fig.tight_layout()
plt.show()

skew

Kurtosis value:

use:

  • A statistic describing the steepness of the distribution of all values ​​of a variable, that is, the sharpness of the data distribution

value:

  1. kurtosis == 0: the steepness is the same as the normal distribution
  2. kurtosis > 0: steeper than normal distribution peak, sharp peak
  3. kurtosis < 0: flatter peak than normal distribution, flat peak

Calculation formula:

K u r t o s i s = E [ ( x − E ( x ) ) / ( D ( x ) ) 4 ] − 3 Kurtosis = E[(x - E(x)) / (\sqrt{D(x)})^4] - 3 Kurtosis=And [( xE ( x )) / (D(x) )4]3

Demo:

  • Continue to demonstrate with the previous set of data:
print(dataframe.kurt())

kurt

cov covariance

Calculation formula:

cov ( X , Y ) = E [ ( X − E [ X ] ) ∗ ( Y − E [ Y ] ) ] , cov(X,Y)=E[ (XE[X]) * (YE[Y]) ],co v ( X ,Y)=And [( XE [ X ])(YE [ Y ])] ,

  • E[X] represents the expectation of variable X.
  • From an intuitive point of view, covariance represents the expectation of the overall error of two variables.
  • If one of them is greater than its own expected value and the other is also greater than its own expected value, the covariance between the two variables is
    positive;
  • If one of the variables is greater than its expected value while the other is less than its own expected value, then the covariance between the two variables is negative.
  • If X and Y are statistically independent, then the covariance between them is 0

value:

  • corr() returns the correlation coefficient, between [-1, 1]
  • |-1| and |1| indicate linear correlation
  • The positive and negative signs indicate positive and negative correlations

Operation object:

  • For a DataFrame with n eigenvalues, calculate the covariance between the two to form an n*n matrix
  • In the covariance matrix, the diagonal is the variance, and the off-diagonal is the covariance

Demo:

  • Continuing with the above data:
dataframe.cov()

those

Corr correlation coefficient:

Calculation formula:

  • Based on cov covariance
    ρ XY = C ov ( X , Y ) / [ D ( X ) ] [ D ( Y ) ] ρXY = Cov(X,Y) / \sqrt{[D(X)]} \sqrt{[ D(Y)}]ρXY=C o v ( X ,Y)/[D(X)] [D(Y) ]

value:

  • corr() calculates the correlation coefficient between [-1, 1]
  • |-1| and |1| indicate linear correlation
  • The positive and negative signs indicate positive and negative correlations

Operation object:

  • For a DataFrame with n eigenvalues, the correlation coefficient is calculated between two to form an n*n matrix
  • The diagonal of the correlation coefficient matrix is ​​always 1

Demo:

  • Continuing with the above data:
dataframe.corr()

corr

Five commonly used data processing functions:

map:

  • Schematic diagram:
    map

Function:

  • Converts each value in a column in a DataFrame/Series to other values ​​according to the given function/dictionary

Dictionary map:

  • For the DataFrame / Series .map() method, pass a dictionary
#转型字典
gendermap = {
    
    'F' : 0, 'M' : 1}

#数据
dataframe = pd.DataFrame({
    
    
    "name":['Jack', 'Alice', 'Lily', 'Mshis', 'Gdli', 'Agosh', 'Filu', 'Mack', 'Lucy', 'Pony'],
    "gender":['F', 'M', 'F', 'F', 'M', 'F', 'M', 'M', 'F', 'F'],
    "age":np.random.randint(15,50,10),
    "salary":np.random.randint(5,50,10),
    })

#map方法
dataframe['gender'] = dataframe['gender'].map(gendermap)
print(dataframe)

dictionary_map

Function map:

  • Pass in a function pointer for DataFrame / Series .map()
dataframe = pd.DataFrame({
    
    
    "name":['Jack', 'Alice', 'Lily', 'Mshis', 'Gdli', 'Agosh', 'Filu', 'Mack', 'Lucy', 'Pony'],
    "gender":['F', 'M', 'F', 'F', 'M', 'F', 'M', 'M', 'F', 'F'],
    "age":np.random.randint(15,50,10),
    "salary":np.random.randint(5,50,10),
    })
print(dataframe)
print('*'*40)

#转型函数
def gender_map(x) :
    gender = 0 if x == 'F' else 1
    return gender

dataframe['gender'] = dataframe['gender'].map(gender_map)

print(dataframe)

function_map

apply:

Function:

  • Traverse the entire Series and DataFrame, and run the specified function on each element, which can be a custom function, or the 21 built-in functions mentioned above, etc.

apply anonymous lambda:

df=pd.DataFrame({
    
    
    "name":['Jack', 'Alice', 'Lily', 'Mshis', 'Gdli', 'Agosh', 'Filu', 'Mack', 'Lucy', 'Pony'],
    "gender":['F', 'M', 'F', 'F', 'M', 'F', 'M', 'M', 'F', 'F'],
    "age":np.random.randint(15,50,10),
    "salary":np.random.randint(5,50,10),
    })

print(df)
print('*'*40)
print(df[['age', 'salary']].apply(lambda x: x*2))

apply built-in function

  • Determines which columns the built-in function can be executed on
#传入的函数也可以是pandas和python内置函数
print(df[['age', 'salary']].apply(max))
print('*'*30)
print(df[['age', 'salary']].apply(np.mean))

apply built-in function

apply own function:

#按值遍历调用
def apply_func(row):
    a = row['name']
    b = row['gender']
    c = row['age']
    return f'name:{
      
      a},gender:{
      
      b}, age:{
      
      c}'

#原地修改,增加一列all
df["all"] = df.apply(lambda row:apply_func(row), axis = 1)
#axis = 1,每次row为dataframe内的一行
print(df)

apply personal function

group()

  • The function is the same as groupby() in Mysql, the by parameter can pass in multiple characteristic values
  • When multiple feature values ​​are passed in, the grouping is the permutation and combination of multiple features, see dfc.groupby(by=['gender','age']) below

Don't have to have:

  • The operation after groupby() is for each group of internal
import numpy as np
dfc=pd.DataFrame({
    
    
    "name":['Jack', 'Alice', 'Lily', 'Mshis', 'Gdli', 'Agosh', 'Filu', 'Mack', 'Lucy', 'Pony'],
    "gender":['F', 'M', 'F', 'F', 'M', 'F', 'M', 'M', 'F', 'F'],
    "age":np.random.randint(25,28,10),
    "salary":np.random.randint(5,50,10),
    })

#划分成组后求对应组的和结果
print(dfc.groupby(by='gender').sum())
print("*"*25)
# groupby查传入的可以时多个属性
print(dfc.groupby(by=['gender','age']).sum())

groupby()

groupby + apply:

  • apply() incoming parameters:
    1. lambda
    2. built-in func()
    3. Personally written functions
  • apply object:
    Multiple groups after groupby(), that is, sub DataFrame

Custom sort:

  • Execute sort_values() for each sub DataFrame
df=pd.DataFrame({
    
    
    "name":['Jack', 'Alice', 'Lily', 'Mshis', 'Gdli', 'Agosh', 'Filu', 'Mack', 'Lucy', 'Pony'],
    "gender":['F', 'M', 'F', 'F', 'M', 'F', 'M', 'M', 'F', 'F'],
    "age":np.random.randint(25,28,10),
    "salary":np.random.randint(5,50,10),
    })
print(df)
print('*'*40)

#此处的x也是一个dataframe
def group_staff_salary(x):
    df1 = x.sort_values(by = 'salary',ascending=True)
#ascending = True为从大到小的顺序,默认倒序
    return df1

df.groupby('gender',as_index=True).apply(group_staff_salary)

groupby + apply = order

Get the maximum value of each group:

  • Restrict each child DataFrame to return objects
#只看每组最高工资:
df=pd.DataFrame({
    
    
    "name":['Jack', 'Alice', 'Lily', 'Mshis', 'Gdli', 'Agosh', 'Filu', 'Mack', 'Lucy', 'Pony'],
    "gender":['F', 'M', 'F', 'F', 'M', 'F', 'M', 'M', 'F', 'F'],
    "age":np.random.randint(25,28,10),
    "salary":np.random.randint(5,50,10),
    })
print(df)
print("*"*40)

#此处的x也是一个dataframe
def group_staff_salary(x):
    df1 = x.sort_values(by = 'salary',ascending=True)
    return df1.iloc[-1, :]

df.groupby('gender',as_index=True).apply(group_staff_salary)
  • Query the information of the highest wage earners for both men and women:
    groupby + apply = max

agg:

Function:

  • Specify multiple execution functions for a set of data at the same time

Dictionary specifying built-in functions:

  • The keys of the dictionary are the features of the DataFrame, and the values ​​of the dictionary are the functions to be performed on the feature values
  • When there are many functions to be performed on the dictionary, an array can be passed
# 1:字典:key为列,val为操作函数
df=pd.DataFrame({
    
    
    "name":['Jack', 'Alice', 'Lily', 'Mshis', 'Gdli', 'Agosh', 'Filu', 'Mack', 'Lucy', 'Pony'],
    "gender":['F', 'M', 'F', 'F', 'M', 'F', 'M', 'M', 'F', 'F'],
    "age":np.random.randint(25,28,10),
    "salary":np.random.randint(5,50,10),
    })


df.agg({
    
    'age':['max'], 'salary':['mean', 'std']})
  • value as row index, key as column index
    dictionary specifying agg

groupby + agg:

  • Many functions in agg operate on all columns of the subDataFrame after groupby:
df.groupby('gender').agg(['max', 'min', 'median'])
  • The group is used as the row index, and the agg internal function is used as the column index:
    groupby  + agg

lambda anonymous function:

  • Agg internal parameters can also be lambda expressions
df.groupby(['gender']).agg(lambda x: x.mean()-x.min())
  • The eigenvalue of groupby is used as the row Index, and the eigenvalue of non-by is used as the column Index
    groupby + agg + lambda

Array of lambda anonymous functions:

  • The inner parameter of agg() can also be an array of lambda expressions
df.groupby(['gender']).agg([lambda x: x.max()-x.min(), lambda x: x.mean()-x.min()])
  • The row Index is the feature of by, and the column Index is the name of the lambda function
    groupby + agg + lambda

Two commonly used file operations:

Read and write csv files:

read csv:

pd.read_csv('./test.csv')

write csv:

df.to_csv('./test.csv',index=False)
#不写行名

Read and write excel files:

Read excel:

pd.read_excel('./test.xlsx')

write excel:

df.to_excel('./test.xlsx',index=True)
#写行名

Guess you like

Origin blog.csdn.net/buptsd/article/details/129423974