Data analysis of Pandas from entry to abandon: code + actual combat, 9 minutes to take you open the door of Pandas! ! !

Today I sorted out how to use Pandas;
it should be the most complete, concise and easy to read ( integrated ) article on the entire network ! !
Um... don’t believe it, it’s true~ ~

Follow Xiaoyu and take you to open the door of Pandas in 9 minutes ! !
Since then embarked on the hard way of data analysts! !

1. Basic definition of Pandas

・In data analysis, Pandas is used very frequently.
・Pandas can be said to be a toolkit based on NumPy that contains more advanced data structures and analytical capabilities.
・Series and DataFrame are two core data structures that represent one-dimensional sequence and Two-dimensional table structure
. Based on these two data structures, Pandas can import, clean, process, count and output data

2. How to use Pandas

2.1 Series

· Series is a fixed length of dictionaries sequence
· When stored, the equivalent of two ndarry, which is the biggest difference between the structure and the dictionary. Because of the dictionary structure, the number of elements is not fixed.
Two basic attributes of Series:
①index
②values
For an example, let’s take a look at the usage of Series :

# -*- coding: utf-8 -*-

"""
@ auth : carl_DJ
@ time : 2020-8-28
"""


from pandas import Series,DataFrame

x1 = Series([1,2,3,4])
x2 = Series(data=[1,2,3,4],index=['a','b','c','d'])
#使用字典来创建
d = {
    
    'a':1,'b':2,'c':3,'d':4}
x3=Series(d)
print(f'x1打印的结果是:{x1}' )
print('='*20)
print(f'x2打印的结果是:{x2}')
print('='*20)
print(f'x3打印的结果是:{x3}')

The results are as follows:

Insert picture description here

2.2 DataFrame use

・Similar to database tables, including row index and column index, DataFrame can be regarded as a dictionary type composed of Series with the same index

We are giving an example:

# -*- coding: utf-8 -*-

"""
@ auth : carl_DJ
@ time : 2020-8-28
"""

from pandas import DataFrame
data = {
    
    
        'Chinese':[66,88,93,11,66],
        'Math':[30,20,40,50,77],
        'English':[65,88,90,55,22]
        }
df1 = DataFrame(data)
df2 = DataFrame(
            data,
            index=['张三','李四','王五','赵刘','贾七'],
            columns=['Chinese','Math','English']
                )

print(f'df1打印的结果是:\n{df1}')
print('='*30)
print(f'df2打印的结果是:\n{df2}')

Look at the effect

Insert picture description here

2.2.1 Delete operation

Delete rows and columns of DataFrame
Example demonstration

#删除行
df2 = df2.drop(columns=['English'])
#删列
df2 = df2.drop(index=['张三'])

operation result
Insert picture description here

2.2.2 Deduplication operation

Remove duplicate values

#去掉重复的值
df1 = df1.drop_duplicates()

2.2.3 Change data format operation

Change data format

#更改数据格式
df2['Chinese'].astype('str')
df2['Chinese'].astype(np.int64)

2.2.4 Remove spaces between data operations

Remove spaces between data

#删除数据左右两边的空格
df2['Chinese'] = df2['Chinese'].map(str.split)

2.2.5 Case conversion operation

All caps

#全部大写转换
df2.columns = df2.columns.str.upper()

All lowercase

#全部小写转换
df2.columns = df2.columns.str.lower()

Capitalize the first letter

#首字母大写
df2.columns = df2.columns.str.title()

2.2.6 Data cleaning

Using apply to clean data
apply is a function with a very high degree of freedom in Pandas, and it is used very frequently.
For example:
numerical column Math case conversion

#对Math列进行大小写转换
df2['Math'] = df2['Math'].apply(str.upper)

②Define the function and use it in apply

#定义函数,在apply中使用
def par_df(par):
    return par*2

df1['Chinese'] = df1['Chinese'].apply(par_df)

2.3 Statistical functions in Pandas

2.3.1 Basic data statistics usage

Count() counts the number, the null value NaN is not calculated
describe() outputs multiple statistical indicators at once, including: count, mean, std, min, max, etc.
min() minimum value
max() maximum value
sum () sum
· median () median
· var () variance
· STD () standard deviation
· argmin () statistical minimum index positions
· the argmax () the position of the maximum index count
· idxmin () statistical minimum Index value
idxmax() The index value that counts the maximum value

2.3.2 Function Link Usage

Inner connection

#内连接
df3 = pd.merge(df1, df2, how='inner')

Outer connection

#外连接
df3 = pd.merge(df1, df2, how='outer')

Right connection

#内连接
df3 = pd.merge(df1, df2, how='right')

Left connection

#内连接
df3 = pd.merge(df1, df2, how='left')

2.3.3 Usage of loc function and iloc function

loc function: Get the row data by the specific value in the row index "Index" (such as taking the row with "Index" as "A")
• iloc function: Get the row data by the row number (such as taking the data of the second row) )

Old rules, examples of code
extraction lines:

# -*- coding: utf-8 -*-

"""
@ auth : carl_DJ
@ time : 2020-8-28
"""
from pandas import DataFrame
data = {
    
    
        'Chinese':[66,88,93,11,66],
        'Math':[30,20,40,50,77],
        'English':[65,88,90,55,22]
        }
df2 = DataFrame(
            data,
            index=['张三','李四','王五','赵刘','贾七'],
            columns=['Chinese','Math','English']
                )

#提取index为'张三'的行
print(f"loc函数提取index为'张三'的行的内容:\n {df2.loc[u'张三']}")
print("="*30)
#提取第1行内容
print(f"iloc函数提取第1行的内容:\n {df2.iloc[1]}")

operation result

Insert picture description here

Examples of extracting columns:

# -*- coding: utf-8 -*-

"""
@ auth : carl_DJ
@ time : 2020-8-28
"""
from pandas import DataFrame
data = {
    
    
        'Chinese':[66,88,93,11,66],
        'Math':[30,20,40,50,77],
        'English':[65,88,90,55,22]
        }
df2 = DataFrame(
            data,
            index=['张三','李四','王五','赵刘','贾七'],
            columns=['Chinese','Math','English']
                )

#提取列为Englis的所有内容
#使用loc函数获取分数
print(f"loc函数提取列为Englis的所有内容:\n {df2.loc[:,['English']]}")

print("="*30)

#提取第2列的所有内容
#使用iloc函数获取分数
print(f"iloc函数提取第2列的所有内容:\n {df2.iloc[:,2]}")

operation result

Insert picture description here

Extract multiple data in columns and rows:

# -*- coding: utf-8 -*-

"""
@ auth : carl_DJ
@ time : 2020-8-28
"""

from pandas import DataFrame
data = {
    
    
        'Chinese':[66,88,93,11,66],
        'Math':[30,20,40,50,77],
        'English':[65,88,90,55,22]
        }
df2 = DataFrame(
            data,
            index=['zhangsan','lisi','wangwu','zhaoliu','jiaqi'],
            columns=['Chinese','Math','English']
                )
#使用loc函数获取分数
loc_soc = df2.loc[['zhangsan','zhaoliu'],['Chinese','English']]
print(f'zhangsan,zhaoliu的Chinese,English成绩分别是:\n{loc_sco}')

#使用iloc函数获取分数
iloc_sco = df2.iloc[[0,3],[0,2]]
print(f'zhangsan,zhaoliu的Chinese,English成绩分别是:\n{iloc_sco}')

operation result
Insert picture description here

2.4 Data grouping

Group by usage

import numpy as np
import pandas as pd

#读取数据csv文件,采用gbk编码格式
data = pd.read_csv('data_info.csv',encoding='gbk')
result = data.groupby('sex').age([np.sum,np.mean])
#打印结果
print(f'结果内容显示为:\n{result}')

2.5 Data sorting

Sorting function sort_values()

#对A11列从大到小进行排序
df.sort_values('A11', ascending=False)

Index restoration reset_index()

'''
reset_index():可以还原索引,重新变为默认的整型索引
inplace = True:不创建新的对象,直接对原始对象进行修改
'''
df.reset_index(inplace=True)

Note:
The sort_values ​​method here is similar to the order by usage in SQL.

2.6 Read and write files

Read csv file

#读取csv文件
pd.read_csv('file_name')

Write to csv file

#写入csv文件,不保存index
pd.to_csv('file_name',index=False)

2.7 Combine two Dataframes

Use merge to merge Dataframe by index

#合并两个Dataframe
df2 = df.merge(df2,left_index=True,right_index=True,how='left')

3. Pandas actual combat code and references

Pandas actual combat code:
"Pandas 5 lines of code to achieve read and write operations to excel"
"Python3, pandas automatically process exclce data and yagmail mail automatically send"

Pandas reference materials:
" Pandas Chinese Net "

Guess you like

Origin blog.csdn.net/wuyoudeyuer/article/details/108274818