Data analysis tool: the mystery and code examples of the pandas library

pandas is a powerful data analysis library in Python. It provides functions such as data cleaning, data manipulation, and data visualization, making data analysis and processing more efficient and convenient. This article will introduce the usage and skills of the pandas library in depth from five aspects: basic concepts, basic knowledge, advanced features, practical cases and summary.

insert image description here

1. Basic concepts

The core of the pandas library is DataFrame, which is a two-dimensional tabular data structure, similar to an Excel table. Each column is a variable and each row is a data record. DataFrame supports data access in units of rows or columns, and also supports complex data filtering and calculations. In addition to DataFrame, the pandas library also provides a Series object, which is a one-dimensional array, which can be understood as a DataFrame with only one column.

2. Basic knowledge

type of data

Pandas supports a variety of data types, including numeric, character, Boolean, etc. Commonly used data types include float (floating point number), int (integer), str (string), bool (Boolean value), etc. Here is sample code for some data types:

import pandas as pd  
  
# 创建一个DataFrame对象  
data = {
    
    'name': ['Alice', 'Bob', 'Charlie'], 'age': [25, 30, 35]}  
df = pd.DataFrame(data)  
  
# 创建一个Series对象  
series = pd.Series([1, 2, 3, 4], dtype='int')
运算符
pandas支持基本的数学运算符，如+、-、*、/等，也支持比较运算符，如==、!=、>、<等。对于DataFrame对象，还可以使用布尔索引来筛选数据。下面是一个使用运算符的示例代码：

python
# 使用数学运算符计算DataFrame的平均值  
mean = df.mean()  
print(mean)  
  
# 使用布尔索引筛选DataFrame的数据  
filtered_df = df[df['age'] > 30]  
print(filtered_df)

array operation

Both Series and DataFrame objects of pandas support array operations such as slicing, indexing, merging, etc. In addition, data can be aggregated, filtered, and sorted using a variety of functions. Here is a sample code using array manipulation:

# 对DataFrame进行切片操作  
print(df.iloc[0:2])  
  
# 对Series进行索引操作  
indexed_series = series[1:4]  
print(indexed_series)  
  
# 对DataFrame进行排序操作  
sorted_df = df.sort_values('age')  
print(sorted_df)

string manipulation

The string operations of pandas are similar to Python's built-in string operations, and support operations such as string concatenation, interception, and replacement. Here is a sample code using string manipulation:

# 对Series进行字符串连接操作  
concatenated_series = series.str.cat()  
print(concatenated_series)

function

Pandas provides a wealth of functions for calculation, statistics and analysis of data. For example, the mean() function can calculate the mean value of the data, the std() function can calculate the standard deviation of the data, the groupby() function can group the data by the specified column, and so on. Here's some sample code using functions:

# 使用mean()函数计算DataFrame的平均值  
mean = df.mean()  
print(mean)  
  
# 使用std()函数计算DataFrame的标准差  
std = df.std()  
print(std)  
  
# 使用groupby()函数按指定列进行数据分组  
grouped_df = df.groupby('age')  
print(grouped_df)

insert image description here

3. Advanced features

built-in type conversion

Pandas provides convenient type conversion methods that can convert data into different types, such as converting strings to date types, converting floating-point numbers to integers, and so on. Here is a sample code using type conversion:

# 将字符串转换为日期类型  
df['date'] = pd.to_datetime(df['date'])  
  
# 将浮点数转换为整数  
df['age'] = df['age'].astype('int')

Multidimensional arrays and matrices

The DataFrame object of pandas can be regarded as a two-dimensional array or matrix, which supports matrix operations and linear algebra operations. In addition, more complex array operations and matrix operations can be performed using the numpy library. Here is a sample code using matrix operations:

import numpy as np  
  
# 创建一个numpy数组  
numpy_array = np.array([[1, 2], [3, 4]])  
  
# 将numpy数组转换为pandas DataFrame  
df = pd.DataFrame(numpy_array)  
print(df)  
  
# 对DataFrame进行矩阵运算  
result = np.dot(df, numpy_array)  
print(result)

nonlinear editing

pandas supports a variety of non-linear editing operations, such as data interpolation, missing value filling, outlier processing, etc. These actions can improve the accuracy and reliability of the data. Here is a sample code using the interpolation method:

# 使用插值方法填充缺失值  
interpolated_df = df.interpolate()  
print(interpolated_df)
图像处理
pandas的图像处理功能相对较弱，但可以与OpenCV等图像处理库结合使用，实现图像的读取、分析和处理。下面是一个使用OpenCV库进行图像处理的示例代码：

python
import cv2  
  
# 读取图像  
image = cv2.imread('image.jpg')  
  
# 使用pandas读取包含图像信息的CSV文件  
df = pd.read_csv('image_data.csv')  
  
# 将CSV文件中的图像信息还原为图像  
restored_image = df['image'].values[0]  
cv2.imshow('Restored Image', restored_image)  
cv2.waitKey(0)  
cv2.destroyAllWindows()

4. Practical cases

The following is a simple example to illustrate how to use pandas for data analysis. Assuming there is a CSV file containing user purchase information, we need to analyze the user's purchase preference and purchase frequency.

read data

Use the pandas read_csv() function to read the CSV file and store it as a DataFrame object. Here is a sample code for reading data:

import pandas as pd  
  
# 读取CSV文件  
df = pd.read_csv('purchase_data.csv')  
print(df)

Read different types of data such as Excel, text, and CSV

#-*-coding:utf-8-*- 
import pandas as pd
#解决数据输出时列名不对齐的问题
pd.set_option('display.unicode.ambiguous_as_wide', True)
pd.set_option('display.unicode.east_asian_width', True)
aa ='../data/TB2018.xlsx'
df = pd.DataFrame(pd.read_excel(aa))
df1= df[['买家会员名','买家实际支付金额']]
print(df1)

print('---------获取股票数据-----------')

bb ='../data/000001.csv'
df = pd.read_csv(bb,encoding = 'gbk')
df1= df[['date','open','high','close','low']]
df1.columns = ['日期','开盘价','最高价','闭市价','最低价']
print(df1)
print('---------获取文本数据-----------')
cc ='../data/fl4_name.txt'
df = pd.read_csv(cc,encoding='gbk')
print(df)

How to select Excel specified row and (or) column data

import pandas as pd
aa ='../data/TB2018.xls'
df = pd.DataFrame(pd.read_excel(aa))
print('------------------按行选取数据-----------------')
print(df[0:1])	#第0行
print(df[:5])	#第5行之前的数据（不含第5行）
print(df[1:5]) #第1行到第4行（不含第5行）
print(df[-1:]) #最后一行
print(df[-3:-1]) #倒数第3行到倒数第1行（不包含最后1行即倒数第1行）

print('------------------按列选取数据-----------------')
df1=df[['买家会员名','买家实际支付金额', '订单状态']] #选取多列，多列名字要放在list里
print(df1)


print('------------------按行列的综合选取数据-----------------')
#选取某一行（如第2行）的“买家会员名”和“买家实际支付金额”
print(df.loc[[2],['买家会员名','买家实际支付金额']])
#选取第2、3行的“买家会员名”和“买家实际支付金额”
print(df.loc[[2,3],['买家会员名','买家实际支付金额']])
#如果列名太长可以使用iloc方法
print(df.iloc[0:3,[0,3,4,5]])

print('------------------')
# 另外可以使用at方法选取“买家会员名”列的第3行数据
print(df.at[3, '买家会员名'])
#使用索引代替列名
print(df.iat[3,0])

data cleaning

Clean the data, including removing invalid data, filling missing values, dealing with outliers, etc. The following is a sample code for data cleaning:

# 去除无效数据  
df = df[df['age'] > 0]  
print(df)  
  
# 填充缺失值和异常值处理，这里使用插值方法填充缺失值和异常值处理。你可以根据具体情况选择不同的方法。使用条件是年龄必须大于等于18岁的数据作为有效数据，其余为无效数据。年龄小于等于18岁的数据用NaN表示，其余用平均值表示。异常值用中位数表示。异常值的判断标准是价格大于等于1000元的数据作为异常数据。使用条件是价格必须小于1000元的数据作为有效数据，其余为异常数据。异常值用中位数表示。异常值的判断标准是价格大于等于1000元的数据作为异常数据。使用条件是价格必须小于1000元的数据作为有效数据，其余为异常数据。异常值用中位数表示。使用条件是价格必须小于1000元的数据作为有效数据，其余为异常数据。异常值用中位数表示。

# 数据清洗
df['age'] = np.where(df['age'] <18, np.nan, df['age'].mean())
df['price'] = np.where(df['price'] >= 1000, np.nan, df['price'].median())

data manipulation

Perform operations on data, including data filtering, sorting, aggregation, etc. The following is a sample code for data manipulation:

# 数据筛选  
filtered_df = df[df['age'] > 30]  
print(filtered_df)  
  
# 数据排序  
sorted_df = df.sort_values('sales')  
print(sorted_df)  
  
# 数据聚合  
grouped_df = df.groupby('category')  
print(grouped_df['sales'].sum())

data visualization

Use libraries such as matplotlib to visualize data analysis results in order to better understand users' purchase preferences and purchase frequency. Here is a sample code for data visualization:

import matplotlib.pyplot as plt  
  
# 绘制条形图展示不同类别的销售总额  
plt.bar(grouped_df.index, grouped_df['sales'].sum())  
plt.xlabel('Category')  
plt.ylabel('Sales Total')  
plt.show()  
  
# 绘制直方图展示价格的分布情况  
plt.hist(df['price'], bins=20)  
plt.xlabel('Price')  
plt.ylabel('Frequency')  
plt.show()

V. Summary

The pandas library has many advantages in data analysis and processing, including ease of use, powerful functions, high efficiency and stability, etc. Through the introduction of this article, readers can understand the basic concepts, basic knowledge, advanced features and practical cases of pandas. When using pandas for data analysis, mastering these basic concepts and operation methods can greatly improve work efficiency and accuracy of data analysis. At the same time, combined with actual cases, it can help readers better understand the application scenarios and actual effects of pandas. In short, pandas is a very practical data analysis library, which is of great significance to professionals such as data analysts and data scientists.