Tips for using Pandas

Pandas is a powerful data analysis Python library that provides a series of APIs for data cleaning, transformation, analysis and visualization. When using Pandas for data processing, common instructions include:

  1. Data reading and parsing

read_csv() : Used to read data files in CSV, Excel, etc. formats and convert them into Pandas DataFrame objects.

import pandas as pd  
  
# 读取 CSV 文件为 DataFrame  
df = pd.read_csv('data.csv')  
  
# 打印 DataFrame  
print(df)

read_excel() : Used to read an Excel file and convert it to a Pandas DataFrame object.

  1. Use pandas to read Excel multiple sheets

import pandas as pd
# 指定文件
excel_reader=pd.ExcelFile('文件.xlsx')  
# 读取文件的所有表单名,得到列表
sheet_names = excel_reader.sheet_names  
# 读取表单的内容,i是表单名的索引,等价于
# pd.read_excel('文件.xlsx', sheet_name=sheet_names[i]) 
df_data =  excel_reader.parse(sheet_name=sheet_names[i])

# 关闭reader
excel_reader.close()
  1. Use pandas to write multiple Excel sheets

import pandas as pd
# 定义writer,选择文件(文件可以不存在)
excel_writer = pd.ExcelWriter('文件.xlsx')  
# 写入指定表单
df1.to_excel(excel_writer, sheet_name='自定义sheet_name1', index=False)
df2.to_excel(excel_writer, sheet_name='自定义sheet_name2', index=False)   
# 储存文件
excel_writer.save()  
# 关闭writer
excel_writer.close()  

read_json() : Used to read data files in JSON format and convert them into Pandas DataFrame objects.

import pandas as pd  
from json import load  
  
# 加载 JSON 文件为 DataFrame,注意需要使用 load() 函数而不是 from json import load()  
df = pd.read_json('data.json', lines=True, dtype={'Name': str})
  1. Data Transformation and Filtering

  • apply() : Used to apply the specified function to each row of data, returning a new array containing the results.

Function function display:

# 定义一个函数,将每个元素加倍
def double(x):
    return x * 2

# 创建DataFrame
data = {'A': [1, 2, 3, 4, 5],
        'B': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)

# 将函数应用于每个元素
df['C'] = df['A'].apply(double)
print(df)

The resulting output:

   A   B   C
0  1  10   2
1  2  20   4
2  3  30   6
3  4  40   8
4  5  50  10
  • agg() : Used to aggregate data, such as averaging, summing, etc.

Function function display:

# 创建DataFrame
data = {'A': [1, 2, 3, 4, 5],
        'B': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)

# 对列A求平均值和总和
result = df['A'].agg(['mean', 'sum'])
print(result)

The resulting output:

mean    3.0
sum    15.0
Name: A, dtype: float64
  • groupby() : Used to group data according to a certain field and return the data of each group.

Function function display:

# 创建DataFrame
data = {'Name': ['John', 'Jane', 'Mike', 'Emily', 'John'],
        'Age': [25, 30, 35, 40, 45],
        'City': ['New York', 'Paris', 'London', 'Paris', 'New York']}
df = pd.DataFrame(data)

# 根据City字段进行分组,并计算每组的平均年龄
result = df.groupby('City')['Age'].mean()
print(result)

The resulting output:

City
London      35
New York    35
Paris       35
Name: Age, dtype: int64
  • head() : Used to get a part of the data at the beginning of the data.

Function function display:

# 创建DataFrame
data = {'A': [1, 2, 3, 4, 5]}
df = pd.DataFrame(data)

# 获取前3行数据
result = df.head(3)
print(result)

The resulting output:

   A
0  1
1  2
2  3
  • tail() : Used to get a part of the data at the end of the data.

Function function display:

# 创建DataFrame
data = {'A': [1, 2, 3, 4, 5]}
df = pd.DataFrame(data)

# 获取最后3行数据
result = df.tail(3)
print(result)

The resulting output:

   A
2  3
3  4
4  5
  1. Data sorting and filtering

  • sort_values() : Used to sort data by specified columns.

Function function display:

# 创建DataFrame
data = {'Name': ['John', 'Jane', 'Mike', 'Emily'],
        'Age': [25, 30, 35, 40]}
df = pd.DataFrame(data)

# 按照Age列进行升序排序
df_sorted = df.sort_values('Age')
print(df_sorted)

The resulting output:

   Name  Age
0  John   25
1  Jane   30
2  Mike   35
3  Emily  40
  • drop_duplicates() : Used to delete duplicate rows in the data and return unique data.

Function function display:

# 创建DataFrame
data = {'Name': ['John', 'Jane', 'Mike', 'Emily', 'John'],
        'Age': [25, 30, 35, 40, 25]}
df = pd.DataFrame(data)

# 删除重复的行
df_unique = df.drop_duplicates()
print(df_unique)

The resulting output:

   Name  Age
0  John   25
1  Jane   30
2  Mike   35
3  Emily  40
  • dropna() : used to delete the qualified rows in the data and return the qualified data.

Function function display:

# 创建DataFrame
data = {'Name': ['John', 'Jane', np.nan, 'Emily'],
        'Age': [25, np.nan, 35, 40]}
df = pd.DataFrame(data)

# 删除包含缺失值的行
df = df.dropna()
print(df)

The resulting output:

   Name   Age
0  John  25.0
2  Emily  35.0
  1. Data Aggregation and Joining

  • join() : Used to combine multiple data sets into one data set. Two datasets can be merged into a new dataset by specifying two or more functions.

Function function display:

# 创建两个DataFrame
data1 = {'Name': ['John', 'Jane', 'Mike'],
         'Age': [25, 30, 35]}
df1 = pd.DataFrame(data1)

data2 = {'Name': ['Emily', 'Tom', 'Kate'],
         'City': ['New York', 'Paris', 'London']}
df2 = pd.DataFrame(data2)

# 使用join()函数将两个DataFrame按照Name列进行合并
df_merged = df1.join(df2.set_index('Name'), on='Name')
print(df_merged)

The resulting output:

   Name  Age      City
0  John   25       NaN
1  Jane   30       NaN
2  Mike   35       NaN
  • merge() : Used to merge two or more data sets into a new data set. Two or more data sets can be combined into a new data set by specifying multiple functions.

Function function display:

# 创建两个DataFrame
data1 = {'Name': ['John', 'Jane', 'Mike'],
         'Age': [25, 30, 35]}
df1 = pd.DataFrame(data1)

data2 = {'Name': ['Emily', 'Tom', 'Kate'],
         'City': ['New York', 'Paris', 'London']}
df2 = pd.DataFrame(data2)

# 使用merge()函数将两个DataFrame按照Name列进行合并
df_merged = pd.merge(df1, df2, on='Name')
print(df_merged)

The resulting output:

   Name  Age      City
0  John   25  New York
1  Jane   30       NaN
2  Mike   35       NaN

It should be noted that the join() and merge() functions perform left join by default, that is, keep all the rows of the left DataFrame and merge them according to the specified columns. If you need other types of connections (such as inner connections, outer connections, etc.), you can specify how parameters to achieve. For example, using the merge() function for an inner join:

df_merged = pd.merge(df1, df2, on='Name', how='inner')

It is also possible to concatenate DataFrames with different column names by specifying the left_on and right_on parameters.

  1. multidimensional analysis

pivot_table() : Used to convert data into tabular form for multidimensional analysis.

Function function display:

# 创建DataFrame
data = {'Name': ['John', 'Jane', 'John', 'Jane', 'John', 'Jane'],
        'Subject': ['Math', 'Math', 'Science', 'Science', 'Math', 'Science'],
        'Score': [80, 90, 85, 95, 75, 80]}
df = pd.DataFrame(data)

# 使用pivot_table()函数进行数据透视
pivot_table = df.pivot_table(index='Name', columns='Subject', values='Score', aggfunc='mean')
print(pivot_table)

The resulting output:

Subject  Math  Science
Name                  
Jane     90.0     87.5
John     77.5     85.0

In the above example, we indexed the row by specifying index='Name' , columns='Subject' indexed the column, values='Score' indexed the grade as the value to be aggregated, and aggfunc='mean' specified The aggregate function is average.

The pivot_table() function summarizes and calculates the data according to the specified row, column and value, and generates a new table, where the row is the name, the column is the subject, and the value is the corresponding average score. In this way, we can easily perform multi-dimensional analysis, such as looking at the average scores of different students on different subjects.

In addition to the average value, the aggfunc parameter can also specify other aggregation functions, such as 'sum', 'count', 'min', 'max', etc., which can be set according to specific needs.

  1. data visualization

plot() : Used to draw various types of charts, such as histograms, line charts, scatter plots, etc. It is based on the Matplotlib library and can be called directly in Pandas.

Function function display:

import pandas as pd
import matplotlib.pyplot as plt

# 创建DataFrame
data = {'Year': [2017, 2018, 2019, 2020],
        'Sales': [100, 150, 200, 180]}
df = pd.DataFrame(data)

# 绘制柱状图
df.plot(x='Year', y='Sales', kind='bar')
plt.xlabel('Year')
plt.ylabel('Sales')
plt.title('Sales by Year')
plt.show()

The output is a histogram with years on the horizontal axis and sales on the vertical axis. In this example, we select the columns of data to plot by specifying x='Year' and y='Sales' , and then specify kind='bar' to plot a histogram.

The resulting output:

# 绘制折线图
df.plot(x='Year', y='Sales', kind='line')
plt.xlabel('Year')
plt.ylabel('Sales')
plt.title('Sales by Year')
plt.show()

output result:

hist() : A function used to draw histograms in Pandas. A histogram is a visualization tool used to show the distribution of data, especially for data with continuous variables.

Function function display:

import pandas as pd
import matplotlib.pyplot as plt

# 创建DataFrame
data = {'Scores': [70, 85, 90, 75, 80, 95, 85, 90, 80, 85]}
df = pd.DataFrame(data)

# 绘制直方图
df.hist(column='Scores', bins=5)
plt.xlabel('Score')
plt.ylabel('Frequency')
plt.title('Score Distribution')
plt.show()

Select the data column to draw a histogram by specifying column='Scores' , and bins=5 specifies the number of groups in the histogram. In this way, the histogram divides the data into 5 groups, the horizontal axis is the score, and the vertical axis is the frequency (the number of times the data appears in each group).

The resulting output:

boxplot() : A function used to draw boxplots in Pandas. A boxplot is a visualization tool used to show distributions and outliers in data.

Function function display:

import pandas as pd
import matplotlib.pyplot as plt

# 创建DataFrame
data = {'Group': ['A', 'A', 'B', 'B', 'C', 'C'],
        'Values': [10, 15, 20, 25, 30, 35]}
df = pd.DataFrame(data)

# 绘制箱线图
df.boxplot(column='Values', by='Group')
plt.xlabel('Group')
plt.ylabel('Values')
plt.title('Boxplot')
plt.show()

By specifying column='Values' select the data column to draw the boxplot, by='Group' specifies the grouping column. In this way, the box plot groups the data according to different groups, the horizontal axis is the group, and the vertical axis is the value.

在绘制箱线图之前,需要先导入pandasmatplotlib.pyplot库。然后,使用boxplot()函数绘制箱线图,并使用xlabel()ylabel()title()函数设置坐标轴标签和图表标题。

箱线图展示了数据的五个统计量:最小值(lower whisker)、第一四分位数(Q1)、中位数(Q2)、第三四分位数(Q3)、最大值(upper whisker)。箱体表示了Q1到Q3之间的数据范围,中间的线表示中位数。离群值(outliers)通常以圆圈或叉形表示。

需要注意的是,boxplot()函数也可以接受其他参数来定制箱线图的样式,如设置颜色、填充颜色、须的长度等。根据需求,可以选择合适的参数进行设置。另外,boxplot()函数也支持同时绘制多个箱线图,比如按照不同的列进行分组。

结果输出:

  1. 数据处理和优化

groupby()agg() 函数结合使用,可以对同一列或同一行进行聚合操作,减少内存占用。

Pandas 的 lr=True 参数可以自动学习数据重分配策略,减少内存占用。

Pandas 提供了一些优化技巧,例如使用 rolling() 函数计算滚动窗口中的平均值等。

了上述指令外,Pandas 还有许多其他常见的指令,如:

  1. describe():用于查看数据的统计信息,例如均值、方差、标准差等。

  1. groupby()agg() 函数配合使用,可以对数据按照某个字段进行分组,并对每组数据进行聚合操作。

  1. rolling() 函数可以计算滚动窗口中的平均值,减少内存占用。

  1. drop_duplicates() 函数可以删除重复的行,并返回唯一的数据。

  1. tail() 函数可以获取数据末尾的一部分数据,方便进行数据分析。

使用 Pandas 进行数据处理时,还需要注意以下几点:

  1. 注意处理缺失值:Pandas 可以通过缺失值处理函数(如 fillna())来处理缺失值,但是在进行聚合操作时,缺失值可能会对结果产生影响。因此,需要根据具体情况选择合适的处理方法。

  1. 注意数据类型:Pandas 支持多种数据类型,包括 int、float、str、Series、DataFrame 等。在进行数据处理时,需要根据数据类型选择合适的函数和指令。

  1. 注意索引:Pandas 对于多维数据集(如矩阵)非常有用,但是索引也会影响数据处理效率。因此,需要合理地设置索引,并注意索引的类型和数量。

  1. 注意异常值:Pandas 对于异常值非常敏感,如果数据中存在异常值,可能会导致程序崩溃或产生不正确的结果。因此,需要注意异常值的处理和输出。

  1. 注意性能:在进行大规模数据处理时,需要注意 Pandas 的性能问题。可以通过调整参数(如缓冲区大小、分区方式等)来提高程序的性能。

总之,使用 Pandas 进行数据处理需要熟悉其指令和技巧,并注意处理数据中的各种问题,才能得到准确、高效的结果。

除了数据处理外,Pandas 还可以用于数据挖掘、机器学习等领域。以下是一些 Pandas 在数据挖掘和机器学习中的应用:

  1. 分类和回归:Pandas 可以用于多分类和回归问题,可以方便地处理复杂的数据集。

  1. 聚类:Pandas 可以用于层次聚类和k均值聚类等聚类算法,可以对数据进行快速聚类。

  1. 特征工程:Pandas 可以用于特征提取和降维,可以将高维数据转换为低维数据,并保留重要特征。

  1. 机器学习模型训练:Pandas 可以用于加载和预处理机器学习模型的输出,可以方便地处理大型数据集。

  1. 数据可视化:Pandas 可以用于绘制热力图、散点图、箱线图等图表,可以方便地对数据进行可视化分析。

总之,Pandas 是一个功能强大的 Python 库,可以用于数据清洗、转换、分析和可视化等多个领域。在进行数据处理和机器学习时,Pandas 可以提供很大的帮助。

Guess you like

Origin blog.csdn.net/anonymous_me/article/details/129532969