Table of contents

Pandas library

Pandas is a commonly used data processing and analysis library in Python, which provides efficient, flexible and easy-to-use data structures and data analysis tools.

1.Series (sequence): Series is a one-dimensional labeled array in the Pandas library, similar to a labeled array. It can hold any data type and has labels (indexes) for accessing and manipulating the data.

2. DataFrame (data frame): DataFrame is a two-dimensional tabular data structure in the Pandas library, similar to a spreadsheet or a table in SQL. It consists of rows and columns, and each column can contain different data types. DataFrame can be created from various data sources such as CSV files, Excel files, databases, etc.

3. Index (index): Index is a label used to identify and access data in Pandas. It can be an integer, string or other data type. Every Series and DataFrame object has a default integer index, and it is also possible to customize the index.

4. Select and filter data: Pandas provides flexible ways to select, filter and manipulate data. Specific rows and columns can be selected using labels, positions, conditions, and more.

5. Missing data processing: Pandas has the function of processing missing data, which can detect, delete or replace missing values in the data.

6. Data aggregation and grouping: Pandas can count and summarize data through grouping and aggregation operations. It supports common statistical functions such as sum, mean, maximum, minimum, etc.

7. Data sorting and ranking: Pandas provides the function of sorting and ranking data, which can sort data according to specified columns or conditions, and assign a rank to each element.

8. Data merging and connection: Pandas can merge and connect multiple DataFrame objects, and supports column or row-based merging operations.

9. Time series data processing: Pandas provides extensive support for processing time series data, including operations such as date range generation, timestamp indexing, and resampling.

common operation

Create DataFrame

import pandas as pd

# 创建一个空的DataFrame
df = pd.DataFrame()

# 从列表创建DataFrame
data = [['Alice', 25], ['Bob', 30], ['Charlie', 35]]
df = pd.DataFrame(data, columns=['Name', 'Age'])

# 从字典创建DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}
df = pd.DataFrame(data)

view data

# 查看DataFrame的前几行，默认为5行
df.head()

# 查看DataFrame的后几行，默认为5行
df.tail()

# 查看DataFrame的列名
df.columns

# 查看DataFrame的索引
df.index

# 查看DataFrame的统计信息
df.describe()

Data Selection and Filtering

# 选择单列
df['Name']

# 选择多列
df[['Name', 'Age']]

# 使用条件选择数据
df[df['Age'] > 30]

# 使用逻辑运算符选择数据
df[(df['Age'] > 25) & (df['Age'] < 35)]

# 使用isin()方法选择数据
df[df['Name'].isin(['Alice', 'Bob'])]

Data sorting and ranking

# 按照某一列的值排序
df.sort_values('Age')

# 按照多列的值排序
df.sort_values(['Age', 'Name'])

# 对DataFrame的元素进行排名
df['Rank'] = df['Age'].rank(ascending=False)

missing data handling

# 检测缺失数据
df.isnull()

# 删除包含缺失数据的行
df.dropna()

# 替换缺失数据
df.fillna(value)

Data Aggregation and Grouping

# 对列进行求和
df['Age'].sum()

# 对列进行平均值计算
df['Age'].mean()

# 对列进行分组计算
df.groupby('Name')['Age'].mean()

Data Merging and Joining

# 按照列进行合并
pd.concat([df1, df2], axis=1)

# 按照行进行合并
pd.concat([df1, df2], axis=0)

# 根据列进行连接
pd.merge(df1, df2, on='key')

# 根据行进行连接
pd.merge(df1, df2, on=['key1', 'key2'])

Related operations of Pandas library in Python