Table of contents
Pandas data selection and manipulation
3 Adding, deleting and modifying data
Pandas data selection and manipulation
Pandas is a Python library for data analysis and manipulation that provides rich functionality to select, filter, add, delete, and modify data.
1 Select columns and rows
Pandas provides a variety of ways to select rows and columns, depending on the type and structure of the data you wish to obtain.
1.1 Select columns
(1) Use column labels
Use the column labels to select one or more columns. You can pass column labels to the DataFrame's indexer, eg
[]
.(2) Use
.loc[]
method
.loc[]
method can select rows and columns based on label names. For column selection, you can use:
to select all rows.
1.2 Select row
(1) Use row index
Use the row index to select one or more rows. You can use
.loc[]
the method or.iloc[]
method.(2) Use
.iloc[]
method
.iloc[]
method uses integer positions to select rows and columns. It.loc[]
differs from the method in that it uses integer indices instead of labels.
Sample code:
import pandas as pd
data = {'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}
df = pd.DataFrame(data)
# 选择单个列
column_A = df['A']
print("单个列 'A':\n", column_A)
# 结果:
# 单个列 'A':
# 0 1
# 1 2
# 2 3
# Name: A, dtype: int64
# 选择多个列
columns_AB = df[['A', 'B']]
print("多个列 'A' 和 'B':\n", columns_AB)
# 结果:
# 多个列 'A' 和 'B':
# A B
# 0 1 4
# 1 2 5
# 2 3 6
# 使用 .loc[] 选择列
column_A_loc = df.loc[:, 'A']
print("使用 .loc[] 选择列 'A':\n", column_A_loc)
# 结果:
# 使用 .loc[] 选择列 'A':
# 0 1
# 1 2
# 2 3
# Name: A, dtype: int64
# 选择多个列
columns_AB_loc = df.loc[:, ['A', 'B']]
print("使用 .loc[] 选择多个列 'A' 和 'B':\n", columns_AB_loc)
# 结果:
# 使用 .loc[] 选择多个列 'A' 和 'B':
# A B
# 0 1 4
# 1 2 5
# 2 3 6
# 使用 .loc[] 选择单个行
row_0_loc = df.loc[0]
print("使用 .loc[] 选择单个行 (索引 0):\n", row_0_loc)
# 结果:
# 使用 .loc[] 选择单个行 (索引 0):
# A 1
# B 4
# C 7
# Name: 0, dtype: int64
# 使用 .loc[] 选择多个行
rows_01_loc = df.loc[0:1]
print("使用 .loc[] 选择多个行 (索引 0 到 1):\n", rows_01_loc)
# 结果:
# 使用 .loc[] 选择多个行 (索引 0 到 1):
# A B C
# 0 1 4 7
# 1 2 5 8
# 使用 .iloc[] 选择单个行
row_0_iloc = df.iloc[0]
print("使用 .iloc[] 选择单个行 (整数位置 0):\n", row_0_iloc)
# 结果:
# 使用 .iloc[] 选择单个行 (整数位置 0):
# A 1
# B 4
# C 7
# Name: 0, dtype: int64
# 使用 .iloc[] 选择多个行
rows_01_iloc = df.iloc[0:2]
print("使用 .iloc[] 选择多个行 (整数位置 0 到 1):\n", rows_01_iloc)
# 结果:
# 使用 .iloc[] 选择多个行 (整数位置 0 到 1):
# A B C
# 0 1 4 7
# 1 2 5 8
# 混合选择行和列
subset = df.loc[0:1, ['A', 'B']]
print("选择特定的行和列:\n", subset)
# 结果:
# 选择特定的行和列:
# A B
# 0 1 4
# 1 2 5
2 Filter data
In Pandas, you can use different methods to filter data based on certain criteria to filter out the data that meets the criteria. Here are some examples and ways to filter data:
2.1 Condition-based filtering
By creating a conditional expression, you can select rows in a DataFrame that satisfy the condition.
import pandas as pd
data = {'A': [1, 2, 3, 4, 5],
'B': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)
# 选择满足条件的行,例如 'A' 列大于 3 的行
filtered_data = df[df['A'] > 3]
print(filtered_data)
Output result:
A B
3 4 40
4 5 50
2.2 Using multiple conditions
You can combine multiple conditions, using logical operators such as &
(and) and (or).|
# 选择同时满足多个条件的行,例如 'A' 列大于 2 且 'B' 列小于 30 的行
filtered_data = df[(df['A'] > 2) & (df['B'] < 30)]
print(filtered_data)
Output result:
A B
2 3 30
2.3 Use isin()
to filter
You can use isin()
the method to filter out rows matching a specified value.
# 选择 'A' 列中匹配特定值的行
filtered_data = df[df['A'].isin([2, 4])]
print(filtered_data)
Output result:
A B
1 2 20
3 4 40
2.4 Using string methods
If your data contains string columns, you can use string methods for filtering.
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40]}
df = pd.DataFrame(data)
# 选择包含特定字符串的行
filtered_data = df[df['Name'].str.contains('b', case=False)]
print(filtered_data)
Output result:
Name Age
1 Bob 30
3 Adding, deleting and modifying data
3.1 Add data
(1) Add row
To add a new row to a DataFrame, you typically create a new data item and then append it to the DataFrame. This can
append
be done using the method. Make sure to setignore_index=True
to reset the index.(2) Add column
To add a new column, simply assign a new column name and provide the corresponding data. This allows new columns to be added to the DataFrame to store additional information.
3.2 Delete data
(1) delete row
Use
drop
the method to delete the specified row. You can specify the index or label of the row to delete, and useaxis=0
the parameter to indicate the row to delete.(2) delete column
To delete a column, use
drop
the method and setaxis=1
the parameter, then specify the name of the column to delete. This will allow you to remove unneeded columns from the DataFrame.
3.3 Modify data
(1) Modify the value of a specific cell
To modify the value of a specific cell in a DataFrame, you can use
.loc[]
the method to update the value of that cell by specifying the row and column label or index.(2) Update multiple values
To update data in batches, you can usually use conditions to select the rows to update and then assign new values. This can help you update multiple data points at once instead of manually modifying them one by one.
3.4 Code example
import pandas as pd
# 创建一个示例 DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]}
df = pd.DataFrame(data)
# 添加新行
new_row = pd.Series({'Name': 'David', 'Age': 40})
df = df.append(new_row, ignore_index=True)
# 结果:
# Name Age
# 0 Alice 25
# 1 Bob 30
# 2 Charlie 35
# 3 David 40
# 添加新列
df['City'] = ['New York', 'Los Angeles', 'Chicago', 'Houston']
# 结果:
# Name Age City
# 0 Alice 25 New York
# 1 Bob 30 Los Angeles
# 2 Charlie 35 Chicago
# 3 David 40 Houston
# 删除行
df = df.drop(2) # 删除索引为2的行
# 结果:
# Name Age City
# 0 Alice 25 New York
# 1 Bob 30 Los Angeles
# 3 David 40 Houston
# 删除列
df = df.drop('City', axis=1) # 删除名为 'City' 的列
# 结果:
# Name Age
# 0 Alice 25
# 1 Bob 30
# 3 David 40
# 修改特定单元格的值
df.loc[1, 'Age'] = 31
# 结果:
# Name Age
# 0 Alice 25
# 1 Bob 31
# 3 David 40
# 更新多个值
df.loc[df['Age'] > 30, 'Age'] = 32 # 更新年龄大于30的行的年龄为32
# 结果:
# Name Age
# 0 Alice 25
# 1 Bob 32
# 3 David 32
# 输出最终结果
print(df)
4 Data sorting
In Pandas, you can use sort_values()
the method to sort data in a DataFrame. Here's how to sort by column, including ascending and descending, and how to sort by multiple columns.
4.1 Sort by column :
To sort data by column, first select the column name to be sorted, and use sort_values()
the method to operate. By default, the data will be sorted in ascending order.
Sort Ascending: Use
sort_values(by='列名')
, where 'column_name' is the name of the column you want to sort by. For example,df.sort_values(by='Age')
sorting will be done in ascending order on the 'Age' column.Sort Descending: To sort in descending order, you can use
sort_values(by='列名', ascending=False)
, where 'column_name' is the name of the column you want to sort by. For example,df.sort_values(by='Age', ascending=False)
sorting will be done in descending order on the 'Age' column.
4.2 Sorting by multiple columns :
If you need to sort by multiple columns, you can do so by providing a list of column names. First, sort by the first column name in the list, then sort by the next column name in the list.
For example, to sort ascending by the 'City' column and then ascending by the 'Age' column, you can use
sort_values(by=['City', 'Age'])
.
4.3 Reset index :
Note that sorted DataFrames may preserve previous index order. If you wish to reset the index to match the new sort order, you can use
reset_index(drop=True)
the method to drop the old index and create a new integer index.
4.4 Code example
import pandas as pd
# 创建一个示例 DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']}
df = pd.DataFrame(data)
# 按列排序
# 默认按升序排序
df_sorted = df.sort_values(by='Age')
# 按照 'Age' 列的升序排序
print("按 'Age' 列的升序排序:\n", df_sorted)
# 按照 'Age' 列的降序排序
df_sorted_desc = df.sort_values(by='Age', ascending=False)
print("\n按 'Age' 列的降序排序:\n", df_sorted_desc)
# 按多列排序
# 先按 'City' 列升序排序,再按 'Age' 列升序排序
df_multi_sorted = df.sort_values(by=['City', 'Age'])
print("\n按 'City' 列和 'Age' 列的升序排序:\n", df_multi_sorted)
# 恢复索引
df_multi_sorted = df_multi_sorted.reset_index(drop=True)
print("\n重置索引后的 DataFrame:\n", df_multi_sorted)
This example demonstrates how to sort data by columns in Pandas, including ascending and descending sorts and sorting by multiple columns. You can also use
reset_index()
the method to reset the index of a sorted DataFrame.