[100 days proficient in Python] Day55: Python data analysis_Pandas data selection and common operations

Table of contents

Pandas data selection and manipulation

1 Select columns and rows

2 Filter data

3 Adding, deleting and modifying data

 4 Data sorting


Pandas data selection and manipulation

        Pandas is a Python library for data analysis and manipulation that provides rich functionality to select, filter, add, delete, and modify data.

1 Select columns and rows

Pandas provides a variety of ways to select rows and columns, depending on the type and structure of the data you wish to obtain.

1.1 Select columns

(1) Use column labels

Use the column labels to select one or more columns. You can pass column labels to the DataFrame's indexer, eg [].

(2) Use .loc[]method

.loc[]method can select rows and columns based on label names. For column selection, you can use :to select all rows.

1.2 Select row

(1) Use row index

Use the row index to select one or more rows. You can use .loc[]the method or .iloc[]method.

(2) Use .iloc[]method

.iloc[]method uses integer positions to select rows and columns. It .loc[]differs from the method in that it uses integer indices instead of labels.

Sample code:

import pandas as pd
 
data = {'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}
df = pd.DataFrame(data)
 
# 选择单个列
column_A = df['A']
print("单个列 'A':\n", column_A)
# 结果:
# 单个列 'A':
# 0    1
# 1    2
# 2    3
# Name: A, dtype: int64
 
# 选择多个列
columns_AB = df[['A', 'B']]
print("多个列 'A' 和 'B':\n", columns_AB)
# 结果:
# 多个列 'A' 和 'B':
#    A  B
# 0  1  4
# 1  2  5
# 2  3  6
 
# 使用 .loc[] 选择列
column_A_loc = df.loc[:, 'A']
print("使用 .loc[] 选择列 'A':\n", column_A_loc)
# 结果:
# 使用 .loc[] 选择列 'A':
# 0    1
# 1    2
# 2    3
# Name: A, dtype: int64
 
# 选择多个列
columns_AB_loc = df.loc[:, ['A', 'B']]
print("使用 .loc[] 选择多个列 'A' 和 'B':\n", columns_AB_loc)
# 结果:
# 使用 .loc[] 选择多个列 'A' 和 'B':
#    A  B
# 0  1  4
# 1  2  5
# 2  3  6
 
# 使用 .loc[] 选择单个行
row_0_loc = df.loc[0]
print("使用 .loc[] 选择单个行 (索引 0):\n", row_0_loc)
# 结果:
# 使用 .loc[] 选择单个行 (索引 0):
# A    1
# B    4
# C    7
# Name: 0, dtype: int64
 
# 使用 .loc[] 选择多个行
rows_01_loc = df.loc[0:1]
print("使用 .loc[] 选择多个行 (索引 0 到 1):\n", rows_01_loc)
# 结果:
# 使用 .loc[] 选择多个行 (索引 0 到 1):
#    A  B  C
# 0  1  4  7
# 1  2  5  8
 
# 使用 .iloc[] 选择单个行
row_0_iloc = df.iloc[0]
print("使用 .iloc[] 选择单个行 (整数位置 0):\n", row_0_iloc)
# 结果:
# 使用 .iloc[] 选择单个行 (整数位置 0):
# A    1
# B    4
# C    7
# Name: 0, dtype: int64
 
# 使用 .iloc[] 选择多个行
rows_01_iloc = df.iloc[0:2]
print("使用 .iloc[] 选择多个行 (整数位置 0 到 1):\n", rows_01_iloc)
# 结果:
# 使用 .iloc[] 选择多个行 (整数位置 0 到 1):
#    A  B  C
# 0  1  4  7
# 1  2  5  8
 
# 混合选择行和列
subset = df.loc[0:1, ['A', 'B']]
print("选择特定的行和列:\n", subset)
# 结果:
# 选择特定的行和列:
#    A  B
# 0  1  4
# 1  2  5

2 Filter data

        In Pandas, you can use different methods to filter data based on certain criteria to filter out the data that meets the criteria. Here are some examples and ways to filter data:

2.1 Condition-based filtering

By creating a conditional expression, you can select rows in a DataFrame that satisfy the condition.

import pandas as pd

data = {'A': [1, 2, 3, 4, 5],
        'B': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)

# 选择满足条件的行,例如 'A' 列大于 3 的行
filtered_data = df[df['A'] > 3]
print(filtered_data)

Output result:

   A   B
3  4  40
4  5  50

2.2 Using multiple conditions

You can combine multiple conditions, using logical operators such as &(and) and (or).|

# 选择同时满足多个条件的行,例如 'A' 列大于 2 且 'B' 列小于 30 的行
filtered_data = df[(df['A'] > 2) & (df['B'] < 30)]
print(filtered_data)

Output result:

   A   B
2  3  30

2.3 Use isin()to filter

You can use isin()the method to filter out rows matching a specified value.

# 选择 'A' 列中匹配特定值的行
filtered_data = df[df['A'].isin([2, 4])]
print(filtered_data)

 Output result:

   A   B
1  2  20
3  4  40

2.4 Using string methods

If your data contains string columns, you can use string methods for filtering.

data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 35, 40]}
df = pd.DataFrame(data)

# 选择包含特定字符串的行
filtered_data = df[df['Name'].str.contains('b', case=False)]
print(filtered_data)

Output result:

   Name  Age
1   Bob   30

3 Adding, deleting and modifying data

3.1 Add data

(1) Add row

        To add a new row to a DataFrame, you typically create a new data item and then append it to the DataFrame. This can appendbe done using the method. Make sure to set ignore_index=Trueto reset the index.

(2) Add column

        To add a new column, simply assign a new column name and provide the corresponding data. This allows new columns to be added to the DataFrame to store additional information.

3.2 Delete data

(1) delete row

        Use dropthe method to delete the specified row. You can specify the index or label of the row to delete, and use axis=0the parameter to indicate the row to delete.

(2) delete column

        To delete a column, use dropthe method and set axis=1the parameter, then specify the name of the column to delete. This will allow you to remove unneeded columns from the DataFrame.

3.3 Modify data

(1) Modify the value of a specific cell

        To modify the value of a specific cell in a DataFrame, you can use .loc[]the method to update the value of that cell by specifying the row and column label or index.

(2) Update multiple values

        To update data in batches, you can usually use conditions to select the rows to update and then assign new values. This can help you update multiple data points at once instead of manually modifying them one by one.

3.4 Code example

import pandas as pd

# 创建一个示例 DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 35]}
df = pd.DataFrame(data)

# 添加新行
new_row = pd.Series({'Name': 'David', 'Age': 40})
df = df.append(new_row, ignore_index=True)
# 结果: 
#    Name  Age
# 0  Alice   25
# 1    Bob   30
# 2 Charlie   35
# 3  David   40

# 添加新列
df['City'] = ['New York', 'Los Angeles', 'Chicago', 'Houston']
# 结果: 
#    Name  Age         City
# 0  Alice   25     New York
# 1    Bob   30  Los Angeles
# 2 Charlie   35      Chicago
# 3  David   40      Houston

# 删除行
df = df.drop(2)  # 删除索引为2的行
# 结果: 
#    Name  Age         City
# 0  Alice   25     New York
# 1    Bob   30  Los Angeles
# 3  David   40      Houston

# 删除列
df = df.drop('City', axis=1)  # 删除名为 'City' 的列
# 结果: 
#    Name  Age
# 0  Alice   25
# 1    Bob   30
# 3  David   40

# 修改特定单元格的值
df.loc[1, 'Age'] = 31
# 结果: 
#    Name  Age
# 0  Alice   25
# 1    Bob   31
# 3  David   40

# 更新多个值
df.loc[df['Age'] > 30, 'Age'] = 32  # 更新年龄大于30的行的年龄为32
# 结果: 
#    Name  Age
# 0  Alice   25
# 1    Bob   32
# 3  David   32

# 输出最终结果
print(df)

 4 Data sorting

        In Pandas, you can use sort_values()the method to sort data in a DataFrame. Here's how to sort by column, including ascending and descending, and how to sort by multiple columns.

4.1   Sort by column :

To sort data by column, first select the column name to be sorted, and use sort_values()the method to operate. By default, the data will be sorted in ascending order.

  • Sort Ascending: Use sort_values(by='列名'), where 'column_name' is the name of the column you want to sort by. For example, df.sort_values(by='Age')sorting will be done in ascending order on the 'Age' column.

  • Sort Descending: To sort in descending order, you can use sort_values(by='列名', ascending=False), where 'column_name' is the name of the column you want to sort by. For example, df.sort_values(by='Age', ascending=False)sorting will be done in descending order on the 'Age' column.

4.2 Sorting by multiple columns :

        If you need to sort by multiple columns, you can do so by providing a list of column names. First, sort by the first column name in the list, then sort by the next column name in the list.

        For example, to sort ascending by the 'City' column and then ascending by the 'Age' column, you can use sort_values(by=['City', 'Age']).

4.3 Reset index :

        Note that sorted DataFrames may preserve previous index order. If you wish to reset the index to match the new sort order, you can use reset_index(drop=True)the method to drop the old index and create a new integer index.

4.4 Code example 

import pandas as pd

# 创建一个示例 DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 35, 40],
        'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']}
df = pd.DataFrame(data)

# 按列排序
# 默认按升序排序
df_sorted = df.sort_values(by='Age')
# 按照 'Age' 列的升序排序
print("按 'Age' 列的升序排序:\n", df_sorted)

# 按照 'Age' 列的降序排序
df_sorted_desc = df.sort_values(by='Age', ascending=False)
print("\n按 'Age' 列的降序排序:\n", df_sorted_desc)

# 按多列排序
# 先按 'City' 列升序排序,再按 'Age' 列升序排序
df_multi_sorted = df.sort_values(by=['City', 'Age'])
print("\n按 'City' 列和 'Age' 列的升序排序:\n", df_multi_sorted)

# 恢复索引
df_multi_sorted = df_multi_sorted.reset_index(drop=True)
print("\n重置索引后的 DataFrame:\n", df_multi_sorted)

 This example demonstrates how to sort data by columns in Pandas, including ascending and descending sorts and sorting by multiple columns. You can also use reset_index()the method to reset the index of a sorted DataFrame.

Guess you like

Origin blog.csdn.net/qq_35831906/article/details/132700706
Recommended