python uses pandas library to process excel table data

Insert image description here

read file

Pandas is a powerful data analysis library that provides a wealth of data structures and data analysis tools, one of which is the read_* function series for reading files in different formats. Here's a simple example of how to use Pandas to read common file formats:

  1. Read Excel file:

    import pandas as pd
    
    # 读取 Excel 文件的第一个表格
    df = pd.read_excel('文件路径.xlsx')
    
    # 打印数据框的前几行
    print(df.head())
    

View data

In Pandas, there are several ways to view DataFrame data and data types in Excel files. Here are some common methods:

  1. View the first few rows of data (or the next few rows):

    Use head() function to view the first few rows of DataFrame. The default is the first five rows. You can specify the number of rows to display by passing parameters.

    import pandas as pd
    
    # 读取 Excel 文件
    df = pd.read_excel('文件路径.xlsx')
    
    # 查看前五行数据
    print(df.head(5))
    # 查看后5行
    print(df.tail(5))
    
    
  2. View data types:

    Use the dtypes property to view the data type of each column in the DataFrame.

    # 查看数据类型
    print(df.dtypes)
    
  3. View statistical summary:

    Use the describe() function to generate statistical summaries on numeric columns, including mean, standard deviation, minimum, 25%, 50%, 75%, and maximum.

    # 查看统计摘要
    print(df.describe())
    
  4. View unique values ​​for a single column:

    If you want to know the unique value of a column, you can use the unique() function.

    # 查看某一列的唯一值
    unique_values = df['列名'].unique()
    print(unique_values)
    
  5. View the information of the entire DataFrame:

    Use the info() function to view the overall information of the DataFrame, including the number of non-null values ​​in each column, data type, etc.

    # 查看整个 DataFrame 的信息
    print(df.info())
    

These methods can help you quickly understand the data in Excel files, its structure and data types. As needed, select appropriate methods to view and understand the data.

Data selection

When selecting data in a DataFrame, here is a simple example of how each is used:

  1. Select columns:

    # 创建一个简单的 DataFrame
    import pandas as pd
    
    data = {
          
          'Name': ['Alice', 'Bob', 'Charlie'],
            'Age': [25, 30, 35],
            'City': ['New York', 'San Francisco', 'Los Angeles']}
    
    df = pd.DataFrame(data)
    
    # 通过列名选择单列
    single_column = df['Name']
    
  2. Select multiple columns:

    # 选择多列
    multiple_columns = df[['Name', 'Age']]
    
  3. Select row:

    # 通过标签索引选择单行
    single_row_by_label = df.loc[0]
    
    # 通过整数索引选择单行
    single_row_by_integer = df.iloc[0]
    
  4. Select rows with specific criteria:

    # 选择满足条件的行
    selected_rows = df[df['Age'] > 25]
    
  5. Combination options:

    # 组合条件选择
    selected_data = df[(df['Age'] > 25) & (df['City'] == 'San Francisco')]
    
  6. Select an element at a specific position:

    # 通过标签索引选择元素
    element_by_label = df.at[0, 'Name']
    
    # 通过整数索引选择元素
    element_by_integer = df.iat[0, 0]
    

These examples demonstrate how to use Pandas to perform simple data selection on a DataFrame. You can use these methods flexibly based on your specific data and needs.

Data filtering

In Pandas, data filtering occurs by selecting rows or columns that meet certain criteria. Here are some common uses of data filtering:

  1. Filter rows based on criteria:

    Set conditions to select rows in a DataFrame that meet the conditions.

    # 创建一个简单的 DataFrame
    import pandas as pd
    
    data = {
          
          'Name': ['Alice', 'Bob', 'Charlie'],
            'Age': [25, 30, 35],
            'City': ['New York', 'San Francisco', 'Los Angeles']}
    
    df = pd.DataFrame(data)
    
    # 选择年龄大于 30 的行
    filtered_rows = df[df['Age'] > 30]
    
  2. Filter rows using isin method:

    Use the isin method to select rows that contain a specific value in a column.

    # 选择居住在指定城市的行
    selected_cities = df[df['City'].isin(['New York', 'Los Angeles'])]
    
  3. Filter rows based on a combination of multiple criteria:

    Use logical operators& (AND), | (OR), ~ (NOT), etc. to combine Multiple conditions.

    # 选择年龄大于 25 且居住在 'San Francisco' 的行
    selected_data = df[(df['Age'] > 25) & (df['City'] == 'San Francisco')]
    
  4. Filter rows based on string criteria:

    Use string methods such as str.contains to filter rows in a text column that contain a specific string.

    # 选择包含 'Bob' 的行
    selected_rows = df[df['Name'].str.contains('Bob')]
    
  5. Filter rows based on index labels:

    Use the loc method to filter rows based on index tags.

    # 设置 'Name' 列为索引列
    df.set_index('Name', inplace=True)
    
    # 选择 'Bob' 的行
    selected_row = df.loc['Bob']
    
  6. Filter columns based on column values:

    Use column names to select specific columns.

    # 选择 'Name' 和 'Age' 列
    selected_columns = df[['Name', 'Age']]
    

These examples demonstrate common uses for data filtering in Pandas. You can use these methods to perform flexible data filtering based on actual conditions and needs.

Create new column

In Pandas, you can create a new column by assigning a new column name to the DataFrame and using the data of the existing column or performing some calculations to get the new column value. Here are some common ways to create new columns:

  1. Create new columns using existing columns for calculations:

    # 创建一个简单的 DataFrame
    import pandas as pd
    
    data = {
          
          'Name': ['Alice', 'Bob', 'Charlie'],
            'Age': [25, 30, 35],
            'City': ['New York', 'San Francisco', 'Los Angeles']}
    
    df = pd.DataFrame(data)
    
    # 使用已有列 'Age' 创建一个新列 'Age_in_2_years'
    df['Age_in_2_years'] = df['Age'] + 2
    
  2. Create a new column using a function for calculation:

    You can use functions to operate on a column of a DataFrame and store the results in a new column.

    # 创建一个函数,用于计算字符串长度
    def calculate_name_length(name):
        return len(name)
    
    # 使用函数创建新列 'Name_Length'
    df['Name_Length'] = df['Name'].apply(calculate_name_length)
    
  3. Create new columns using conditional statements:

    You can use conditional statements to create new columns based on the values ​​of a certain column.

    # 使用条件语句创建新列 'Is_Adult'
    df['Is_Adult'] = df['Age'] >= 18
    
  4. Create new columns based on multiple columns:

    Perform calculations using data from multiple existing columns and create new columns.

    # 使用 'Age' 和 'Name_Length' 列创建新列 'Combined_Column'
    df['Combined_Column'] = df['Age'] * df['Name_Length']
    
  5. Create a new column using the assign method:

    Use the assign method to chain operations and create multiple new columns at once.

    # 使用 assign 方法创建多个新列
    df = df.assign(Double_Age=df['Age'] * 2, Triple_Age=df['Age'] * 3)
    

These methods provide a variety of flexible ways to create new columns, choose the appropriate method based on your needs. When creating a new column, consider the source of the data, the calculation logic, and the name of the new column.

Calculate and summarize data

In Pandas, you can use some built-in functions to calculate summary data, including mean, median, standard deviation, etc. Here are some common ways to calculate summary data:

  1. Calculate the mean of a column:

    # 计算 'Age' 列的均值
    mean_age = df['Age'].mean()
    
  2. Calculate the median of a column:

    # 计算 'Age' 列的中位数
    median_age = df['Age'].median()
    
  3. Calculate the standard deviation of a column:

    # 计算 'Age' 列的标准差
    std_dev_age = df['Age'].std()
    
  4. Calculate the sum of a column:

    # 计算 'Age' 列的总和
    sum_age = df['Age'].sum()
    
  5. Calculate the minimum and maximum values ​​of a column:

    # 计算 'Age' 列的最小值和最大值
    min_age = df['Age'].min()
    max_age = df['Age'].max()
    
  6. Get statistical summary using describe method:

    # 使用 describe 方法获取数值列的统计摘要
    summary_stats = df.describe()
    
  7. Count the number of unique values:

    # 计算 'City' 列中唯一值的数量
    unique_cities_count = df['City'].nunique()
    
  8. Count unique values ​​and their occurrences:

    # 计算 'City' 列中每个唯一值的出现次数
    city_counts = df['City'].value_counts()
    

These methods provide ways to perform various summary statistics on a data set. Which method to choose depends on the statistics you care about. Appropriate functions can be selected for calculation based on actual needs.

Group statistics

In Pandas, group statistics is a powerful data analysis tool that allows you to group data according to the value of a certain column or multiple columns, and then perform statistics for each group. The following are some common group statistics methods:

  1. Group by a single column and calculate statistics:

    # 创建一个简单的 DataFrame
    import pandas as pd
    
    data = {
          
          'Category': ['A', 'B', 'A', 'B', 'A', 'B'],
            'Value': [10, 20, 30, 15, 25, 18]}
    
    df = pd.DataFrame(data)
    
    # 按 'Category' 列分组,并计算每组的均值
    group_means = df.groupby('Category')['Value'].mean()
    
  2. Group by multiple columns and calculate statistics:

    # 创建一个带有多列的 DataFrame
    data = {
          
          'Category': ['A', 'B', 'A', 'B', 'A', 'B'],
            'City': ['X', 'Y', 'X', 'Y', 'X', 'Y'],
            'Value': [10, 20, 30, 15, 25, 18]}
    
    df = pd.DataFrame(data)
    
    # 按 'Category' 和 'City' 列分组,并计算每组的均值
    group_means = df.groupby(['Category', 'City'])['Value'].mean()
    
  3. Compute multiple statistics simultaneously:

    # 同时计算均值和标准差
    group_stats = df.groupby('Category')['Value'].agg(['mean', 'std'])
    
  4. Apply multiple functions using the agg method:

    # 使用 agg 方法应用不同的统计函数
    custom_stats = df.groupby('Category')['Value'].agg(['sum', 'mean', 'count'])
    
  5. Apply a custom function using the apply method:

    # 使用 apply 方法应用自定义函数
    def custom_function(group):
        return group.max() - group.min()
    
    custom_result = df.groupby('Category')['Value'].apply(custom_function)
    
  6. Pivot Table:

    # 使用透视表计算 'City' 列和 'Category' 列的均值
    pivot_table = df.pivot_table(values='Value', index='Category', columns='City', aggfunc='mean')
    

These methods provide rich grouping statistical functions and can be customized according to different needs. Group statistics are very useful for understanding the distribution of data and conducting comparative analysis.

Guess you like

Origin blog.csdn.net/weixin_74850661/article/details/134773028