Pandas library for Python data processing and analysis

introduce

  • Pandas (Python Data Analysis Library) is a popular Python third-party library and one of the indispensable tools for data processing and data analysis.
  • It provides efficient data analysis methods and flexible and efficient data structures. Compared with other data processing libraries, pandas is more suitable for processing relational data or labeled data, and it also has good performance in time series analysis.
  • Pandas is often a very useful tool if you need to do data manipulation, cleaning, transformation, and analysis.

core data types

Series

  • Series : is a one-dimensional data structure in Pandas, similar to a one-dimensional array or list.

    A Series can store any data type, and each element has a label associated with it, called an index. Indexes help label and name data, making data access more convenient and intuitive.

    Unlike traditional arrays and lists, Pandas indexes can be of any data type, including integers, strings, dates, etc., and the indexes of different elements can also be the same.

    When creating a Series, you can name each element by specifying an index so that the elements can be accessed and manipulated by index.

    When accessing elements in a Series, you also need to use indexes to specify the location to be accessed.

  • Create a Series object:

    pandas.Series(data=None, index=None, dtype: Dtype = None, name=None, copy: bool=False, 
                  fastpath: bool = False)
    
    • data: Specify the data in the Series, which can be a list, array, dictionary, scalar value, etc. Required parameters

    • index: Specifies the index of the Series, used to identify and access data.

      Indexes can be lists, arrays, range objects, scalar values, or other Series

      If no index is specified explicitly, Pandas will automatically generate a default integer index

    • dtype: Specify the data type of the data in the Series. If not specified, Pandas will attempt to automatically infer the data type

    • name: Specified Series name

    • copy: Default is False. If set to True, the data is copied instead of using a reference to the original data.

  • Example

    import pandas as pd
    import numpy as np
    
    # 创建Series(默认索引)
    data = pd.Series([1, np.nan, 6])
    # 创建Series(自建索引)
    data = pd.Series([1, np.nan, 6], index=[0, 3, 4])
    # 创建Series(通过字典直接创建带索引的数据)
    data = pd.Series({
          
          0: 1, 3: np.nan, 4: 6})
    print(data)
    
    # 输出结果
    0    1.0
    3    NaN
    4    6.0
    dtype: float64
    

DataFrame

  • DataFrame: It is a two-dimensional data structure in Pandas, similar to an Excel table or SQL table, consisting of rows and columns, and can store different types of data.

    DataFrame can be viewed as a collection of Series objects, each Series object represents a column of data

    In DataFrame, the role of index is even more important. In addition to specifying an index for each row and column, DataFrame also supports multi-level indexes, that is, multiple indexes can be specified for rows and columns at the same time. This provides more flexibility and functionality for processing multidimensional data.

    Indexes make it easy to select a specific number of rows and columns from a DataFrame, select any subset of data by specifying row and column indexes, or filter data that meets specific criteria by using conditional expressions.

  • Create a DataFrame object:

    pandas.DataFrame(data=None, index=None, dtype: Dtype = None, columns: Axes = None, copy: bool=False)
    
    • data: Specify the data in the Series, which can be a list, array, dictionary, scalar value, etc. Required parameters

    • index: Specifies the index of the Series, used to identify and access data.

      Indexes can be lists, arrays, range objects, scalar values, or other Series

      If no index is specified explicitly, Pandas will automatically generate a default integer index

    • dtype: Specify the data type of the data in the Series. If not specified, Pandas will attempt to automatically infer the data type

    • columns: Specify the column labels of the DataFrame, used to identify and access the columns.

      Column labels can be lists, arrays, range objects, scalar values, or other Series.

      If no column labels are specified explicitly, Pandas will automatically generate default integer column labels.

    • copy: Default is False. If set to True, the data is copied instead of using a reference to the original data.

  • Example:

    import pandas as pd
     
    data = {
          
          'name': ['John', 'Emma', 'Mike', 'Lisa'],
            'age': [28, 24, 32, 35],
            'city': ['New York', 'London', 'Paris', 'Tokyo']}
    df1 = pd.DataFrame(data)
    
    # 通过columns指定DataFrame的列索引
    data = [[1,'Bob', 24, 'American'],[2, 'Nancy', 23, 'Australia'],[3, 'Lili', 22, 'China'],[4, 'Leo', 27, 'M78'],[5, 'David', 24, 'moon']]
    df2 = pd.DataFrame(data, columns=['serial', 'name', 'age', 'from',])
    
    # 自定义索引
    df3 = pd.DataFrame(data, columns=['serial', 'name', 'age', 'from'], index=['a','b','c','d','e'])
    
    # df1 输出结果
       name  age      city
    0  John   28  New York
    1  Emma   24    London
    2  Mike   32     Paris
    3  Lisa   35     Tokyo
    # df3 输出结果
       serial   name  age       from
    a       1    Bob   24   American
    b       2  Nancy   23  Australia
    c       3   Lili   22      China
    d       4    Leo   27        M78
    e       5  David   24       moon
    

Commonly used functions and methods

Data import and export

Pandas can import data from a variety of data sources, including CSV, Excel, SQL databases, JSON, and more, and can export data to these formats.

  • pandas.read_csv(): Import data from a CSV file and return a DataFrame object (df)

    Parameter Description:

    • filepath_or_buffer : CSV file path or file object.
    • sep: optional, separator, default is comma
    • header: optional, specify which row is used as the column name, the default is the first row
    • index_col: optional, specifies which column to use as the index
  • pandas.read_excel(): Import data from Excel file and return a DataFrame object (df)

    Parameter Description:

    • io : Excel file path, file object or URL
    • sheet_name: optional, worksheet name
    • header: optional, specify which row is used as the column name, the default is the first row
  • df.to_csv() : Export data to a CSV file

    Parameter Description:

    • path_or_buf: Exported file path or file object
    • sep: optional, separator, default is comma
    • index: optional, whether to include index, default is True
  • df.to_excel() : Export data to Excel file

    Parameter Description:

    • excel_writer : Excel file path, file object or ExcelWriter object
    • sheet_name: optional, worksheet name
    • index: optional, whether to include index, default is True
  • Example

    import pandas as pd
    # 从CSV文件导入数据
    df = pd.read_csv('data.csv')
    # 将数据导出到Excel文件
    df.to_excel('data.xlsx', index=False)
    

Data processing and transformation

Pandas provides various methods to handle missing data, duplicate data, abnormal data, as well as perform data transformation, filtering and merging data from different data sources, including operations such as joins, merges and joins.

  • df.isnull() Sum df.notnull() : 检测缺值

  • df.drop() : Delete rows or columns

  • df.dropna() : delete rows containing missing values

  • df.drop_duplicates() : delete duplicate rows

  • df.fillna(value): fill in missing values

  • df.apply(func): Apply function to rows or columns

  • df.groupby(‘column_name’).mean(): Grouped data (aggregation)

  • df.pivot_table(): Create a pivot table

  • df.melt() : Convert wide format data to long format

  • Example

    # 聚合操作
    data = [[1,'Bob', 24, 'high-school'],[2, 'Nancy', 23, 'college'],[3, 'Lili', 22, 'college']]
    df = pd.DataFrame(data, columns=['serial', 'name', 'age', 'grade'], index=['a','b','c'])
    # 聚合,按grade分组,并计算分组后的平均年龄
    xdf = df.groupby('grade')['age'].mean()
    
    # 透视表
    pd.pivot_table(df, values='value_column', index='index_column', columns='column_to_pivot')
    # 应用自定义函数
    df.apply(custom_function, axis=1)
    

Data merging and splitting

  • pd.concat(): Used to merge rows (vertical stacking) or columns (horizontal connection) data, usually used to join rows or columns of different data sets, but not Performs a column-based merge without checking or handling duplicate data, simply stacking the data together

    Main parameter description:

    • objs : List of data objects to be merged, which can be a list of DataFrame or Series. The only required parameter

    • axis : Specifies the axis direction of the merge. The default is 0, which means merging by rows (vertical stacking). If set to 1, it means merging by columns (horizontal connection)

    • join: Specify the connection method, the default is 'outer'. Possible values ​​are:

      • 'outer': Performs an outer join, retaining all rows or columns, and filling non-existent data with missing values.
      • 'inner': Perform an inner join and keep only common rows or columns.
    • ignore_index : Defaults to False. If set to True, the original index is ignored and a new consecutive integer index is created.

    • keys : Tag used to create hierarchical index, can be a string, list or array. If keys are provided, a MultiIndex will be created.

    **Applicable scenarios:** Mainly used for simple data stacking operations, merging data from different sources or processing methods, such as stacking multiple similar data sets together by rows, or splicing columns of different data sets together.

  • pd.merge(): Used for column-based merging, similar to the JOIN operation in SQL, used to combine two or more data frames (DataFrame) according to one or multiple shared columns to join.

    pd.merge()Typically the join columns are checked for duplicate values ​​and different processing operations are performed depending on the join type, such as INNER JOIN, LEFT JOIN, RIGHT JOIN, or FULL JOIN.

    Main parameter description:

    • left: DataFrame on the left, used for merged left data.
    • right: DataFrame on the right, used for merged right data.
    • how: Specify the connection method, the default is 'inner'. Can take the following values:
      • 'inner': Performs an inner join and only retains rows common to both DataFrames.
      • 'outer': Performs an outer join, retaining all rows in both DataFrames and filling missing values ​​with NaN.
      • 'left': Performs a left join, retaining all rows in the left DataFrame, and filling unmatched rows in the right DataFrame with NaNs.
      • 'right': Performs a right join, retaining all rows in the right DataFrame, and filling unmatched rows in the left DataFrame with NaNs.
    • on: Column names used for the join (columns with the same name in the left and right DataFrames). Can be a string of a single column name, or a list of column names if a multi-column join is required.
    • left_on: The column name used for connection in the left DataFrame. If the connection column names on the left and right sides are different, you can use this parameter to specify the column name on the left.
    • right_on: The column name used for connection in the right DataFrame. If the connection column names on the left and right sides are different, you can use this parameter to specify the column name on the right side.
    • left_index: Default is False. If set to True, the index of the left DataFrame is used as the join key.
    • right_index: Default is False. If set to True, the index of the right DataFrame is used as the join key.
    • suffixes: Default is ('_x', '_y'). A tuple of suffix strings used to handle overlapping column names, which can be specified to be added to the end of column names in case of column name conflicts.
    • sort: Default is False. If set to True, the results are sorted after merging.
    • copy: Default is True. If set to False, attempts to perform the join operation without copying the data, which can improve performance.

    **Applicable scenarios:** Mainly used for more complex column-based data connection and merging operations, merging data from different data sets based on shared columns, usually used for data association, data connection and database-style merging operations.

  • pd.join(): The function joins based on the index or column value

  • df.split() : Split a single column containing multiple values ​​into multiple columns, making the data more organized and easier to process

  • Example:

    # 合并两个DataFrame
    merged_df = pd.concat([df1, df2], axis=0)
    # 数据库风格的连接
    merged_df = pd.merge(df1, df2, on='key_column')
    

Data viewing and overview

  • df.head(n): View the first n rows of the DataFrame, the default is 5 rows
  • df.tail(n): View the last n rows of the DataFrame, the default is 5 rows
  • df.info(): View the basic information of DataFrame, including data type and number of non-null values
  • df.describe(): Generate descriptive statistical information, including mean, standard deviation, etc.

Data selection and retrieval

  • df[‘column_name’]: Select a single column

  • df[[‘col1’, ‘col2’]]: Select multiple columns

  • df.loc[row_label] : Select rows using labels

  • df.iloc[row_index] : Select rows using integer index

  • df.query(): Query data using conditions

  • Example:

    # 选择列:可以使用列名或列索引来选择列。
    df['name']
    # 选择行:可以使用行索引来选择行。
    df.loc['a']
    # 切片方式访问。访问第二行到第三行数据
    df.iloc[1:3] 
    

Data screening and filtering

  • df[df[‘column_name’] > value] : Filter rows by condition

  • df[(condition1) & (condition2)] : Use logical operators to filter rows

  • Example

    # 过滤行:可以使用条件表达式来过滤行。
    df[df['age'] > 30]
    

Data sorting and ranking

  • df.sort_values(‘column_name’) : Sort by column value

  • df.sort_index() : Sort by index

  • df.rank(): Assign ranking to data

  • Example:

    df = df.sort_values(by='age', ascending=False)
    

Time series processing

Pandas provides powerful support for time series data, including date parsing, time indexing, and rolling window operations.

  • pd.to_datetime(arg, format): Convert string to datetime type

    • arg: Date and time string, timestamp, Series, etc.
    • format: optional, specify date and time format
  • df.resample(): Resample time series data

  • df.shift(periods, freq): Shift time series data

  • Example:

    # 解析日期列
    df['date_column'] = pd.to_datetime(df['date_column'])
    # 创建时间索引
    df.set_index('date_column', inplace=True)
    # 滚动窗口操作
    df['rolling_mean'] = df['value_column'].rolling(window=3).mean()
    

data visualization

Pandas integrates with the Matplotlib library to easily visualize data.

  • df.plot(): Draw data visualization charts

    • Chart type (e.g. 'line', 'bar', 'scatter', etc.) and other plotting options can be specified via parameters
  • df.hist(): draw histogram

    • The number of columns, color, etc. of the histogram can be specified through parameters.
  • df.boxplot() : Draw box plot

    • You can specify whether to display through parameters
  • Example:

    import matplotlib.pyplot as plt
    # 创建柱状图
    df['column_name'].plot(kind='bar')
    # 创建散点图
    df.plot.scatter(x='x_column', y='y_column')
    # 更多的可视化选项可以结合使用 Pandas 和 Matplotlib
    

Advanced usage

multilevel index

  • Pandas' multi-level indexing feature is very powerful, allowing the creation of complex hierarchical indexes in a DataFrame, allowing for more flexible organization and analysis of data.

    A common application scenario is to use multi-level indexes to represent time series data, such as year and quarter as the two levels of the index.

  • By creating multi-level indexes, data can be divided and aggregated at different levels.

    For example, you can group your data by year, and then within each year by quarter. In this way, various statistical analyzes can be performed more conveniently, such as calculating the average, total, etc. of each quarter.

  • When creating a multi-level index, you can use Pandas's MultiIndex class to specify the index level and label.

    You can easily create a DataFrame with multiple levels of indexing by specifying the name of the level and corresponding label values.

  • Using multi-level indexes can bring many benefits, such as improving data query efficiency and simplifying data operations and analysis.

    But at the same time, you also need to pay attention to avoiding index confusion and excessive data structure complexity when using multi-level indexes.

    Therefore, when using multi-level indexes, it needs to be applied flexibly according to specific needs and data characteristics.

  • Example:

    import pandas as pd
     
    # 创建多级索引
    index = pd.MultiIndex.from_tuples([('2019', 'Q1'), ('2019', 'Q2'), ('2020', 'Q1'), ('2020', 'Q2')])
    data = pd.DataFrame({
          
          'Sales': [100, 200, 150, 250]}, index=index)
    # 查询特定季度的销售数据
    print(data.loc[('2020', 'Q1')])
    print("==================")
    # 查询特定年份的销售数据
    print(data.loc['2020'])
    
    # 输出结果:
    Sales    150
    Name: (2020, Q1), dtype: int64
    ==================
        Sales
    Q1    150
    Q2    250
    

pivot table

  • A pivot table is a method of creating a summary table based on one or more columns in your data.

    Pandas provides the pivot_table() function to easily aggregate and analyze data.

    With the pivot_table() function, you can specify one or more columns as the row index and another or more columns as the column index, and then summarize the data based on the specified aggregation function. In this way, statistics corresponding to each row and column can be quickly calculated, such as average, sum, count, etc.

  • Pandas' pivot table functionality provides a convenient and flexible method of aggregating and analyzing data to help better understand and utilize data.

    The benefit of a pivot table is that it provides an intuitive, concise way to view and analyze data.

    Pivot tables make it easy to slice, dice, and filter your data to gain a deeper understanding of its characteristics and relationships.

  • When using a pivot table, you can select appropriate aggregation functions, row and column indexes, and filter conditions according to specific needs to obtain the desired analysis results.

    Pivot tables are not only suitable for a single DataFrame, but can also be used for merging and analyzing multiple DataFrames.

  • Example:

    import pandas as pd
     
    # 创建一个包含销售数据的DataFrame
    data = pd.DataFrame({
          
          'Year': ['2019', '2019', '2020', '2020'],
                         'Quarter': ['Q1', 'Q2', 'Q1', 'Q2'],
                         'Product': ['A', 'B', 'A', 'B'],
                         'Sales': [100, 200, 150, 250]})
    # 创建透视表
    pivot_table = data.pivot_table(index='Year', columns='Quarter', values='Sales', aggfunc='sum')
    # 打印透视表
    print(pivot_table)
    
    # 输出结果
    Quarter   Q1   Q2
    Year
    2019     100  200
    2020     150  250
    

time series analysis

  • Pandas provides flexible and efficient functionality when it comes to processing time series data.

    Its date and time processing functions include date range generation, date indexing, date addition and subtraction, date formatting, etc. You can easily create date ranges and use these dates as indexes into your data, making it easier to manipulate and analyze time series data.

  • Pandas also supports resampling operations, which convert time series data from one frequency to another.

    For example, you can convert data sampled by day to data sampled by month, or from data sampled by hour to data sampled by minute. Resampling capabilities allow the flexibility to adjust the granularity and frequency of data as needed.

  • Pandas also provides sliding window operations that can perform sliding window statistical calculations on time series data.

    You can define the size of the window and the sliding step, and perform summary, aggregation or other calculation operations on the data within the window. This is useful for tasks such as moving averages, rolling sums, etc. in time series data.

  • Example:

    import pandas as pd
     
    # 创建一个包含时间序列数据的DataFrame
    df = pd.DataFrame({
          
          'Date': pd.date_range(start='2020-01-01', periods=10),
                       'Sales': [100, 200, 150, 250, 180, 120, 300, 350, 400, 250]})
     
    # 将日期列设置为索引
    df.set_index('Date', inplace=True)
    # 计算每周销售总额
    weekly_sales = df.resample('W').sum()
    # 打印每周销售总额
    print(weekly_sales)
    

Work with Excel files

  • When working with Excel files using Pandas, you can use the read_excel() function to read the Excel data and load it into a DataFrame.

    Various operations and processing can then be performed on the read data, such as filtering data for specific columns, filtering data based on conditions, sorting data, adding new columns to the DataFrame, and so on.

  • Example:

    import pandas as pd
     
    # 读取Excel文件
    df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
     
    # 显示DataFrame的前几行数据
    print(df.head())
    
    # 筛选特定列数据
    selected_columns = ['Name', 'Age']
    filtered_data = df[selected_columns]
     
    # 按条件筛选数据
    condition = df['Age'] > 25
    filtered_data = df[condition]
     
    # 数据排序
    sorted_data = df.sort_values(by='Age', ascending=False)
     
    # 添加新列
    df['Gender'] = ['Male', 'Female', 'Male', 'Female', 'Male']
     
    # 写入到新的Excel文件
    df.to_excel('new_data.xlsx', index=False)
    

Guess you like

Origin blog.csdn.net/footless_bird/article/details/134776677