Data Science Essentials Pandas DataFrame: Data Sorting Explained

Learning the pandas sorting method is a great way to start or practice basic data analysis with Python. Most commonly, data analysis is done using Excel, SQL or pandas. One of the great advantages of using pandas is that it can handle large amounts of data and provide high-performance data manipulation capabilities.

This article describes how to use .sort_values() and .sort_index() to efficiently sort data in a DataFrame.

insert image description here

Getting Started with Pandas Sorting Methods

A DataFrame is a data structure with labeled rows and columns. A DataFrame can be sorted by row or column value and by row or column index.

Rows and columns have indexes, which are numerical representations of the data's position in the DataFrame. Data can be retrieved from a specific row or column using the DataFrame's index position. By default, index numbers start at zero. You can also manually assign your own indexes.

data preparation

Fuel economy data compiled by the U.S. Environmental Protection Agency (EPA) for vehicles built between 1984 and 2021.
insert image description here
EPA Fuel Economy Dataset

For analysis purposes, the MPG (miles per gallon) data for vehicles that will be viewed by make, model, year, and other vehicle attributes will be read into the columns of the DataFrame.

import pandas as pd

column_subset = [
    "id",
    "make",
    "model",
    "year",
    "cylinders",
    "fuelType",
    "trany",
    "mpgData",
    "city08",
    "highway08"
]

df = pd.read_csv(
    "数据科学必备Pandas DataFrame:数据排序详解/vehicles.csv",
    usecols=column_subset,
    nrows=100
)

df.head()

insert image description here

.sort_values()

Values ​​in a DataFrame can be sorted along either axis (column or row) using .sort_values() , similar to how values ​​are sorted in Excel.
insert image description here

.sort_index()

You can use .sort_index() to sort a DataFrame by row index or column label, which is to sort a DataFrame by row index or column name
insert image description here

DataFrame single column data sorting

Use .sort_values() . The default returns a new DataFrame sorted in ascending order, without modifying the original DataFrame.

Sort by column in ascending order

Sorting with .sort_values() is to pass a single argument to the method containing the name of the column to sort on.

df.sort_values("city08")

insert image description here

Sort order adjustment

By default .sort_values() ascending is set to True (sort in ascending order). Set to False if sorting in descending order.

df.sort_values(by="city08",ascending=False)

insert image description here

selection sort algorithm

Available algorithms are quicksort, mergesort, and heapsort .

df.sort_values(by="city08",ascending=False,kind="mergesort")

insert image description here

DataFrame multi-column data sorting

To sort by two keys, you can pass a list of column names by .

Ascending order by column

To sort a DataFrame on multiple columns, a list of column names must be provided.

df.sort_values(by=["make", "model"])[["make", "model"]]

          make               model
0   Alfa Romeo  Spider Veloce 2000
18        Audi                 100
19        Audi                 100
20         BMW                740i
21         BMW               740il
..         ...                 ...
12  Volkswagen      Golf III / GTI
13  Volkswagen           Jetta III
15  Volkswagen           Jetta III
16       Volvo                 240
17       Volvo                 240
[100 rows x 2 columns]

Change column sort order

Adjust the order of the sort by list.

df.sort_values(by=["model", "make"])[["make", "model"]]
             make        model
18           Audi          100
19           Audi          100
16          Volvo          240
17          Volvo          240
75          Mazda          626
..            ...          ...
62           Ford  Thunderbird
63           Ford  Thunderbird
88     Oldsmobile     Toronado
42  CX Automotive        XM v6
43  CX Automotive       XM v6a
[100 rows x 2 columns]

Descending sort by multiple columns

df.sort_values(by=["make", "model"],ascending=False)[["make", "model"]]
          make               model
16       Volvo                 240
17       Volvo                 240
13  Volkswagen           Jetta III
15  Volkswagen           Jetta III
11  Volkswagen      Golf III / GTI
..         ...                 ...
21         BMW               740il
20         BMW                740i
18        Audi                 100
19        Audi                 100
0   Alfa Romeo  Spider Veloce 2000
[100 rows x 2 columns]

Sorting on multiple columns with different sort orders

Use multiple columns for sorting and have those columns use different ascending parameters. With pandas this can be done with a single method call. If you want to sort some columns in ascending order and some in descending order, you can pass a list of booleans to ascending.

df.sort_values(
    by=["make", "model", "city08"],
    ascending=[True, True, False]
)[["make", "model", "city08"]]

          make               model  city08
0   Alfa Romeo  Spider Veloce 2000      19
18        Audi                 100      17
19        Audi                 100      17
20         BMW                740i      14
21         BMW               740il      14
..         ...                 ...     ...
11  Volkswagen      Golf III / GTI      18
15  Volkswagen           Jetta III      20
13  Volkswagen           Jetta III      18
17       Volvo                 240      19
16       Volvo                 240      18
[100 rows x 3 columns]

DataFrame index ordering

A DataFrame has an .index property which by default is a numerical representation of its row position. An index can be thought of as a row number, which helps to find and identify rows quickly.

Sort by index in ascending order

DataFrames can be sorted by row index using .sort_index(). Sorting by column value as in the previous example reorders the rows in the DataFrame, so the index becomes cluttered. This also happens when filtering a DataFrame or deleting or adding rows.

Use .sort_values() to create a new sorted DataFrame for subsequent operations.

sorted_df = df.sort_values(by=["make", "model"])
sorted_df

insert image description here
Use .sort_index() to restore the original order of the DataFrame.

sorted_df.sort_index()

insert image description here
Use bool to judge.

sorted_df.sort_index() == df

insert image description here
You can assign a custom index to the .set_index() setting list for parameter passing.

assigned_index_df = df.set_index(["make", "model"])
assigned_index_df 

insert image description here
Use .sort_index() to sort.

assigned_index_df.sort_index()

insert image description here

Sort by index descending

assigned_index_df.sort_index(ascending=False)

insert image description here

DataFrame column ordering

Sorts row values ​​using the DataFrame's column labels. Using .sort_index() and setting the optional parameter axis to 1 will sort the DataFrame by column labels. The sorting algorithm is applied to the axis labels instead of the actual data. This facilitates visual inspection of the DataFrame.

When you use .sort_index() without passing any explicit arguments, use axis=0 as the default argument. The axis of a DataFrame is either the index (axis=0) or the column (axis=1). You can use these two axes to index, select, and sort data in the DataFrame.

Column label sorting

df.sort_index(axis=1)

insert image description here

df.sort_index(axis=1, ascending=False)

insert image description here

Handling missing data when sorting

Real-world data usually has many imperfections. While pandas has a variety of ways to clean data before sorting, sometimes it's good to see what data is missing while sorting. You can do this with the na_position parameter.

df["mpgData_"] = df["mpgData"].map({
    
    "Y": True})

insert image description here

na_position 的 .sort_values()

.sort_values() accepts a parameter called na_position that helps with missing data in the column to be sorted.

df.sort_values(by="mpgData_",na_position="first")

    city08  cylinders fuelType  ...            trany  year mpgData_
1        9         12  Regular  ...     Manual 5-spd  1985      NaN
3       10          8  Regular  ...  Automatic 3-spd  1985      NaN
4       17          4  Premium  ...     Manual 5-spd  1993      NaN
5       21          4  Regular  ...  Automatic 3-spd  1993      NaN
11      18          4  Regular  ...  Automatic 4-spd  1993      NaN
..     ...        ...      ...  ...              ...   ...      ...
32      15          8  Premium  ...  Automatic 4-spd  1993     True
33      15          8  Premium  ...  Automatic 4-spd  1993     True
37      17          6  Regular  ...  Automatic 3-spd  1993     True
85      17          6  Regular  ...  Automatic 4-spd  1993     True
95      17          6  Regular  ...  Automatic 3-spd  1993     True
[100 rows x 11 columns]

Any missing data in the columns used for sorting will appear at the front of the DataFrame. Used to view missing values ​​for a column.

DataFrame sorting modification

Add the important parameter inplace=True to .sort_values() . The role is to modify the original DataFrame directly.

df.sort_values("city08", inplace=True)
df.sort_index(inplace=True)

The resulting df replaces the original df for each execution of the code in the book.

Guess you like

Origin blog.csdn.net/qq_20288327/article/details/124223029