Learning the pandas sorting method is a great way to start or practice basic data analysis with Python. Most commonly, data analysis is done using Excel, SQL or pandas. One of the great advantages of using pandas is that it can handle large amounts of data and provide high-performance data manipulation capabilities.
This article describes how to use .sort_values() and .sort_index() to efficiently sort data in a DataFrame.
Article directory
Getting Started with Pandas Sorting Methods
A DataFrame is a data structure with labeled rows and columns. A DataFrame can be sorted by row or column value and by row or column index.
Rows and columns have indexes, which are numerical representations of the data's position in the DataFrame. Data can be retrieved from a specific row or column using the DataFrame's index position. By default, index numbers start at zero. You can also manually assign your own indexes.
data preparation
Fuel economy data compiled by the U.S. Environmental Protection Agency (EPA) for vehicles built between 1984 and 2021.
EPA Fuel Economy Dataset
For analysis purposes, the MPG (miles per gallon) data for vehicles that will be viewed by make, model, year, and other vehicle attributes will be read into the columns of the DataFrame.
import pandas as pd
column_subset = [
"id",
"make",
"model",
"year",
"cylinders",
"fuelType",
"trany",
"mpgData",
"city08",
"highway08"
]
df = pd.read_csv(
"数据科学必备Pandas DataFrame:数据排序详解/vehicles.csv",
usecols=column_subset,
nrows=100
)
df.head()
.sort_values()
Values in a DataFrame can be sorted along either axis (column or row) using .sort_values() , similar to how values are sorted in Excel.
.sort_index()
You can use .sort_index() to sort a DataFrame by row index or column label, which is to sort a DataFrame by row index or column name
DataFrame single column data sorting
Use .sort_values() . The default returns a new DataFrame sorted in ascending order, without modifying the original DataFrame.
Sort by column in ascending order
Sorting with .sort_values() is to pass a single argument to the method containing the name of the column to sort on.
df.sort_values("city08")
Sort order adjustment
By default .sort_values() ascending is set to True (sort in ascending order). Set to False if sorting in descending order.
df.sort_values(by="city08",ascending=False)
selection sort algorithm
Available algorithms are quicksort, mergesort, and heapsort .
df.sort_values(by="city08",ascending=False,kind="mergesort")
DataFrame multi-column data sorting
To sort by two keys, you can pass a list of column names by .
Ascending order by column
To sort a DataFrame on multiple columns, a list of column names must be provided.
df.sort_values(by=["make", "model"])[["make", "model"]]
make model
0 Alfa Romeo Spider Veloce 2000
18 Audi 100
19 Audi 100
20 BMW 740i
21 BMW 740il
.. ... ...
12 Volkswagen Golf III / GTI
13 Volkswagen Jetta III
15 Volkswagen Jetta III
16 Volvo 240
17 Volvo 240
[100 rows x 2 columns]
Change column sort order
Adjust the order of the sort by list.
df.sort_values(by=["model", "make"])[["make", "model"]]
make model
18 Audi 100
19 Audi 100
16 Volvo 240
17 Volvo 240
75 Mazda 626
.. ... ...
62 Ford Thunderbird
63 Ford Thunderbird
88 Oldsmobile Toronado
42 CX Automotive XM v6
43 CX Automotive XM v6a
[100 rows x 2 columns]
Descending sort by multiple columns
df.sort_values(by=["make", "model"],ascending=False)[["make", "model"]]
make model
16 Volvo 240
17 Volvo 240
13 Volkswagen Jetta III
15 Volkswagen Jetta III
11 Volkswagen Golf III / GTI
.. ... ...
21 BMW 740il
20 BMW 740i
18 Audi 100
19 Audi 100
0 Alfa Romeo Spider Veloce 2000
[100 rows x 2 columns]
Sorting on multiple columns with different sort orders
Use multiple columns for sorting and have those columns use different ascending parameters. With pandas this can be done with a single method call. If you want to sort some columns in ascending order and some in descending order, you can pass a list of booleans to ascending.
df.sort_values(
by=["make", "model", "city08"],
ascending=[True, True, False]
)[["make", "model", "city08"]]
make model city08
0 Alfa Romeo Spider Veloce 2000 19
18 Audi 100 17
19 Audi 100 17
20 BMW 740i 14
21 BMW 740il 14
.. ... ... ...
11 Volkswagen Golf III / GTI 18
15 Volkswagen Jetta III 20
13 Volkswagen Jetta III 18
17 Volvo 240 19
16 Volvo 240 18
[100 rows x 3 columns]
DataFrame index ordering
A DataFrame has an .index property which by default is a numerical representation of its row position. An index can be thought of as a row number, which helps to find and identify rows quickly.
Sort by index in ascending order
DataFrames can be sorted by row index using .sort_index(). Sorting by column value as in the previous example reorders the rows in the DataFrame, so the index becomes cluttered. This also happens when filtering a DataFrame or deleting or adding rows.
Use .sort_values() to create a new sorted DataFrame for subsequent operations.
sorted_df = df.sort_values(by=["make", "model"])
sorted_df
Use .sort_index() to restore the original order of the DataFrame.
sorted_df.sort_index()
Use bool to judge.
sorted_df.sort_index() == df
You can assign a custom index to the .set_index() setting list for parameter passing.
assigned_index_df = df.set_index(["make", "model"])
assigned_index_df
Use .sort_index() to sort.
assigned_index_df.sort_index()
Sort by index descending
assigned_index_df.sort_index(ascending=False)
DataFrame column ordering
Sorts row values using the DataFrame's column labels. Using .sort_index() and setting the optional parameter axis to 1 will sort the DataFrame by column labels. The sorting algorithm is applied to the axis labels instead of the actual data. This facilitates visual inspection of the DataFrame.
When you use .sort_index() without passing any explicit arguments, use axis=0 as the default argument. The axis of a DataFrame is either the index (axis=0) or the column (axis=1). You can use these two axes to index, select, and sort data in the DataFrame.
Column label sorting
df.sort_index(axis=1)
df.sort_index(axis=1, ascending=False)
Handling missing data when sorting
Real-world data usually has many imperfections. While pandas has a variety of ways to clean data before sorting, sometimes it's good to see what data is missing while sorting. You can do this with the na_position parameter.
df["mpgData_"] = df["mpgData"].map({
"Y": True})
na_position 的 .sort_values()
.sort_values() accepts a parameter called na_position that helps with missing data in the column to be sorted.
df.sort_values(by="mpgData_",na_position="first")
city08 cylinders fuelType ... trany year mpgData_
1 9 12 Regular ... Manual 5-spd 1985 NaN
3 10 8 Regular ... Automatic 3-spd 1985 NaN
4 17 4 Premium ... Manual 5-spd 1993 NaN
5 21 4 Regular ... Automatic 3-spd 1993 NaN
11 18 4 Regular ... Automatic 4-spd 1993 NaN
.. ... ... ... ... ... ... ...
32 15 8 Premium ... Automatic 4-spd 1993 True
33 15 8 Premium ... Automatic 4-spd 1993 True
37 17 6 Regular ... Automatic 3-spd 1993 True
85 17 6 Regular ... Automatic 4-spd 1993 True
95 17 6 Regular ... Automatic 3-spd 1993 True
[100 rows x 11 columns]
Any missing data in the columns used for sorting will appear at the front of the DataFrame. Used to view missing values for a column.
DataFrame sorting modification
Add the important parameter inplace=True to .sort_values() . The role is to modify the original DataFrame directly.
df.sort_values("city08", inplace=True)
df.sort_index(inplace=True)
The resulting df replaces the original df for each execution of the code in the book.