[Turn] Pandas Basic Exercises (2)-Data Condition Selection and Sorting!

Pandas basic exercise (2)-data condition selection and sorting!

The previous article introduced the use of Pandas' basic functions. This article mainly introduces the conditional filtering and sorting functions of Pandas, also with the help of a small case!

1. Read in the data

First use the read_csv() function to get the data in. This data is the data of a commodity category, which has several attributes such as quantity (class), name (name), description (description), price (price), etc.:

import pandas as pd
url = "https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv"

chipo = pd.read_csv(url,sep = '\t')
chipo

2. Change the data type and character name of a column

There is a "$" character in the price column in the read data. When processing data, this symbol needs to be removed. Here are two ways:

1. Replace list and series; first format each data in a certain column to get the list, and then assign the list to the column that needs to be replaced;

prices = [float(value[1:-1]) for value in chipo.item_price]
chipo.item_price = prices

2. Use apply() and lambda functions to replace;

chipo['item_price'] = chipo['item_price'].apply(lambda x:float(x[1:-1]))

3. Eliminate repeated data in multiple columns

Data redundancy often occurs when processing data. At this time, it is necessary to remove redundant data in the data in advance (the phenomenon of data duplication in multiple columns in a row). The function used is drop_duplicates(['column name 1','column name 2']) function

What is removed here is the duplicated data in the three columns of item_name, quantity, and choice_description:

chipo_filtered = chipo.drop_duplicates(['item_name','quantity','choice_description'])
chipo_filtered

4. Conditional screening

1. Filter out the data whose quantity value is 1:

chipo_one_prod = chipo_filtered[chipo_filtered.quantity==1]
chipo_one_prod

2. On the basis of 1, filter out the data whose item_price is greater than 10, and use nunique to check the number of unique items of item_name:

chipo_one_prod[chipo_one_prod['item_price']>10].item_name.nunique()


# 输出结果
# 25

Conditional filtering in 3, 2 can also be used to filter:

chipo.query("item_price>10")

4. Filter data with multiple conditions, and filter out data that satisfies item_name as Chicken Bowl and quantity as 1

chipo[(chipo['item_name']=='Chicken Bowl')&(chipo['quantity']==1)]

The & (and) used in 5, 4 to connect conditional statements, here are the following | (or):

Need to meet the data whose item_name is not Chicken Bowl or quantity is 1

chipo[(chipo['item_name']!='Chicken Bowl')|(chipo['quantity']!=1)]

4. Sort the data in the specified column

1. Sort a certain column of data, and the final result will only show the result of this column after sorting. Command statement: data.column name.sort_values(), for example, here is sorting with the column of item_name:

chipo.item_name.sort_values()# Sorting the values

2. Data.sort_values(by=column name) is sorted by a certain column of data, corresponding to other column data, the position needs to be changed, and all the sorted data finally displayed

chipo.sort_values(by = 'item_name')

The advanced application in 3, 2. For example, here I want to see the most expensive product name in the data . Here, I use the method of 2 to reverse the item_price, and then extract the item_name of the first row of the sorted data.

chipo.sort_values(by = 'item_price',ascending = False).head(1).item_name

# 打印结果
# 3598    Chips and Fresh Tomato Salsa
Name: item_name, dtype: object

5. Data.loc method to filter data

Note: When using the data.loc method to filter data, only row names and column names can be used as filter conditions

1. Filter out the data with row names 2, 3 and column names quantity, item_name, item_price

chipo.loc[[2,3],['quantity','item_name','item_price']]

2. There is no restriction on the row name, only the column name quantity and item_name data are filtered out

chipo.loc[:,['quantity','item_name']]

3. There is no restriction on column names, only data with row names 5 and 6 are filtered out

chipo.loc[[5,6],:]

4. Comprehensive application filters out: two columns of data whose row name can be divisible by 8 and column names item_price and item_name

chipo.loc[chipo.index%8==0,['item_name','item_price']]

6, data.iloc method to filter data

The idea of data.iloc and data.loc methods can only be selected based on index values, not column names and row names as filter conditions

1. Filter out the data in rows 5-8 and columns 2-3

chipo.iloc[4:8,1:3]

2. Filter out the data in rows 2-4;

chipo.iloc[1:4,:]

3. Filter out the first two columns of data:

chipo.iloc[:,:2]

The above is the entire content of this article. Friends who are not familiar with the application of some of the methods in it, remember to follow the code and type it again to deepen your understanding!

[Turn] Pandas Basic Exercises (2)-Data Condition Selection and Sorting!

Guess you like