8 magic operations for filtering data in Pandas

The most commonly used data analysis in daily Pythonlife is to query and filter, and pick out the data we want according to various conditions, various dimensions and combinations, so as to facilitate our analysis and mining.

Today, I have summarized the common types of operations for daily query and screening for you to learn and reference. Examples sklearnof data used in this paper are presented. If you like this article, remember to bookmark, follow, and like.boston

[Note] The complete code, data, and technical exchange group are provided at the end of the article

from sklearn import datasets
import pandas as pd

boston = datasets.load_boston()
df = pd.DataFrame(boston.data, columns=boston.feature_names)

picture

1. []

The first one is the quickest and most convenient []. Write the filtering conditions or combination conditions directly in the dataframe. For example, below, you want to filter out NOXall data greater than the average value of this variable, and then sort in NOXdescending order.

df[df['NOX']>df['NOX'].mean()].sort_values(by='NOX',ascending=False).head()

picture

Of course, combined conditions can also be used, with logical symbols between conditions, & |etc. For example, in the following example, in addition to the above conditions, add and conditions CHAS为1, pay attention to the conditions that the logical symbols should be ()separated.

df[(df['NOX']>df['NOX'].mean())& (df['CHAS'] ==1)].sort_values(by='NOX',ascending=False).head()

picture

2.loc/iloc

In addition [], loc/ilocit should be the two most commonly used query methods. locAccess by tag value (column name and row index value), access ilocby numeric index, support single-value access or slice query. In addition to []filtering data by conditions, locyou can also specify the returned column variables to filter from both the row and column dimensions.

For example, in the following example, the data is filtered out according to the conditions, and the specified variables are filtered out, and then assigned.

df.loc[(df['NOX']>df['NOX'].mean()),['CHAS']] = 2

picture

3. isin

The above filter conditions < > == !=are all ranges, but many times it is necessary to lock certain specific values, which is needed isinat this time. For example, we want to limit NOXthe value to only 0.538,0.713,0.437medium time.

df.loc[df['NOX'].isin([0.538,0.713,0.437]),:].sample(5)

picture

~Of course, you can also do the inversion operation, just add a symbol before the filter condition .

df.loc[~df['NOX'].isin([0.538,0.713,0.437]),:].sample(5)

picture

4. str.contains

The above examples are all filter conditions for the comparison of numerical values. In addition to numerical values, there are also query requirements for strings . pandasIt can be .str.contains()used to which is a bit like what is used in SQL statements like.

The following uses the data of titanic as an example to filter out the data that contains Mrsor in the person's name, or the logical symbol is in quotation marks.Lily|

train.loc[train['Name'].str.contains('Mrs|Lily'),:].head()

picture

.str.contains()You can also set regularization filtering logic in .

  • case=True: use case to specify case sensitivity

  • na=True: it means to convert the NAN to the boolean value True

  • flags=re.IGNORECASE: flags to pass to the re module, e.g. re.IGNORECASE

  • regex=True: regex : if True, assume the first string is a regular expression, otherwise it is a string

5. where/mask

In SQL, the function we know whereis to filter out the ones that meet the conditions. Filtering is also used in pandas where, but the usage is slightly different.

whereThe accepted condition needs to be of boolean type , and if the matching condition is not met, it will be assigned the default NaNor other specified value. For example Sex, maleas a filter condition, condit is a Boolean Series, and the non-male values ​​are assigned to the default NaNnull value.

cond = train['Sex'] == 'male'
train['Sex'].where(cond, inplace=True)
train.head()

picture

You can also use otherassign to a specified value.

cond = train['Sex'] == 'male'
train['Sex'].where(cond, other='FEMALE', inplace=True)

picture

You can even write combined conditions.

train['quality'] = ''
traincond1 = train['Sex'] == 'male'
cond2 = train['Age'] > 25

train['quality'].where(cond1 & cond2, other='低质量男性', inplace=True)

picture

maskAnd whereis a pair of operations, and whereis just the opposite.

train['quality'].mask(cond1 & cond2, other='低质量男性', inplace=True)

picture

6. query

This is a very elegant way of filtering data. All filtering operations are done ''within.

# 常用方式
train[train.Age > 25]
# query方式
train.query('Age > 25')

The above two methods have the same effect. Another example is more complicated, add str.containsthe combination conditions of the above usage, pay attention to the conditions ''sometimes , both sides should be ""wrapped.

train.query("Name.str.contains('William') & Age > 25")

picture

You querycan also @set variables here.

name = 'William'
train.query("Name.str.contains(@name)")

7. filter

filteris another unique filtering feature. filterInstead of filtering specific data, filter specific rows or columns. It supports three filtering methods:

  • items: fixed column names

  • regex: regular expression

  • like: and fuzzy query

  • axis: controls the query that is row index or column columns

An example is given below.

train.filter(items=['Age', 'Sex'])

picture

train.filter(regex='S', axis=1) # 列名包含S的

picture

train.filter(like='2', axis=0) # 索引中有2的

picture

train.filter(regex='^2', axis=0).filter(like='S', axis=1)

picture

8. any/all

anyThe method means that if at least one value is Truethe result True, it allneeds all the values ​​to be Truethe result True, such as the following.

>> train['Cabin'].all()
>> False
>> train['Cabin'].any()
>> True

anyAnd allgenerally need to be used in conjunction with other operations, such as viewing the null value of each column.

train.isnull().any(axis=0)

picture

Another example is to check the number of rows with null values.

>>> train.isnull().any(axis=1).sum()
>>> 708

Originality is not easy, welcome to like, leave a message, share, and support me to continue writing.

recommended article

Technology Exchange

Welcome to reprint, collect, like and support!

insert image description here

At present, a technical exchange group has been opened, and the group has more than 2,000 members . The best way to remark when adding is: source + interest direction, which is convenient to find like-minded friends

  • Method 1. Send the following picture to WeChat, long press to identify, and reply in the background: add group;
  • Method ②, add micro-signal: dkl88191 , note: from CSDN
  • Method ③, WeChat search public account: Python learning and data mining , background reply: add group

long press follow

Guess you like

Origin blog.csdn.net/qq_34160248/article/details/124390932