The most commonly used data analysis in daily Python
life is to query and filter, and pick out the data we want according to various conditions, various dimensions and combinations, so as to facilitate our analysis and mining.
Today, I have summarized the common types of operations for daily query and screening for you to learn and reference. Examples sklearn
of data used in this paper are presented. If you like this article, remember to bookmark, follow, and like.boston
[Note] The complete code, data, and technical exchange group are provided at the end of the article
from sklearn import datasets
import pandas as pd
boston = datasets.load_boston()
df = pd.DataFrame(boston.data, columns=boston.feature_names)
1. []
The first one is the quickest and most convenient []
. Write the filtering conditions or combination conditions directly in the dataframe. For example, below, you want to filter out NOX
all data greater than the average value of this variable, and then sort in NOX
descending order.
df[df['NOX']>df['NOX'].mean()].sort_values(by='NOX',ascending=False).head()
Of course, combined conditions can also be used, with logical symbols between conditions, & |
etc. For example, in the following example, in addition to the above conditions, add and conditions CHAS为1
, pay attention to the conditions that the logical symbols should be ()
separated.
df[(df['NOX']>df['NOX'].mean())& (df['CHAS'] ==1)].sort_values(by='NOX',ascending=False).head()
2.loc/iloc
In addition []
, loc/iloc
it should be the two most commonly used query methods. loc
Access by tag value (column name and row index value), access iloc
by numeric index, support single-value access or slice query. In addition to []
filtering data by conditions, loc
you can also specify the returned column variables to filter from both the row and column dimensions.
For example, in the following example, the data is filtered out according to the conditions, and the specified variables are filtered out, and then assigned.
df.loc[(df['NOX']>df['NOX'].mean()),['CHAS']] = 2
3. isin
The above filter conditions < > == !=
are all ranges, but many times it is necessary to lock certain specific values, which is needed isin
at this time. For example, we want to limit NOX
the value to only 0.538,0.713,0.437
medium time.
df.loc[df['NOX'].isin([0.538,0.713,0.437]),:].sample(5)
~
Of course, you can also do the inversion operation, just add a symbol before the filter condition .
df.loc[~df['NOX'].isin([0.538,0.713,0.437]),:].sample(5)
4. str.contains
The above examples are all filter conditions for the comparison of numerical values. In addition to numerical values, there are also query requirements for strings . pandas
It can be .str.contains()
used to which is a bit like what is used in SQL statements like
.
The following uses the data of titanic as an example to filter out the data that contains Mrs
or in the person's name, or the logical symbol is in quotation marks.Lily
|
train.loc[train['Name'].str.contains('Mrs|Lily'),:].head()
.str.contains()
You can also set regularization filtering logic in .
-
case=True: use case to specify case sensitivity
-
na=True: it means to convert the NAN to the boolean value True
-
flags=re.IGNORECASE: flags to pass to the re module, e.g. re.IGNORECASE
-
regex=True: regex : if True, assume the first string is a regular expression, otherwise it is a string
5. where/mask
In SQL, the function we know where
is to filter out the ones that meet the conditions. Filtering is also used in pandas where
, but the usage is slightly different.
where
The accepted condition needs to be of boolean type , and if the matching condition is not met, it will be assigned the default NaN
or other specified value. For example Sex
, male
as a filter condition, cond
it is a Boolean Series, and the non-male values are assigned to the default NaN
null value.
cond = train['Sex'] == 'male'
train['Sex'].where(cond, inplace=True)
train.head()
You can also use other
assign to a specified value.
cond = train['Sex'] == 'male'
train['Sex'].where(cond, other='FEMALE', inplace=True)
You can even write combined conditions.
train['quality'] = ''
traincond1 = train['Sex'] == 'male'
cond2 = train['Age'] > 25
train['quality'].where(cond1 & cond2, other='低质量男性', inplace=True)
mask
And where
is a pair of operations, and where
is just the opposite.
train['quality'].mask(cond1 & cond2, other='低质量男性', inplace=True)
6. query
This is a very elegant way of filtering data. All filtering operations are done ''
within.
# 常用方式
train[train.Age > 25]
# query方式
train.query('Age > 25')
The above two methods have the same effect. Another example is more complicated, add str.contains
the combination conditions of the above usage, pay attention to the conditions ''
sometimes , both sides should be ""
wrapped.
train.query("Name.str.contains('William') & Age > 25")
You query
can also @
set variables here.
name = 'William'
train.query("Name.str.contains(@name)")
7. filter
filter
is another unique filtering feature. filter
Instead of filtering specific data, filter specific rows or columns. It supports three filtering methods:
-
items: fixed column names
-
regex: regular expression
-
like: and fuzzy query
-
axis: controls the query that is row index or column columns
An example is given below.
train.filter(items=['Age', 'Sex'])
train.filter(regex='S', axis=1) # 列名包含S的
train.filter(like='2', axis=0) # 索引中有2的
train.filter(regex='^2', axis=0).filter(like='S', axis=1)
8. any/all
any
The method means that if at least one value is True
the result True
, it all
needs all the values to be True
the result True
, such as the following.
>> train['Cabin'].all()
>> False
>> train['Cabin'].any()
>> True
any
And all
generally need to be used in conjunction with other operations, such as viewing the null value of each column.
train.isnull().any(axis=0)
Another example is to check the number of rows with null values.
>>> train.isnull().any(axis=1).sum()
>>> 708
Originality is not easy, welcome to like, leave a message, share, and support me to continue writing.
recommended article
-
Li Hongyi's "Machine Learning" Mandarin Course (2022) is here
-
Someone made a Chinese version of Mr. Wu Enda's machine learning and deep learning
-
I'm addicted, and recently I gave the company a big visual screen (with source code)
-
So elegant, 4 Python automatic data analysis artifacts are really fragrant
-
It's very fragrant, and 20 visual large-screen templates have been organized
Technology Exchange
Welcome to reprint, collect, like and support!
At present, a technical exchange group has been opened, and the group has more than 2,000 members . The best way to remark when adding is: source + interest direction, which is convenient to find like-minded friends
- Method 1. Send the following picture to WeChat, long press to identify, and reply in the background: add group;
- Method ②, add micro-signal: dkl88191 , note: from CSDN
- Method ③, WeChat search public account: Python learning and data mining , background reply: add group