Problems with cut function in pandas

        Today, a colleague gave me feedback on a problem, that is, the total number of data after binning is different in different columns. Normally, in a dataframe (when the data is aligned), the lengths of different columns are the same, although different columns may be divided into different numbers of groups. , but the total number should be the same. So I checked the cut function again. cut has the following parameters:

x :              是我们要传入和切分的一维数组,可以是列表,也可以是dataFrame的一列
bins :           代表切片的方式,可以自定义传入列表[a,b,c],表示按照a-b,b-c的区间来切分,也可以是数值n,直接指定分为n组
right :          True/False,为True时,表示分组区间是包含右边,不包含左边,即(]; False,代表[)
labels :         标签参数,比如[低、中、高]
retbins :        True/False, True返回bin的具体范围值,当bins是单个数值时很有用。(因为bin是数字的话,  划分组具体数值我们是不知道的)
precision :      存储和显示 bin 标签的精度。默认为3
include_lowest : True/False, 第一个区间是否应该是左包含的
duplicates :     raise/drop, 如果bin列表里有重复,报错/直接删除至保留一个
ordered :        标签是否有序。 适用于返回的类型 Categorical 和 Series(使用 Categorical dtype)。 如果为 True,则将对生成的分类进行排序。 如果为 False,则生成的分类将是无序的(必须提供标签)

Because the statistics are different, they are either more or less. If there are more, it may be repeated statistics. But looking at the parameters, especially the interval parameter right, there are only two situations, one is left open and right closed or left closed and right open. , but no matter what the situation, it is impossible to repeat statistics, so it cannot be that there are too many statistics, it can only be that there are too few statistics, so which parameter may cause fewer statistics? After some research, it was found that the include_lowest parameter is used Sometimes it may not be rigorous. For example, my data is as follows:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'a':np.arange(10),
    'b':np.arange(10),
})
df_cut = pd.cut(df['a'],bins=[0,3,6,9])

get

0           NaN
1    (0.0, 3.0]
2    (0.0, 3.0]
3    (0.0, 3.0]
4    (3.0, 6.0]
5    (3.0, 6.0]
6    (3.0, 6.0]
7    (6.0, 9.0]
8    (6.0, 9.0]
9    (6.0, 9.0]

It can be seen that although I passed 0 in bins, because its interval is left open and right closed by default, 0 is not included, and I did not add the include_lowest parameter, resulting in missing results. Of course, this is also related to the interval open and close parameters. It cannot be separated, because when it is opened and closed, one side will always be missed, either the front or the back, so if you want to get complete statistics, you need to use two parameters in combination. The interval defaults to left open and right closed, so press first By default, of course in this case, if the minimum data falls on the left boundary of the first interval, it will be missed. At this time, adding an include_lowest parameter can just make up for the missed data, which is more perfect. . After adding this parameter, we get the result

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'a':np.arange(10),
    'b':np.arange(10),
})
df_cut = pd.cut(df['a'],bins=[0,3,6,9],include_lowest=True)
df_cut

0    (-0.001, 3.0]
1    (-0.001, 3.0]
2    (-0.001, 3.0]
3    (-0.001, 3.0]
4       (3.0, 6.0]
5       (3.0, 6.0]
6       (3.0, 6.0]
7       (6.0, 9.0]
8       (6.0, 9.0]
9       (6.0, 9.0]

It can be found that 0 can already be added.

Guess you like

Origin blog.csdn.net/zy1620454507/article/details/132473911