Detailed explanation of pandas cut qcut binning algorithm

Project github address: bitcarmanlee easy-algorithm-interview-and-practice
welcome everyone to star, leave a message, and learn and progress together

1. Divide the box

The demand for data binning is very common in practice. For a set of continuous values, it will be divided into several segments, and each segment will be regarded as a category. This process is called binning. The binning operation is essentially a process of discretizing continuous values.

To give a common example: the
most common is to divide the age into boxes. Assuming that people’s age ranges from 0-120 years old, we consider 0-5 as infants, 6-15 as juveniles, 16-30 as youths, 31-50 as middle-aged, and 50-60 as juveniles. People who are middle-aged and over 60 years old are considered old people. In this process, the consecutive ages are divided into six categories: infants, teenagers, youth, middle-aged, middle-aged, and old, or into six "boxes", each "box" represents one category.

2.cut method

There are cut method and qcut method in pandas that can realize the requirement of binning. Let's take a look at the cut method first.

def t1():
    scores = [80, 55, 78, 99, 60, 35, 82, 57]
    cut = pd.cut(scores, 3)
    print(cut)

The above method divides the scores into three intervals, and the final result is

[(77.667, 99.0], (34.936, 56.333], (77.667, 99.0], (77.667, 99.0], (56.333, 77.667], (34.936, 56.333], (77.667, 99.0], (56.333, 77.667]]
Categories (3, interval[float64]): [(34.936, 56.333] < (56.333, 77.667] < (77.667, 99.0]]

The first line of output indicates which box the original data is located in, and the second line indicates the relevant information of the three boxes.

def t2():
    scores = [80, 55, 78, 99, 60, 35, 82, 57]
    bins = [0, 60, 80, 100]
    cut = pd.cut(scores, bins)
    print(cut)

    print(cut.codes)
    print(cut.categories)
    print(pd.value_counts(cut))

The output result is

[(60, 80], (0, 60], (60, 80], (80, 100], (0, 60], (0, 60], (80, 100], (0, 60]]
Categories (3, interval[int64]): [(0, 60] < (60, 80] < (80, 100]]
[1 0 1 2 0 0 2 0]
IntervalIndex([(0, 60], (60, 80], (80, 100]],
              closed='right',
              dtype='interval[int64]')
(0, 60]      4
(80, 100]    2
(60, 80]     2
dtype: int64

The above method specifies the bins to be divided, so the interval is (0, 60), (60, 80), (80, 100) when binning.
The value_counts method can count the number of intervals.

def t3():
    scores = [80, 55, 78, 99, 60, 35, 82, 57]
    bins = [0, 60, 80, 100]
    cut = pd.cut(scores, bins, labels=["low", "mid", "high"])
    print(pd.value_counts(cut))
    print()

    cut2 = pd.cut(scores, bins, labels=["low", "mid", "high"], right=False)
    print(pd.value_counts(cut2))

low     4
high    2
mid     2
dtype: int64

high    3
low     3
mid     2
dtype: int64

In the above method, the labels parameter is specified, so that each binning interval is equivalent to having a label name.
If you specify right=False, the right interval changes from the default closed interval to an open interval.

3.qcut method

def t4():
    scores = [x**2 for x in range(11)]
    cut = pd.qcut(scores, 5)
    print(cut)
    print()
    print(pd.value_counts(cut))

[(-0.001, 4.0], (-0.001, 4.0], (-0.001, 4.0], (4.0, 16.0], (4.0, 16.0], ..., (16.0, 36.0], (36.0, 64.0], (36.0, 64.0], (64.0, 100.0], (64.0, 100.0]]
Length: 11
Categories (5, interval[float64]): [(-0.001, 4.0] < (4.0, 16.0] < (16.0, 36.0] < (36.0, 64.0] <
                                    (64.0, 100.0]]

(-0.001, 4.0]    3
(64.0, 100.0]    2
(36.0, 64.0]     2
(16.0, 36.0]     2
(4.0, 16.0]      2
dtype: int64

The difference with the cut method is that cut is divided according to the value of the variable, and qcut is divided according to the number of variables. The above method means that the input is divided into five binning intervals of equal number.