How to efficiently label each value to a bin after I created the bins by pandas.cut() function?

user3768495 :

Say I have a column in a dataframe which is 'user_age', and I have created 'user_age_bin' by something like:

df['user_age_bin']= pd.cut(df['user_age'], bins=[10, 15, 20, 25,30])

Then I build a machine learning model by using the 'user_age_bin' feature.

Next, I got one record which I need to throw into my model and make prediction. I don't want to use the user_age as it is because the model uses user_age_bin. So, how can I convert a user_age value (say 28) into user_age_bin? I know I can create a function like this:

def assign_bin(age):
    if age < 10:
        return '<10'
    elif age< 15:
        return '10-15'
     ... etc. etc.

and then do:

user_age_bin = assign_bin(28)

But this solution is not elegant at all. I guess there must be a better way, right?

Edit: I changed the code and added explicit bin range. Edit2: Edited wording and hopefully the question is clearer now.

user3768495 :

tl;dr: np.digitize is a good solution.

After reading all the comments and answers here and some more Googling, I think I got a solution that I am pretty satisfied. Thank you to all of you guys!

Setup

import pandas as pd
import numpy as np
np.random.seed(42)

bins = [0, 10, 15, 20, 25, 30, np.inf]
labels = bins[1:]
ages = list(range(5, 90, 5))
df = pd.DataFrame({"user_age": ages})
df["user_age_bin"] = pd.cut(df["user_age"], bins=bins, labels=False)

# sort by age 
print(df.sort_values('user_age'))

Output:

 user_age  user_age_bin
0          5             0
1         10             0
2         15             1
3         20             2
4         25             3
5         30             4
6         35             5
7         40             5
8         45             5
9         50             5
10        55             5
11        60             5
12        65             5
13        70             5
14        75             5
15        80             5
16        85             5

Assign category:

# a new age value
new_age=30

# use this right=True and '-1' trick to make the bins match
print(np.digitize(new_age, bins=bins, right=True) -1)

Output:

4

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=20473&siteId=1