Say I have a column in a dataframe which is 'user_age', and I have created 'user_age_bin' by something like:
df['user_age_bin']= pd.cut(df['user_age'], bins=[10, 15, 20, 25,30])
Then I build a machine learning model by using the 'user_age_bin' feature.
Next, I got one record which I need to throw into my model and make prediction. I don't want to use the user_age
as it is because the model uses user_age_bin
. So, how can I convert a user_age
value (say 28) into user_age_bin
? I know I can create a function like this:
def assign_bin(age):
if age < 10:
return '<10'
elif age< 15:
return '10-15'
... etc. etc.
and then do:
user_age_bin = assign_bin(28)
But this solution is not elegant at all. I guess there must be a better way, right?
Edit: I changed the code and added explicit bin range. Edit2: Edited wording and hopefully the question is clearer now.
tl;dr: np.digitize
is a good solution.
After reading all the comments and answers here and some more Googling, I think I got a solution that I am pretty satisfied. Thank you to all of you guys!
Setup
import pandas as pd
import numpy as np
np.random.seed(42)
bins = [0, 10, 15, 20, 25, 30, np.inf]
labels = bins[1:]
ages = list(range(5, 90, 5))
df = pd.DataFrame({"user_age": ages})
df["user_age_bin"] = pd.cut(df["user_age"], bins=bins, labels=False)
# sort by age
print(df.sort_values('user_age'))
Output:
user_age user_age_bin
0 5 0
1 10 0
2 15 1
3 20 2
4 25 3
5 30 4
6 35 5
7 40 5
8 45 5
9 50 5
10 55 5
11 60 5
12 65 5
13 70 5
14 75 5
15 80 5
16 85 5
Assign category:
# a new age value
new_age=30
# use this right=True and '-1' trick to make the bins match
print(np.digitize(new_age, bins=bins, right=True) -1)
Output:
4