Hello everyone, in machine learning, many algorithms require us to transform (encode) categorical features.
For the convenience of explanation, the following creates an exampleDataFrame
Numeric data
Let us first discuss the transformation of continuous data, that is, according to Score
the value of the column, to add a column of labels, that is, if the score is greater than 90, it is marked as A, the score between 80-90 is marked as B, and so on.
Custom function + loop traversal
First of all, of course, is the simplest and most stupid method. Write a function yourself and use a loop to traverse it. It must be one def
plus onefor
df1 = df.copy()
def myfun(x):
if x>90:
return 'A'
elif x>=80 and x<90:
return 'B'
elif x>=70 and x<80:
return 'C'
elif x>=60 and x<70:
return 'D'
else:
return 'E'
df1['Score_Label'] = None
for i in range(len(df1)):
df1.iloc[i,3] = myfun(df1.iloc[i,2])
This code, I believe everyone can understand, is simple and easy to think but more troublesome
Is there an easier way? pandas
Of course, it provides a lot of efficient operation functions, continue to look down.
custom function + map
Now, you can use map
to kill the loop (although it is essentially a loop)
df2 = df.copy()
def mapfun(x):
if x>90:
return 'A'
elif x>=80 and x<90:
return 'B'
elif x>=70 and x<80:
return 'C'
elif x>=60 and x<70:
return 'D'
else:
return 'E'
df2['Score_Label'] = df2['Score'].map(mapfun)
the result is the same
Custom function + apply
If you still want to simplify the code, you can use custom function + apply to kill the custom function
df3 = df.copy()
df3['Score_Label'] = df3['Score'].apply(lambda x: 'A' if x > 90 else ('B' if 90 > x >= 80 else ('C' if 80 > x >= 70 else ('D' if 70 > x >= 60 else 'E'))))
The result is the same as the above, but it is easy to be hit by writing this way.
use pd.cut
Now, let's continue to understand more advanced pandas
functions, Score
still coding, using pd.cut
, and specifying the divided interval, it can directly help you to group well
df4 = df.copy()
bins = [0, 59, 70, 80, 100]
df4['Score_Label'] = pd.cut(df4['Score'], bins)
You can also directly use the labels
parameters to modify the name of the corresponding group, is it more convenient?
df4['Score_Label_new'] = pd.cut(df4['Score'], bins, labels=['low', 'middle', 'good', 'perfect'])
Binarization using sklearn
Since it is related to machine learning, sklearn
it must not run away. If you need to add a new column and determine whether the score is passing, you can use the Binarizer
function, and the code is also concise and easy to understand
df5 = df.copy()
binerize = Binarizer(threshold = 60)
trans = binerize.fit_transform(np.array(df1['Score']).reshape(-1,1))
df5['Score_Label'] = trans
text data
The following introduces the more common, transforming and labeling text data. For example, add a new column to mark the gender male and female as 0 and 1 respectively
use replace
First of all replace
, but it should be noted that the above-mentioned methods related to custom functions are still feasible
df6 = df.copy()
df6['Sex_Label'] = df6['Sex'].replace(['Male','Female'],[0,1])
The above is a gender operation, because there are only men and women, so you can manually specify 0, 1, but if there are many categories, you can also use pd.value_counts()
it to automatically specify labels, such as Course Name
grouping columns
df6 = df.copy()
value = df6['Course Name'].value_counts()
value_map = dict((v, i) for i,v in enumerate(value.index))
df6['Course Name_Label'] = df6.replace({
'Course Name':value_map})['Course Name']
use map
It is emphasized that, when adding a new column, you must be able to think of itmap
df7 = df.copy()
Map = {
elem:index for index,elem in enumerate(set(df["Course Name"]))}
df7['Course Name_Label'] = df7['Course Name'].map(Map)
use astype
This method should not be known to many people. This belongs to the above-mentioned Zhihu problem. There are too many methods that can be implemented.
df8 = df.copy()
value = df8['Course Name'].astype('category')
df8['Course Name_Label'] = value.cat.codes
using sklearn
Like numerical values, this classic operation in machine learning sklearn
must have a way LabelEncoder
to encode categorical data.
from sklearn.preprocessing import LabelEncoder
df9 = df.copy()
le = LabelEncoder()
le.fit(df9['Sex'])
df9['Sex_Label'] = le.transform(df9['Sex'])
le.fit(df9['Course Name'])
df9['Course Name_Label'] = le.transform(df9['Course Name'])
It is also possible to convert two columns at once
df9 = df.copy()
le = OrdinalEncoder()
le.fit(df9[['Sex','Course Name']])
df9[['Sex_Label','Course Name_Label']] = le.transform(df9[['Sex','Course Name']])
use factorize
Finally, let's introduce a small but easy-to-use pandas
method. We need to note that in the above method, the automatically generated Course Name_Label
column, although one data corresponds to one language, because it avoids writing custom functions or dictionaries, it can be automatically generated, So mostly unordered.
If we want it to be ordered, that is, Python
correspond 0
, Java
correspond 1
, besides specifying it ourselves, what elegant way is there? This can be used factorize
, it will be encoded according to the order of appearance
df10 = df.copy()
df10['Course Name_Label'] = pd.factorize(df10['Course Name'])[0]
Combined with anonymous functions, we can perform sequential encoding conversion on multiple columns
df10 = df.copy()
cat_columns = df10.select_dtypes(['object']).columns
df10[['Sex_Label', 'Course Name_Label']] = df10[cat_columns].apply(
lambda x: pd.factorize(x)[0])
Summarize
So far, the ten pandas
data encoding methods I want to introduce have been shared, and the code can be used by modifying the variable name. If you have more methods on this issue, you can leave a message in the comment area.
recommended article
-
Li Hongyi's "Machine Learning" Mandarin Course (2022) is here
-
Someone made a Chinese version of Mr. Wu Enda's machine learning and deep learning
-
I'm addicted, and recently I gave the company a big visual screen (with source code)
-
So elegant, 4 Python automatic data analysis artifacts are really fragrant
-
It's very fragrant, and 20 visual large-screen templates have been organized
Technology Exchange
Welcome to reprint, collect, like and support!
At present, a technical exchange group has been opened, and the group has more than 2,000 members . The best way to remark when adding is: source + interest direction, which is convenient to find like-minded friends
- Method 1. Send the following picture to WeChat, long press to identify, and reply in the background: add group;
- Method ②, add micro-signal: dkl88191 , note: from CSDN
- Method ③, WeChat search public account: Python learning and data mining , background reply: add group