The 10 Most Commonly Used Data Encoding Ways in Machine Learning

Hello everyone, in machine learning, many algorithms require us to transform (encode) categorical features.

For the convenience of explanation, the following creates an exampleDataFrame
picture

Numeric data

Let us first discuss the transformation of continuous data, that is, according to Scorethe value of the column, to add a column of labels, that is, if the score is greater than 90, it is marked as A, the score between 80-90 is marked as B, and so on.

Custom function + loop traversal

First of all, of course, is the simplest and most stupid method. Write a function yourself and use a loop to traverse it. It must be one defplus onefor

df1 = df.copy()

def myfun(x):
    if x>90:
        return 'A'
    elif x>=80 and x<90:
        return 'B'
    elif x>=70 and x<80:
        return 'C'
    elif x>=60 and x<70:
        return 'D'
    else:
        return 'E'
    
df1['Score_Label'] = None
for i in range(len(df1)):
    df1.iloc[i,3] = myfun(df1.iloc[i,2])

This code, I believe everyone can understand, is simple and easy to think but more troublesome
picture

Is there an easier way? pandasOf course, it provides a lot of efficient operation functions, continue to look down.

custom function + map

Now, you can use mapto kill the loop (although it is essentially a loop)

df2 = df.copy()

def mapfun(x):
    if x>90:
        return 'A'
    elif x>=80 and x<90:
        return 'B'
    elif x>=70 and x<80:
        return 'C'
    elif x>=60 and x<70:
        return 'D'
    else:
        return 'E'

df2['Score_Label'] = df2['Score'].map(mapfun)

the result is the same
picture

Custom function + apply

If you still want to simplify the code, you can use custom function + apply to kill the custom function

df3 = df.copy()
df3['Score_Label'] = df3['Score'].apply(lambda x: 'A' if x > 90 else ('B' if 90 > x >= 80 else ('C' if 80 > x >= 70 else ('D' if 70 > x >= 60 else 'E'))))

The result is the same as the above, but it is easy to be hit by writing this way.

use pd.cut

Now, let's continue to understand more advanced pandasfunctions, Scorestill coding, using pd.cut, and specifying the divided interval, it can directly help you to group well

df4 = df.copy()
bins = [0, 59, 70, 80, 100]
df4['Score_Label'] = pd.cut(df4['Score'], bins)

picture

You can also directly use the labelsparameters to modify the name of the corresponding group, is it more convenient?

df4['Score_Label_new'] = pd.cut(df4['Score'], bins, labels=['low', 'middle', 'good', 'perfect'])

picture

Binarization using sklearn

Since it is related to machine learning, sklearnit must not run away. If you need to add a new column and determine whether the score is passing, you can use the Binarizerfunction, and the code is also concise and easy to understand

df5 = df.copy()
binerize = Binarizer(threshold = 60)
trans = binerize.fit_transform(np.array(df1['Score']).reshape(-1,1))
df5['Score_Label'] = trans

picture

text data

The following introduces the more common, transforming and labeling text data. For example, add a new column to mark the gender male and female as 0 and 1 respectively

use replace

First of all replace, but it should be noted that the above-mentioned methods related to custom functions are still feasible

df6 = df.copy()
df6['Sex_Label'] = df6['Sex'].replace(['Male','Female'],[0,1])

picture

The above is a gender operation, because there are only men and women, so you can manually specify 0, 1, but if there are many categories, you can also use pd.value_counts()it to automatically specify labels, such as Course Namegrouping columns

df6 = df.copy()
value = df6['Course Name'].value_counts()
value_map = dict((v, i) for i,v in enumerate(value.index))
df6['Course Name_Label'] = df6.replace({
    
    'Course Name':value_map})['Course Name']

picture

use map

It is emphasized that, when adding a new column, you must be able to think of itmap

df7 = df.copy()
Map = {
    
    elem:index for index,elem in enumerate(set(df["Course Name"]))}
df7['Course Name_Label'] = df7['Course Name'].map(Map)

picture

use astype

This method should not be known to many people. This belongs to the above-mentioned Zhihu problem. There are too many methods that can be implemented.

df8 = df.copy()
value = df8['Course Name'].astype('category')
df8['Course Name_Label'] = value.cat.codes

picture

using sklearn

Like numerical values, this classic operation in machine learning sklearnmust have a way LabelEncoderto encode categorical data.

from sklearn.preprocessing import LabelEncoder
df9 = df.copy()
le = LabelEncoder()
le.fit(df9['Sex'])
df9['Sex_Label'] = le.transform(df9['Sex'])
le.fit(df9['Course Name'])
df9['Course Name_Label'] = le.transform(df9['Course Name'])

picture

It is also possible to convert two columns at once

df9 = df.copy()
le = OrdinalEncoder()
le.fit(df9[['Sex','Course Name']])
df9[['Sex_Label','Course Name_Label']] = le.transform(df9[['Sex','Course Name']])

use factorize

Finally, let's introduce a small but easy-to-use pandasmethod. We need to note that in the above method, the automatically generated Course Name_Labelcolumn, although one data corresponds to one language, because it avoids writing custom functions or dictionaries, it can be automatically generated, So mostly unordered.

If we want it to be ordered, that is, Pythoncorrespond 0, Javacorrespond 1, besides specifying it ourselves, what elegant way is there? This can be used factorize, it will be encoded according to the order of appearance

df10 = df.copy()
df10['Course Name_Label'] = pd.factorize(df10['Course Name'])[0]

picture

Combined with anonymous functions, we can perform sequential encoding conversion on multiple columns

df10 = df.copy()
cat_columns = df10.select_dtypes(['object']).columns

df10[['Sex_Label', 'Course Name_Label']] = df10[cat_columns].apply(
    lambda x: pd.factorize(x)[0])

picture

Summarize

So far, the ten pandasdata encoding methods I want to introduce have been shared, and the code can be used by modifying the variable name. If you have more methods on this issue, you can leave a message in the comment area.

recommended article

Technology Exchange

Welcome to reprint, collect, like and support!

insert image description here

At present, a technical exchange group has been opened, and the group has more than 2,000 members . The best way to remark when adding is: source + interest direction, which is convenient to find like-minded friends

  • Method 1. Send the following picture to WeChat, long press to identify, and reply in the background: add group;
  • Method ②, add micro-signal: dkl88191 , note: from CSDN
  • Method ③, WeChat search public account: Python learning and data mining , background reply: add group

long press follow

Guess you like

Origin blog.csdn.net/qq_34160248/article/details/124292890