One article to understand one-hot encoding in Pandas

What is one-hot encoding

One-hot encoding is a data preprocessing step used to convert categorical values ​​into compatible numerical representations.

categorical_column bool_col col_1 col_2 label
value_A True 9 4 0
value_B False 7 2 0
value_D True 9 5 0
value_D False 8 3 1
value_D False 9 0 1
value_D False 5 4 1
value_B True 8 1 1
value_D True 6 6 1
value_C True 0 5 0

For example, in this dummy dataset, the categorical column has multiple string values. Many machine learning algorithms require the input data to be in numeric form, so some way is needed to convert this data attribute into a form compatible with such algorithms. This article breaks down a categorical column into multiple binary value columns. 

One-hot encoding using Pandas library

First, read .csvthe file or any other relevant file into a Pandas dataframe.

df = pd.read_csv("data.csv")

In order to check unique values ​​and understand the data better, you can use the following Panda functions.

df['categorical_column'].nunique()
df['categorical_column'].unique()

For this dummy data, the function returns the following output:

>>> 4
>>> array(['value_A', 'value_C', 'value_D', 'value_B'], dtype=object)

For a categorical column, it can be broken into multiple columns, for which this article uses pandas.get_dummies()the method. It requires the following parameters:

parameter
data: similar to array, Series or DataFrame Pandas raw dataframe object
columns: similar to a list, defaults to None List of categorical columns to be one-hot encoded
drop_first: Boolean value, default is False Remove the first level of category labels

To better understand this function, let's perform a one-hot encoding on this dummy dataset.

  • One-hot encoding of categorical columns

We use get_dummiesthe method and take the original dataframe as data input, in columnswhich we pass in a categorical_columnlist containing only the headers.

df_encoded = pd.get_dummies(df, columns=['categorical_column', ])

The following command deletes categorical_columnand creates a new column for each unique value. So a single categorical column is converted into 4 new columns in which only one column has value 1 and other 3 columns have value 0, that's why it is called one hot encoding.

categorical_column_value_A categorical_column_value_B categorical_column_value_C categorical_column_value_D
1 0 0 0
0 1 0 0
0 0 0 1
0 0 0 1
0 0 0 1
0 0 0 1
0 1 0 0
0 0 0 1
0 0 1 0
0 0 0 1

The problem arises when one wants to one-hot encode a Boolean column, which creates two new columns.

  • One-hot encoding of binary columns
df_encoded = pd.get_dummies(df, columns=[bool_col, ])
bool_col_False bool_col_True
0 1
1 0
0 1
1 0

Instead of having to add a column, you can have just one column where Trueis encoded as 1and Falseis encoded as 0. To solve this problem, drop_firstparameters can be used.

df_encoded = pd.get_dummies(df, columns=['bool_col'], drop_first=True)
bool_col_True
1
0
1
0

in conclusion

One-hot encoding is performed on the virtual data set, and the final result is as follows:

col_1 col_2 bool A B C D label
9 4 1 1 0 0 0 0
7 2 0 0 1 0 0 0
9 5 1 0 0 0 1 0
8 3 0 0 0 0 1 1
9 0 0 0 0 0 1 1
5 4 0 0 0 0 1 1
8 1 1 0 1 0 0 1
6 6 1 0 0 0 1 1
0 5 1 0 0 1 0 0
1 8 1 0 0 0 1 0

Categorical and Boolean values ​​are converted into numerical values ​​that can be used as input to machine learning algorithms.

Guess you like

Origin blog.csdn.net/csdn1561168266/article/details/132328362