What is one-hot encoding
One-hot encoding is a data preprocessing step used to convert categorical values into compatible numerical representations.
categorical_column | bool_col | col_1 | col_2 | label |
---|---|---|---|---|
value_A | True | 9 | 4 | 0 |
value_B | False | 7 | 2 | 0 |
value_D | True | 9 | 5 | 0 |
value_D | False | 8 | 3 | 1 |
value_D | False | 9 | 0 | 1 |
value_D | False | 5 | 4 | 1 |
value_B | True | 8 | 1 | 1 |
value_D | True | 6 | 6 | 1 |
value_C | True | 0 | 5 | 0 |
For example, in this dummy dataset, the categorical column has multiple string values. Many machine learning algorithms require the input data to be in numeric form, so some way is needed to convert this data attribute into a form compatible with such algorithms. This article breaks down a categorical column into multiple binary value columns.
One-hot encoding using Pandas library
First, read .csv
the file or any other relevant file into a Pandas dataframe.
df = pd.read_csv("data.csv")
In order to check unique values and understand the data better, you can use the following Panda functions.
df['categorical_column'].nunique()
df['categorical_column'].unique()
For this dummy data, the function returns the following output:
>>> 4
>>> array(['value_A', 'value_C', 'value_D', 'value_B'], dtype=object)
For a categorical column, it can be broken into multiple columns, for which this article uses pandas.get_dummies()
the method. It requires the following parameters:
parameter | |
---|---|
data: similar to array, Series or DataFrame | Pandas raw dataframe object |
columns: similar to a list, defaults to None | List of categorical columns to be one-hot encoded |
drop_first: Boolean value, default is False | Remove the first level of category labels |
To better understand this function, let's perform a one-hot encoding on this dummy dataset.
- One-hot encoding of categorical columns
We use get_dummies
the method and take the original dataframe as data input, in columns
which we pass in a categorical_column
list containing only the headers.
df_encoded = pd.get_dummies(df, columns=['categorical_column', ])
The following command deletes categorical_column
and creates a new column for each unique value. So a single categorical column is converted into 4 new columns in which only one column has value 1 and other 3 columns have value 0, that's why it is called one hot encoding.
categorical_column_value_A | categorical_column_value_B | categorical_column_value_C | categorical_column_value_D |
---|---|---|---|
1 | 0 | 0 | 0 |
0 | 1 | 0 | 0 |
0 | 0 | 0 | 1 |
0 | 0 | 0 | 1 |
0 | 0 | 0 | 1 |
0 | 0 | 0 | 1 |
0 | 1 | 0 | 0 |
0 | 0 | 0 | 1 |
0 | 0 | 1 | 0 |
0 | 0 | 0 | 1 |
The problem arises when one wants to one-hot encode a Boolean column, which creates two new columns.
- One-hot encoding of binary columns
df_encoded = pd.get_dummies(df, columns=[bool_col, ])
bool_col_False | bool_col_True |
---|---|
0 | 1 |
1 | 0 |
0 | 1 |
1 | 0 |
Instead of having to add a column, you can have just one column where True
is encoded as 1
and False
is encoded as 0
. To solve this problem, drop_first
parameters can be used.
df_encoded = pd.get_dummies(df, columns=['bool_col'], drop_first=True)
bool_col_True |
---|
1 |
0 |
1 |
0 |
in conclusion
One-hot encoding is performed on the virtual data set, and the final result is as follows:
col_1 | col_2 | bool | A | B | C | D | label |
---|---|---|---|---|---|---|---|
9 | 4 | 1 | 1 | 0 | 0 | 0 | 0 |
7 | 2 | 0 | 0 | 1 | 0 | 0 | 0 |
9 | 5 | 1 | 0 | 0 | 0 | 1 | 0 |
8 | 3 | 0 | 0 | 0 | 0 | 1 | 1 |
9 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
5 | 4 | 0 | 0 | 0 | 0 | 1 | 1 |
8 | 1 | 1 | 0 | 1 | 0 | 0 | 1 |
6 | 6 | 1 | 0 | 0 | 0 | 1 | 1 |
0 | 5 | 1 | 0 | 0 | 1 | 0 | 0 |
1 | 8 | 1 | 0 | 0 | 0 | 1 | 0 |
Categorical and Boolean values are converted into numerical values that can be used as input to machine learning algorithms.