Representation
Mapping raw data to features
map values
map string value
Models cannot learn patterns from string values, so you'll need to do some feature engineering to convert these values to numeric form:
First, define a vocabulary for the string values of all the features you want to represent . For features, the vocabulary will contain all the streets you know.
Note : All other streets can be grouped into a general "Other" category known as the OOV (Out of Glossary) bucket.street_name
Then, use this vocabulary to create a one -hot encoding that represents the specified string value as a binary vector. In this vector (corresponding to the specified string value):
- Only one element is set to .
1
- All other elements are set to .
0
The length of this vector is equal to the number of elements in the vocabulary.
- Only one element is set to .
Figure 3 shows the one-hot encoding of a particular street (Shorebird Way). In this binary vector, the value of the element representing the Shorebird Way is and the value of the element representing all other streets . 1
0
Mapping categorical (enumeration) values
Categorical features have a discrete set of possible values. For example, a feature named contains only 3 possible values: Lowland Countries
{'Netherlands', 'Belgium', 'Luxembourg'}
You might encode categorical features (like ) as enumerated types or discrete sets of integers representing distinct values. E.g: Lowland Countries
- Represent the Netherlands as 0
- Represent Belgium as 1
- Represent Luxembourg as 2
However, machine learning models often represent each categorical feature as a separate Boolean value. For example, Lowland Countries
in the model can be represented as 3 separate boolean features:
- x 1 : Is it the Netherlands?
- x 2 : Is it Belgium?
- x 3 : Is it Luxembourg?
Coding in this way also simplifies situations where a value may belong to more than one classification (eg, "border with France" is True for both Belgium and Luxembourg).