OneHot coding is used in production to solve dimensional problems

I don’t know if you have encountered such a problem in the process of using OneHot encoding. For example, the value (discrete) of a certain column in the training sample is "green" "red" "yellow", and it has been one-hot encoded , The effect is as follows:

When reading new data in real-time in the production environment, some data that has not been seen in the training sample appears, such as "green" "blue", and its one-hot encoding is as follows:

So in this case, the data dimensions will be inconsistent, but because the input dimensions of the trained model are determined, this may cause the model to fail to calculate normally, so how to solve this problem?

You can use Categorical in pandas to solve this problem. The specific code is as follows:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
__author__ = 'Seven'
import pandas as pd

train_words = ['green', 'red', 'yellow']
product_words = pd.Series(['green', 'blue'])

product_words_op = pd.Categorical(product_words, categories=train_words)

print(pd.get_dummies(product_words_op))

The implementation effect is as follows:

Since in the list of known categories of green, all the one-hot code entries of green are zero. If you find new data in the production data, the corresponding rows should all be 0. This method can solve the problem of dimensionality in the production environment to a certain extent, and the model cannot be calculated.

Guess you like

Origin blog.csdn.net/gf19960103/article/details/102736828