KDD CUP 1999 dataset processed with onehot encoding

Because the 1 2 3 fields of the dataset (starting from 0) are categorical features, the classification algorithm cannot be used directly.

So convert it to a numeric field.
One-hot encoding is used here (or 1 of k encoding).

For example, for example, a field has four values, pear, apple, peach, and banana.
Then these four categories will be converted into 4 numerical fields after encoding.

Pears 0 0 0 0

Apples 0 1 0 0

Bananas 0 0 1 0

Peaches 0 0 0 1

Big is what it means.

Below is a comparison table of the original data and the processed data. (This is the first piece of data, normal is marked as 1, and the output format is label+feature field)

The first column in the data set, which is the red part, has three values. tcp , icmp , udp. After the conversion is completed, there are three numbers representing this column.

There are 70 values ​​in the test set plus the training set in the green field, but only 50 are included in the randomly sampled training data, so 50 are used.

Similarly, the third field should be 11, but there are only 8 types when sampling, which is represented by 8.

Other numeric fields are not processed for the time being.


Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326645658&siteId=291194637