Some treatment methods for categorical variables

1, brief introduction

Categorical variables were similar to enumerate, we have a certain number of value types.

For example: red and blue color classification of elements, an element in the shape of medium and small classification.

And such a big value substantially given English string or the like as a red data, this time, we have to carry out some operations, they ended up value may be mapped directly to the process or to delete.

 

2, three methods (along with the code interpretation)

First, let's pretreatment

 

import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
X = pd.read_csv('../input/train.csv', index_col='Id') 
X_test = pd.read_csv('../input/test.csv', index_col='Id')

# Remove rows with missing target, separate target from predictors
X.dropna(axis=0, subset=['SalePrice'], inplace=True)#Because our dependent variable is SalePrice,we need to drop some missing targets

y = X.SalePrice#Select dependent variable
X.drop(['SalePrice'], axis=1, inplace=True)

# To keep things simple, we'll drop columns with missing values
cols_with_missing = [col for col in X.columns if X[col].isnull().any()] 
X.drop(cols_with_missing, axis=1, inplace=True)
X_test.drop(cols_with_missing, axis=1, inplace=True)
#Now we have the dataframe without missing values

# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y,
                                                      train_size=0.8, test_size=0.2,
                                                      random_state=0)
View Code

 

 

1) Delete categorical variables

drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])
#exclude=['object'] means categorical data
View Code

You can then check the value of a wave of their mean_absolute_error

 

2) encoded label

Construction i.e. mapping values.

 

 

But here ① particular attention, we divided the train began as a valid and two samples, go inside the train categorical variables directly label if straightforward violence, it would compile error because you can not say valid sample there will be some not appear in the train over the top of categorical variables.

② In this case, this assumption makes sense, because there is only ranked categories. Not all categorical variable has a definite value in order, but we will have those variables is called the order 有序变量. For tree-based model (such as decision trees and random forest), label coding ordinal variables may be good effect.

# All categorical columns
object_cols = [col for col in X_train.columns if X_train[col].dtype == "object"]

# Columns that can be safely label encoded
good_label_cols = [col for col in object_cols if 
                   set(X_train[col]) == set(X_valid[col])]
#See that we must ensure X_train dataset have the same label encoded as X_valid

# Problematic columns that will be dropped from the dataset
bad_label_cols = list(set(object_cols)-set(good_label_cols))
View Code
from sklearn.preprocessing import LabelEncoder

# Drop categorical columns that will not be encoded
label_X_train = X_train.drop(bad_label_cols, axis=1)
label_X_valid = X_valid.drop(bad_label_cols, axis=1)

# Apply label encoder 
label_encoder=LabelEncoder() 
for col in good_label_cols:
    label_X_train[col]=label_encoder.fit_transform(label_X_train[col])
    label_X_valid[col]=label_encoder.transform(label_X_valid[col])
View Code

 

3) One-Hot encoding

 

 ① can see that we want to add a number of columns in the data, how many categories we add the number of columns, if so many categories, it means that we want to expand the list of very large, so we usually only for a relatively low base the column-hot encoding. Then, the column can be concentrated from a high-cardinality delete data, encoded labels may be used. Under normal circumstances, the election of 10 categories as the deletion criteria

② different tag coding sequence of one-hot encoded category is not assumed. Thus, if there is no clear classification data in sequence (e.g., "red" neither more nor less than the "yellow" than "yellow"), this method may be particularly effective. We have no intrinsic ordering of categorical variables called nominal variables.

# Columns that will be one-hot encoded
low_cardinality_cols = [col for col in object_cols if X_train[col].nunique() < 10]

# Columns that will be dropped from the dataset
high_cardinality_cols = list(set(object_cols)-set(low_cardinality_cols))
View Code
from sklearn.preprocessing import OneHotEncoder

# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[low_cardinality_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[low_cardinality_cols]))

# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index

# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)

# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)
View Code

 

Guess you like

Origin www.cnblogs.com/Y-Knightqin/p/12556156.html