Kaggle Intermediate-Machine Learning Data Processing and Feature Engineering

1 How to deal with categorical variables?

Method 1: Discard (generally not used)

Method 2: LabelEncoder

from sklearn.processing import LabelEncoder
label_encoder = LabelEncoder()
X[col] = label_encoder.fit_transform(X[col])
X_val[col] = label_encoder.transform(X_val[col]

Method 3: OneHotEncoding:

insert image description here
Role: It can be used to deal with unordered category features.
Note: This method is not used when the number of feature categories is greater than 15

from sklearn.procession import OneHotEncoding

One_H_encoder = OneHotEncoding(handle_unknown='ignore',sparse=False)
OH_cols_train = pd.DataFrame(One_H_encoder.fit_transform(X_train[object_cols])
OH_cols_val = pd.DataFrame(One_H_encoder.transform(X_val[object_cols])

OH_cols_train.index = X_train.index
OH_cols_val.index = X_val.index

num_X_train = X_train.drop(object_cols,axis=1)
num_X_val = X_val.drop(object_cols,axis=1)

OH_X_train = pd.concat([num_X_train,OH_cols_train],axis=1)
OH_X_test = pd.concat([num_X_val,OH_cols_val],axis=1)

Method 4: CountEncoding

  • Idea: Use the number of occurrences of a value in a feature to replace this value
  • 为何有效:Rare values tend to have similar counts (with values like 1 or 2), so you can classify rare values together at prediction time. Common values with large counts are unlikely to have the same exact count as other values. So, the common/important values get their own grouping.

code:

import category_encoder as ce

cat_cols = ['currency','country','category']
count_encoder = ce.CountEncoder()
count_encoded = count_encoder.fit_transform(train_data[cat_cols])
data = baseline_data.join(count_encoded).add_suffix('_count')

Method 5: TargetEncoding

  • Idea: Use the mean value corresponding to each feature value to replace, such as country='A', you can calculate the mean value of a certain numerical feature corresponding to all country='A' samples.

  • Note 1: The test set cannot be included here, otherwise Target Leakage will occur.

  • 注意2:如果某个特征取值非常多,会导致方差过高。Target encoding attempts to measure the population mean of the target for each level in a categorical feature. This means when there is less data per level, the estimated mean will be further away from the “true” mean, there will be more variance. There is little data per IP address so it’s likely that the estimates are much noisier than for the other features. The model will rely heavily on this feature since it is extremely predictive. This causes it to make fewer splits on other features, and those features are fit on just the errors left over accounting for IP address. So, the model will perform very poorly when seeing new IP addresses that weren’t in the training data (which is likely most new data). Going forward, we’ll leave out the IP feature when trying different encodings.

import category_encoder as ce
cat_cols = ['country','currency','category']
target_encoder = ce.TargetEncoder(cols=cat_cols)
target_encoder.fit(train[cat_cols],train['outcome'])

train = train.join(target_encoder.transform(train[cat_cols]).add_suffix('_target_encode'))
valid = valid.join(target_encoder.transform(valid[cat_cols]).add_suffix('_target_encode'))

Method 5: CatBoostEncoder

  • Idea: Similar to TargetEncoder, but calculates the mean of the samples before the replacement sample

code:

import category_encoder as ce
cat_cols = ['currency','country','category']
cat_encoder = ce.CatBoostEncoder(cols = cat_cols)
cat_encoder.fit(train[cat_cols],train['outcome'])
train = train.join(cat_encoder.transform(train[cat_cols]).add_suffix('_target'))
valid = valid.join(cat_encoder.transform(valid[cat_cols]).add_suffix('_target'))

Method 6: Create a new feature

  • Idea: Suppose there are features: country, population, we can create a country_population as a new feature.

code:

import pandas as pd
import itertools
from sklearn.preprocessing import LabelEncoder

cat_features = ['currency','country','population']
interaction = pd.DataFrame(index = train.index)
for col1,col2 in itertools.combination(cat_features,2):
	new_col_name = '_'.join([col1,col2])
	new_values = train.col1.map(str) + '_' + train.col2.map(str)
	
	encoder = LabelEncoder()
	interaction[new_col_name] = encoder.fit_transform(new_values)


# 创建一个在过去6个小时各个IP的执行数量
def past_six_hours(series,time_window='6H'):
	series = pd.Series(series.index,index=series)
	count = series.rolling(time_window).count() - 1
	return count

Method 7: Feature Selection

  • SelectKBest using sklearn.feature_selection
from sklearn.feature_selection import SelectKBest,f_classif
feature_cols = dataset.columns.drop('outcome')
selector = SelectKBest(f_classif,k=5)
X_new = selector.fit_transform(dataset[feature_cols],dataset['outcome'])

#若我们要知道最后保留了哪5个特征,可以使用inverse_transform来转换回来
selected_feature = pd.DataFrame(selector.inverse_transform(X_new),index = dataset.index,columns=feature_cols)

selected_columns = selected_feature.columns[selected_feature.var() != 0]


#L1
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel

def select_features_l1(X, y):
    """ Return selected features using logistic regression with an L1 penalty """
    logistic = LogisticRegression(C=0.1,penalty='l1',random_state=7).fit(X,y)
    model = SelectFromModel(logistic,prefit=True)
    X_new = model.transform(X)
    selected_features = pd.DataFrame(model.inverse_transform(X_new),index=X.index,columns=X.columns)
    cols_to_keep = selected_features.columns[selected_features.var()!=0]
    return cols_to_keep

2 Pipeline

Pipeline benefits:

  1. Make the code more streamlined and intuitive
  2. Reduce the possibility of bugs
  3. Can be done in batches

code:

import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.processing import OneHotEncoder
from sklearn.imputer import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.metrics import mean_absolute_error

X_full = pd.read_csv('X_train.csv') 
X_test_full = pd.read_csv('X_test.csv')

X_full.dropna(axis=0,subset=['SalePrice'],inplace=True)
y = X_full.SalePrice
X_full.drop('SalePrice',axis=1)

X_train_full,X_val_full,y_train,y_vaild = train_test_split(X_full,y,test_size=0.3,random_state=0)


categorical_cols = [cols for cols in X_train_full.columns if X_train_full[cols].nunique() <10 and X_train_full[cols].dtype == 'object']

numerical_cols = [cols for cols in X_train_full.columns if X_train_full[cols].dtype in ['int64','float64']]

my_cols = categorical_cols + numerical_cols

X_train = X_train_full[my_cols].copy()
X_val = X_val_full[my_cols].copy()
X_test = X_test[my_cols].copy()

#Step1:
numerical_transform = SimpleImputer()

categorical_transform = Pipline(steps=[('imputer',SimpleImputer(strategy='most_frequent')),\
('onehot',OneHotEncoder(handle_unknown='ignore',sparse=False))]

processor = ColumnTransformer(transformers=[('numerical',numerical_transform,numerical_cols),('cat',categorical_transform,categorical_cols))

#Step2:
model = RandomForestRegressor(n_estimators=100,random_state=0)

my_pipeline = Pipeline(steps=[('processor',processor),('model',model)])

my_pipeline.fit(X_train,y_train)
preds = my_pipeline.predict(X_val)
mean_error = mean_absolute_error(preds,y_val)

3 Cross Validation

Good for: small datasets

code:

from sklearn.model_selection import cross_val_error
from sklern.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.imputer import SimpleImputer

def get_score(n_estimators,X,y)
	my_pipeline = Pipeline(steps=[('Imputer',SimpleImputer(strategy='median'))
	\,('model',RandomForestRegressor(n_estimators=n_estimators,random_state=0))])
	score = -1*cross_val_score(my_pipeline,X,y,cv=5,scoring='neg_mean_absolute_error')
	return score.mean()
	

4 XGBoost

Idea: First initialize a weak learner, then use this weak learner to predict and calculate the loss, then train a new learner based on the loss, add the new learner to the large learner, and then iterate the above steps.

insert image description here

Important parameters:

  1. n_estimators: The number of learners, which can also be seen as the number of iterations, is usually set between 100-1000. If it is too low, it will be underfitting, and if it is too high, it will be overfitting.
  2. early_stopping_rounds: If the loss value has not changed for several rounds, stop early, usually set to 5, it is a good choice to use higher n_estimators and early_stoppint_rounds
  3. eval_set: Used together with early_stoppint_rounds to calculate the validation score.
  4. n_jobs: When the data set is large, this parameter can be set, which is equivalent to distributed computing.
  5. learning_rate: Give each base learner a weight instead of simple addition, the default is 0.1

code:

from xgboost import XGBRegressor

model = XGBRegressor(n_estimators=500,learning_rate=0.01,random_state=0)
model.fit(X_train,y_trian,early_stopping_rounds=5,eval_set=[(X_valid,y_valid)],verbose=False)

5 Data leakage

1.Target Leakage:

  • Occurrence scenario: when the data set contains samples that will not play a role in prediction.

insert image description here

  • In the above data set, it can be found that the change of taken_antibiotic_medicine often changes got_pneumonia. Although the model trained with this data performs well on the verification set, when we get it to the real world, the accuracy is often very low. . The reason is: the purpose of using this model is to predict whether a certain patient has the disease, so people who come to see the doctor, even if they have already suffered from the disease, they have not yet received the medicine, so it is obviously inaccurate to use this model to predict , because some data are not useful in forecasting.

2.Train-Test Contamination

  • What happens: when we pad or normalize the dataset before separating the training and validation sets.

Guess you like

Origin blog.csdn.net/weixin_44027006/article/details/106160743