Machine Learning - Data Preprocessing

For learning machine learning algorithms, it will certainly involve processing data, so from the beginning of the learning data preprocessing

For data preprocessing, probably following steps:

Step 1 - introducing the desired library

Import data processing required python library, the following two libraries are very important two libraries, each will be introduced

  • numpy

The library contains function mathematical function library

  • pandas

This library is used to import the data set and manage

Step 2 - Import datasets

Data sets usually .csv format to save, csv file in the form of plain text list of data storage, file data in each row is a record.

For csv file, using module pandas read_cvs method of reading.

Step 3 - missing data processing

Since the actual acquired data are rarely the same type, for various reasons can cause data loss, and therefore need to be addressed, so as not to degrade the performance of machine learning model.

We can replace the missing data using the mean or median entire column data, python used imputer class sklearn.preprocessing in to complete the task.

Step 4 - coding classification data

Classification data type generally comprise free tag value, for example, "Yes" or "No", rather than value, such as 0 or 1.

Because the tag value is not used in the mathematical equations in machine learning models, therefore, need to convert a numerical tag value.

python used LabelEncoder class sklearn.preprocessing library can accomplish this task.

Step 5 - Data is divided into training and test sets

Machine learning, it is necessary to call the data set into two parts, a training machine learning model training set , referred to as a trained test model performance test set . Usually in the ratio of 80/20 to need data set into training and test sets.

python using train_test_split () method sklearn.crossvalidation library divided.

Step 6 - feature scaling

Most of the machine learning algorithm using the Euclidean distance between two points in the data calculation process. If the variation range of the data set of feature values ​​is relatively large, then a large value is smaller than the value of the calculated distances will result in different weights. Hence the need for standardized or wherein Z-score normalization.

python can be used sklearn.preprocessing library StandardScalar

Code

# -*- coding: utf-8 -*-
"""
Author: wxer
"""
# step 1 - import the libraries
import numpy as np
import pandas as pd
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler


# step 2 - import dataset
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values
Y = dataset.iloc[: 大专栏  机器学习 —— 数据预处理, 3].values


# step 3 - handing the missing data
imputer = Imputer(missing_values='NaN', strategy='mean', axis=0)
imputer = imputer.fit(X[:, 1: 3])
X[:, 1: 3] = imputer.transform(X[:, 1: 3])


# step 4 - encoding  categorical data
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])

onehotencoder = OneHotEncoder(categorical_features=[0])
X = onehotencoder.fit_transform(X).toarray()
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)


# step 5 - splitting the datasets into training sets and test sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)

# step 6 - feature scaling

sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.fit_transform(X_test)

reference

  1. Data PreProcessing

  2. Three common data normalization methods

Guess you like

Origin www.cnblogs.com/lijianming180/p/12099860.html