Machine Learning topics: Getting Started (R)

First, the entry (reprint)

Pictures .png

https://www.cnblogs.com/leezx/p/6229323.html](https://www.cnblogs.com/leezx/p/6229323.html

1, the linear regression

Python code

#Import Library
#Import other necessary libraries like pandas, numpy...
from sklearn import linear_model
#Load Train and Test datasets
#Identify feature and response variable(s) and values must be numeric and numpy arrays

x_train=input_variables_values_training_datasets
y_train=target_variables_values_training_datasets
x_test=input_variables_values_test_datasets

# Create linear regression object
linear = linear_model.LinearRegression()

# Train the model using the training sets and check score
linear.fit(x_train, y_train)
linear.score(x_train, y_train)

#Equation coefficient and Intercept
print('Coefficient: n', linear.coef_)
print('Intercept: n', linear.intercept_)

#Predict Output
predicted= linear.predict(x_test)

Code R

#Load Train and Test datasets
#Identify feature and response variable(s) and values must be numeric and numpy arrays

x_train <- input_variables_values_training_datasets
y_train <- target_variables_values_training_datasets
x_test <- input_variables_values_test_datasets
x <- cbind(x_train,y_train)

# Train the model using the training sets and check score
linear <- lm(y_train ~ ., data = x)
summary(linear)

#Predict Output
predicted= predict(linear,x_test)

2, logistic regression

Python code

#Import Library
from sklearn.linear_model import LogisticRegression

#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset

# Create logistic regression object
model = LogisticRegression()

# Train the model using the training sets and check score
model.fit(X, y)
model.score(X, y)

#Equation coefficient and Intercept
print('Coefficient: n', model.coef_)
print('Intercept: n', model.intercept_)

#Predict Output
predicted= model.predict(x_test)

Code R

x <- cbind(x_train,y_train)

# Train the model using the training sets and check score
logistic <- glm(y_train ~ ., data = x,family='binomial')
summary(logistic)

#Predict Output
predicted= predict(logistic,x_test)

3, decision tree

Python code

#Import Library
#Import other necessary libraries like pandas, numpy...

from sklearn import tree
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset

# Create tree object
model = tree.DecisionTreeClassifier(criterion='gini') # for classification, here you can change the algorithm as gini or entropy (information gain) by default it is gini  

# model = tree.DecisionTreeRegressor() for regression

# Train the model using the training sets and check score
model.fit(X, y)
model.score(X, y)

#Predict Output
predicted= model.predict(x_test)

Code R

library(rpart)
x <- cbind(x_train,y_train)

# grow tree
fit <- rpart(y_train ~ ., data = x,method="class")
summary(fit)

#Predict Output
predicted= predict(fit,x_test)

4, support vector machines

Python code

#Import Library
from sklearn import svm

#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset

# Create SVM classification object

model = svm.svc() # there is various option associated with it, this is simple for classification. You can refer link, for mo# re detail.

# Train the model using the training sets and check score
model.fit(X, y)
model.score(X, y)

#Predict Output
predicted= model.predict(x_test)

Code R

library(e1071)
x <- cbind(x_train,y_train)
# Fitting model
fit <-svm(y_train ~ ., data = x)
summary(fit)

#Predict Output
predicted= predict(fit,x_test)

5, Naive Bayes

Python code

#Import Library
from sklearn.naive_bayes import GaussianNB

#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset

# Create SVM classification object model = GaussianNB() # there is other distribution for multinomial classes like Bernoulli Naive Bayes, Refer link

# Train the model using the training sets and check score
model.fit(X, y)

#Predict Output
predicted= model.predict(x_test)

Code R

library(e1071)
x <- cbind(x_train,y_train)

# Fitting model
fit <-naiveBayes(y_train ~ ., data = x)
summary(fit)

#Predict Output
predicted= predict(fit,x_test)

6, kNN nearest neighbor algorithm

Python code

#Import Library
from sklearn.neighbors import KNeighborsClassifier

#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset

# Create KNeighbors classifier object model
KNeighborsClassifier(n_neighbors=6) # default value for n_neighbors is 5

# Train the model using the training sets and check score
model.fit(X, y)

#Predict Output
predicted= model.predict(x_test)

Code R

library(knn)
x <- cbind(x_train,y_train)
# Fitting model
fit <-knn(y_train ~ ., data = x,k=5)
summary(fit)
#Predict Output
predicted= predict(fit,x_test)

7, K-means algorithm

Python code

#Import Library
from sklearn.cluster import KMeans

#Assumed you have, X (attributes) for training data set and x_test(attributes) of test_dataset

# Create KNeighbors classifier object model
k_means = KMeans(n_clusters=3, random_state=0)

# Train the model using the training sets and check score
model.fit(X)

#Predict Output
predicted= model.predict(x_test)

Code R

library(cluster)
fit <- kmeans(X, 3) # 5 cluster solution

8, Random Forests

For more details of this algorithm to compare and optimize the decision tree model parameters, we recommend reading the following article:

Python

#Import Library
from sklearn.ensemble import RandomForestClassifier

#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset

# Create Random Forest object
model= RandomForestClassifier()

# Train the model using the training sets and check score
model.fit(X, y)

#Predict Output
predicted= model.predict(x_test)

Code R

library(randomForest)
x <- cbind(x_train,y_train)

# Fitting model
fit <- randomForest(Species ~ ., x,ntree=500)
summary(fit)

#Predict Output
predicted= predict(fit,x_test)

9, dimensionality reduction algorithm

Want to know more information about this algorithm, you can read "Beginner's Guide dimensionality reduction algorithm" .

Python code

#Import Library
from sklearn import decomposition

#Assumed you have training and test data set as train and test
# Create PCA obeject pca= decomposition.PCA(n_components=k) #default value of k =min(n_sample, n_features)

# For Factor analysis
#fa= decomposition.FactorAnalysis()

# Reduced the dimension of training dataset using PCA
train_reduced = pca.fit_transform(train)

#Reduced the dimension of test dataset
test_reduced = pca.transform(test)

Code R

library(stats)
pca <- princomp(train, cor = TRUE)
train_reduced  <- predict(pca,train)
test_reduced  <- predict(pca,test)

※※※ dimensionality reduction supplement: https://blog.csdn.net/fnqtyr45/article/details/82836063

10, Gradient Boosting algorithm and AdaBoost

More: detailed understanding of Gradient and AdaBoost

Python code

#Import Library
from sklearn.ensemble import GradientBoostingClassifier

#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset

# Create Gradient Boosting Classifier object
model= GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0)

# Train the model using the training sets and check score
model.fit(X, y)

#Predict Output
predicted= model.predict(x_test)

Code R

library(caret)
x <- cbind(x_train,y_train)

# Fitting model
fitControl <- trainControl( method = "repeatedcv", number = 4, repeats = 4)
fit <- train(y ~ ., data = x, method = "gbm", trControl = fitControl,verbose = FALSE)

predicted= predict(fit,x_test,type= "prob")[,2]

GradientBoostingClassifier and Random Forests are two different boosting tree classifier. People often ask about the difference between these two algorithms.

Epilogue

Now I can determine, you should have a general understanding of commonly used machine learning algorithms. This writing and provide the sole purpose of R and Python language code, is to let you immediately start learning. If you want to learn to master the machine, then start it immediately. Doing exercises, rational understanding of the whole process, the application of this code and feel the fun of it!

From < http://blog.jobbole.com/92021/ >

t-SNE

https://blog.csdn.net/scythe666/article/details/79203239

Second, the project template

Reference: http://www.shujuren.org/article/984.html
1. Prepare Problem problem preparation

a) Load libraries package load the required R

b) Load dataset load the required data set

c) Split-out validation dataset data set into

1. Summarize Data Data Summary

a) Statistical analysis Descriptive statistics Descriptive

b) Data visualizations data visualization

1. Prepare Data Data Preparation

a) Data Cleaning Data Cleaning

b) Feature Selection feature selection

c) Data Transforms data conversion

1. Evaluate Algorithms algorithm Reviews

a) Test options and evaluation metric evaluation and test set

b) Spot Check Algorithms test algorithms

c) Compare Comparative Analysis Algorithms Algorithm

1. Improve Accuracy Performance Optimization

a) Algorithm Tuning parameter adjustment

b) Ensembles integration

1. Finalize Model Application Model

a) Predictions on validation dataset prediction model

b) Create standalone model on entire training dataset full data set to build the model

c) Save model for later use and implementation model storage