R Data Analysis: Building Predictive Models with R

Predictive models are becoming more and more popular in various fields. Today’s sharing is somewhat different from the previous clinical predictive models, but the method and ideas are the same. If you learn more about the application of methods in various fields, your vision will not be limited.

Today I try to use another example to give you a unified framework for the practice of forecasting models (the same operation in R can be implemented in a variety of ways, and the unification of the framework is especially important, it is not just a matter of me doing it). Instead:

eliminate syntactical differences between many of the functions for building and predicting models

data division

Usually our data is limited, so the first step is to decide how to use our data. There are many genres in this step.

When the data is relatively small, generally all the data is used for training to make the model as representative as possible, but the problem that comes with it is that there is no out-of-sample verification. When I wrote about machine learning above, I mentioned that out-of-sample verification is an important step in model evaluation, so data is generally still divided. Personal opinion: Many students only have more than 200 data, so don't divide the data set, use them all to ensure the validity of the model.

I now have the following data:

This is a data set with 4335 observations and 1579 variables. Now I want to divide it into training set and test set. The code is as follows:

inTrain <- createDataPartition(mutagen, p = 3/4, list = FALSE)
trainDescr <- descr[inTrain,]
testDescr  <- descr[-inTrain,]

trainClass <- mutagen[inTrain]
testClass  <- mutagen[-inTrain]

The focus of the code is createDataPartition. The p parameter of this function refers to the proportion of the training set, which means that 75% of the data is used to train the model, and the remaining 25% of the data is used for verification. When using this function, it should be noted that the outcome must be used as the basis for division. For example, if I am building a death prediction model now, I must divide it on the variable of survival status. For example, the proportion of death or non-death in the original data is 7:3, so we can ensure that the proportion of death or non-death in the divided training set is still 7:3.

There are two common problems in data partitioning: one is the problem of zero-variance predictor, and the other is the problem of multicollinearity.

The first problem refers to the fact that many variables have only one value and do not provide information. For example, variable A can take y and n, and y accounts for 90% and n accounts for 10%. In order to avoid this problem, we should first exclude variables with a single value in the data set, and variables with unbalanced proportions, which can be found and deleted using the nearZeroVar function.

The second problem is collinearity. One solution is to take the principal component and run it again after dimensionality reduction. The other is to do correlation. If the correlation coefficient is greater than a certain threshold, delete it. You can use the findCorrelation function.

The following code implements the removal of variables with zero-variance predictor and correlation coefficient greater than 0.9:

isZV <- apply(trainDescr, 2, function(u) length(unique(u)) == 1)
trainDescr <- trainDescr[, !isZV]
testDescr  <-  testDescr[, !isZV]

descrCorr <- cor(trainDescr)
highCorr <- findCorrelation(descrCorr, 0.90)

trainDescr <- trainDescr[, -highCorr]
testDescr  <-  testDescr[, -highCorr]
ncol(trainDescr)

Running the code above dropped our predictors from 1575 to 640.

Next, for model-specific prediction algorithms, such as partial least squares, neural networks and support vector machines, we need to center or standardize numerical variables. At this time, we need to preprocess the data and use the preProcess function. The specific code is as follows:

xTrans <- preProcess(trainDescr)
trainDescr <- predict(xTrans, trainDescr)
testDescr  <- predict(xTrans,  testDescr)

In the preProcess function, the training data can be conveniently interpolated, centered, standardized or taken principal components through the method parameter . I would like to call it the strongest data preprocessing function.

Modeling and tuning

Modeling is carried out with the train function. The essence of the unified framework established by the predictive model provided by caret is also in the train function. If we want to use various machine learning algorithms to make predictive models, we only need to change the method parameters in the train. And there are many algorithms that we can use. The list is as follows :

Another parameter worth noting is trControl, which can be used to set the cross-validation method:

trControl which specifies the resampling scheme, that is, how cross-validation should be performed to find the best values of the tuning parameters

For example, I now need to train a logistics model (the most used by everyone), I can write the following code:

default_glm_mod = train(
  form = default ~ .,
  data = default_trn,
  trControl = trainControl(method = "cv", number = 5),
  method = "glm",
  family = "binomial"
)

In the above code, we set 4 important parameters:

One is the model formula form, the other is the data source data, the third is the cross-validation method, here we use 5-fold cross-validation, and the fourth is the model algorithm method.

I now want to train a support vector machine model, the code is as follows:

bootControl <- trainControl(number = 200)
svmFit <- train(
                trainDescr, trainClass,
                method = "svmRadial",
                trControl = bootControl,
                scaled = FALSE)

The above code takes a little time to run. We use the self-service sampling method, and the number of sampling iterations is 200, that is, 200 data sets are sampled, so it is time-consuming.

The support vector machine is sigma and C with two hyperparameters. Each line in the above result output represents a hyperparameter combination. From the above figure, we can see the performance of the model under different hyperparameter combinations.

Among them is the consistency coefficient Kappa. If our outcome is very unbalanced, this Kappa will be a particularly important indicator for evaluating the performance of the model.

Kappa is an excellent performance measure when the classes are highly unbalanced

We found that hyperparameters 0.00137 and 2, the model will perform better, this also becomes our final model.

predict new samples

The model has been trained now, we can use the best trained model finalmodel to predict our test set, and then evaluate the model performance. The code to predict new samples is as follows:

predict(svmFit$finalModel, newdata = testDescr)

Sometimes we will train multiple models of different algorithms, such as a support vector machine model and another gbm model. At this time, using predict can also easily obtain the prediction results of multiple algorithms, just put the model in a list.

In addition, there are two functions extractProb and extractPrediction. extractPrediction can conveniently propose the predicted value from the prediction model, and extractProb can propose the predicted probability.

Evaluate model performance

For classification problems, we can use confusionMatrix to easily get the following model evaluation indicators:

And the ROC curve to report:

svmROC <- roc(svmProb$mutagen, svmProb$obs)

aucRoc can help us quickly get the area under the curve.

For regression problems, there is no so-called accuracy and the Kappa statistic when evaluating model performance. We care about R2, plotObsVsPred can conveniently draw the trend of actual values ​​and predicted values, care about rmse and mae, the calc_rmse function can help calculate rmse, and the get_best_result function can output R-squared and other indicators .

Predictor Selection

The importance of predictors is plotted. Sometimes our data has many variables, or many dimensions, which leads to the disaster of dimensionality. Many predictors actually cannot provide information to the model. The choice of predictors is to make the model as simple as possible without compromising the performance of the model.

VarImp can help us see the importance of each predictor's contribution to the model, and this importance is given in the form of a score. I don't know how the score is calculated. The code to get the importance score of each predictor is as follows:

varImp(gbmFit, scale = FALSE)

More intuitively, if we plot the object, we can get the noodle diagram of the importance of predictors as follows:

The figure above only lists the top 20 variables of importance. We actually have 655 variables, but there are more than 200 variables with an importance score of 0, so we can actually completely ignore them when running the model.

summary

Today I will give you a brief introduction to some knowledge about making prediction models in caret. Thank you for reading it patiently. My articles are very detailed, and the important codes are in the original text. I hope everyone can do it by themselves.

Guess you like

Origin blog.csdn.net/tm_ggplot2/article/details/126612953