[R] Day7 language study notes the data set is divided into training, validation and test sets

 

1. Objective: The method describes the data set into a training set and test set validation set.

 

2. Source: GitHub  https://github.com/reisanar/datasets/blob/master/WestRoxbury.csv

 

 

3. This blog describes a method of partitioning data, variables and therefore does not do too much introduction.

 

4. The method of division

 

4.1 The variable is divided into a training set, validation and test sets

Method 1:

## partitioning into training (50%), validation (30%), and test (20%) sets

# randomly sample 50% of the row IDs for training
train.rows <- sample(rownames(housing.df), dim(housing.df)[1]*0.5)

# sample 30% of the row IDs into the validation set, drawing only from records not already in the training set
# use setdiff() to find records not already in the training set
valid.rows <- sample(setdiff(rownames(housing.df), train.rows), dim(housing.df)[1]*0.3)

# assign the remaining 20% row IDs serve as test
test.rows <- setdiff(rownames(housing.df), union(train.rows, valid.rows))

# create the 3 data frames by collecting all columns from the appropriate rows 
housing.train <- housing.df[train.rows, ]
housing.valid <- housing.df[valid.rows, ]
housing.test <- housing.df[test.rows, ]

 

Method 2:

## alternative

train.rows <- sample(1:nrow(housing.df), dim(housing.df)[1]*0.5)
housing.train <- housing.df[train.rows,]

remain <- housing.df[-train.rows,]
valid.rows <- sample(1:nrow(remain), dim(housing.df)[1]*0.3) #dim(housing.df)[1]*0.3 -> be careful!
housing.valid <- remain[valid.rows,]
housing.test <- remain[-valid.rows,]

 

4.2 divides the data into training and test sets

Method 1:

## partitioning into training (60%) and validation (40%)
set.seed(1) ## to get the same sequence of numbers

# randomly sample 60% of the row IDs for training; the remaining 40% serve as validation
train.rows <- sample(rownames(housing.df), dim(housing.df)[1]*0.6)

# collect all the columns with training row ID into training set:
housing.train <- housing.df[train.rows, ]

# assign row IDs that are not already in the training set, into validation 
valid.rows <- setdiff(rownames(housing.df), train.rows) 
housing.valid <- housing.df[valid.rows, ]

  

Method 2:

## alternative 1
# collect all the columns without training row ID into validation set 
train.rows <- sample(1:nrow(housing.df), dim(housing.df)[1]*0.6)
housing.train <- housing.df[train.rows, ]
housing.valid <- housing.df[-train.rows, ] 

  

Method 3:

## alternative 2: generate random numbers
gp <- runif(nrow(housing.df)) # generate uniform random numbers
housing.train <- housing.df[gp < 0.6,]
housing.test <- housing.df[gp >=0.6,]

  

Method 4:

## alternative 3
n_obs <- nrow(housing.df) # get the number of observations
permuted_rows <- sample(n_obs) # shuffle row indices: permuted_rows
housing_shuffled <- housing.df[permuted_rows,] # randomly order data: Sonar
split <- round(n_obs * 0.6) # identify row to split on: split
housing.train <- housing_shuffled[1: split,] # create train
housing.test <- housing_shuffled[(split+1):nrow(housing_shuffled),] # create test

  

Guess you like

Origin www.cnblogs.com/shanshant/p/12239446.html