Incremental Learning: Random Forest Processing Huge Data

Table of contents

1. Ordinary learning vs incremental learning

2. Application of incremental learning on Kaggle data

As an early open source machine learning algorithm library, sklearn does not open the interface for accessing GPU for computing. That is, all algorithms in sklearn do not support access to more computing resources. Therefore, when we want to use random forests to operate on huge amounts of data, we are likely to encounter a shortage of computing resources. Fortunately, we have two ways to solve this problem:

$\bullet$ Use other machine learning algorithm libraries that can be connected to the GPU to implement random forests, such as xgboost.

$\bullet$ Continue training with sklearn, but using incremental learning.

Incremental learning is a very common method in machine learning and is prevalent in both supervised and unsupervised learning. Incremental learning allows the algorithm to continuously access new data to expand the current model, that is, it allows huge amounts of data to be divided into several subsets and input into the model for training.

1. Ordinary learning vs incremental learning

1.1 General learning

Generally speaking, when a model is trained once, if new data is used to train the model, the model trained on the original data will be replaced. For example, two data sets are imported, one is the California housing price data set, and the other is the Kaggle housing price data set.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor as RFR
from sklearn.tree import DecisionTreeRegressor as DTR
from sklearn.model_selection import cross_validate,KFold
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import mean_squared_error

# 加利福尼亚房价数据集
data=pd.read_csv('F:\\Jupyter Files\\机器学习进阶\\集成学习\\datasets\\House Price\\train_encode.csv',encoding='utf-8')
data.drop('Unnamed: 0', axis=1, inplace=True)
x=data.iloc[:,:-1]
y=data.iloc[:,-1]

x.shape  #(1460, 80)

# 加利福尼亚房价数据集
X_fc = fetch_california_housing().data
y_fc = fetch_california_housing().target

X_fc.shape  #(20640, 8)

$\bullet$ Train on the California house price dataset:

model = RFR(n_estimators=3, warm_start=False) #不支持增量学习的
model1 = model.fit(X_fc,y_fc)
#RMSE
(mean_squared_error(y_fc,model1.predict(X_fc)))**0.5

0.30123985583215596

$\bullet$ View all trees in the forest and you can see the random number seeds for each tree:

model1.estimators_

[DecisionTreeRegressor(max_features='auto', random_state=1785210460),
 DecisionTreeRegressor(max_features='auto', random_state=121562514),
 DecisionTreeRegressor(max_features='auto', random_state=1271073231)]

$\bullet$ Let model1 continue to train on the kaggle housing price data set x, y:

model1 = model1.fit(x.iloc[:,:8],y)
#注意!! x有80个特征，X_fc只有8个特征，输入同一个模型的数据必须结构一致
model1.estimators_

[DecisionTreeRegressor(max_features='auto', random_state=349555903),
 DecisionTreeRegressor(max_features='auto', random_state=1253222501),
 DecisionTreeRegressor(max_features='auto', random_state=2145441582)]

The original tree in model1 disappears, and the new tree replaces the original tree.

$\bullet$ Let model1 train on the California housing price data set:

(mean_squared_error(y_fc,model1.predict(X_fc)))**0.5

235232.2375340384

The RMSE is extremely large, and the model now no longer has any ability to predict y_fc. It is very obvious that the original tree in model1 disappeared, and the tree trained based on the kaggle data set covered the original tree, so model1 no longer has the memory of the California housing price data report it had seen.

This coverage rule of sklearn is the basis for cross-validation. Because each training will not be affected by the previous training, we can use the model for cross-validation, otherwise there will be data leakage. But in incremental learning, the tree trained on the original data will not be replaced, and the model will consistently remember the previously trained data.

1.2 Incremental learning

$\bullet$ Train on the California house price dataset:

# 增量学习
model = RFR(n_estimators=3, warm_start=True) #支持增量学习
model2 = model.fit(X_fc,y_fc)
model2.estimators_

[DecisionTreeRegressor(max_features=1.0, random_state=1192338237),
 DecisionTreeRegressor(max_features=1.0, random_state=506683268),
 DecisionTreeRegressor(max_features=1.0, random_state=654939120)]

(mean_squared_error(y_fc,model2.predict(X_fc)))**0.5

0.29385313927085455

$\bullet$ Let model2 continue to train on the kaggle housing price data set x, y:

model2 = model2.fit(x.iloc[:,:8],y)
model2.estimators_

[DecisionTreeRegressor(max_features=1.0, random_state=1192338237),
 DecisionTreeRegressor(max_features=1.0, random_state=506683268),
 DecisionTreeRegressor(max_features=1.0, random_state=654939120)]

In incremental learning, the tree does not change

$\bullet$ Let model2 train on the California housing price data set:

(mean_squared_error(y_fc,model2.predict(X_fc)))**0.5

0.29385313927085455

Even though x and y have been trained, the memory of the California housing price data set in model2 is still there, so it can still achieve good scores when predicting X_fc and y_fc.

In incremental learning, the tree does not change and the results that have been trained will be retained. For bagging models such as random forests, this means that trees trained on previous data will be retained, new trees will be trained on new data, and the old and new trees will not affect each other.

However, there is a problem here: although the original tree has not changed, incremental learning does not seem to add new trees. In fact, for random forests, new trees need to be added manually:

#调用模型的参数，可以通过这种方式修改模型的参数，而不需要重新实例化模型
model2.n_estimators += 2 #增加2棵树，用于增量学习
model2.fit(x.iloc[:,:8],y)
model2.estimators_ #原来的树还是没有变化，新增的树是基于新输入的数据进行训练的

[DecisionTreeRegressor(max_features=1.0, random_state=1192338237),
 DecisionTreeRegressor(max_features=1.0, random_state=506683268),
 DecisionTreeRegressor(max_features=1.0, random_state=654939120),
 DecisionTreeRegressor(max_features=1.0, random_state=1440840641),
 DecisionTreeRegressor(max_features=1.0, random_state=1050229920)]

2. Application of incremental learning on Kaggle data

When faced with large data, we use a loop mode to read the contents of huge csv or database files in batches, preprocess the data in batches, and then incrementally learn it into a model.

STEP1: Define training and test data addresses

trainpath = r"F:\Jupyter Files\机器学习进阶\集成学习\datasets\Big data\bigdata_train.csv"
testpath = r"F:\Jupyter Files\机器学习进阶\集成学习\datasets\Big data\bigdata_test.csv"

STEP2: Try to find out the total amount of data in csv

When we decide to use incremental learning, the data should be so huge that it is impossible to directly open and view, impossible to directly train, or even impossible to directly import (for example, more than 20 G). But if we need to circularly import data, we must know the approximate actual amount of data, so we can obtain the amount of data in the csv that cannot be opened in the following ways:

If it is a competition data set, you can generally find the corresponding instructions on the competition page.
If it is a database data set, statistics can be performed in the database
If you cannot find the corresponding instructions, you can use the deque library to import the last few lines of the csv file and view the index
If the data does not have an index, you can only rely on pandas to try to find the approximate data range.

Method 1: The data set has an index

#使用deque与StringIO辅助，导入csv文件最后的n行
from collections import deque #deque：双向队列
from io import StringIO
with open(trainpath, 'r') as data:
    q = deque(data, 5)
pd.read_csv(StringIO(''.join(q)), header=None)

	0	1	2	3	4	5	6	7	8	9	...	101	102	103	104	105	106	108	109	110
0	995029	3.0	3.0	5.0	5.0	2.0	3.0	2.0	5.0	5.0	...	291658.0	666.0	469.0	37.0	1954.0	33.0	41.0	865.0	-70.6503
1	995030	2.0	4.0	4.0	2.0	4.0	2.0	4.0	4.0	4.0	...	968800.0	666.0	469.0	6.0	208.0	30.0	208.0	19838.0	-123.0867
2	995031	2.0	1.0	3.0	2.0	5.0	1.0	5.0	4.0	4.0	...	567037.0	93.0	541.0	596.0	2892.0	1602.0	144.0	2745.0	112.5000
3	995032	1.0	4.0	1.0	5.0	2.0	2.0	1.0	5.0	2.0	...	989963.0	57.0	441.0	13.0	520.0	29.0	208.0	10546.0	-97.0000
4	995033	3.0	2.0	4.0	3.0	4.0	2.0	4.0	3.0	4.0	...	443675.0	36.0	272.0	3.0	285.0	15.0	208.0	9322.0	-76.3729

You can see that the index of the last row is 995033, so there are 99w pieces of data in the training set.

Method 2: The data set has no index

If the data does not have an index, try using skiprows and nrows in pandas. skiprows : This import skips the first skiprows lines. nrows : This import only imports nrows rows. For example, when skiprows=1000, nrows=1000, pandas will import rows 1001~2000. When skiprows exceeds the amount of data, an EmptyDataError will be reported.

for i in range(0,10**7,100000):
    df = pd.read_csv(trainpath,skiprows=i, nrows=1)
    print(i)

---------------------------------------------------------------------------
EmptyDataError                            Traceback (most recent call last)

You can see that 90w was imported successfully, but an error was reported for 100w, so the data volume is between 90-100w. You can also continue to precise the specific range of data volume, but generally speaking we only need to confirm the area within 10w.

STEP3: After confirming the data amount, prepare the loop range

looprange = range(0,10**6,50000)

STEP4: Establish a model for incremental learning and define a test set

reg = RFR(n_estimators=10
          ,random_state=1412
          ,warm_start=True
          ,verbose=True #增量学习的过程总是很漫长的，你可以选择展示学习过程
         )
#定义测试集
test = pd.read_csv(testpath,header="infer",index_col=0)
Xtest = test.iloc[:,:-1]
Ytest = test.iloc[:,-1]

STEP5: Start cyclic import and incremental learning

Note: When skiprows+nrows exceeds the data volume, all remaining data will be exported.

for line in looprange:
    if line == 0:
        #首次读取时，保留列名，并且不增加树的数量
        header = "infer"
        newtree = 0
    else:
        #非首次读取时，不要列名，每次增加10棵树
        header = None
        newtree = 10
    
    trainsubset = pd.read_csv(trainpath, header = header, index_col=0, skiprows=line, nrows=50000)
    Xtrain = trainsubset.iloc[:,:-1]
    Ytrain = trainsubset.iloc[:,-1]
    reg.n_estimators += newtree
    reg = reg.fit(Xtrain,Ytrain)
    print("DONE",line+50000)
        
    #当训练集的数据量小于50000时，打断循环
    if Xtrain.shape[0] < 50000:
        break

After all the data has been trained, test on the test set:

reg.score(Xtest,Ytest)

0.9903482355083931

When using incremental learning, if parameter adjustment is required, we need to package the incremental learning loop into an evaluator or function so that it can be continuously called during the parameter adjustment process. The amount of calculation required for this process is extremely large, but At least we have a way to train on huge data on the CPU.