GBDT to predict time series data

       Recently, there is a need to use the GBDT algorithm to realize the prediction of time series data ( regression task ). The data is the real estate transaction data of 6 different cities from January 2011 to April 2020. The corresponding data is not found on the Internet. Based on the time series data to use the GBDT algorithm blog or information, I also went to github to find one. The most common one is to predict tags based on feature data to achieve classification or regression problems. Based on this, I want to write about time series data processing A brief introduction, followed by code and data addresses

 

One data introduction

     The 6 cities are Chongqing, Hangzhou, Luoyang, Nanchong, and Wuhu. The features are shown in the following figure: The regression task is to predict the feature data from January to April 2020 based on historical data, so a total of 13 feature values ​​need to be predicted

Two-model training

   The simple idea is to use the year and month as features, and then use GBDT to learn the values ​​of the latter features (that is, the following 13 columns). Of course, if you need the model to be more robust, you can add other column feature data to fit the target feature, here is just A brief introduction, so only the simplest one is explained.

def GBDT_train(X,Y):
    #print(X.head())
    #print(Y.head())
    for i in range(num_of_index):#训练16个模型,即输出值
        #print(Y.iloc[:200,i].head())
        #x_train, x_test, y_train, y_test = train_test_split(X, Y.iloc[:200,i].astype("str").values)
        x_train, x_test, y_train, y_test = train_test_split(X, Y.iloc[:,i])

        # Model training, using GBDT algorithm The default is 75% for training and 25% for testing

        '''GradientBoostingRegressor parameter introduction
          @n_estimators: the number of sub-models, the default is 100
          @max_depth: the maximum depth, the default is 3
          @min_samples_split: the minimum number of samples to split
          @learning_rate: the learning rate
        '''
        gbr = GradientBoostingRegressor(n_estimators=200, max_depth= 2, min_samples_split=2, learning_rate=0.1)
        gbr.fit(x_train, y_train.ravel())
        joblib.dump(gbr, name+"train_model_"+ str(i) +"_result.m") # save the model

        y_gbr = gbr.predict(x_train)
        y_gbr1 = gbr.predict(x_test)
        acc_train = gbr.score(x_train, y_train)
        acc_test = gbr.score(x_test, y_test)
        print(name+"train_model_"+ str(i)  +"_result.m"+'训练准确率',acc_train)
        print(name+"train_model_"+ str(i)  +"_result.m"+'验证准确率',acc_test)

 

Three-model prediction

Using the built-in method, there is nothing to say here

# Load the model and predict
def GBDT_Predict( ):

    X_Pred = [2020,4]
    print("预测:2020-4")
    X_Pred = np.reshape(X_Pred, (1, -1))
    for i in range(num_of_index):
        gbr = joblib.load(name+"train_model_"+ str(i)  +"_result.m")    # 加载模型
        #test_data = pd.read_csv(r"./data_test.csv")
        test_y = gbr.predict(X_Pred)
        test_y = np.reshape(test_y, (1, -1))
        print(test_y)

 

Four experimental results

The results of the experiment are still good, and it might be better if you add other feature effects and generalization capabilities!

! ! ! ! The code and data are here ! ! !

Guess you like

Origin blog.csdn.net/qq_39463175/article/details/106296391