TPOT自动选择机器学习模型和参数--回归示例

前两篇博客写了在anacoda下安装tpot库和使用tpot做分类的例子,这篇是写做回归的例子

anacoda下安装tpot库

使用TPOT自动选择scikit-learn机器学习模型和参数--分类示例

环境:win10+pycharm+anacoda

数据集:sklearn自带的波士顿房价数据集

代码:

'''
    回归,预测波士顿房价
    '''
    from tpot import TPOTRegressor
    import pandas as pd
    import numpy as np
    from sklearn.datasets import load_boston
    from sklearn.model_selection import train_test_split
 
    housing = load_boston()
    X_train, X_test, y_train, y_test = train_test_split(housing.data, housing.target)
                                                       # ,train_size=0.75, test_size=0.25)

    tpot = TPOTRegressor(generations=20, verbosity=2) #迭代20次
    tpot.fit(X_train, y_train)
    print(tpot.score(X_test, y_test))
    tpot.export('pipeline.py')

运行结果:


Best pipeline: XGBRegressor(RidgeCV(input_matrix), learning_rate=0.1, max_depth=5, min_child_weight=2, n_estimators=100, nthread=1, subsample=0.8)

可以看出tpot给出的模型是用XGBRegressor。

预测代码:

    import pandas as pd
    import numpy as np
    from sklearn.datasets import load_boston
    from sklearn.model_selection import train_test_split
    import xgboost as xgb
   
    housing=load_boston()
    #print(housing)
    da=pd.DataFrame(housing.data)
    da.columns =housing.feature_names
    #print(da.head())
    ta=pd.DataFrame(housing.target)
    ta.columns=['target']
    #print(ta.head())
    boston=pd.concat([da,ta],axis=1) #记住啊,axis=0:作用对象是index; axis=1:作用对象是columns。,
    #print(boston.head())

    featuress=np.array(boston.drop(['target'],axis=1))
    target=np.array(boston['target'])
    #print(featuress)
    train_features,test_feratures,train_target,test_target=train_test_split(featuress,target,random_state=42)
  
    xgbr=xgb.XGBRegressor(learning_rate=0.1, max_depth=5, min_child_weight=2,
                          n_estimators=100, nthread=1, subsample=0.8)
    xgbr.fit(train_features,train_target)
    result=xgbr.score(test_feratures,test_target)
    print("xgbr_result: %s"%result)

运行结果:

这个结果不是很好。

问题:

我之前分别用tpot迭代5次和10次输出模型,输出的时候cv score的值不明白为什么是负的?

迭代5次:


迭代10次:


一下是根据迭代10次的模型跑预测的代码:

gdbt = GradientBoostingRegressor(alpha=0.9, learning_rate=0.1, loss='huber',
                                     max_depth=7, max_features=0.4,
                                     min_samples_leaf=3, min_samples_split=8,
                                     n_estimators=100, subsample=0.9000000000000001)

    gdbt.fit(train_features,train_target)
    #restlt=gdbt.predict(test_feratures)
    restlt2=gdbt.score(test_feratures,test_target)
    print("gdbt_result: %s"%restlt2)

运行结果:


可以看出这个迭代20次的模型预测结果略有提升。20次是远远不够的,tpot默认的迭代次数的100次,理论上说迭代次数越多模型效果越好,特别是数据集较大的时候!但是也特别耗费时间。 

问题:CV score的值为什么是负的?

参考:

TPOT的API

TPOT的github地址

猜你喜欢

转载自blog.csdn.net/tony_stark_wang/article/details/79886030
今日推荐