参考文章:
https://wang-shuo.github.io/2017/02/21/%E5%9C%A8Windows%E4%B8%8B%E5%AE%89%E8%A3%85XGBoost/
https://www.ibm.com/developerworks/community/blogs/jfp/entry/Installing_XGBoost_For_Anaconda_on_Windows?lang=zh
http://blog.csdn.net/u012344939/article/details/68064084
XGBoost是Gradient Boosting算法的一种增强算法,在Kaggle竞赛中大放异彩。下面介绍XGBoost在Windows上的安装过程,我的环境配置:(Windows 7,64 bits, python2.7, anaconda2)。
一、软件安装
为了能在Windows上通过Python使用XGBoost,需要先安装以下三个软件:Python,Git,MINGW
1.1 Python和Git的安装
对于Python,到Python官网下载想安装的版本。对于Git的安装有很多种选择,一种是使用Git for Windows,安装按照默认选项即可(https://gitforwindows.org/)。
1.2 XGBoost的下载
Git安装完成后,开始菜单中会出现一个叫Git Bash的程序,点开后就会出现一个类似Windows命令行的窗口,首先在这个Bash窗口,使用cd命令进入你想保存XGBoost代码的文件夹,比如:
- cd C:/Users/Administrator.ZGC-20150403SJZ
通过如下命令下载xgboost:
- git clone --recursive https://github.com/dmlc/xgboost
出错:RPC failed,解决方法:git init
再输入如下指令:
- cd xgboost
- git submodule init
- git submodule update
1.3 MinGW-W64的安装
接下来就是编译刚刚下载的XGBoost的代码,编译代码需要用到MinGW-W64。它的安装包从这里下载,下载完成后双击安装,出现下面的安装界面,点击Next:
然后在Architecture选项处选择x86_64即可,其他选项保持默认,如下图:
然后点击下一步,就能安装完成。
我使用的是默认安装路径C:\Program Files\mingw-w64\x86_64-6.3.0-posix-seh-rt_v5-rev1。那么make命令和运行库就在下面的文件夹中(也就是包含mingw32-make的文件夹):C:\Program Files\mingw-w64\x86_64-6.3.0-posix-seh-rt_v5-rev1\mingw64\bin,接下来就是把上面的路径添加到系统的Path中。
上面的步骤完成后,关闭Git Bash窗口后重新打开,为了确认添加环境变量已经添加成功,可以在Bash中键入下面的命令:
- which mingw32-make
如果添加成功的话,应该返回类似下面这样的信息:
为了输入的方便,可以简化mingw32-make命令为make:
- alias make='mingw32-make'
二、XGBoost的编译
现在就可以开始编译XGBoost了,首先进入xgboost文件夹
- cd F:/tools/xgboost
采用下面的命令来分开编译,每次编译一个子模块。注意,我们要等每个命令编译完成后才能键入下一个命令。
- cd dmlc-core
- make -j4
- cd ../rabit
- make lib/librabit_empty.a -j4
- cd ..
- cp make/mingw64.mk config.mk
- make -j4
一旦最后一个命令完成后,整个编译过程就完成了。
下面就开始在anaconda下安装Python xgboost模块。打开Anaconda prompt,进入XGBoost文件夹下面的python-package子文件夹,然后键入:
- cd xgboost/python-package>python setup.py install
最后,我运行本地包含调用xgboost的代码,成功运行:
import numpy as np import pandas as pd from Cython.Shadow import inline import matplotlib.pyplot as plt #matplotlib inline ###################1 oridinal data################## train_df = pd.read_csv('input/train.csv', index_col=0) test_df = pd.read_csv('input/test.csv', index_col=0) print("type of train_df:" + str(type(train_df))) #print(train_df.columns) print("shape of train_df:" + str(train_df.shape)) print("shape of test_df:" + str(test_df.shape)) train_df.head() #print(train_df.head()) ##############################2 smooth label################################# prices = pd.DataFrame({"price":train_df["SalePrice"], "log(price+1)":np.log1p(train_df["SalePrice"])}) print("shape of prices:" + str(prices.shape)) prices.hist() # plt.plot(alphas, test_scores) # plt.title("Alpha vs CV Error") plt.show() y_train = np.log1p(train_df.pop('SalePrice')) print("shape of y_train:" + str(y_train.shape)) ######################3 take train and test data together################ all_df = pd.concat((train_df, test_df), axis=0) print("shape of all_df:" + str(all_df.shape)) ######################4 make category data to string########################## print(all_df['MSSubClass'].dtypes) all_df['MSSubClass'] = all_df['MSSubClass'].astype(str) all_df['MSSubClass'].value_counts() print(all_df['MSSubClass'].value_counts()) #####################5 fill null############################# all_dummy_df = pd.get_dummies(all_df) print(all_dummy_df.head()) print(all_dummy_df.isnull().sum().sort_values(ascending=False).head()) mean_cols = all_dummy_df.mean() print(mean_cols.head(10)) all_dummy_df = all_dummy_df.fillna(mean_cols) print(all_dummy_df.isnull().sum().sum()) ###############6 smooth numeric cols######################## numeric_cols = all_df.columns[all_df.dtypes != 'object'] print(numeric_cols) numeric_col_means = all_dummy_df.loc[:, numeric_cols].mean() numeric_col_std = all_dummy_df.loc[:, numeric_cols].std() all_dummy_df.loc[:, numeric_cols] = (all_dummy_df.loc[:, numeric_cols] - numeric_col_means) / numeric_col_std ###############7 train model################################ dummy_train_df = all_dummy_df.loc[train_df.index] dummy_test_df = all_dummy_df.loc[test_df.index] print("shape of dummy_train_df:" + str(dummy_train_df)) print("shape of dummy_test_df:" + str(dummy_test_df)) from sklearn.linear_model import Ridge from sklearn.model_selection import cross_val_score X_train = dummy_train_df.values X_test = dummy_test_df.values from xgboost import XGBRegressor params = [1,2,3,4,5,6] test_scores = [] for param in params: clf = XGBRegressor(max_depth=param) test_score = np.sqrt(-cross_val_score(clf, X_train, y_train, cv=10, scoring='neg_mean_squared_error')) test_scores.append(np.mean(test_score)) plt.plot(params, test_scores) plt.title("max_depth vs CV Error") plt.show()