Anaconda下Xgboost安装及问题解决(环境配置：win7, 64bits, python2.7, Anaconda2)

参考文章：

https://wang-shuo.github.io/2017/02/21/%E5%9C%A8Windows%E4%B8%8B%E5%AE%89%E8%A3%85XGBoost/

https://www.ibm.com/developerworks/community/blogs/jfp/entry/Installing_XGBoost_For_Anaconda_on_Windows?lang=zh

http://blog.csdn.net/u012344939/article/details/68064084

XGBoost是Gradient Boosting算法的一种增强算法，在Kaggle竞赛中大放异彩。下面介绍XGBoost在Windows上的安装过程，我的环境配置：（Windows 7，64 bits, python2.7, anaconda2）。

一、软件安装

为了能在Windows上通过Python使用XGBoost，需要先安装以下三个软件：Python，Git，MINGW

1.1 Python和Git的安装

对于Python，到Python官网下载想安装的版本。对于Git的安装有很多种选择，一种是使用Git for Windows，安装按照默认选项即可(https://gitforwindows.org/)。

1.2 XGBoost的下载

Git安装完成后，开始菜单中会出现一个叫Git Bash的程序，点开后就会出现一个类似Windows命令行的窗口，首先在这个Bash窗口，使用cd命令进入你想保存XGBoost代码的文件夹，比如：

cd C:/Users/Administrator.ZGC-20150403SJZ

通过如下命令下载xgboost:

git clone --recursive https://github.com/dmlc/xgboost

出错：RPC failed，解决方法：git init

再输入如下指令：

cd xgboost
git submodule init
git submodule update

1.3 MinGW-W64的安装

接下来就是编译刚刚下载的XGBoost的代码，编译代码需要用到MinGW-W64。它的安装包从这里下载，下载完成后双击安装，出现下面的安装界面，点击Next：

然后在Architecture选项处选择x86_64即可，其他选项保持默认，如下图：

然后点击下一步，就能安装完成。

我使用的是默认安装路径C:\Program Files\mingw-w64\x86_64-6.3.0-posix-seh-rt_v5-rev1。那么make命令和运行库就在下面的文件夹中（也就是包含mingw32-make的文件夹）：C:\Program Files\mingw-w64\x86_64-6.3.0-posix-seh-rt_v5-rev1\mingw64\bin，接下来就是把上面的路径添加到系统的Path中。

上面的步骤完成后，关闭Git Bash窗口后重新打开，为了确认添加环境变量已经添加成功，可以在Bash中键入下面的命令：

which mingw32-make

如果添加成功的话，应该返回类似下面这样的信息：

为了输入的方便，可以简化mingw32-make命令为make：

alias make='mingw32-make'

二、XGBoost的编译

现在就可以开始编译XGBoost了，首先进入xgboost文件夹

cd F:/tools/xgboost

采用下面的命令来分开编译，每次编译一个子模块。注意，我们要等每个命令编译完成后才能键入下一个命令。

cd dmlc-core
make -j4
cd ../rabit
make lib/librabit_empty.a -j4
cd ..
cp make/mingw64.mk config.mk
make -j4

一旦最后一个命令完成后，整个编译过程就完成了。

下面就开始在anaconda下安装Python xgboost模块。打开Anaconda prompt，进入XGBoost文件夹下面的python-package子文件夹，然后键入：

cd xgboost/python-package>python setup.py install

最后，我运行本地包含调用xgboost的代码，成功运行：

import numpy as np
import pandas as pd
from Cython.Shadow import inline

import matplotlib.pyplot as plt
#matplotlib inline

###################1 oridinal data##################
train_df = pd.read_csv('input/train.csv', index_col=0)
test_df = pd.read_csv('input/test.csv', index_col=0)

print("type of train_df:" + str(type(train_df)))
#print(train_df.columns)
print("shape of train_df:" + str(train_df.shape))
print("shape of test_df:" + str(test_df.shape))

train_df.head()
#print(train_df.head())

##############################2 smooth label#################################
prices = pd.DataFrame({"price":train_df["SalePrice"], "log(price+1)":np.log1p(train_df["SalePrice"])})
print("shape of prices:" + str(prices.shape))
prices.hist()
# plt.plot(alphas, test_scores)
# plt.title("Alpha vs CV Error")
plt.show()

y_train = np.log1p(train_df.pop('SalePrice'))
print("shape of y_train:" + str(y_train.shape))

######################3 take train and test data together################
all_df = pd.concat((train_df, test_df), axis=0)
print("shape of all_df:" + str(all_df.shape))

######################4 make category data to string##########################
print(all_df['MSSubClass'].dtypes)
all_df['MSSubClass'] = all_df['MSSubClass'].astype(str)
all_df['MSSubClass'].value_counts()
print(all_df['MSSubClass'].value_counts())
 
#####################5 fill null#############################
all_dummy_df = pd.get_dummies(all_df)
print(all_dummy_df.head())
print(all_dummy_df.isnull().sum().sort_values(ascending=False).head())
 
mean_cols = all_dummy_df.mean()
print(mean_cols.head(10))
 
all_dummy_df = all_dummy_df.fillna(mean_cols)
print(all_dummy_df.isnull().sum().sum())
 
###############6 smooth numeric cols########################
numeric_cols = all_df.columns[all_df.dtypes != 'object']
print(numeric_cols)
 
numeric_col_means = all_dummy_df.loc[:, numeric_cols].mean()
numeric_col_std = all_dummy_df.loc[:, numeric_cols].std()
all_dummy_df.loc[:, numeric_cols] = (all_dummy_df.loc[:, numeric_cols] - numeric_col_means) / numeric_col_std
 
###############7 train model################################
dummy_train_df = all_dummy_df.loc[train_df.index]
dummy_test_df = all_dummy_df.loc[test_df.index]
print("shape of dummy_train_df:" + str(dummy_train_df))
print("shape of dummy_test_df:" + str(dummy_test_df))
 
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
 
X_train = dummy_train_df.values
X_test = dummy_test_df.values

from xgboost import XGBRegressor
params = [1,2,3,4,5,6]
test_scores = []
for param in params:
    clf = XGBRegressor(max_depth=param)
    test_score = np.sqrt(-cross_val_score(clf, X_train, y_train, cv=10, scoring='neg_mean_squared_error'))
    test_scores.append(np.mean(test_score))

plt.plot(params, test_scores)
plt.title("max_depth vs CV Error")
plt.show()