Anaconda下Xgboost安装及问题解决(环境配置:win7, 64bits, python2.7, Anaconda2)

参考文章:

https://wang-shuo.github.io/2017/02/21/%E5%9C%A8Windows%E4%B8%8B%E5%AE%89%E8%A3%85XGBoost/

https://www.ibm.com/developerworks/community/blogs/jfp/entry/Installing_XGBoost_For_Anaconda_on_Windows?lang=zh

http://blog.csdn.net/u012344939/article/details/68064084

XGBoost是Gradient Boosting算法的一种增强算法,在Kaggle竞赛中大放异彩。下面介绍XGBoost在Windows上的安装过程,我的环境配置:(Windows 7,64 bits, python2.7, anaconda2)。

一、软件安装

为了能在Windows上通过Python使用XGBoost,需要先安装以下三个软件:Python,Git,MINGW

1.1 Python和Git的安装

对于Python,到Python官网下载想安装的版本。对于Git的安装有很多种选择,一种是使用Git for Windows,安装按照默认选项即可(https://gitforwindows.org/)。

1.2 XGBoost的下载

Git安装完成后,开始菜单中会出现一个叫Git Bash的程序,点开后就会出现一个类似Windows命令行的窗口,首先在这个Bash窗口,使用cd命令进入你想保存XGBoost代码的文件夹,比如:

  1. cd C:/Users/Administrator.ZGC-20150403SJZ

通过如下命令下载xgboost:

  1. git clone --recursive https://github.com/dmlc/xgboost


出错:RPC failed,解决方法:git init


再输入如下指令:

  1. cd xgboost
  2. git submodule init
  3. git submodule update


1.3 MinGW-W64的安装

接下来就是编译刚刚下载的XGBoost的代码,编译代码需要用到MinGW-W64。它的安装包从这里下载,下载完成后双击安装,出现下面的安装界面,点击Next:
mingw install screenshot_1

然后在Architecture选项处选择x86_64即可,其他选项保持默认,如下图:
mingw install screenshot2

然后点击下一步,就能安装完成。


我使用的是默认安装路径C:\Program Files\mingw-w64\x86_64-6.3.0-posix-seh-rt_v5-rev1。那么make命令和运行库就在下面的文件夹中(也就是包含mingw32-make的文件夹):C:\Program Files\mingw-w64\x86_64-6.3.0-posix-seh-rt_v5-rev1\mingw64\bin,接下来就是把上面的路径添加到系统的Path中



上面的步骤完成后,关闭Git Bash窗口后重新打开,为了确认添加环境变量已经添加成功,可以在Bash中键入下面的命令:

  1. which mingw32-make

如果添加成功的话,应该返回类似下面这样的信息:


为了输入的方便,可以简化mingw32-make命令为make:

  1. alias make='mingw32-make'


二、XGBoost的编译

现在就可以开始编译XGBoost了,首先进入xgboost文件夹

  1. cd F:/tools/xgboost


采用下面的命令来分开编译,每次编译一个子模块。注意,我们要等每个命令编译完成后才能键入下一个命令。

  1. cd dmlc-core
  2. make -j4
  3. cd ../rabit
  4. make lib/librabit_empty.a -j4
  5. cd ..
  6. cp make/mingw64.mk config.mk
  7. make -j4


一旦最后一个命令完成后,整个编译过程就完成了。

下面就开始在anaconda下安装Python xgboost模块。打开Anaconda prompt,进入XGBoost文件夹下面的python-package子文件夹,然后键入:

  1. cd xgboost/python-package>python setup.py install



最后,我运行本地包含调用xgboost的代码,成功运行:

import numpy as np
import pandas as pd
from Cython.Shadow import inline

import matplotlib.pyplot as plt
#matplotlib inline

###################1 oridinal data##################
train_df = pd.read_csv('input/train.csv', index_col=0)
test_df = pd.read_csv('input/test.csv', index_col=0)

print("type of train_df:" + str(type(train_df)))
#print(train_df.columns)
print("shape of train_df:" + str(train_df.shape))
print("shape of test_df:" + str(test_df.shape))

train_df.head()
#print(train_df.head())

##############################2 smooth label#################################
prices = pd.DataFrame({"price":train_df["SalePrice"], "log(price+1)":np.log1p(train_df["SalePrice"])})
print("shape of prices:" + str(prices.shape))
prices.hist()
# plt.plot(alphas, test_scores)
# plt.title("Alpha vs CV Error")
plt.show()

y_train = np.log1p(train_df.pop('SalePrice'))
print("shape of y_train:" + str(y_train.shape))

######################3 take train and test data together################
all_df = pd.concat((train_df, test_df), axis=0)
print("shape of all_df:" + str(all_df.shape))

######################4 make category data to string##########################
print(all_df['MSSubClass'].dtypes)
all_df['MSSubClass'] = all_df['MSSubClass'].astype(str)
all_df['MSSubClass'].value_counts()
print(all_df['MSSubClass'].value_counts())
 
#####################5 fill null#############################
all_dummy_df = pd.get_dummies(all_df)
print(all_dummy_df.head())
print(all_dummy_df.isnull().sum().sort_values(ascending=False).head())
 
mean_cols = all_dummy_df.mean()
print(mean_cols.head(10))
 
all_dummy_df = all_dummy_df.fillna(mean_cols)
print(all_dummy_df.isnull().sum().sum())
 
###############6 smooth numeric cols########################
numeric_cols = all_df.columns[all_df.dtypes != 'object']
print(numeric_cols)
 
numeric_col_means = all_dummy_df.loc[:, numeric_cols].mean()
numeric_col_std = all_dummy_df.loc[:, numeric_cols].std()
all_dummy_df.loc[:, numeric_cols] = (all_dummy_df.loc[:, numeric_cols] - numeric_col_means) / numeric_col_std
 
###############7 train model################################
dummy_train_df = all_dummy_df.loc[train_df.index]
dummy_test_df = all_dummy_df.loc[test_df.index]
print("shape of dummy_train_df:" + str(dummy_train_df))
print("shape of dummy_test_df:" + str(dummy_test_df))
 
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
 
X_train = dummy_train_df.values
X_test = dummy_test_df.values

from xgboost import XGBRegressor
params = [1,2,3,4,5,6]
test_scores = []
for param in params:
    clf = XGBRegressor(max_depth=param)
    test_score = np.sqrt(-cross_val_score(clf, X_train, y_train, cv=10, scoring='neg_mean_squared_error'))
    test_scores.append(np.mean(test_score))

plt.plot(params, test_scores)
plt.title("max_depth vs CV Error")
plt.show()

猜你喜欢

转载自blog.csdn.net/weixin_41770169/article/details/79548005