机器学习算法--回归2--岭回归与IASSO回归

主要讨论岭回归 IASSO回归以及前向逐步回归。

之前解决线性回归时，得到最终解：

当XTX可逆时，可以直接求解，当XTX不可逆时（特征属性数大于样本数），可以采用岭回归以及IASSO回归解决，其实就是相当于引入了正则化。

1.岭回归

普通线性回归目标函数为：

引入岭回归之后，其实相当于添加L2正则化即

进行矩阵变换得：

E=(Y-XW)T*(Y-XW)+ $\lambda$ WTW

求导等于得W=：

由于XTX+ $\lambda$ I肯定可逆，因此可以直接矩阵求逆求解。

直接矩阵求逆求解上式值：

def ridgeRegres(xMat,yMat,lam=0.2):
    xTx = xMat.T*xMat
    denom = xTx + eye(shape(xMat)[1])*lam
    if linalg.det(denom) == 0.0:
        print ("This matrix is singular, cannot do inverse")
        return
    ws = denom.I * (xMat.T*yMat)
    return ws

测试不同 $\lambda$ 取值，对于回归系数的影响情况

def ridgeTest(xArr,yArr):
    xMat = mat(xArr); yMat=mat(yArr).T
	#压缩行，对于矩阵的各列求均值
    yMean = mean(yMat,0)
    yMat = yMat - yMean     #to eliminate X0 take mean off of Y
    #regularize X's  对于X的各列求均值
    xMeans = mean(xMat,0)   #calc mean then subtract it off
	#取0求样本方差的无偏估计值（除以N-1；对应取1求得的是方差（除以N）
    xVar = var(xMat,0)      #calc variance of Xi then divide by it
    xMat = (xMat - xMeans)/xVar
    numTestPts = 30
    wMat = zeros((numTestPts,shape(xMat)[1]))
    for i in range(numTestPts):
        ws = ridgeRegres(xMat,yMat,exp(i-10))
        wMat[i,:]=ws.T
    return wMat

其中mean函数：

numpy.mean(a, axis, dtype, out，keepdims ) axis参数设置：

axis 不设置值，对 m*n 个数求均值，返回一个实数
axis = 0：压缩行，对各列求均值，返回 1* n 矩阵
axis =1 ：压缩列，对各行求均值，返回 m *1 矩阵

var函数：var(X,W,dim) 求解方差

W 取0求样本方差的无偏估计值除以N-1；对应取1求得的是方差（除以N）

dim =1 对每列操作 dim = 2 对每行操作

样本均值，方差以及协方差的求解如下：

所有特征减去均值并除以方差来使每维特征具有相同的重要性，y值需要减去均值进行处理。

最终返回不同 $\lambda$ 参数时，回归系数矩阵。

def plotridge():
    x, y = loadDataSet('abalone.txt')
    rw = ridgeTest(x, y)
    import matplotlib.pyplot as plt
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.plot(rw)
    plt.show()

当 $\lambda$ 较小时，回归系数和线性回归基本一致，当 $\lambda$ 逐渐变大时，回归系数逐渐减为零。因此在中间某个位置，存在最优值 $\lambda$ 。

2.IASSO回归

IASSO回归，相当于引入L1正则化即

当 $\lambda$ 较大时，一些系数会被迫缩减至0，因此易产生稀疏解。

直接求解难以求解，可以使用前向逐步回归来代替，前向逐步回归属于一种贪心算法，每一步都尽可能减小误差。

数据标准化，使其分布满足0均值和单位方差

每轮迭代中：

设置当前最小误差lowerestError为正无穷

对每个特征：

增大或者减小：

改变一个系数得到一个新的W

计算新W下的误差

如果误差小于lowerestError,设置Wbest等于当前W

将W设置为新的Wbest

数据标准化处理：

def regularize(xMat):#regularize by columns
    inMat = xMat.copy()
    inMeans = mean(inMat,0)   #calc mean then subtract it off
    inVar = var(inMat,0)      #calc variance of Xi then divide by it
    inMat = (inMat - inMeans)/inVar
    return inMat

逐步前向算法：

def stageWise(xArr,yArr,eps=0.01,numIt=100):
    xMat = mat(xArr); yMat=mat(yArr).T
    yMean = mean(yMat,0)
    yMat = yMat - yMean     #can also regularize ys but will get smaller coef
    xMat = regularize(xMat)
    m,n=shape(xMat)
    returnMat = zeros((numIt,n)) #testing code remove
    ws = zeros((n,1)); wsTest = ws.copy(); wsMax = ws.copy()
    for i in range(numIt):
        print (ws.T)
        lowestError = inf; 
        for j in range(n):
            for sign in [-1,1]:
                wsTest = ws.copy()
                wsTest[j] += eps*sign
                yTest = xMat*wsTest
                rssE = rssError(yMat.A,yTest.A)
                if rssE < lowestError:
                    lowestError = rssE
                    wsMax = wsTest
        ws = wsMax.copy()
        returnMat[i,:]=ws.T
    return returnMat

运行，绘出回归系数变化图：

def plotstageWise():
    x, y = loadDataSet('abalone.txt')
    rw = stageWise(x, y, 0.005, 1000)
    import matplotlib.pyplot as plt
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.plot(rw)
    plt.show()

可知当迭代一定次数时，会发现有些系数接近于0，说明这些系数对应的特征对于结果并不重要，即说明解的稀疏性，只有部分属性特征对于结果会产生重要影响。

当使用缩减方式时，模型增加了偏差（防止了过拟合），减小了方差。

3.偏差方差分解

测试样本x yD表示样本在数据中的标记 y表示样本的真实标记 f（x：D）为模型在训练集D上学得的模型f在x上预测输出

期望预测为：

样本数相同训练集不同的方差为：

噪声为：

期望输出与真实标记差值为偏差：

假定噪声期望为零：

泛化误差：

即泛化误差等于偏差，方差以及噪声之和。

偏差：度量学习任务的期望预测与真实结果的偏离程度刻画学习算法本身的学习能力；

方差：度量训练集的变动对于学习性能的影响，刻画数据扰动造成的影响；

噪声：度量泛化误差的下界，刻画学习问题本身的难度。

训练不足时，拟合程度不够，数据扰动不足以影响性能，此时偏差主导了泛化性能。

训练充分时，拟合程度加深，数据扰动会影响性能，此时方差主导了泛化性能。

因此当训练强度达到某一值时，泛化误差最小，性能最优。

4.实践

以鲍鱼数据为例：

加载数据集，使用线性回归来拟合：

利用交叉验证，使用岭回归来对数据进行拟合：

def crossValidation(xArr,yArr,numVal=10):
    m = len(yArr)                           
    indexList = range(m)
	#30个lamda的参数值相应错误率保存
    errorMat = zeros((numVal,30))#create error mat 30columns numVal rows
    for i in range(numVal):
        trainX=[]; trainY=[]
        testX = []; testY = []
		#划分训练集与测试集
        random.shuffle(list(indexList))
        for j in range(m):#create training set based on first 90% of values in indexList
            if j < m*0.9: 
                trainX.append(xArr[indexList[j]])
                trainY.append(yArr[indexList[j]])
            else:
                testX.append(xArr[indexList[j]])
                testY.append(yArr[indexList[j]])
		#进行训练
        wMat = ridgeTest(trainX,trainY)    #get 30 weight vectors from ridge
        for k in range(30):#loop over all of the ridge estimates
            matTestX = mat(testX); matTrainX=mat(trainX)
            meanTrain = mean(matTrainX,0)
            varTrain = var(matTrainX,0)
            matTestX = (matTestX-meanTrain)/varTrain #regularize test with training params
            yEst = matTestX * mat(wMat[k,:]).T + mean(trainY)#test ridge results and store
			#计算误差
            errorMat[i,k]=rssError(yEst.T.A,array(testY))
            #print errorMat[i,k]
    meanErrors = mean(errorMat,0)#calc avg performance of the different ridge weight vectors
    minMean = float(min(meanErrors))
    bestWeights = wMat[nonzero(meanErrors==minMean)]
    #can unregularize to get model
    #when we regularized we wrote Xreg = (x-meanX)/var(x)
    #we can now write in terms of x not Xreg:  x*w/var(x) - meanX/var(x) +meanY
    xMat = mat(xArr); yMat=mat(yArr).T
    meanX = mean(xMat,0); varX = var(xMat,0)
    unReg = bestWeights/varX
    print ("the best model from Ridge Regression is:\n",unReg)
    print ("with constant term: ",-1*sum(multiply(meanX,unReg)) + mean(yMat))