使用随机逻辑回归进行特征筛选，并利用筛选后的特征建立逻辑回归模型

from sklearn.linear_model import LogisticRegression as LR
from sklearn.linear_model import RandomizedLogisticRegression as RLR

rlr=RLR()  #建立随机逻辑回归模型，筛选变量
rlr.fit(x,y)  #训练模型
rlr.get_support()  #获取特征筛选结果

print（u'有效特征为：%s'%','.join(np.array(data.iloc[:,:8].columns)[rlr.get_support()])）
x=data[np.array(data.iloc[:,:8].columns)[rlr.get_support()]].as_matrix()  #筛选好特征

lr=LR()  #建立逻辑回归模型
lr.fit(x,y)  #用筛选后的特征数据来训练模型
print(u'逻辑回归模型训练结束')
print(u'模型的平均正确率为：%s'%lr.score(x,y))  #给出模型的平均正确率

Scikit_Learn API :

sklearn.linear_model  广义线性模型

sklearn.linear_model.LogisticRegression   Logistic 回归分类器

Methods：

score(X, y[, sample_weight]) Returns the mean accuracy on the given test data and labels

Parameters:

:x：array-like, Test samples; y: array-like, True labels for X.

sample_weight：可选项，样本权重

Returns:

score: float, Mean accuracy of self.predict(X) wrt. y 获取各个特征的分数

sklearn.linear_model.RandomizedLogisticRegression  随机逻辑回归

官网对于随机逻辑回归的解释：

Randomized Logistic Regression works by subsampling the training data and fitting a L1-penalized LogisticRegression model where the penalty of a random subset of coefficients has been scaled. By performing this double randomization several times, the method assigns high scores to features that are repeatedly selected across randomizations. This is known as stability selection. In short, features selected more often are considered good features.

解读：对训练数据进行多次采样拟合回归模型，即在不同的数据子集和特征子集上运行特征算法，不断重复，最终选择得分高的重要特征。这是稳定性选择方法。得分高的重要特征可能是由于被认为是重要特征的频率高（被选为重要特征的次数除以它所在的子集被测试的次数）

Methods：

fit(X, y) Fit the model using X, y as training data.

Parameters:

:x：array-like, Training samples 训练样本; y: array-like, Target values 目标值，如果需要的话可以转换为训练样本的数据类型.

Returns:

self: object, returns an instance of self. 对象，返回对象的一个实例

Methods：

get_support([indices=False]) Get a mask, or integer index, of the features selected

Parameters:

indices：boolean,default False 指数; True：the return value will be an array of integers, rather than a boolean mask。如果为 True，则返回一个整数数组，而不是布尔类型的值

Returns:

support: array.

An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector

从特征向量中选择保留特征的索引。

如果indices为False，则这是一个布尔数组[＃输入要素]，如果为其保留选择了其对应的特征，则元素为True

如果indices为True，则这是一个整形数组[＃输出要素]，其值是输入要素向量中的索引

参考链接：

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RandomizedLogisticRegression.html#sklearn.linear_model.RandomizedLogisticRegression

使用随机逻辑回归进行特征筛选，并利用筛选后的特征建立逻辑回归模型

猜你喜欢