Sk-learn及算法笔记

查看所有的API:

http://scikit-learn.org/stable/modules/classes.html

XGBoost的python API中提供了sklearn版本的API:

https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn

只要 from xgboost import XGBClassifier, XGBRegressor 就可以按照sklearn的方式使用fit和predict去预测.

Regression:

http://scikit-learn.org/stable/supervised_learning.html#supervised-learning

Linear Regression : #　普通的普通最小二乘法

Ridge Regression : #岭回归目标函数加入了对ｗ的惩罚,L2规范化

Lasso : #　目标函数加入了对ｗ和样本个数的惩罚,L1规范化

# 基于稀疏模型的情况，进行线性拟合，这时的效果较好

Multi_taskLasso : # 多元lasso，一元的lasso是根据y=X*x（y是结果，X是矩阵，x是方程系数）来倒推x,多任务lasso中的y从一维度变成二维度。

Elastic Net : #弹性网络,对ｗ同时进行L1和L2规范化，并对样本数惩罚

......

Bayesian Regression:

Logistic regression:

上面的形式是交叉熵损失函数转化后的结果.

SVM, 测试代码参考: https://github.com/zhangweijiqn/testPython/blob/master/src/SklearnTest/testTrain.py

分类问题:

SVM的实现有三种,SVC,NuSVC,LinearSVC.SVC是基于libsvm的实现,NuSVC和SVC类似,但是添加了控制支持向量个数的参数,LinearSVC是kernel为linear时SVC的实现,参数更加可调.

多类问题SVC和NuSVC实现one vs one和one vs rest两种, linearSVC仅实现了one vs rest.

回归问题:

There are three different implementations of Support Vector Regression: SVR , NuSVR and LinearSVR . LinearSVR provides a faster implementation than SVR but only considers linear kernels, while NuSVR implements a slightly different formulation than SVR and LinearSVR . See Implementation details for further details.

SVR:

参考：　 http://www.cmlab.csie.ntu.edu.tw/~cyy/learning/tutorials/SVR.pdf

求解问题：

评价指标R^2: http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html#sklearn.svm.SVR

The coefficient R^2 is defined as (1 - u/v), where u is the regression sum of squares ((y_true - y_pred) ** 2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.

Ensemble Methods

ExtraTreesClassifier

Forests of randomized trees 中除了提供Random forest,还提供了 Extremely Randomized Trees 方法, 使用的是 ExtraTreesClassifier, 该分类器的论文为:

P. Geurts, D. Ernst., and L. Wehenkel, “Extremely randomized trees”, Machine Learning, 63(1), 3-42, 2006. (random forest是2002年发表)

Extremely Randomized Trees思想如下:

与随机森林不同的地方在于分裂值的选择, 这里的选择方法是调用的Pick_a_random_split,实际就是对选出的K个特征,在他们的[min,max]区间内随机选择(uniformly, 每个元素等概率)一个值作为分裂值,然后将K个中最好的最为最终的分裂节点. (这里如果是K个特征,每个特征不是随机选择一个,而是随机选择多个会不会更好???)

A bias/variance analysis of the Extra-Trees algorithm is also provided as well as a geometrical and a kernel characterization of the models induced.

Extra-Tree还提供了bias/variance两个指标,用来评价单棵树的bias和树与树之间的variance, 停止分裂的样本数n越大,树越小, bias越大,variance越小.

Isolation Forest:

相关的文章有两篇:

Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. “Isolation forest.” Data Mining, 2008. ICDM‘08. Eighth IEEE International Conference on.

Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. “Isolation-based anomaly detection.” ACM Transactions on Knowledge Discovery from Data (TKDD) 6.1 (2012): 3.

Isolation forest可以用来做异常点检测,算法如下:

Algorithm(Generate one tree):

Input: Datasets

Output: A split

If the stop rule doesn’t satisfy, continue.

Select one attribute from nonconstant attributes randomly.

For this attribute, randomly select a cut-point uniformly in [min, max].

Split the datasets by this cut-point.

About the stop rule, max depth and |D| = 1.

Itree is a binary tree.

At last, randomly generate trees M times and ensemble the trees in subsampling datasets. (Usually 256)

是一种识别outlier的方法,无监督, 与extraTree不同的是step2和step3,首先随机选择一个特征,然后随机选择一个分裂点,满足分裂条件如max depth等, 初始训练集bootstrap选择256个.

NMF ( Non-Negative Matrix Factorization)

http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html#sklearn.decomposition.NMF

Find two non-negative matrices (W, H) whose product approximates the non- negative matrix X. This factorization can be used for example for dimensionality reduction, source separation or topic extraction.

paper: http://www.csie.ntu.edu.tw/~cjlin/papers/pgradnmf.pdf

多类问题以及多类标问题：

Multiclass and multilabel algorithms

拿出其中一个类，其他的都作为另一类，转换为二分类问题，取准确率最大的一类作为

This strategy, also known as one-vs-all , is implemented in OneVsRestClassifier . The strategy consists in fitting one classifier per class. For each classifier, the class is fitted against all the other classes. In addition to its computational efficiency (only n_classes classifiers are needed), one advantage of this approach is its interpretability. Since each class is represented by one and one classifier only, it is possible to gain knowledge about the class by inspecting its corresponding classifier. This is the most commonly used strategy and is a fair default choice.

1.12.3. One-Vs-One

类与类两两计算分类准确率，然后多数投票。

OneVsOneClassifier constructs one classifier per pair of classes. At prediction time, the class which received the most votes is selected. In the event of a tie (among two classes with an equal number of votes), it selects the class with the highest aggregate classification confidence by summing over the pair-wise classification confidence levels computed by the underlying binary classifiers.

Since it requires to fit n_classes * (n_classes - 1) / 2 classifiers, this method is usually slower than one-vs-the-rest, due to its O(n_classes^2) complexity.

1.12.4. Error-Correcting Output-Codes

1.12.4.1. Multiclass learning

Sklearn中随机森林中 feature importance：

The relative rank (i.e. depth) of a feature used as a decision node in a tree can be used to assess the relative importance of that feature with respect to the predictability of the target variable. Features used at the top of the tree contribute to the final prediction decision of a larger fraction of the input samples. The expected fraction of the samples they contribute to can thus be used as an estimate of the relative importance of the features .

By averaging those expected activity rates over several randomized trees one can reduce the variance of such an estimate and use it for feature selection.

每棵树的importance是特征在树中节点的深度决定，第一个节点最重要；所有树上特征的importance求

Sk-learn及算法笔记

猜你喜欢