Quantitative investment study notes 27 - "Python application of machine learning" course notes 01

Beijing Institute of Technology online courses:
http://www.icourse163.org/course/BIT-1001872001
machine learning classification
supervised learning
unsupervised learning
semi-supervised learning
reinforcement learning
deep learning
Scikit-learn algorithm classification

sklearn comes datasets

Six of the task sklearn: classification, regression, clustering, dimension reduction, model selection, data preprocessing.
First, unsupervised learning: no tag data. The most commonly used clustering and dimension reduction.
Clustering: the process is divided into multiple classes according to the data similarity data. Using a sample of "distance" to estimate the similarity of samples, different calculation methods different from the classification results. The distance calculation method is common Euclidean distance, Manhattan distance, Mahalanobis distance, cosine similarity.
sklearn clustering functionality is included in the sklearn.cluster. The same data set of different algorithms may be applied to give different results, the running time is different.
It accepts data input formats:
standard input format: [the number of samples, the number of features] defined in a matrix form.
Input similarity matrix form: In [the number of samples] defined matrix, the matrix elements of each sample similarity.
Common clustering algorithm

Dimensionality reduction: in the case of having to ensure that the data representing the distribution characteristic or the high-dimensional data into low-dimensional data.
For the visualization of data, or data reduction.
sklearn dimensionality reduction algorithm module contained in the decomposition, containing seven kinds of dimensionality reduction algorithm. There are

1.聚类
①k-means算法及应用
以k为参数,把n个对象分为k个簇,使簇内具有较高的相似度,而簇间的相似度较低。
过程:
随机选择k个点作为初始的聚类中心。
对于剩下的点,根据其与聚类中心的距离,将其归入最近的簇。
对每个簇,计算所有点的均值作为新的聚类中心。
重复前两步直到聚类中心不再发生改变。
实例:31省市居民收入分类。详见文章的github代码库。
拓展和改进:KMeans默认使用欧氏距离进行计算。如果要用其它距离计算方法,要修改源码。
②DBSCAN算法
是一种基于密度的聚类算法。聚类时不需要预先指定簇的个数。
将数据点分为三类:
核心点:在半径Eps内含有超过MinPts数目的点。
边界点:在半径Eps内点的数量小于MinPts,但是落在核心点的邻域内。
噪音点:既不是核心点也不是边界点。
过程:
将所有点标记为核心点,边界点或噪音点。
删除噪声点
为距离在Eps之内的所有核心点之间赋予一条边。
每组连通的核心点形成一个簇。
将每个边界点指派到一个与之关联的核心点的簇中(哪一个核心点的半径范围之内)。
实例:学生上网时间分类。详见文章的github代码库。
技巧:长尾数据不适宜聚类,可以用对数转换。
本文代码:
https://github.com/zwdnet/MyQuant/tree/master/25

我发文章的四个地方,欢迎大家在朋友圈等地方分享,欢迎点“在看”。
我的个人博客地址:https://zwdnet.github.io
我的知乎文章地址: https://www.zhihu.com/people/zhao-you-min/posts
我的博客园博客地址: https://www.cnblogs.com/zwdnet/
我的微信个人订阅号:赵瑜敏的口腔医学学习园地

Guess you like

Origin www.cnblogs.com/zwdnet/p/12372891.html