Machine Learning | (3) Scikit-learn classifier algorithm -k- neighbor

Machine Learning | Introduction to Machine Learning

Machine Learning | (1) Scikit-learn and engineering features

Machine Learning | (2) sklearn data set consisting of machine learning

Machine Learning | (3) Scikit-learn classifier algorithm -k- neighbor


k- classification algorithm of neighbors

k- nearest-neighbors algorithm to classify different measurement distances between the characteristic values

Advantages: high accuracy, is insensitive to outliers, assuming no data input

Disadvantages: high computational complexity, the spatial complexity is high

Use Data range: numeric and nominal type

An example of neighbors understand k-

Movies can be classified according to themes, each theme is how to define it? So if both types of films, action movies and romance. What are the common features of an action movie there? So romance and what significant differences exist? We found that more number of action films fight scenes, and the film kissing love scene relatively more. Of course, there are some kissing action film shot, love the film will have some fight scenes. It is not simply whether there is a fight scene or a kissing scene to judge the film by category. So now we have six film categories have been identified, there are number of fight scenes and kissing the lens, as well as a movie type is unknown.

Movie name Fight scene Kissing scene movie type
California Man 3 104 Romance
He's not Really into dues 2 100 Romance
Beautiful Woman 1 81 Romance
Kevin Longblade 101 10 Action movie
Robo Slayer 3000 99 5 Action movie
Amped II 98 2 Action movie
? 18 90 unknown

Then we use the K- nearest neighbor algorithm to classify romance and action movies: there is a sample data set, also called the training sample set, the number of samples of M, know every feature of a data category corresponding relationship, then there is an unknown type of data collection 1 one, then we have to choose after a test sample data and the training sample from the M, sorted recently selected the K, this value is generally not more than 20. Select the K most similar data in the highest number of classification. According to this principle we to judge the film classification unknown

Movie name Distance unknown movie
California Man 20.5
He's not Really into dues 18.7
Beautiful Woman 19.2
Kevin Longblade 115.3
Robo Slayer 3000 117.4
Amped II 118.9

We assume that K is 3, then type the first three movies are romance rankings, we determined that this unknown movie is a love story. Then calculates the distance is how to calculate it?

Euclidean distance  then for two vector points a_ {1} a 1 a_ {2}, and a distance between a 2, may be represented by the formula:

√ (x 1 -x 2) 2 + (y 1 -y 2) 2√ (x 1 -x 2) 2 + (y 1 -y 2) 2\sqrt{\left({x_{1}-x_{2}}\right)^{2}+\left({y_{1}-y_{2}}\right)^{2}}√​(x​1​​−x​2​​)​2​​+(y​1​​−y​2​​)​2​​​​​

If there are four input variable characteristics, such as (1,3,5,2) and a distance between the (7,6,9,4) is calculated as:

sklearn.neighbors

sklearn.neighbors provide supervised learning methods based on neighbor function, sklearn.neighbors.KNeighborsClassifier is a nearest neighbor classifier. So KNeighborsClassifier is a class, we look at examples of parameters of time

class sklearn.neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=1, **kwargs)**
  """
  :param n_neighbors:int,可选(默认= 5),k_neighbors查询默认使用的邻居数

  :param algorithm:{'auto','ball_tree','kd_tree','brute'},可选用于计算最近邻居的算法:'ball_tree'将会使用 BallTree,'kd_tree'将使用 KDTree,“野兽”将使用强力搜索。'auto'将尝试根据传递给fit方法的值来决定最合适的算法。

  :param n_jobs:int,可选(默认= 1),用于邻居搜索的并行作业数。如果-1,则将作业数设置为CPU内核数。不影响fit方法。

  """
import numpy as np
from sklearn.neighbors import KNeighborsClassifier

neigh = KNeighborsClassifier(n_neighbors=3)

Method

fit(X, y)

Use X as the category value of the training data fitting model, y as X's. X, y is an array or a matrix

X = np.array([[1,1],[1,1.1],[0,0],[0,0.1]])
y = np.array([1,1,0,0])
neigh.fit(X,y)

kneighbors(X=None, n_neighbors=None, return_distance=True)

Find the specified set of points of X n_neighbors neighbors, return_distance to False, then, does not return from

neigh.kneighbors(np.array([[1.1,1.1]]),return_distance= False)

neigh.kneighbors(np.array([[1.1,1.1]]),return_distance= False,an_neighbors=2)

predict(X)

Class label prediction data provided by

neigh.predict(np.array([[0.1,0.1],[1.1,1.1]]))

predict_proba(X)

Return test data X belongs to a class of probability estimates

neigh.predict_proba(np.array([[1.1,1.1]]))

 


k- nearest neighbor Case Study

In this case using the most famous "Iris," the dataset, the dataset Fisher had been used in the classic paper, currently used as a textbook data samples stored in Scikit-learn tools package.

Iris details read data set information

from sklearn.datasets import load_iris
# 使用加载器读取数据并且存入变量iris
iris = load_iris()

# 查验数据规模
iris.data.shape

# 查看数据说明(这是一个好习惯)
print iris.DESCR

Description of the data and the data itself, the identification code, we learned that a total of 150 data sets Iris Iris data samples, and uniformly distributed in 3 different subspecies; each data sample has a total of about four different petals, shape feature described calyx. Since there is no established set of tests, so as usual, we need the data immediately split 25% of the samples for testing, the remaining 75% of the sample used to train the model.

Since the arrangement is not clear whether random data set, there may be arranged by category in order to carry out such training samples is not balanced, so we need to split the data, there is already a default feature random sampling.

Dividing the data set of Iris

from sklearn.cross_validation import train_test_split
X_train,X_test,y_train,y_test = train_test_split(iris.data,iris.target,test_size=0.25,random_state=42)

Data were normalized feature

from sklearn.preprocessing import StandardScaler

ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.fit_transform(X_test)

K-nearest neighbor algorithm is very intuitive machine learning models, we can not find K nearest neighbor algorithm parameters of the training process, that is to say, we do not have any learning algorithm by analyzing training data, but only to make classification decisions directly from the distribution of the test sample training data. Therefore, K neighbors belong to non-parametric model is very simple in.

from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV

def knniris():
    """
    鸢尾花分类
    :return: None
    """

    # 数据集获取和分割
    lr = load_iris()

    x_train, x_test, y_train, y_test = train_test_split(lr.data, lr.target, test_size=0.25)

    # 进行标准化

    std = StandardScaler()

    x_train = std.fit_transform(x_train)
    x_test = std.transform(x_test)

    # estimator流程
    knn = KNeighborsClassifier()

    # # 得出模型
    # knn.fit(x_train,y_train)
    #
    # # 进行预测或者得出精度
    # y_predict = knn.predict(x_test)
    #
    # # score = knn.score(x_test,y_test)

    # 通过网格搜索,n_neighbors为参数列表
    param = {"n_neighbors": [3, 5, 7]}

    gs = GridSearchCV(knn, param_grid=param, cv=10)

    # 建立模型
    gs.fit(x_train,y_train)

    # print(gs)

    # 预测数据

    print(gs.score(x_test,y_test))

    # 分类模型的精确率和召回率

    # print("每个类别的精确率与召回率:",classification_report(y_test, y_predict,target_names=lr.target_names))

    return None

if __name__ == "__main__":
    knniris()

 


Reference material

 

Published 274 original articles · won praise 453 · Views 400,000 +

Guess you like

Origin blog.csdn.net/u012325865/article/details/104288611