Introduction to Machine Learning (5): Classification Algorithm-KNN Algorithm

Learning Directory:
Insert picture description here

1. Sklearn converter and estimator

1. Converter (the parent class of feature engineering)
Insert picture description here
2. Estimator (implementation of sklearn machine learning algorithm)
Insert picture description here
Step 1: Instantiate an estimator
Step 2: estimator.fit(x_train, y_train) training and calculation (the call is complete, Model generation)
Step 3: Model evaluation
      1. Directly compare the true value and the predicted value
             y_predice=estimator.predict(x_test) (predict the test set)
             y_text==y_predict
      2. Calculate the accuracy rate
             accuracy=estimator.score(x_text ,y_text)

2. KNN algorithm (K—nearest neighbor algorithm)

Principle : According to a sample, calculate the k samples that are most similar to him in the feature space (that is, the closest neighbors in the feature space). Most of these k samples belong to the category, and the sample is considered to belong to this category.
Calculating distance formula : Euclidean distance, Manhattan distance, Minkowski distance
Example : Use KNN algorithm to predict and classify the iris data set
Insert picture description here
Insert picture description here

Advantages:
     simple, easy to understand, easy to implement, no training required (recalculate when the test set is received)
Disadvantages:
      lazy algorithm: the classification of test samples is computationally intensive, and the memory overhead is large. The
     value of k must be specified, but the value of k is small and easy to accept Affected by abnormal points, when k is large, it is easy to be affected by sample balance.
Use scenario: small data scenario

3. Model selection and tuning

Application: It can easily help us to choose the value of k in the KNN algorithm.
Insert picture description here
1. What is cross-validation?
      The training data is divided into 4 parts, three training sets and one validation set, each time a different validation set is changed, after 4 sets of tests, the four sets of model results are averaged, which is 4-fold cross-validation.
Insert picture description here
2. Hyperparameter search-the
principle of grid search : try hyperparameters one by one, and then select an optimal hyperparameter to build a model
Insert picture description here

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler,StandardScaler
from sklearn.neighbor import KNeighborsClassifer
from sklearn.model_selection import GridSearchCV
def knn_iris_gscv():
    """用KNN算法对鸢尾花数据集进行分类,并添加网格搜索和交叉验证
    :return:"""
    #获取数据
    iris=load_iris()
    #划分数据集
    x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=22)
    #特征工程:标准化
    transfer=StandardScaler()#实例化
    x_train=transfer.fit_transform(x_train)
    x_test=transfer.transform(x_test)#不使用fit(),因为对验证集进行标准化要按照训练集的标准化标准进行
    #KNN算法预估器
    estimator=KNeighborsClassifier()#实例化
    #加入网格搜索和交叉验证
    #参数准备
    param_dict={
    
    'n_neighbors':[1,3,5,7,9,11]}#K值分别设为1,3,5,7,9,11
    estimator=GridSearchCV(estimator,param_grid=param_dict,cv=10)#10折交叉验证
    estimator.fit(x_train,y_train)#把训练数据放进去
    #模型评估
    #方法一:直接比对真实值和预测值
    y_predict=estimator.predict(x_test)
    print('y_predict:\n',y_predict)
    print('直接比对真实值和预测值:\n', y_test==y_predict)
    # 方法二:计算准确率
    score = estimator.score(x_test,y_test)
    print('准确率:\n', score)

    print('最佳参数:\n', estimator.best_params_)
    print('最佳结果:\n', estimator.best_score_)
    print('最佳估计器:\n', estimator.best_estimator_)
    print('交叉验证结果:\n', estimator.cv_results_)
if __name__=='__main__':
    knn_iris_gscv()

Insert picture description here
Insert picture description here

Case: Predicting the Facebook sign-in location

Insert picture description here

Insert picture description here
Reduce the data range from more than 20 million to 80,000:
Insert picture description here
Convert the timestamp to year, month, day, hour, minute and second: (pd.to_datatime() can parse different date expression modes)
Insert picture description here

Insert picture description here

Leave the date, time and day of the week: (Because the year, month and everything are the same, there is no need to leave it)
Insert picture description here
Insert picture description here
Group by place_id, use groupby(), use .count() to count the number of check-ins with different place_id, just keep row_id :
Insert picture description here
Insert picture description here
Keep the ID of the place where the number of check-ins is greater than three times:
Insert picture description here
For the place_id in the data, return the result of place_count[place_count>3], and only keep the value of the index (index) of the id (values): (The number of check-ins in the place_id is greater than 3 true)
Insert picture description here
Boolean index data, and keep place_id where place_count>3 (the above is true):
Insert picture description here
Insert picture description here
Insert picture description here
Insert picture description here
Insert picture description here
Insert picture description here

Summary of classification algorithms:
Insert picture description here

Guess you like

Origin blog.csdn.net/qq_45234219/article/details/114832673