Machine learning-KNN algorithm principle && Spark implementation

A data developer who doesn’t understand the algorithm is not a good algorithm engineer. I still remember some of the data mining algorithms that my instructor talked about when I was a graduate student. I was quite interested, but I felt less contact after work. Offline data warehouse>ETL engineer>BI engineer (if you don’t like it), the work I do now is mainly offline data warehouse. Of course, I have done some ETL work in the early stage. For the long-term development of the profession, I broaden my technical boundaries. It is necessary to gradually deepen the real-time and model, so starting from this article, I will also list a FLAG to learn the real-time and model part in depth.

To change yourself, start by improving what you are not good at.

1. Introduction to KNN-K Nearest Neighbor Algorithm

First of all, KNN is a classification algorithm that uses supervised machine learning to label the categories of the training set. When the test object and the training object are completely matched, they can be classified, but the test object and the training object have multiple classes How to match? We can judge whether the test object terminology is a certain training object before, but if there are multiple training object classes, how to solve this problem, so there is KNN. KNN is measured between different feature values The distance is classified. The idea is: if most of the k most similar (ie, the nearest neighbors in the feature space) samples of a sample in the feature space belong to a certain category, the sample also belongs to this category, where K is usually not greater than An integer of 20. In the KNN algorithm, the selected neighbors are all objects that have been correctly classified. This method only determines the category of the sample to be classified based on the category of the nearest one or several samples in the classification decision

fileThe core idea of ​​the KNN algorithm is that if most of the K nearest samples in the feature space of a sample belong to a certain category, the sample also belongs to this category and has the characteristics of the samples in this category. This method only determines the category of the sample to be classified based on the category of the nearest one or several samples in determining the classification decision. The KNN method is only related to a very small number of adjacent samples when making category decisions. Since the KNN method mainly relies on the limited surrounding samples, rather than the method of discriminating the class domain to determine the category, the KNN method is better than other methods for the sample set to be divided with more cross or overlap of the class domain. For fit.

2.KNN algorithm flow

2.1 Prepare data and preprocess the data.

2.2 Calculate the distance from the test sample point (that is, the point to be classified) to each other sample point.

2.3 Sort each distance, and then select the K points with the smallest distance.

2.4 Compare the categories to which the K points belong, and according to the principle that the minority obeys the majority, the test sample points are classified as the one with the highest proportion among the K points

3. Advantages and disadvantages of KNN algorithm

Advantages: easy to understand, very convenient to implement, no need to estimate parameters, no training

Disadvantages: If the amount of data in a certain class in the data set is large, it will inevitably cause more test sets to run into this class, because the probability of being closer to these points is also greater

4. KNN algorithm Spark implementation

4.1 Data download and description

Link: https://pan.baidu.com/s/1FmFxSrPIynO3udernLU0yQ Extraction code: hell After copying this content, open the Baidu Netdisk mobile phone App, the operation is more convenient

Iris flower data set, the data set contains 3 types of 150 key data, each type contains 50 data, each record contains 4 features: calyx length, calyx width, petal length, petal width

Using these 4 features to predict which species the iris flower belongs to (iris-setosa, iris-versicolour, iris-virginica)

4.2 Implementation

package com.hoult.work

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

object KNNDemo {
  def main(args: Array[String]): Unit = {

    //1.初始化
    val conf=new SparkConf().setAppName("SimpleKnn").setMaster("local[*]")
    val sc=new SparkContext(conf)
    val K=15

    //2.读取数据,封装数据
    val data: RDD[LabelPoint] = sc.textFile("data/lris.csv")
      .map(line => {
        val arr = line.split(",")
        if (arr.length == 6) {
          LabelPoint(arr.last, arr.init.map(_.toDouble))
        } else {
          println(arr.toBuffer)
          LabelPoint(" ", arr.map(_.toDouble))
        }
      })


    //3.过滤出样本数据和测试数据
    val sampleData=data.filter(_.label!=" ")
    val testData=data.filter(_.label==" ").map(_.point).collect()

    //4.求每一条测试数据与样本数据的距离
    testData.foreach(elem=>{
      val distance=sampleData.map(x=>(getDistance(elem,x.point),x.label))
      //获取距离最近的k个样本
      val minDistance=distance.sortBy(_._1).take(K)
      //取出这k个样本的label并且获取出现最多的label即为测试数据的label
      val labels=minDistance.map(_._2)
        .groupBy(x=>x)
        .mapValues(_.length)
        .toList
        .sortBy(_._2).reverse
        .take(1)
        .map(_._1)
      printf(s"${elem.toBuffer.mkString(",")},${labels.toBuffer.mkString(",")}")
      println()
    })
    sc.stop()

  }

  case class LabelPoint(label:String,point:Array[Double])

  import scala.math._

  def getDistance(x:Array[Double],y:Array[Double]):Double={
    sqrt(x.zip(y).map(z=>pow(z._1-z._2,2)).sum)
  }
}

Complete code: https://github.com/hulichao/bigdata-spark/blob/master/src/main/scala/com/hoult/work/KNNDemo.scala Wu Xie, Xiao San Ye, mixed in the background, big data, manual A rookie in the field of intelligence. Please pay attention to morefile

Guess you like

Origin blog.csdn.net/hu_lichao/article/details/113101236