Machine learning-KMeans algorithm principle && Spark implementation

A data developer who doesn’t understand the algorithm is not a good algorithm engineer. I still remember some of the data mining algorithms that my instructor talked about when I was a graduate student. I was quite interested, but I felt less contact after work. Offline data warehouse>ETL engineer>BI engineer (if you don’t like it), the work I do now is mainly offline data warehouse. Of course, I have done some ETL work in the early stage. For the long-term development of the profession, I broaden my technical boundaries. It is necessary to gradually deepen the real-time and model, so starting from this article, I will also list a FLAG to learn the real-time and model part in depth.

To change yourself, start by improving what you are not good at.

1. KMeans-Introduction to Algorithm

The K-Means algorithm is an unsupervised clustering algorithm. It is relatively simple to implement and has a good clustering effect, so it is widely used.

  • K-means algorithm, also known as K-means or K-means, is generally used as the first algorithm to master clustering algorithms.

  • Here K is a constant and needs to be set in advance. In layman's terms, the algorithm aggregates M samples that are not labeled into K clusters in an iterative manner.

  • The process of gathering samples is often divided by the distance between samples as an index.

file Core : K-means clustering algorithm is an iterative solution clustering analysis algorithm. Its steps are to randomly select K objects as initial cluster centers, and then calculate the distance between each object and each seed cluster center. Assign each object to the cluster center closest to it. The cluster centers and the objects assigned to them represent a cluster. Each time a sample is allocated, the cluster center of the cluster will be recalculated based on the existing objects in the cluster. This process will be repeated until a certain termination condition is met. The termination condition can be that no (or minimum number) of objects are reassigned to different clusters, no (or minimum number) of cluster centers change again, and the sum of squared errors is locally minimum

2.KMeans algorithm flow

2.1 Read the file, prepare the data, and preprocess the data

2.2 Randomly find K points as the initial center point

2.3 Traverse the data set, calculate the distance from each point to the 3 centers, which center point is the closest to that center point

2.4 Calculate the new center point based on the new classification

2.5 Use the new center point to start the next cycle (continue to cycle step 2.3)

Conditions for exiting the loop :

1. Specify the number of cycles

2. All center points almost no longer move (that is, the sum of the distances moved by the center point is less than our given Changshu, such as 0.00001)

3. Advantages and disadvantages of KMeans algorithm

The choice of K value : The influence of k value on the final result is very important, but it must be given in advance. Given a suitable value of k, prior knowledge is required, which is difficult to estimate out of thin air, or may lead to poor results.

The existence of abnormal points : The K-means algorithm uses the mean value of all points as the new mass point (center point) in the iterative process. If there are abnormal points in the cluster, the mean deviation will be more serious. For example, there are five data of 2, 4, 6, 8, and 100 in a cluster, then the new mass point is 24. Obviously this mass point is farther away from most points; in the current situation, using the median 6 may be better than using The idea of ​​the mean is better, and the clustering method using the median is called K-Mediods clustering (K median clustering)

Initial value sensitivity : K-means algorithm is sensitive to initial value, and choosing different initial values ​​may result in different cluster division rules. In order to avoid the abnormality of the final result caused by this sensitivity, multiple sets of initial nodes can be initialized to construct different classification rules, and then the optimal construction rule can be selected. In response to this, it is derived from the following: two-division K-Means algorithm, K-Means++ algorithm, K-Means|| algorithm, Canopy algorithm, etc.

The advantages of simple implementation, mobility, and good scalability make it one of the most commonly used algorithms in clustering.

4. KMeans algorithm Spark implementation

4.1 Data download and description

Link: https://pan.baidu.com/s/1FmFxSrPIynO3udernLU0yQ Extraction code: hell After copying this content, open the Baidu Netdisk mobile phone App, the operation is more convenient

Iris flower data set, the data set contains 3 types of 150 key data, each type contains 50 data, each record contains 4 features: calyx length, calyx width, petal length, petal width

After these 4 features, cluster the flowers, assume that K is set to 3, and see the difference with the actual result

4.2 Implementation

Does not use the mlb library, but uses the native implementation of scala

package com.hoult.work

import org.apache.commons.lang3.math.NumberUtils
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession

import scala.collection.mutable.ListBuffer
import scala.math.{pow, sqrt}
import scala.util.Random

object KmeansDemo {

  def main(args: Array[String]): Unit = {

    val spark = SparkSession
      .builder()
      .master("local[*]")
      .appName(this.getClass.getCanonicalName)
      .getOrCreate()

    val sc = spark.sparkContext
    val dataset = spark.read.textFile("data/lris.csv")
      .rdd.map(_.split(",").filter(NumberUtils.isNumber _).map(_.toDouble))
      .filter(!_.isEmpty).map(_.toSeq)


    val res: RDD[(Seq[Double], Int)] = train(dataset, 3)

    res.sample(false, 0.1, 1234L)
      .map(tp => (tp._1.mkString(","), tp._2))
      .foreach(println)
  }

  // 定义一个方法 传入的参数是 数据集、K、最大迭代次数、代价函数变化阈值
  // 其中 最大迭代次数和代价函数变化阈值是设定了默认值,可以根据需要做相应更改
  def train(data: RDD[Seq[Double]], k: Int, maxIter: Int = 40, tol: Double = 1e-4) = {

    val sc: SparkContext = data.sparkContext

    var i = 0 // 迭代次数
    var cost = 0D //初始的代价函数
    var convergence = false   //判断收敛,即代价函数变化小于阈值tol

    // step1 :随机选取 k个初始聚类中心
    var initk: Array[(Seq[Double], Int)] = data.takeSample(false, k, Random.nextLong()).zip(Range(0, k))

    var res: RDD[(Seq[Double], Int)] = null

    while (i < maxIter && !convergence) {

      val bcCenters = sc.broadcast(initk)

      val centers: Array[(Seq[Double], Int)] = bcCenters.value

      val clustered: RDD[(Int, (Double, Seq[Double], Int))] = data.mapPartitions(points => {

        val listBuffer = new ListBuffer[(Int, (Double, Seq[Double], Int))]()

        // 计算每个样本点到各个聚类中心的距离
        points.foreach { point =>

          // 计算聚类id以及最小距离平方和、样本点、1
          val cost: (Int, (Double, Seq[Double], Int)) = centers.map(ct => {

            ct._2 -> (getDistance(ct._1.toArray, point.toArray), point, 1)

          }).minBy(_._2._1)  // 将该样本归属到最近的聚类中心
          listBuffer.append(cost)
        }

        listBuffer.toIterator
      })
      //
      val mpartition: Array[(Int, (Double, Seq[Double]))] = clustered
        .reduceByKey((a, b) => {
          val cost = a._1 + b._1   //代价函数
          val count = a._3 + b._3   // 每个类的样本数累加
          val newCenters = a._2.zip(b._2).map(tp => tp._1 + tp._2)    // 新的聚类中心点集
          (cost, newCenters, count)
        })
        .map {
          case (clusterId, (costs, point, count)) =>
            clusterId -> (costs, point.map(_ / count))   // 新的聚类中心
        }
        .collect()
      val newCost = mpartition.map(_._2._1).sum   // 代价函数
      convergence =  math.abs(newCost - cost) <= tol    // 判断收敛,即代价函数变化是否小于小于阈值tol
      // 变换新的代价函数
      cost = newCost
      // 变换初始聚类中心
      initk = mpartition.map(tp => (tp._2._2, tp._1))
      // 聚类结果 返回样本点以及所属类的id
      res = clustered.map(tp=>(tp._2._2,tp._1))
      i += 1
    }
    // 返回聚类结果
    res
  }

  def getDistance(x:Array[Double],y:Array[Double]):Double={
    sqrt(x.zip(y).map(z=>pow(z._1-z._2,2)).sum)
  }


}

Complete code: https://github.com/hulichao/bigdata-spark/blob/master/src/main/scala/com/hoult/work/KmeansDemo.scala

Screenshot of the result: fileWu Xie, Xiao San Ye, a little rookie in the background, big data, and artificial intelligence. Please pay attention to morefile

Guess you like

Origin blog.csdn.net/hu_lichao/article/details/113101229