MLlib of KNN algorithm examples

MLlib of KNN algorithm examples

thought knn algorithm :
proximity algorithm, or nearest neighbor (kNN, k-NearestNeighbor) classification algorithm is one of classification data mining technology in the easiest way. The so-called K-nearest neighbor, k nearest neighbor is the meaning of that is that each sample can use its closest neighbors k to represent (that knows nothing doubts nothing).
To find the distance formula :
Manhattan distance,
Euclidean distance

Demand :
sample data,
label, F1, F2, F3, F4, F5
0,10,20,30,40,30
0,12,22,29,42,35
0,11,21,31,40,34
0 , 13,22,30,42,32
0,12,22,32,41,33
0,10,21,33,45,35
1,30,11,21,40,34
1,33,10,20 , 43,30
1,30,12,23,40,33
1,32,10,20,42,33
1,30,13,20,42,30
1,30,09,22,41,32

- programming a spark, the following categories of the unknown vector denoted predicted classes
B1, B2, B3, B4, B5
11,21,31,44,32
14,26,32,39,30
32,14,21,42 32
34,12,22,42,34

Code implementation :
Object suanfa {
DEF main (args: Array [String]): Unit = {
// create a spark environment

val spark: SparkSession = SparkSession.builder()
  .appName(this.getClass.getSimpleName)
  .master("local[*]")
  .getOrCreate()

Using new StructType () custom data types introduced

    //自定义样本的数据类型
val schemal: StructType = new StructType()
  .add("label", DataTypes.DoubleType)
  .add("f1", DataTypes.DoubleType)
  .add("f2", DataTypes.DoubleType)
  .add("f3", DataTypes.DoubleType)
  .add("f4", DataTypes.DoubleType)
  .add("f5", DataTypes.DoubleType)

  //自定义未知数据的数据类型
val schema2: StructType = new StructType()
  .add("id", DataTypes.DoubleType)
  .add("b1", DataTypes.DoubleType)
  .add("b2", DataTypes.DoubleType)
  .add("b3", DataTypes.DoubleType)
  .add("b4", DataTypes.DoubleType)
  .add("b5", DataTypes.DoubleType)

Two data do crossjoin Cartesian product join, forming many placed ZHB (integrated form) with

//导入样本数据导入
val yb: DataFrame = spark.read.option("header","true").schema(schemal).csv("data/demo/yangben")
//导入为知数据
val wz: DataFrame = spark.read.schema(schema2).option("header","true").csv("data/demo/weizhi")
    //将样本数据和为主数据连接crossjion(交叉join  笛卡尔积join)
    val zhj: DataFrame = wz.crossJoin(yb)

Vectors.sqdist get their first use one of the two empty Euclidean distance data with udf () calculation table
will report Failed wrong to execute user defined function (data execution abnormal, the data do not match) run-time
solution: write your own common Array into mutable.WrappedArray, when dense in change back

        //自定义一个计算欧式距离的函数  
   import org.apache.spark.sql.functions._
    val osjl: UserDefinedFunction = udf((arr1:mutable.WrappedArray[Double], arr2:mutable.WrappedArray[Double]) =>{
       val v1: linalg.Vector = Vectors.dense(arr1.toArray)
       val v2: linalg.Vector = Vectors.dense(arr2.toArray)
        Vectors.sqdist(v1,v2)//sq(平方)dist(距离)只不过得到的最后的结果的平方不影响计算结果
    })

Integrated table using .Select () method of the Euclidean distance calculation data into two tables set above

//计算样本和未知之间的距离
    import spark.implicits._
val fra: DataFrame = zhj.select(
  $"label", //如果想用符号就必须用隐式转换(import spark.implicits._),
  //col("label")如果想用这个就必须用(import org.apache.spark.sql.functions._)
  $"id",
  //这个数据不是普通数组是WrappedArray会报错
  osjl(array("f1", "f2", "f3", "f4", "f5"), array("b1", "b2", "b3", "b4", "b5"))as "dist"
)

Creating sparksql treatment table

//处理表格找出距离最小的5名
fra.createTempView("top")
    val top5 = spark.sql(
      """
        |
        |select
        |*
        |from
        |(
        |    select
        |    id,
        |    label,
        |    dist,
        |    row_number() over(partition by id order by dist)as rn
        |    from
        |    top
        |)t
        |where rn<6
        |order by id
        |
      """.stripMargin)

    //在距离最小的5名里找出次数最多的label

top5.createTempView("top5")
spark.sql(
  """
    |
    |select
    |id,
    |label
    |from
    |top5
    |group by id,label
    |having count(1)> 2
    |
  """.stripMargin)

      .show(100,false)

    //关流
      spark.stop()
  }
}
Published 48 original articles · won praise 11 · views 1506

Guess you like

Origin blog.csdn.net/weixin_45896475/article/details/104426110