Spark reads MySQL (Oracle) data and saves it in libsvm format

libsvm data format:

The training data and test data file formats used by libsvm are as follows:

 [label] [index1]:[value1] [index2]:[value2][label] [index1]:[value1] [index2]:[value2]

The label target value, that is, the class (which category it belongs to), is the type you want to classify, usually some integer.

index is a sequential index, usually a continuous integer. It refers to the feature number, which must be arranged in ascending order

Value is the eigenvalue, the data used for training, usually composed of a bunch of real numbers.

which is:

目标值   第一维特征编号:第一维特征值   第二维特征编号:第二维特征值 …
 
目标值   第一维特征编号:第一维特征值   第二维特征编号:第二维特征值 …
 
……
 
目标值   第一维特征编号:第一维特征值   第二维特征编号:第二维特征值 …

For example: 5 1:0.6875 2:0.1875 3:0.015625 4:0.109375

Indicates that the training features have 4 dimensions, the first dimension is 0.6875, the second dimension is 0.1875, the third dimension is 0.015625, and the fourth dimension is 0.109375. The target value is 5.

Note: The format of training and test data must be the same, as shown above. The target value in the test data is for error calculation.

rely:

	<properties>
        <scala.version>2.11.8</scala.version>
        <spark.version>2.2.0</spark.version>
    </properties>
    <repositories>
        <repository>
            <id>cloudera</id>
            <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
        </repository>
    </repositories>
    <dependencies>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>${scala.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.6.0</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-core</artifactId>
            <version>2.6.0</version>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-mllib_2.11</artifactId>
            <version>2.2.0</version>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-mllib-local_2.11</artifactId>
            <version>2.2.0</version>
        </dependency>

        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>5.1.47</version>
        </dependency>
    </dependencies>

Code example:

package com.spark.milib

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{
    
    DataFrame, Row, SparkSession}

object SqlDataToLibsvm {
    
    
  def main(args: Array[String]): Unit = {
    
    

    val sparkSession: SparkSession = SparkSession.builder().appName("test").master("local[4]").getOrCreate()

    val dataFrame: DataFrame = sparkSession.read.format("jdbc")
      .option("url", "jdbc:mysql://localhost:3306/test11")
      .option("dbtable", "tablename2")
      .option("user", "root")
      .option("password", "123")
      .load()

    dataFrame.createTempView("dataFrame")
    val frame: DataFrame = sparkSession.sql("select T_FACTOR,MP_ID,TG_ID,SJD,P0,P1,P2,P3,I1,I2,I3,U1,U2,U3 from dataFrame")

    val rdd: RDD[Row] = frame.rdd

    val data = rdd.map{
    
     line =>

      //此时line数据格式为:[22.0,11.0,11.0,12.0,22.0,33.0,22.0,23.0,24.0,11.0,23.0,13.0,25.0,23.0]
      //切割数据外面的中括号
      val lineStr: String = line.toString().substring(1,line.toString().length-2)
      val values = lineStr.toString().split(",").map(_.toDouble)
      //init返回除了最后一个元素的所有元素,作为特征向量
      //Vectors.dense向量化,dense密集型
      val feature = Vectors.dense(values.init)
      val label = values.last
      LabeledPoint(label, feature)
    }

    MLUtils.saveAsLibSVMFile(data,"D:\\test")
  }
}

effect:

Insert picture description here

Guess you like

Origin blog.csdn.net/weixin_44455388/article/details/107342126