libsvm data format:
The training data and test data file formats used by libsvm are as follows:
[label] [index1]:[value1] [index2]:[value2] …
[label] [index1]:[value1] [index2]:[value2] …
The label target value, that is, the class (which category it belongs to), is the type you want to classify, usually some integer.
index is a sequential index, usually a continuous integer. It refers to the feature number, which must be arranged in ascending order
Value is the eigenvalue, the data used for training, usually composed of a bunch of real numbers.
which is:
目标值 第一维特征编号:第一维特征值 第二维特征编号:第二维特征值 …
目标值 第一维特征编号:第一维特征值 第二维特征编号:第二维特征值 …
……
目标值 第一维特征编号:第一维特征值 第二维特征编号:第二维特征值 …
For example: 5 1:0.6875 2:0.1875 3:0.015625 4:0.109375
Indicates that the training features have 4 dimensions, the first dimension is 0.6875, the second dimension is 0.1875, the third dimension is 0.015625, and the fourth dimension is 0.109375. The target value is 5.
Note: The format of training and test data must be the same, as shown above. The target value in the test data is for error calculation.
rely:
<properties>
<scala.version>2.11.8</scala.version>
<spark.version>2.2.0</spark.version>
</properties>
<repositories>
<repository>
<id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>2.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.11</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib-local_2.11</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.47</version>
</dependency>
</dependencies>
Code example:
package com.spark.milib
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{
DataFrame, Row, SparkSession}
object SqlDataToLibsvm {
def main(args: Array[String]): Unit = {
val sparkSession: SparkSession = SparkSession.builder().appName("test").master("local[4]").getOrCreate()
val dataFrame: DataFrame = sparkSession.read.format("jdbc")
.option("url", "jdbc:mysql://localhost:3306/test11")
.option("dbtable", "tablename2")
.option("user", "root")
.option("password", "123")
.load()
dataFrame.createTempView("dataFrame")
val frame: DataFrame = sparkSession.sql("select T_FACTOR,MP_ID,TG_ID,SJD,P0,P1,P2,P3,I1,I2,I3,U1,U2,U3 from dataFrame")
val rdd: RDD[Row] = frame.rdd
val data = rdd.map{
line =>
//此时line数据格式为:[22.0,11.0,11.0,12.0,22.0,33.0,22.0,23.0,24.0,11.0,23.0,13.0,25.0,23.0]
//切割数据外面的中括号
val lineStr: String = line.toString().substring(1,line.toString().length-2)
val values = lineStr.toString().split(",").map(_.toDouble)
//init返回除了最后一个元素的所有元素,作为特征向量
//Vectors.dense向量化,dense密集型
val feature = Vectors.dense(values.init)
val label = values.last
LabeledPoint(label, feature)
}
MLUtils.saveAsLibSVMFile(data,"D:\\test")
}
}