History of the most simple spark tutorial XXIII - the first machine learning to run Java and Python code cases

[Advance notice]
article by author: Zhang Yaofeng in conjunction with their own experience in the production of finishing, forming easy to understand article
writing is not easy, reproduced please specify, thank you!
Code Cases Address: ? HTTPS: //github.com/Mydreamandreality/ sparkResearch


Run your first machine learning program

Let us together develop the first machine learning program
we have to do is also very simple case, 自动识别垃圾邮件
before the start coding, we put some of the pit may encounter buried ↓

The system relies

If you can not find the run MLlib warning gfortranlibraries need to install the library
MLlib used 线性代数包Breeze, it relies on netlib-javaand jblas, netlib-java and jblas need to rely on native Fortran routines, so you need to install gfortran runtime library
if you are using python development spark, need NumPy version 1.4or later .numpy is the basis of machine learning library python

Centos installation using the command:sudo yum install libgfortran.x86_64

Here Insert Picture Description


Coding ideas

  • We may need the following steps:
    • First of all to express our message string RDD
    • MLlib operation of a 特征提取算法data value is converted into feature, the operation returns a vector RDD
    • Calls for RDD vector 分类算法(逻辑回归), returns a model object, the new data are classified with the object
    • MLlib evaluation function used in the evaluation model test data set

Java Code Case:

package v1;

import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.mllib.classification.LogisticRegressionModel;
import org.apache.spark.mllib.classification.LogisticRegressionWithSGD;
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.regression.LabeledPoint;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.mllib.feature.HashingTF;

import java.util.Arrays;
import java.util.regex.Pattern;

/**
 * Created by 張燿峰
 * 机器学习入门案例
 * 过滤垃圾邮件
 *
 * @author 孤
 * @date 2019/5/7
 * @Varsion 1.0
 */
public class SpamEmail {
    private static final Pattern SPACE = Pattern.compile(" ");

    public static void main(String[] args) {
        SparkSession sparkSession = SparkSession.builder().appName("spam-email").master("local[2]").getOrCreate();
        JavaSparkContext javaSparkContext = JavaSparkContext.fromSparkContext(sparkSession.sparkContext());

        //垃圾邮件数据
        JavaRDD<String> spamEmail = javaSparkContext.textFile("spam.json");
        //优质邮件数据
        JavaRDD<String> normalEmail = javaSparkContext.textFile("normal.json");

        //创建hashingTF实例把邮件文本映射为包含10000个特征的向量
        final HashingTF hashingTF = new HashingTF(10000);

        JavaRDD<LabeledPoint> spamExamples = spamEmail.map(new Function<String, LabeledPoint>() {
            @Override
            public LabeledPoint call(String v1) throws Exception {
                return new LabeledPoint(0, hashingTF.transform(Arrays.asList(SPACE.split(v1))));
            }
        });

        JavaRDD<LabeledPoint> normaExamples = normalEmail.map(new Function<String, LabeledPoint>() {
            @Override
            public LabeledPoint call(String v1) throws Exception {
                return new LabeledPoint(1, hashingTF.transform(Arrays.asList(SPACE.split(v1))));
            }
        });

        //训练数据
        JavaRDD<LabeledPoint> trainData = spamExamples.union(normaExamples);
        trainData.cache();      //逻辑回归需要迭代,先缓存

        //随机梯度下降法  SGD 逻辑回归
        LogisticRegressionModel model = new LogisticRegressionWithSGD().run(trainData.rdd());

        Vector spamModel = hashingTF.transform(Arrays.asList(SPACE.split("垃 圾 钱 恶 心 色 情 赌 博 毒 品 败 类 犯罪")));
        Vector normaModel = hashingTF.transform(Arrays.asList(SPACE.split("work 工作 你好 我们 请问 时间 领导")));
        System.out.println("预测负面的例子: " + model.predict(spamModel));
        System.out.println("预测积极的例子: " + model.predict(normaModel));

    }
}

This program uses MLlib two functions HashingTFand LogisticRegressionWithSGDthe former building from the text data word frequency 特征向量, which uses 随机梯度下降法(Stochastic Gradient Descent), referred to SGD realize logistic regression, we start from the assumption that the two documents spam.json and normal.json, two Logistic regression models were files contain examples of spam and non-spam, one per line, then we put each file based on word frequency is converted to text feature vectors can then train a separate two types of messages


Python Code Case

from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.feature import HashingTF
from pyspark.mllib.classification import 

LogisticRegressionWithSGD
spam = sc.textFile("spam.txt")
normal = sc.textFile("normal.txt")

tf = HashingTF(numFeatures = 10000)

spamFeatures = spam.map(lambda email: tf.transform(email.split(" ")))

normalFeatures = normal.map(lambda email: tf.transform(email.split(" ")))

positiveExamples = spamFeatures.map(lambda features: LabeledPoint(1, features))

negativeExamples = normalFeatures.map(lambda features: LabeledPoint(0, features))

trainingData = positiveExamples.union(negativeExamples)

trainingData.cache() # 因为逻辑回归是迭代算法,所以缓存训练

# 使用SGD算法运行逻辑回归
model = LogisticRegressionWithSGD.train(trainingData)

# 一样的HashingTF特征来得到特征向量,然后对该向量应用得到的模型
posTest = tf.transform("work 工作 你好 我们 请问 时间 领导".split(""))

negTest = tf.transform("垃 圾 钱 恶 心 色 情 赌 博 毒 品 败 类 犯罪".split(" "))

print "Prediction for positive test example: %g" % model.predict(posTest)

print "Prediction for negative test example: %g" % model.predict(negTest)

to sum up

References:spark文档

  • Indicates the target machine learning algorithms attempt to maximize the mathematical algorithms based on the behavior of the training data (training data) makes, and in order to predict or make decisions
  • Machine learning problem is divided into several, including classification, regression, clustering, each with a different goal. Take Classification (Classification), as a simple example: classification is based on other data points have been labeled (such as some are already marked as spam or not spam) Examples of the data to identify a point belong to several categories of which (such as judgment message is not spam, all learning algorithms need to define the characteristics of each data point (feature) set, which is passed to the learning function of the value of
    most algorithms are just designed for characteristics value (specifically, it is a vector of numbers represents the value of each characteristic) defined, therefore characterized and transformed into a feature vector machine learning process is a very important step
  • For example, a classification algorithm may define a plane in the space of feature vectors, so that this plane can "best" spaced spam and non-spam. It should be the "best" given in the definition (for example, most of the data points in this plane are correctly classified). A model learning algorithm will return decisions (such as the selected plane) representatives, and this model can be used to predict a new point at the end of the operation (for example, according to a new mail feature vector which side of the plane to determine its is not spam)

Machine learning itself is a big topic, the challenge is to take your first step
now, you have successfully taken the first step, keep it up!

Published 55 original articles · won praise 329 · views 70000 +

Guess you like

Origin blog.csdn.net/youbitch1/article/details/89922757