[Machine learning] Naive Bayes principle and examples based on Spark

Naive Bayes Classification

The origin of Bayesian principle: Bayes wrote an article to solve a problem called "inverse probability", trying to answer how to make a more mathematically logical guess without too much reliable evidence.

Reverse probability: The reverse probability is relative to the forward probability. The problem of forward probability is easy to understand. For example, we already know that there are N balls in the bag, either black balls or white balls, and M of them are black balls. Then reach in and touch a ball to know the probability of drawing a black ball. how many. This is to make judgments after understanding the whole picture. In real life, it is difficult for us to know the whole picture. Bayesian started from the actual scene and asked a question: If we don't know the ratio of black and white balls in the bag in advance, but based on the color of the ball we touched, how can we judge the ratio of black and white balls in the bag?

It is based on this problem that Bayes proposed the Bayes principle. Bayesian principle is completely different from other statistical inference methods. It is based on subjective judgment: without knowing all the objective facts, you can also estimate a value first, and then continuously modify it based on actual results.

Several concepts in Bayesian principle:

Priori probability: is to judge the probability of things happening through experience. For example, the heavy snow season in the north is from November to December, which is based on the experience of previous years. The probability of heavy snow at this time is much higher than at other times.

Posterior probability: The posterior probability is the probability of inferring the cause after the result has occurred. For example, if someone is diagnosed with heart disease, the cause may be A, B, or C. The probability of having heart disease due to cause A is the posterior probability. It is a kind of conditional probability.

Conditional probability: The probability of occurrence of event A under the condition that another event B has occurred, expressed as P(A|B), read as "probability of occurrence of A under the condition that B occurs". For example, under the condition of cause A, the probability of suffering from heart disease is the conditional probability.

Likelihood function

In mathematical statistics, the likelihood function is a function of the parameters in the statistical model, which represents the likelihood in the model parameters. You can understand the training process of a probability model as the process of seeking parameter estimates. For example, consider the experiment of tossing a coin. If it is known that the probability of the coin being thrown heads up is 0.5, we can know the possibility of various results after several throws. For example, the probability of two shots being heads is 0.25.

Bayesian principle, in fact, Bayesian principle is to solve the posterior probability . Bayes' formula is as follows:

Naive Bayes:

Naive Bayes classification is a commonly used Bayes classification method. It is a simple but extremely powerful predictive modeling algorithm. It is called Naive Bayes because it assumes that each input variable is independent.

The Naive Bayes model consists of two types of probabilities:

  1. The probability of each category P(Cj);
  2. The conditional probability of each attribute P(Ai|Cj).

In Naive Bayes, what is to be counted is the conditional probability of the attribute, namely P(Ai|Cj).

Because there may be many attributes A, namely

because

and so

Naive Bayes classification is often used in text classification, such as spam text filtering, sentiment prediction, recommendation systems, etc.

Commonly used naive Bayes algorithm :

Gaussian Naive Bayes : The characteristic variables are continuous variables and conform to the Gaussian distribution, such as the height of a person and the length of an object. Continuous variable attributes, considering the probability density function, assuming p(xi|c)∼N(μc,i,σ2c,i), where μc,i,σ2c,i are the values ​​of the c type samples on the i-th attribute Mean and standard deviation, then

 

Polynomial Naive Bayes : The feature variable is a discrete variable and conforms to a multinomial distribution. In document classification, the feature variable is reflected in the number of times a word appears, or the TF-IDF value of the word.

Bernoulli Naive Bayes : The feature variable is a Boolean variable, which conforms to the 0/1 distribution. The feature in document classification is whether a word appears.

Bernoulli Naive Bayes uses the file as the granularity. If the word appears in a file, it is 1, otherwise it is 0. Polynomial Naive Bayes uses words as the granularity and calculates the specific number in a file. Gaussian Naive Bayes is suitable for dealing with the situation where the characteristic variable is continuous and conforms to the normal distribution (Gaussian distribution). For example, natural phenomena such as height and weight are more suitable to be handled by Gaussian Naive Bayes. The text classification uses polynomial naive Bayes or Bernoulli naive Bayes.

Complementary Naive Bayes: Complementary Naive Bayes (ComplementNB, cNB) is an adaptive algorithm of standard polynomial Naive Bayes (MNB) algorithm, especially suitable for unbalanced data sets.

Processing skills:

  1. If the continuous feature is not normally distributed, we should use various methods to convert it to normal distribution.
  2. If the test data set has a "zero frequency" problem, apply the smoothing technique "Laplace estimation" to correct the data set.

Zero probability problem: When calculating the probability of an instance, if a certain quantity x has not appeared in the observation sample library (training set), the probability result of the entire instance will be 0. In the problem of text classification, when a word does not appear in the training sample, the tone probability of the word is 0, and it is also 0 when the probability of text appearance is calculated by multiplication. This is unreasonable. You cannot arbitrarily assume that the probability of an event is zero because it is not observed.

  1. Deleting repetitive and highly relevant features may lose frequency information and affect the effect.
  2. Naive Bayes classification has limited options for parameter adjustment. It is recommended to focus on data preprocessing and feature selection.

Laplace smoothing:

It is to add 1 to the counts of all the divisions under each category, so that if the number of training sample sets is sufficiently large, it will not affect the result and solve the awkward situation of zero probability.

 

Among them, ajl represents the l-th choice of the j-th feature , represents the number of the j-th feature, and K represents the number of types.

This is well understood. After adding Laplacian smoothing, the probability of occurrence of 0 is avoided, and each value is guaranteed to be in the range of 0 to 1, and the probability that the final sum is 1. nature. Such as:

The number of appearance features is handsome, not handsome, in two cases, then S j is 2, then the final probability p (looks handsome|married) is 4/8 (the number of marrying is 6 + the number of features is 2)

Examples of Naive Bayes :

The following data sets are known:

Serial number

Age

Work

House

Loan

Class

1

middle aged

no

no

general

no

2

middle aged

no

no

it is good

no

3

middle aged

Yes

Yes

it is good

Yes

4

middle aged

no

Yes

very good

Yes

5

elderly

no

Yes

very good

Yes

It is known that someone is Age middle-aged, Work is, House is, Loan is average. Which category does this person belong to?

Analysis: Naive Bayes formula can be used to transform the problem into:

P(Class | Age,Work ,House,Loan)= P(Age,Work,House,Loan| Class )* P(Class)/P(Age,Work,House,Loan)=

P(Age|Class)* P(Work |Class) * P(House |Class) * P(Loan |Class)* P(Class)/ [P(Age)*P( Work)* P( House) *P( Loan)]

1. Data processing

   First convert each feature into a digital representation:

description

Numbering

middle aged

0

elderly

1

no

0

Yes

1

Yes

1

very good

0

it is good

1

general

2

The above table is expressed as:

Serial number

Age

Work

House

Loan

Class

1

0

0

0

2

0

2

0

0

0

1

0

3

0

1

1

1

1

4

0

0

1

0

1

5

1

0

1

0

1

Code example:

输入:naviebays.txt

0 0 0 2 0

0 0 0 1 0

0 1 1 1 1

0 0 1 0 1

1 0 1 0 1



package sparkmlNaiveBayes;

import java.util.*;

import org.apache.spark.SparkConf;

import org.apache.spark.api.java.JavaRDD;

import org.apache.spark.api.java.function.Function;

import org.apache.spark.ml.classification.NaiveBayes;

import org.apache.spark.ml.classification.NaiveBayesModel;

import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator;

import org.apache.spark.ml.feature.VectorAssembler;

import org.apache.spark.sql.Dataset;

import org.apache.spark.sql.Row;

import org.apache.spark.sql.RowFactory;

import org.apache.spark.sql.SparkSession;

import org.apache.spark.sql.types.DataTypes;

import org.apache.spark.sql.types.StructField;

import org.apache.spark.sql.types.StructType;



public class navieBayes {

    public static void main(String[] args) {

      

      SparkConf spconf=new  SparkConf().setMaster("local[4]").setAppName("testNaiveByes");

      SparkSession spsession=SparkSession.builder().config(spconf).getOrCreate();

       String schemaString = "a b c d label";

       List<StructField> fields = new ArrayList<>();

       for (String fieldName : schemaString.split(" ")) {

         StructField field = DataTypes.createStructField(fieldName, DataTypes.IntegerType, true);

         fields.add(field);

       }

       StructType schema = DataTypes.createStructType(fields);

       JavaRDD<String> javaRDDstr=spsession.sparkContext().textFile(".\\naviebays.txt",2).toJavaRDD();

      

       JavaRDD<Row> rowRDD=javaRDDstr.map((Function<String, Row>) r->

       {

       String[] attributes =  r.split(" ");

       int[] arr = new int[attributes.length];

        for (int i = 0; i < attributes.length; i++)

        {

            arr[i] = Integer.parseInt(attributes[i]);

        }

       return RowFactory.create(arr[0],arr[1],arr[2],arr[3],arr[4]);

} );

       Dataset<Row> df1 = spsession.createDataFrame(rowRDD, schema);

       df1.show();

       VectorAssembler v=new VectorAssembler();

       v.setInputCols( new String[]{"a","b","c","d"});

       v.setOutputCol("items");

       Dataset  df2= v.transform(df1);

       df2.show(false);

       NaiveBayes nb = new NaiveBayes();

      nb.setFeaturesCol("items");

      nb.setLabelCol("label");         

      NaiveBayesModel model = nb.fit(df2);

        Dataset<Row> predictions = model.transform(df2);

    predictions.show(false);

        MulticlassClassificationEvaluator evaluator = new MulticlassClassificationEvaluator()

                .setLabelCol("label")

                .setPredictionCol("prediction")

                .setMetricName("accuracy");

             double accuracy = evaluator.evaluate(predictions);

            System.out.println("Test set accuracy = " + accuracy);

    }

}

 

Guess you like

Origin blog.csdn.net/henku449141932/article/details/110368360