Java uses naive Bayesian classification algorithm to realize information classification

Table of contents
  • Bayesian classification algorithm 
  • code example
    • Dataset data.txt
    • Code
    • output result
  • scenes to be used

Bayesian classification algorithm 

Bayesian classification algorithm is a classification method of statistics, which is a kind of algorithm that uses probability and statistics knowledge to classify. In many occasions, the Naïve Bayes (NB) classification algorithm is comparable to the decision tree and neural network classification algorithms. This algorithm can be applied to large databases, and the method is simple, with high classification accuracy and fast speed.

Since Bayesian theorem assumes that the influence of an attribute value on a given class is independent of the values ​​of other attributes, which is often not true in practice, its classification accuracy may decline. For this reason, many Bayesian classification algorithms that reduce the independence assumption are derived, such as TAN (tree augmented Bayes network) algorithm.

So since it is a naive Bayesian classification algorithm, what is its core algorithm?

is the following Bayesian formula:

It will be much clearer to change the expression form, as follows:

We can finally find p (category | feature)! It is equivalent to completing our task.

code example

Let's take girls looking for a partner as an example, and extract several key characteristics of girls looking for a partner, such as appearance, personality, height, self-motivation, and asset status as the characteristics of spouse selection. Through prior research and other means, obtain a part of data samples, that is, various characteristics And the mate selection results (classification) dataset. According to the data set, the naive Bayesian function is used to calculate the value of each feature set under the category, and the category with the largest result value is considered to belong to this category. Since this is calculated using probability, it is not necessarily very accurate. The larger the data set sample data, the higher the accuracy rate.

Dataset data.txt

Each line of code in the following data set has one sample data. The specific features in each piece of data are separated by commas "," and the order of features is as follows:

Appearance, personality, height, self-motivation, asset situation, girl's favorite result

handsome, good, tall, motivated, rich, liking
not handsome, good, tall, motivated, rich, liking handsome,
not good, tall, motivated, rich, liking handsome,
good, not tall, motivated, rich, Favorable handsome
, good, tall, not motivated, rich, liking handsome,
good, tall, motivated, not rich, liking handsome,
good, not tall, not motivated, rich, unfavorable not handsome, not
good, not tall ,motivated, rich, liking
not handsome, not good, not high, motivated, not rich, not liking handsome,
good, not tall, motivated, not rich, liking
not handsome, good, tall, not motivated, rich , dislike
handsome, not good, tall, motivated, rich, dislike not handsome
, good, tall, motivated, rich, dislike
handsome, not good, tall, motivated, not rich, like handsome,
not good, Tall, not motivated, rich, likable Handsome
, good, tall, motivated
, not
rich, disliked Motivated, not rich, dislike
handsome, good, not high, motivated, rich, likable
not handsome, not good, not tall, not motivated, rich, dislike
handsome, good, tall, motivated, not rich, Favorable
Handsome, good, not tall, not motivated, rich, liking Handsome,
good, tall, not motivated, not rich, disliked Handsome, not
good, tall, not motivated, rich, disliked

Code

?

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

import java.io.BufferedReader;

import java.io.File;

import java.io.FileInputStream;

import java.io.InputStreamReader;

import java.util.*;

import java.util.stream.Collectors;

  

/**

 * @author liuya

 */

public class NaiveBayesModel {

  

    //样本数据

    private static List<List<String>> data = new ArrayList<>();

    //样本数据

    private static Set<List<String>> dataSet = new HashSet<>();

    //分类模型

    public static Map<String,String> modelMap = new HashMap<>();

    //样本数据集

    private static String path = "./src/data.txt";

  

    public static void main(String[] args) {

        //训练模型

        trainingModel();

        //识别

        classification("帅","好","高","上进","有钱");

        classification("不帅","不好","不高","不上进","不有钱");

    }

  

    /**

     * 导入数据

     * @param path

     * @return

     */

    public static void readData(String path){

        List<String> row = null;

        try {

            InputStreamReader isr = new InputStreamReader(new FileInputStream(new File(path)));

            BufferedReader br = new BufferedReader(isr);

            String str = null;

            while((str = br.readLine()) != null){

                row = new ArrayList<>();

                String[] str1 = str.split(",");

                for(int i = 0; i < str1.length ; i++) {

                    row.add(str1[i]);

                }

                dataSet.add(row);

                data.add(row);

            }

            br.close();

            isr.close();

        } catch (Exception e) {

            e.printStackTrace();

            System.out.println("读取文件内容出错!");

        }

    }

  

    public static void trainingModel() {

        readData(path);

        String category1="中意";

        String category2="不中意";

        dataSet.forEach(e->{

          double categoryP1=  calculateBayesian(e.get(0),e.get(1),e.get(2),e.get(3),e.get(4),category1);

          double categoryP2=  calculateBayesian(e.get(0),e.get(1),e.get(2),e.get(3),e.get(4),category2);

            String result=categoryP1>categoryP2?category1:category2;

            modelMap.put(e.get(0)+"-"+e.get(1)+"-"+e.get(2)+"-"+e.get(3)+"-"+e.get(4),result);

        });

    }

  

    /**

     * 分类的识别

     * */

    public static void  classification(String look, String character, String height, String progresses, String wealthy){

        String key=look+"-"+character+"-"+height+"-"+progresses+"-"+wealthy;

        String result=modelMap.get(key);

        System.out.println("特征为"+look+","+character+","+height+","+progresses+","+wealthy+"的对象,女生"+result);

    }

  

  

    /**

     * 分类的核心是比较朴素贝叶斯的结果值,结果值大的认为就属于该分类(会有误差,数据集量越大,结果判定的准确率就会越高)由于分母相同可以直接比较分子来确定分类

     * */

    public static double calculateBayesian(String look, String character, String height, String progresses, String wealthy,String category) {

        //获取P(x|y)的分母

      //  double denominator = getDenominator(look,character,height,progresses,wealthy);

        //获取P(x|y)的分子

        double molecule = getMolecule(look,character,height,progresses,wealthy,category);

        return molecule/1;

    }

  

    /**

     * 获取p(x|y)分子

     * @return

     */

    public static double getMolecule(String look, String character, String height, String progresses, String wealthy,String category) {

        double resultCP = getProbability(5, category);

        double lookCP = getProbability(0, look, category);

        double characterCP = getProbability(1, character, category);

        double heightCP = getProbability(2, height, category);

        double progressesCP = getProbability(3, progresses, category);

        double wealthyCP = getProbability(4, wealthy, category);

        return lookCP * characterCP * heightCP * progressesCP * wealthyCP * resultCP;

  

    }

  

    /**

     * 获取p(x|y)分母

     * @return

     */

    public static double getDenominator(String look, String character, String height, String progresses, String wealthy) {

        double lookP = getProbability(0, look);

        double characterP = getProbability(1, character);

        double heightP = getProbability(2, height);

        double progressesP = getProbability(3, progresses);

        double wealthyP = getProbability(4, wealthy);

        return lookP * characterP * heightP * progressesP * wealthyP;

    }

  

  

    /**

     * 获取某特征的概率

     * @return

     */

    private static double getProbability(int index, String feature) {

        int size = data.size();

        int num = 0;

        for (int i = 0; i < size; i++) {

            if (data.get(i).get(index).equals(feature)) {

                num++;

            }

        }

        return (double) num / size;

    }

  

    /**

     * 获取某类别下某特征的概率

     * @return

     */

    private static double getProbability(int index, String feature, String category) {

        List<List<String>> filterData=data.stream().filter(e -> e.get(e.size() - 1).equals(category)).collect(Collectors.toList());

        int size =filterData.size();

        int num = 0;

        for (int i = 0; i < size; i++) {

            if (data.get(i).get(index).equals(feature)) {

                num++;

            }

        }

        return (double) num / size;

    }

}

output result

scenes to be used

For example, website spam classification, automatic article classification, website spam classification, file classification, etc.

Take anti-spam emails as an example to illustrate the use of classification algorithms. First, batches of classified mail samples (such as 5,000 normal emails and 2,000 spam emails) are input into the classification algorithm for training to obtain a spam classification model, and then use The classification algorithm combines the classification model to classify and identify the mails to be processed.

The probability of extracting a set of feature information based on the classified sample information, for example, the probability of the word "credit card" appearing in spam emails is 20%, and the probability of non-spam emails is 1%, and a classification model is obtained . Then extract the feature value from the mail to be identified and combine it with the classification model to judge whether the classification is spam or not. Since the classification judgment obtained by the Bayesian algorithm is a probability value, misjudgment may occur.

Guess you like

Origin blog.csdn.net/qq_15509251/article/details/131513591