Table of contents
- Bayesian classification algorithm
- code example
- Dataset data.txt
- Code
- output result
- scenes to be used
Bayesian classification algorithm
Bayesian classification algorithm is a classification method of statistics, which is a kind of algorithm that uses probability and statistics knowledge to classify. In many occasions, the Naïve Bayes (NB) classification algorithm is comparable to the decision tree and neural network classification algorithms. This algorithm can be applied to large databases, and the method is simple, with high classification accuracy and fast speed.
Since Bayesian theorem assumes that the influence of an attribute value on a given class is independent of the values of other attributes, which is often not true in practice, its classification accuracy may decline. For this reason, many Bayesian classification algorithms that reduce the independence assumption are derived, such as TAN (tree augmented Bayes network) algorithm.
So since it is a naive Bayesian classification algorithm, what is its core algorithm?
is the following Bayesian formula:
It will be much clearer to change the expression form, as follows:
We can finally find p (category | feature)! It is equivalent to completing our task.
code example
Let's take girls looking for a partner as an example, and extract several key characteristics of girls looking for a partner, such as appearance, personality, height, self-motivation, and asset status as the characteristics of spouse selection. Through prior research and other means, obtain a part of data samples, that is, various characteristics And the mate selection results (classification) dataset. According to the data set, the naive Bayesian function is used to calculate the value of each feature set under the category, and the category with the largest result value is considered to belong to this category. Since this is calculated using probability, it is not necessarily very accurate. The larger the data set sample data, the higher the accuracy rate.
Dataset data.txt
Each line of code in the following data set has one sample data. The specific features in each piece of data are separated by commas "," and the order of features is as follows:
Appearance, personality, height, self-motivation, asset situation, girl's favorite result
handsome, good, tall, motivated, rich, liking
not handsome, good, tall, motivated, rich, liking handsome,
not good, tall, motivated, rich, liking handsome,
good, not tall, motivated, rich, Favorable handsome
, good, tall, not motivated, rich, liking handsome,
good, tall, motivated, not rich, liking handsome,
good, not tall, not motivated, rich, unfavorable not handsome, not
good, not tall ,motivated, rich, liking
not handsome, not good, not high, motivated, not rich, not liking handsome,
good, not tall, motivated, not rich, liking
not handsome, good, tall, not motivated, rich , dislike
handsome, not good, tall, motivated, rich, dislike not handsome
, good, tall, motivated, rich, dislike
handsome, not good, tall, motivated, not rich, like handsome,
not good, Tall, not motivated, rich, likable Handsome
, good, tall, motivated
, not
rich, disliked Motivated, not rich, dislike
handsome, good, not high, motivated, rich, likable
not handsome, not good, not tall, not motivated, rich, dislike
handsome, good, tall, motivated, not rich, Favorable
Handsome, good, not tall, not motivated, rich, liking Handsome,
good, tall, not motivated, not rich, disliked Handsome, not
good, tall, not motivated, rich, disliked
Code
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 |
|
output result
scenes to be used
For example, website spam classification, automatic article classification, website spam classification, file classification, etc.
Take anti-spam emails as an example to illustrate the use of classification algorithms. First, batches of classified mail samples (such as 5,000 normal emails and 2,000 spam emails) are input into the classification algorithm for training to obtain a spam classification model, and then use The classification algorithm combines the classification model to classify and identify the mails to be processed.
The probability of extracting a set of feature information based on the classified sample information, for example, the probability of the word "credit card" appearing in spam emails is 20%, and the probability of non-spam emails is 1%, and a classification model is obtained . Then extract the feature value from the mail to be identified and combine it with the classification model to judge whether the classification is spam or not. Since the classification judgment obtained by the Bayesian algorithm is a probability value, misjudgment may occur.