How to use probabilistic thinking to solve programming problems?

Before starting the main text, I would like to recommend a good book - "Hackers and Painters". This book is some manuscripts of the author. The Chinese version was published in 2011. After reading the complete book, you will be convinced by the author's vision. Many software The development of this aspect coincides with the author's prediction, and the ideas put forward by many authors are also worth pondering. Today's article is also inspired by this book.


In the book, the author introduces the development idea of ​​his own spam filter. At first, he filtered the spam by looking for the characteristics of the spam, such as the sender's address, the subject of the email, and the sensitive words in the text. Such a spam filter can achieve a correct rate of 78%, but when the author tries to further improve the correct rate, he finds that the filtering rules may be too strict - the blocking of spam is stronger, but misjudged emails The amount of (non-spam being treated as spam) has also increased.


Therefore, the author switched to a probability filter, which is actually a Bayesian classifier. The probability-based method can greatly improve the accuracy of spam identification, while the misjudgment is not large. It seems that the author already had the idea of ​​​​machine learning at that time, and perhaps did not realize it, he called it a probability filter.


If we only see an implementation of a mail filter from this example, then this is no different from an ordinary technical book. The author mentioned that programmers seem to have developed a habit of overemphasizing certainty. We require inputs to have deterministic outputs. If there are no characteristics, we will not be able to start. But in reality, a more direct way may be to sum up experience, that is, to use prior experience as a basis to judge whether something is logical. Not everything needs to be deterministic, and probabilistic approaches tend to be more realistic.


Think about it, I saw an article a few days ago about the Bayesian formula, and an example inside is very appropriate. In pursuit of certainty, first hypotheses and then evidence. If my assumptions are wrong, they will probably lead us to wrong results. For example, I came up with the assumption that the moon was made of cheese. After my observation, the moon was yellow in color, and I came to the conclusion that the moon was made of cheese. Obviously, this is the wrong conclusion, and you might exclaim that anyone with common sense would think this is wrong. When you think this way, you've added your prior knowledge, which is common sense, so you won't be fooled by false conclusions. Of course, if one's prior knowledge is all wrong, that's another story.


Some time ago, I learned a round of deep learning. With some perceptual understanding, the field of machine learning has adopted the idea of ​​probability, which is more in line with the way of human thinking. This brings us a new way of solving problems in the code. We don't need complete certainty. For many problems, probability can help us better.


In order to deepen the understanding, the following is a Bayesian classifier I completed according to the author's idea to distinguish whether an English sentence is a good word or a bad word. Bayesian formula is a very important part of probability theory. It is mainly used in quality inspection, insurance, trend prediction and other fields. Its history is also very interesting. The great thing about it is that we can use prior knowledge to get A posteriori knowledge, there are many articles introduced on the Internet, which can supplement the theory. Gossip less, look at the code, our powerful Python comes into play.

good_sentences=[
    'i like you',
    'i love you',
    'you are so cool',
    'you are right',
    'you are beautiful'
]


bad_sentences=[
    'i hate you',
    'i beat you',
    'you are not cool',
    'you are wrong',
    'you are ugly'
]
In order to train the model, I provided some basic data, there is not much data, if you want a higher accuracy rate, you can provide more sentences. Sentences are divided into two groups, good words and bad words.
good_sentences_join = ' '.join(good_sentences)
good_words = good_sentences_join.split()

bad_sentences_join = ' '.join(bad_sentences)
bad_words = bad_sentences_join.split()

good_words_freq = {}
for word in good_words:
    good_words_freq.setdefault(word, round(good_words.count(word)/len(good_words), 2))

bad_words_freq = {}
for word in bad_words:
bad_words_freq.setdefault(word, round(bad_words.count(word)/len(bad_words), 2))
Next, calculate the probability of the words in the sentence appearing in good words and the probability of appearing in bad words, which is our prior knowledge.
p_bad = {}
keys = set(good_words + bad_words)
for key in keys:
    bad_freq = 0.0
    if key in bad_words_freq:
        bad_freq = bad_words_freq[key]

    good_freq = 0.0
    if key in good_words_freq:
        good_freq = good_words_freq[key]

p_bad.setdefault(key, max(0.01, min(0.99, bad_freq/(bad_freq+good_freq))))
We calculate the weight of each word in the model making a sentence bad. Here you can add and print to see p_bad, you will be surprised to find that words such as i, you have a weight of 0.5, that is to say, these are neutral words, because these words appear in both good words and bad words, we can certainly think that is neutral.
def analyse_bad_sentence(sentence, p_not_found=0.5):
    words = sentence.lower().split()
    p_bad_word = 1.0
    p_bad_word_reverse = 1.0
    for word in words:
        if word in p_bad:
            p_bad_word *= p_bad[word]
            p_bad_word_reverse *= 1 - p_bad[word]
        else:
            p_bad_word *= p_not_found
            p_bad_word_reverse *= 1 - p_not_found
return p_bad_word / (p_bad_word + p_bad_word_reverse)
Now, we can use the model to define the function of analyzing the sentence. The main thing is to use the prior knowledge to calculate the current actual sentence and give the confidence. Here we use Bayes' theorem. Pay attention to the second parameter, p_not_found, the default value here is 0.5, that is to say, if a word has no prior knowledge, I will treat it as a neutral word, you can try to assign a different value to this parameter, greater than 0.5 , your little robot is a little dark, and it will think it is a bad word for words it has never seen; if it is less than 0.5, your little robot is a little naive, and it will treat it as a good word if it has not been seen before.
if __name__ == '__main__':
    import sys

    if len(sys.argv) != 2:
        print('Usage: ai_simulate sentence')
    
    sentence = sys.argv[1]
    confidence = analyse_bad_sentence(sentence)
    if confidence > 0.5:
        print('You should not swear at me!')
    else:
print('Thank you!')

The final main program part is very simple, no more to say, let's try it below, I use ubuntu.

$ ./ai_simulate.py "He like you"

Thank you!


$./ai_simulate.py "He hate you"

You should not swear at me!


$ ./ai_simulate.py "He love and hate you"

Thank you!


For the third example, if you change p_not_found to 0.6, it becomes

You should not swear at me!


Because He still has and, which is not in our model data, the shadowy robot will think you are scolding him XD. The code is here .

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325580474&siteId=291194637