Data Mining - Naive Bayes Classifier demo

At the request of a classmate, help write a naive Bayes classifier based on examples from their book, and here is the result. But the content is quite simple, just look at it.


The Naive Bayes here is as follows:
P(H|X) = P(X|H)*P(H) / P(X)

And it's just for categorical attributes (quite naive)


The data transformation rules are as follows:
age : youth==1, middle_aged=2, senior==3
income : low==1, medium==2, high==3
student : no==0, yes==1
credit_ratin : fair==0, excellent==1
buys_computer : no==0, yes==1
The training data is as follows:
age income  student credit_rating buys_computer
1   3   0   0   0
1   3   0   1   0
2   3   0   0   1
3   2   0   0   1
3   1   1   0   1
3   1   1   1   0
2   1   1   1   1
1   2   0   0   0
1   1   1   0   1
3   2   1   0   1
1   2   1   1   1
2   2   0   1   1
2   3   1   0   1
3   2   0   1   0
code show as below:
import re


def load_data(path):
    '''
    Read the file and load the training data
    :param path: training data path
    :return: training data
             labels are data attributes, such as age, income, etc.,
             training_data is specific data, the format is like: [{'age':1,'income':1},{'age':2,'income':2}], not all attributes are listed
    '''
    training_data = list()
    with open(path, encoding='utf-8') as file:
        data = file.read().split('\n')
        labels = re.split(r'[\s]+', data[0])
        for content in data[1:]:
            parts = re.split(r'[\s]+', content.strip())
            item = dict()
            for index, part in enumerate(parts):
                item[labels[index]] = part
            training_data.append(item)
    return labels, training_data


class Classifier:
    '''
    According to the Naive Bayes classification formula P(H|X) = P(X|H)*P(H) / P(X)
    a classifier
    '''

    def __init__(self, classify_tag, decision=3, train_data_path='training_data.txt'):
        '''
        Initialize the classifier
        :param decision: Decimal places reserved in the calculation process, application scenario: round (decimal, the number of reserved decimal places)
        :param train_data_path: path of training data
        '''

        self.labels, self.training_data = load_data(train_data_path)
        self.labels.remove(classify_tag)
        self.load_all_clazzs(classify_tag)
        self.decision = decision

    def load_all_clazzs(self, tag):
        '''
        Read all classification results
        :param tag: attribute used for classification, here is buys_computer
        :return: All classification results, for example: ['0','1']
        '''
        self.clazz_tag = tag
        self.clazz_values = set()
        for item in self.training_data:
            self.clazz_values.add(item[tag])
        self.clazz_values = list(self.clazz_values)

    def predict(self, item):
        '''
        Predict classification results on input data
        :param item: object to be predicted
        :param classify_tag: classification attribute, here is 'buys_computer'
        :return: Classification result, eg: '1'
        '''

        PXs = list()
        Ps = list()

        # Count the P(Ci), P(X|Ci) corresponding to each category
        for value in self.clazz_values:
            rules = {self.clazz_tag: value}

            p = self.get_P_on_base(rules)
            Ps.append(p)

            px = self.get_PX_on_base(item, rules)
            PXs.append(px)

        print(self.clazz_tag + ' :', self.clazz_values)
        print ('P (Ci):', Ps)
        print('P(X|Ci) : ', PXs)

        # Iterate over all values ​​to find the maximum P(Ci) * P(X|Ci)
        target = [0, 0] # The format is [position subscript, value of P(Ci) * P(X|Ci)]
        for index, p in enumerate(Ps):
            px = PXs[index]
            p_px = round(p * px, self.decision)
            if p_px > target[1]:
                target[0] = index
                target[1] = p_px

        print('Max P(Ci)P(X|Ci) : ', target[1])

        return self.clazz_values[target[0]]

    def count(self, rules):
        '''
        Calculate the number of all eligible tuples in the training data
        :param rules: all conditions, the data structure is a tuple, the format is: {'age':'1','buys_computer':'1'}
        :return: the number of matches
        '''

        # By traversing the training data, and then traversing the conditions, determine whether the current data meets the conditions, if it meets the conditions (without executing break), then enter else, count +1
        hit_num = 0
        for item in self.training_data:
            for key, value in rules.items():
                if item[key] != value:
                    break
            else:
                hit_num += 1
        return hit_num

    def get_P_on_base(self, rules):
        '''
        Calculate the prior probability, P(Ci)
        :param rules: Category requirements, format for example: {'buys_computer':'1'}
        :return: The value of P(Ci)
        '''
        hit_num = self.count(rules)
        P = round(hit_num / len(self.training_data), self.decision)
        return P

    def get_PX_on_base(self, item, rules):
        '''
        Get P(X|H)
        :param item: Equivalent to X, the format is for example: {'age':1,'income':1} , not all properties are listed here
        :param rules: The requirements of the category, the format is for example: {'buys_computer':'1'}
        :return: the value of P(X|H)
        '''

        base_num = self.count(rules)
        PX_on_base = 1 # It is followed by the cumulative multiplication of P(X|Ci), so it is initialized to 1
        for label in self.labels:
            # Get all statistical requirements, for example: {'buys_computer':'1','age':'1'}
            tmp_rule = {label: item[label]}
            tmp_rule.update(rules)

            hit_num = self.count(tmp_rule)

            # Calculate P(X|Ci) and multiply into PX_on_base
            PX_on_base = round(PX_on_base * hit_num / base_num, self.decision)

        return PX_on_base


if __name__ == '__main__':
    # data transformation
    # age : youth==1, middle_aged=2, senior==3
    # income : low==1, medium==2, high==3
    # student : no==0, yes==1
    # credit_ratin : fair==0, excellent==1
    # buys_computer : no==0, yes==1

    # Data to be judged
    item = {'age': '1', 'income': '2', 'student': '1', 'credit_rating': '0'}

    # Category labels
    clazz_tag = 'buys_computer'

    classifier = Classifier(clazz_tag)

    # Predict classification result
    ans = classifier.predict(item)

    print('\nresult:')
    print (clazz_tag + ':' + ans)

Are there a lot of annotations? It's not my style at all, but there's no way, you know.

Follow up:

The code is so simple that it is impossible to use it. Among them, according to my rough understanding, I can also supplement the probability of continuous value attributes, use Laplace calibration for dealing with zero probability events, and so on. And for issues such as code efficiency, it doesn't do anything at all. If you are interested, you may change it later.


Ah, for all the concepts in the text, only those who agree with me on the premise can understand it, I don't explain it (after all, I don't understand it very well).

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324506107&siteId=291194637