Bayes decision: gender classification based on height and weight characteristics

Please download the code and files from here:Auorui/Pattern-recognition-programming: Pattern recognition programming (github.com)

Brief description

According to the height and weight data as features respectively, use the maximum likelihood method to estimate the distribution density parameters under the assumption of normal distribution, establish a minimum error rate Bayes classifier, write the resulting decision rules, apply the classifier to the test sample, and investigate Test for error conditions. When designing the classifier, examine the use of different prior probabilities (such as 0.5 vs. 0.5, 0.75 vs. 0.25, 0.9 vs. 0.1, etc.) to conduct experiments and examine the impact on decision rules and error rates.

At the same time, height and weight data are used as features, the probability density is estimated under the assumption of normal distribution, a minimum error rate Bayes classifier is established, the resulting decision rule is written, the classifier is applied to training/test samples, and training/test errors are examined Condition. Compare the difference in results under relevant and irrelevant hypotheses. When designing a classifier, you can conduct experiments using different prior probabilities to examine the impact on decision-making and error rates.

Minimum Error Bayesian Decision Making

Here, to classify the data of men and women, we must first solve the a priori concept P(x). This probability is obtained through statistics, or a probability value given based on our own experience, so this value can be set. Yes, you can choose 0.5 vs. 0.5, 0.75 vs. 0.25, 0.9 vs. 0.1 for testing.

In Bayesian statistics, the posterior probability is the revised or updated probability of an event occurring after taking into account new information. The posterior probability is calculated by updating the prior probability using Bayes' theorem.

P(w_{i}|x)=\frac{p(x|w_{i})\times P(w_{i})}{p(x)}

wherep(x) is the probability density function of x, which is:

p(x)=\sum_{i=1}^{2}p(x|w_{i})p(w_{i})

Bayesian decision making can be equivalently expressed as

p(x|w_{1})P(w_{1})>p(x|w_{2})P(w_{2})

If the above conditions are met, then x belongs to w_{1}, otherwise it belongs to w_{2}. This is the minimum error Bayesian decision rule.

Minimum Risk Bayesian Decision Making

In actual applications, the smallest classification error rate is not necessarily the best criterion. Different categories of classification errors may lead to different consequences. Sometimes, misclassification may be more severe in some categories than in others. For example, in medical diagnosis, misdiagnosing disease as health may be more serious than misdiagnosing health as disease. When there is a decision risk, the areas R_{1} and R_{2} are reselected based on the risk to minimize P_{e}. The risk or loss associated with w_{k} is defined as:

r_{k}=\sum_{i=1}^{c}\lambda _{ki}\int_{R_{i}}p(x|w_{k})dx

For this data, there are only two categories:

l_{1}=\lambda _{11}p(x|w_{1})p(w_{1})+\lambda _{21}p(x|w_{2})p(w_{2})

l_{2}=\lambda _{12}p(x|w_{1})p(w_{1})+\lambda _{22}p(x|w_{2})p(w_{2})

Youngl_{1}<l_{2},则xw_{i}类,currently

(\lambda _{12}-\lambda _{11})p(x|w_{1})P(w_{1})>(\lambda _{21}-\lambda _{22})p(x|w_{2})P(w_{2})

is further simplified. When samples of the w_{2} category are misclassified, more serious consequences will occur. It can be set to \lambda _{21}>\lambda _{12}, so if a> category. p(x|w_{2})>p(x|w_{1})\frac{\lambda _{12}}{\lambda _{21}}, then it is judged as w_{2}

Data preprocessing

First we can observe our data:

It is roughly distributed like this, one row of data is height and weight. You can use python files to read rows for data cleaning. Here you can use np.loadtxt directly. It will return a two-dimensional array. Using the slicing method, you can divide the characteristics of height and weight and perform mean variance.

# @Auorui
import numpy as np
from scipy.stats import norm


class Datasets:
    # 一个简单的数据加载器
    def __init__(self, datapath, t):
        self.datapath = datapath
        self.data = np.loadtxt(self.datapath)  # 二维数组
        self.height = self.data[:, 0]
        self.weight = self.data[:, 1]
        self.length = len(self.data)
        self.t = t

    def __len__(self):
        return self.length

    def mean(self, data):
        # 均值,可以使用np.mean替换
        total = 0
        for x in data:
            total += x
        return total / self.length

    def var(self, data):
        # 方差,可以使用np.var替换
        mean = self.mean(data)
        sq_diff_sum = 0
        for x in data:
            diff = x - mean
            sq_diff_sum += diff ** 2
        return sq_diff_sum / self.length

    def retain(self, *args):
        # 保留小数点后几位
        formatted_args = [round(arg, self.t) for arg in args]
        return tuple(formatted_args)

    def __call__(self):
        mean_height = self.mean(self.height)
        var_height = self.var(self.height)
        mean_weight = self.mean(self.weight)
        var_weight = self.var(self.weight)
        return self.retain(mean_height, var_height, mean_weight, var_weight)

Data loading

def Dataloader(maledata,femaledata):
    mmh, mvh, mmw, mvw = maledata()
    fmh, fvh, fmw, fvw = femaledata()

    male_height_dist = norm(loc=mmh, scale=mvh**0.5)
    male_weight_dist = norm(loc=mmw, scale=mvw**0.5)
    female_height_dist = norm(loc=fmh, scale=fvh**0.5)
    female_weight_dist = norm(loc=fmw, scale=fvw**0.5)

    data_dist = {
        'mh': male_height_dist,
        'mw': male_weight_dist,
        'fh': female_height_dist,
        'fw': female_weight_dist
    }

    return data_dist

 Here, a dictionary is used to store the normal distribution of male and female data.

Calculate probability density functions (pdf values) and Bayesian decisions

Here we will use height for minimum risk Bayesian decision making, weight for minimum error Bayesian decision making, and height and weight for minimum error Bayesian decision making.

def classify(height=None, weight=None, ways=1):
    """
    根据身高、体重或身高与体重的方式对性别进行分类

    :param height: 身高
    :param weight: 体重
    :param ways: 1 - 采用身高
                 2 - 采用体重
                 3 - 采用身高与体重
    :return: 'Male' 或 'Female',表示分类结果
    """
    # 先验概率的公式 : P(w1) = m1 / m ,样本总数为m,属于w1类别的有m1个样本。

    p_male = 0.5
    p_female = 1 - p_male

    cost_male = 0  # 预测男性性别的成本,设为0就是不考虑了
    cost_female = 0  # 预测女性性别的成本
    cost_false_negative = 10  # 实际为男性但预测为女性的成本
    cost_false_positive = 5  # 实际为女性但预测为男性的成本

    assert ways in [1, 2, 3], "Invalid value for 'ways'. Use 1, 2, or 3."
    assert p_male + p_female == 1., "Invalid prior probability, the sum of categories must be 1"

    # if ways == 1:
    #     assert height is not None, "If mode 1 is selected, the height parameter cannot be set to None"
    #     p_height_given_male = male_height_dist.pdf(height)
    #     p_height_given_female = female_height_dist.pdf(height)
    #
    #
    #     return 1 if p_height_given_male * p_male > p_height_given_female * p_female else 2

    if ways == 1:
        assert height is not None, "If mode 1 is selected, the height parameter cannot be set to None"
        p_height_given_male = male_height_dist.pdf(height)
        p_height_given_female = female_height_dist.pdf(height)

        risk_male = cost_male + cost_false_negative if p_height_given_male * p_male <= p_height_given_female * p_female else cost_female
        risk_female = cost_female + cost_false_positive if p_height_given_male * p_male >= p_height_given_female * p_female else cost_male

        return 1 if risk_male <= risk_female else 2

    if ways == 2:
        assert height is not None, "If mode 2 is selected, the weight parameter cannot be set to None"
        p_weight_given_male = male_weight_dist.pdf(weight)
        p_weight_given_female = female_weight_dist.pdf(weight)

        return 1 if p_weight_given_male * p_male > p_weight_given_female * p_female else 2

    if ways == 3:
        assert height is not None, "If mode 3 is selected, the height and weight parameters cannot be set to None"
        p_height_given_male = male_height_dist.pdf(height)
        p_height_given_female = female_height_dist.pdf(height)
        p_weight_given_male = male_weight_dist.pdf(weight)
        p_weight_given_female = female_weight_dist.pdf(weight)

        return 1 if p_height_given_male * p_weight_given_male * p_male > p_height_given_female * p_weight_given_female * p_female else 2

    return 3

Validate and calculate prediction accuracy using test set

def test(test_path,ways=3):
    test_data = np.loadtxt(test_path)
    true_gender_label=[]
    pred_gender_label=[]
    for data in test_data:
        height, weight, gender = data
        true_gender_label.append(int(gender))
        pred_gender = classify(height, weight, ways)
        pred_gender_label.append(pred_gender)
        if pred_gender == 1:
            print('Male')
        elif pred_gender == 2:
            print('Female')
        else:
            print('Unknown\t')
    return true_gender_label, pred_gender_label

def accuracy(true_labels, predicted_labels):
    assert len(true_labels) == len(predicted_labels), "Input lists must have the same length"
    correct_predictions = sum(1 for true, pred in zip(true_labels, predicted_labels) if true == pred)
    total_predictions = len(true_labels)
    accuracy = correct_predictions / total_predictions
    return accuracy

forecast result

Minimum risk Bayesian decision making using height

When using height for minimum risk Bayesian decision-making, the accuracy is 94.29% on the test1 data and 91.0% on the test2 data.

Minimum error Bayesian decision making using body weight

When weight is used for minimum risk Bayesian decision-making, the accuracy is 94.29% on the test1 data and 85.33% on the test2 data.​ 

Minimum error rate Bayesian decision-making using height and weight

When using height and weight for minimum error rate Bayesian decision-making, the accuracy rate on test1 data is 97.14%, and the accuracy rate on test2 data is 90.33%.

Add new features

In addition to the combination of height and weight, we can also extend new features, such as bmi.

def calculate_bmi(height,weight):
    # 计算BMI作为新特征
    height_meters = height / 100  # 将身高从厘米转换为米
    bmi = weight / (height_meters ** 2)  # BMI计算公式
    return bmi

In this way, more features can be made, and those who are interested may wish to continue along this line of thinking.

Guess you like

Origin blog.csdn.net/m0_62919535/article/details/134065370