Explanation of important classification knowledge in machine learning algorithms (understand after reading!)

Today's machine learning uses new computing techniques, so it is different from the machine learning of the past. It grew out of pattern recognition and the theory that computers can learn to perform specific tasks without being programmed; researchers interested in artificial intelligence want to see if computers can learn from data. The iterative aspect of machine learning is important because models are able to adapt independently as they are exposed to new data. They learn from previous calculations to arrive at reliable, repeatable decisions and results. This is not a new science, but one that has gained new impetus.

1. Specify the threshold

Logistic regression returns probabilities. You can use the returned probabilities "as is" (eg, the probability that a user clicks on this ad is 0.00023), or you can convert the returned probabilities into binary values ​​(eg, this email is spam).

If a logistic regression model returns a probability of 0.9995 when predicting an email, it means that the model predicts that the email is very likely to be spam. Conversely, another email with a predicted score of 0.0003 in the same logistic regression model is very likely not spam. But what if an email had a prediction score of 0.6? In order to map logistic regression values ​​to binary categories, you must specify a classification threshold (also known as a decision threshold). Values ​​above this threshold indicate "spam" and values ​​below this threshold indicate "not spam". People tend to think that the classification threshold should always be 0.5, but the threshold is problem-dependent, so you have to tune it.

We detail the metrics that can be used to evaluate the predictions of a classification model, and the effect of changing the classification threshold on those predictions, in later sections.

⭐Note :

"Tuning" a threshold for logistic regression is not the same as tuning a hyperparameter such as learning rate. When choosing a threshold, you need to assess how much you will suffer if you make a mistake. For example, mislabeling non-spam as spam would be very bad. But while it's unpleasant to mistakenly mark spam as not spam, it shouldn't cost you your job.

2. Positive and Negative and Positive and Negative

In this section, we will first define the main components of the metrics used to evaluate classification models. Let's start with a fable:

Aesop's Fables: The Wolf Came (Abridged Version) There is a shepherd boy who has to look after the sheep in the town, but he is getting tired of the job. Just for the fun of it, he yelled, "Wolf is coming!" when no wolf appeared at all. The villagers rushed to protect the sheep, but they were very angry when they found out that the shepherd boy was joking. (This situation has been repeated many times.)

...

One night, the shepherd boy saw a wolf approaching the sheep, and he shouted: "Wolf is coming!" The villagers didn't want to be teased by him again, so they all stayed at home. The hungry wolf slaughtered the flock and had a good meal. Now, the whole town can't get rid of it. Panic ensues.

We make the following definitions:

"Wolf is coming" is the positive category.

"No wolves" is a negative class.

We can summarize our "wolf prediction" model with a 2x2 confusion matrix describing all possible outcomes (four in total):

A true example is when the model correctly predicts a positive class sample as a positive class. Likewise, a true negative is when the model correctly predicts a negative class sample as a negative class.

A false positive is when the model incorrectly predicts a sample of the negative class as a positive class, while a false negative is when the model incorrectly predicts a sample of the positive class as a negative class.

In later sections, we describe how to evaluate classification models using metrics derived from these four outcomes.

3. K nearest neighbor algorithm (KNN, K-NearestNeighbor)

Proximity algorithm, or K-Nearest Neighbor (KNN, K-NearestNeighbor) classification algorithm is one of the simplest methods in data mining classification technology. The core idea of ​​the KNN algorithm is that if most of the k nearest neighbor samples of a sample in the feature space belong to a certain category, the sample also belongs to this category and has the characteristics of samples in this category. In determining the classification decision, this method only determines the category of the sample to be divided according to the category of the nearest one or several samples. The KNN method is only related to a very small number of adjacent samples when making category decisions. Since the KNN method mainly relies on the limited surrounding samples rather than the method of discriminating the class domain to determine the category to which it belongs, the KNN method is more accurate than other methods for the sample sets to be divided when the class domain crosses or overlaps more. for fit.

To learn artificial intelligence machine learning algorithms, I have the most comprehensive "Artificial Intelligence Self-study Video", which has a full 60G, and I will share it with you!

整理了很久,非常全面。包括一些人工智能基础入门视频+AI常用框架实战视频、图像识别、OpenCV、NLQ、YOLO、机器学习、Pytorch、Tensorflow、计算机视觉、深度学习与神经网络等视频、课件源码、国内外知名精华资源、AI热门论文、行业报告等。

推荐一个宝藏公众号:AI技术星球,回复关键词“289”即可获得一整份全套的AI学习路线。

小伙伴也可以私信我分享,希望可以帮助到有需要的人。

机器学习(Machine Learning, ML)是一门多领域交叉学科,涉及概率论统计学逼近论、凸分析、算法复杂度理论等多门学科。现在我们来介绍机器学习算法中非常重要的知识—分类(classification),即找一个函数判断输入数据所属的类别,可以是二类别问题(是/不是),也可以是多类别问题(在多个类别中判断输入数据具体属于哪一个类别)。与回归问题(regression)相比,分类问题的输出不再是连续值,而是离散值,用来指定其属于哪个类别。分类问题在现实中应用非常广泛,比如垃圾邮件识别,手写数字识别,人脸识别,语音识别等。

Guess you like

Origin blog.csdn.net/HB_id01289/article/details/128901598