学习笔记之Machine Learning Crash Course | Google Developers

Machine Learning Crash Course | Google Developers

https://developers.google.com/machine-learning/crash-course/
Google's fast-paced, practical introduction to machine learning

Classification

Thresholding
- Logistic regression returns a probability. You can use the returned probability "as is" (for example, the probability that the user will click on this ad is 0.00023) or convert the returned probability to a binary value (for example, this email is spam).
- In order to map a logistic regression value to a binary category, you must define a classification threshold (also called the decision threshold). A value above that threshold indicates "spam"; a value below indicates "not spam." It is tempting to assume that the classification threshold should always be 0.5, but thresholds are problem-dependent, and are therefore values that you must tune.
- Note: "Tuning" a threshold for logistic regression is different from tuning hyperparameters such as learning rate. Part of choosing a threshold is assessing how much you'll suffer for making a mistake. For example, mistakenly labeling a non-spam message as spam is very bad. However, mistakenly labeling a spam message as non-spam is unpleasant, but hardly the end of your job.
True vs. False and Positive vs. Negative
- A true positive is an outcome where the model correctly predicts the positive class. Similarly, a true negative is an outcome where the model correctly predicts the negative class.
- A false positive is an outcome where the model incorrectly predicts the positive class. And a false negative is an outcome where the model incorrectly predicts the negative class.
Accuracy
- Accuracy is one metric for evaluating classification models. Informally, accuracy is the fraction of predictions our model got right. Formally, accuracy has the following definition:
  - $Accuracy = \frac{Number of correct predictions}{Total number of predictions}$
- For binary classification, accuracy can also be calculated in terms of positives and negatives as follows:
  - $Accuracy = \frac{T P + T N}{T P + T N + F P + F N}$
  - Where TP = True Positives, TN = True Negatives, FP = False Positives, and FN = False Negatives.
- Accuracy alone doesn't tell the full story when you're working with a class-imbalanced data set, like this one, where there is a significant disparity between the number of positive and negative labels.
Precision and Recall
- Precision attempts to answer the following question:
  - What proportion of positive identifications was actually correct?
- Precision is defined as follows:
  - $Precision = \frac{T P}{T P + F P}$
- Note: A model that produces no false positives has a precision of 1.0.
- Recall attempts to answer the following question:
  - What proportion of actual positives was identified correctly?
- Mathematically, recall is defined as follows:
  - $Recall = \frac{T P}{T P + F N}$
- Note: A model that produces no false negatives has a recall of 1.0.
- To fully evaluate the effectiveness of a model, you must examine both precision and recall. Unfortunately, precision and recall are often in tension. That is, improving precision typically reduces recall and vice versa.
- Various metrics have been developed that rely on both precision and recall. For example, see F1 score.

学习笔记之Machine Learning Crash Course | Google Developers

Classification

猜你喜欢