One hundred face machine learning notes -4

Model Assessment

  ROC curve

  question: how to calculate AUC?

  answer: First, the size of the AUC refers to the area under the ROC curve, the quantization value can be reflected on the ROC curve to measure the performance of the model. AUC values were calculated only need to do the integration along the horizontal axis ROC on it. Since the ROC curve is generally located above this straight line y = x (if not, as long as the probability model prediction is reversed to 1-p can get a better classifier), the AUC values generally 0.5 between ~ 1. The larger the AUC, the more likely explanation classifier to better the true positive samples standing in the front, the classification performance.

  Question: ROC curve compared to PR ( https://wordpress.aberttsy.cn/index.php/2020/04/01/machine-learning-3/ ) curve What are the characteristics?

  answer: There PR compared to curve, a characteristic ROC curve, when the distribution of positive and negative samples is changed, the shape of the ROC curve can be substantially unchanged, while the shape of the PR curve are generally more drastic changes occur. 

 

  As can be seen, the curve PR significant changes occurred, the ROC curve shape substantially unchanged. This feature allows ROC curve can minimize the interference caused by the different test sets, more objectively measure the performance of the model itself. What's the real meaning of it? In many practical problems, the number of positive and negative samples are often uneven. For example, to calculate advertising often involves the conversion of the model, the number of positive samples tend to be negative even number of samples 1/1000 1/10000. If you select a different test set, change PR curve will be very large, and the ROC curve is more stably reflect the quality of the model itself. So, more suitable scenes ROC curve, is widely used in sorting, recommendations, advertising and so on. But note that the choice of PR curve or ROC curve is due to the practical problems vary, if researchers want to see more performance model on a particular data set, PR curve can be more directly reflect its performance.

  Cosine distance

  question: Why are some scenarios to be used cosine similarity instead of Euclidean distance?

  answer: For two vectors A and B, which is defined as the cosine similarity i.e. two vectors cosine of the angle, the angular relationship between the concerned vectors, they are not concerned with the absolute size of which is in the range [- 1,1]. Whereas if therebetween cosine similarity, then; when a large part of the length of the gap text similarity, but the contents are similar, if the word or word frequency as the feature vector, which Euclidean distance in feature space is generally large the angle may be small, and therefore a high similarity. Further, in the field of text, images, video, feature dimensions of the object of study is often high, cosine similarity remains at a high-dimensional case, "the same is 1, 0 is orthogonal, opposite to -1" nature, while a value of the Euclidean distance is affected by the dimensions of the range is not fixed, and rather ambiguous meaning.

  In some scenarios, e.g. Word2Vec in which die length vector are normalized through, this time with a cosine distance Euclidean distance has a monotonic relationship, i.e.,

Where || A-B || 2 represents the Euclidean distance, cos (A, B) represents a cosine similarity, (1-cos (A, B)) represented by the cosine distance. In this scenario, if you select from neighbors minimum (maximum similarity), then use the results of cosine similarity and Euclidean distance is the same.

  Overall, the absolute value of the difference reflects the Euclidean distance, and the relative differences reflect the direction cosine distance. For example, statistics two plays of user viewing behavior, user A is viewing vector (0,1), user B is (1,0); this time cosine great distance between the two, and Euclidean distance is small; we for the analysis of two different video user preferences, are more concerned about the relative difference, obviously it should be used cosine distance. And when we analyze user activity to the landing times (unit: second) and the average length of time to watch: time (in minutes) as a feature, the cosine distance will be considered (1, 10), (10, 100) two users are very close; but obviously these two user activity has a great difference, at this time we are more concerned about the absolute value of the difference should be used Euclidean distance.

  question: whether the cosine distance is the distance a strictly defined?

  answer: First, look at the definition of distance: in a set, if each can uniquely identify the elements of a real number, making three from the axiom (positive definiteness, symmetry, triangle inequality) holds, then the real number may be called on this element the distance between. Cosine distance satisfies positive definiteness and symmetry, but does not satisfy the triangle inequality, the distance it is not strictly defined.

  A / B test trap

  question: After the model had been fully evaluated offline, why on-line A / B test?

  answer:

  (1) off-line assessment model can not completely eliminate the effects of over-fitting, therefore, offline evaluation results obtained can not fully replace online assessment.
  (2) off-line assessment can not be completely reduced environmental engineering line. In general, off-line assessment often do not consider the delay line environment, loss of data, loss of data labels and so on. Therefore, the results of off-line assessment is under ideal engineering environment.
  Some commercial indicator (3) online system can not be calculated in the evaluated offline. Offline assessments are generally evaluated against the model itself, while other indicators related to the model, especially the business indicators, often can not be obtained directly. For example, on-line a new recommendation algorithm, offline assess tend to focus on is to improve the ROC curve, PR curve, and online assessment can fully understand the recommended method brings users click-through rate, duration of retention, changes in PV visits, etc. . These will have to conduct a comprehensive assessment by the A / B testing.

  question: How to make online A / B test?

  answer: for the primary means of A / B test kit of parts is a user, i.e. the user divided into experimental group and control group, the experimental group of users to impose the new model, the user of the control group subjected to the old model. In the process of division of the tub, the sample to be noted that the independence and unbiased sampling mode to ensure that the same can only be assigned to a user with a bucket, the bucket in a separatory process selected user_id needs a random number , so as to ensure that the sample bucket is unbiased.

  question: how to divide the experimental group and the control group (newly developed model A, but existing users are using a model B, ask how divided, can validate the model A)?

  answer: The User_id divided into test and control groups, respectively, using the model A, model B, in order to verify the effect of the A model.

Model evaluation method

  question: In the model evaluation process, what are the main method of verification, to say the advantages and disadvantages.

  answer:

  (1) Holdout Test is the simplest and most direct method of verification, the original set of samples will randomly divided into a training set and a validation set of two parts. For example, a click-through rate for the prediction model, we sample scale 70% to 30% divided into two parts, 70% of the samples for the model training; 30% of samples are used for model validation, including drawing the ROC curve, to calculate an accurate and recall rate and other indicators to assess the performance of the model. Holdout inspection shortcomings is obvious that calculated on the validation set last evaluation index and the original packet has a lot. In order to eliminate the randomness, the researchers introduced the idea of ​​"cross-check" of.

  (2) k-fold cross-validation: First, the entire sample is divided into k subsets of equal size samples; sequentially traverse the k subsets, each subset of the current set of verification as all of the remaining subsets of the training set for training and assessment models; the average of the last k assessment index as the final evaluation index. In the actual experiment, k is often taken 10. Leave a validation: each left a sample as a validation set, all the other samples as a test set. The total number of samples n, n samples sequentially traversed, n times verified, then averaged to give the final evaluation index evaluation index. In the case of the total number of samples are available, leaving a great validation of the time overhead. In fact, leaving a verification is a special case of verification stay p. Validation is a time to stay p p samples left as a validation set, and choose p elements from n elements in a kind likely, so it's time overhead is much higher than leave a verification, and therefore rarely works in practice It is applied.

  补充:不管是Holdout检验还是交叉检验,都是基于划分训练集和测试集的方法进行 模型评估的。然而,当样本规模比较小时,将样本集进行划分会让训练集进一步 减小,这可能会影响模型训练效果。

  (3) 自助法是基于自助采样法的检验方法。对于总数为n的样本集合,进行n次有 放回的随机抽样,得到大小为n的训练集。n次采样过程中,有的样本会被重复采 样,有的样本没有被抽出过,将这些没有被抽出的样本作为验证集,进行模型验 证,这就是自助法的验证过程。

  question: 在自助法的采样过程中,对n个样本进行n次自助抽样,当n趋于无穷大时, 最终有多少数据从未被选择过?

  answer:

因此,当样本数很大时,大约有36.8%的样本从未被选择过,可作为验证集。

Guess you like

Origin www.cnblogs.com/tsy-0209/p/12629699.html