Machine learning basic review

2021-Byte Spring Recruitment Internship-Data Advertising-Recommendation Algorithm-Internship Experience

Push formula, make wheels on site

kmeans

#coding=utf-8
def distance(pt1,pt2):
    m=len(pt1)
    ans=0
    for i in range(m):
        ans+=(pt1[i]-pt2[i])**2
    return ans
        
# 构造样本
pts=[]
for i in range(10):
    for j in range(10):
        pts.append([i,j])

# 初始中心
centers=[[0,2],[3,3]]
k=len(centers)
m=2
# 迭代
print(centers)
for i in range(5):
    # 给每个样本分配对应的标签
    labels=[]
    for pt in pts:
        best_i=-1
        best_dis=100000
        for i, center in enumerate(centers):
            dis=distance(pt,center)
            if dis<best_dis:
                best_i=i
                best_dis=dis
        labels.append(best_i)
    # 重新计算中心点
    cts=[[] for _ in range(k)]
    centers=[[0]*m for _ in range(k)]
    for pt,label in zip(pts, labels):
        cts[label].append(pt)
    for label in range(k):
        n_samples=len(cts[label])
        for dim in range(m):
            sm=0
            for sample in range(n_samples):
                sm+=cts[label][sample][dim]
            sm/=n_samples
            centers[label][dim]=sm
    print(centers)
print(labels)
    

logistic regression

Mainly back these codes

  • Cross entropy
loss = -(y @ np.log(y_hat) + (1 - y) @ np.log(1 - y_hat)) / n
  • y_hat calculation

Just sigmoid

return 1 / (1 + np.exp(-X @ self.w))
  • Gradient calculation
dw = (y - y_hat) @ X
dw /= n
class LR:
    def __init__(self, lr=0.1, max_iter=1000, eps=1e-5):
        self.eps = eps
        self.max_iter = max_iter
        self.lr = lr

    def predict_proba(self, X):
        return 1 / (1 + np.exp(-X @ self.w))

    def fit(self, X: np.ndarray, y: np.ndarray):
        n, m = X.shape
        X = np.hstack([X, np.ones([n, 1])])
        m += 1
        self.w = np.zeros([m])
        pre_loss = 0
        for i in range(self.max_iter):
            y_hat = self.predict_proba(X)
            # 交叉熵
            loss = -(y @ np.log(y_hat) + (1 - y) @ np.log(1 - y_hat)) / n
            if abs(loss - pre_loss) < self.eps:
                print('误差不再减少')
                break
            pre_loss = loss
            print(loss)
            # 计算梯度
            dw = (y - y_hat) @ X
            dw /= n
            # 梯度下降
            self.w += dw * self.lr
            print(self.w)

Linear regression

Shredded code

挖井的问题(每个家庭都可以打井,成本为c[i],或者挖水管,i,j两家通水管成本为dp[i][j]。求所有家庭喝上水的最小成本)

Output samples that meet the four requirements in the database (the specific requirements are forgotten, and the hash is involved)

A hundred million-level data is the temperature of various locations on the earth. The array is sorted, and the time complexity is O(n). The interviewer was very nice, gave a lot of tips, and finally wrote it... I am ashamed

Computer Basics

Python deep copy and shallow copy problem

Regular expression

Hash table

Which structure of the hash table in Python corresponds to and how to resolve hash conflicts

Solution to hash conflict

Metric

AUC

Randomly given two samples, one positive and one negative, the probability that the positive sample is ranked before the negative sample (equivalent and Wilcoxon-Mann-Witney Test), so the larger the AUC, the more likely the positive sample is to be ranked before the negative sample, that is, the greater the classification result Great.

So AUC reflects the classifier's ability to sort samples.

Drawing/calculation method

method 1

The approach is that since the probability of each sample is known, it is sorted from small to large and used as the cutoff probability in turn. A prediction less than this probability is a negative example, and a prediction greater than this probability is a positive example, so that each sample has a prediction Value, you can calculate the true and false positive rates in the sample, draw points on the coordinate system, and then change the cutoff probability in turn, calculate the true and false positive rates of different groups, the area between the drawn curve and the horizontal axis is the AUC

Method 2

Suppose there are a total of (m+n) samples, of which m are positive samples, n are negative samples, and there are a total of m n sample pairs, count, and the probability of a positive sample predicted as a positive sample is greater than the probability of a negative sample predicted as a positive sample The value is recorded as 1, the count is accumulated, and then divided by (m n) is the value of AUC

import numpy as np
from sklearn.metrics import roc_auc_score

def auc(labels, probs):
    n_samples = len(labels)
    pos_cnt = sum(labels)
    neg_cnt = n_samples - pos_cnt
    
    total_comb = pos_cnt * neg_cnt #组合数
    
    pos_index = np.where(labels==1)[0] #找出正例的索引
    neg_index = np.where(labels==0)[0] # 找出负例的索引
    
    cnt = 0
    for pos_i in pos_index:
        for neg_j in neg_index:
            if probs[pos_i] > probs[neg_j]:
                cnt += 1
            elif probs[pos_i] == probs[neg_j]:
                cnt += 0.5
            else:
                cnt += 0
    auc = cnt / total_comb
    return auc

labels = np.array([1,1,0,0,1,1,0])
probs= np.array([0.8,0.7,0.5,0.5,0.5,0.5,0.3])
print('ours:', auc(labels,probs))
print('sklearn:', roc_auc_score(labels,probs))

Method 3

Set the scale interval of the horizontal axis to 1/N and the scale interval of the vertical axis to 1/P ; then sort the samples according to the predicted probability output by the model (from high to low);

Traverse the samples in turn, and draw the ROC curve from the zero point. Each time a positive sample is encountered, a curve with a scale interval is drawn along the vertical axis, and a curve with a scale interval is drawn along the horizontal axis every time a negative sample is encountered, until the traversal After finishing all the samples, the curve finally stops at the point (1,1), and the entire ROC curve is drawn.

Why is ROC more robust to unbalanced samples than PR?

We know that when we draw ROC and PR curves, we draw dotted lines with (FPR, TPR) and (Precision, Recall) respectively.

Because in the case of the same TPR, their FPR difference is relatively small. But their PR curve difference is relatively large, because under the same TPR situation, their Precision difference is quite large.

When to choose PR and when to choose ROC?

Essentially, the difference in the first question is that ROC and PR focus on different points. ROC focuses on positive and negative samples at the same time, while PR only focuses on positive samples. This is not difficult to understand, because TPR is a measure of positive samples, and FPR is a measure of negative samples. However, both Precision and Recall are positive samples of measurement.

For example, for predicting cancer, we would prefer PR, because we hope to predict as many cancer patients as possible while being as accurate as possible, and don't miss any cancer patients. As for FPR, it is actually not that important, because it can always be further verified by other more means.

But for cat and dog image classification models, we would prefer ROC, because it is our goal to accurately identify both cats and dogs, and ROC pays attention to both aspects at the same time.

Linear model

What to do when L1 is unguided

When the loss function is not differentiable and the gradient descent is no longer effective, the coordinate axis descent method can be used.The gradient descent is to update the parameters along the negative gradient direction of the current point, and the coordinate axis descent method is along the direction of the coordinate axis. For the number of features, when the coordinate axis descent method enters the parameter update, the value of m-1 is fixed first, and then the local optimal solution of the other is sought, so as to avoid the problem of indirect loss function.
Use Proximal Algorithm to solve L1, this method is to optimize the upper bound result of the loss function# Probability Statistics

Probability statistics

Probability question: A and B take turns eating sweets. The probability of eating each round is 1/2. The one who eats first wins; the probability that A wins. There are two candies, the expectation of the number of candies

2 3 \frac{2}{3} 32

Expect not

The incidence of a certain disease is 1/1000. The patient has a 95% probability of being diagnosed with the disease, and a healthy person has a 5% probability of being misdiagnosed. If a person is detected to be sick, what is the actual probability of the disease.

Bayesian formula

Various rand fancy conversions

470. Implement Rand10() with Rand7()

class Solution {
    
    
public:
    int rand10() {
    
    
        int row,col,idx;
        do{
    
    
            row=rand7();
            col=rand7();
            idx=col+(row-1)*7;
        }while(idx>40);
        return 1+(idx-1)%10;
    }
};

Insert picture description here

  • rand5 implements rand7
class Solution {
    
    
public:
    int rand7() {
    
    
        int row,col,idx;
        do{
    
    
            row=rand5();
            col=rand5();
            idx=col+(row-1)*5;
        }while(idx>21);
        return 1+(idx-1)%7;
    }
};

expect:

2 ⋅ 1 1 − 4 25 2\cdot \frac{1}{1-\frac{4}{25}} 212541

  • rand11 implements rand7

This is weird, I understand it is equivalent to only one line

The difference between maximum likelihood estimation and maximum posterior probability

Maximum likelihood estimation provides a method to evaluate model parameters given observation data, and the sampling in maximum likelihood estimation satisfies the assumption that all samples are independent and identically distributed.

The maximum posterior probability is a point estimate that is difficult to observe based on empirical data. The biggest difference from the maximum likelihood estimation is that the maximum posterior probability is integrated into the prior distribution of the estimator, so the maximum posterior probability can be regarded as a rule Maximum Likelihood Estimation

What is the conjugate prior distribution

Assuming that it is a parameter in the overall distribution, the prior density function is, and the posterior density function calculated by sampling information has the same functional form, it is called the conjugate prior distribution

How does a randomizer with a uniform distribution of 0~1 change into a randomizer with a mean value of 0 and a variance of 1

Uniform distribution produces normal distribution

Probabilistic algorithm-uniform distribution produces normal distribution

Inverse function method

Generally, a probability distribution, if its distribution function is y = F (x) y=F(x)and=F ( x ) , then, the range of y is 0~1, find the inverse functionGGG , and then generate a random number between 0 and 1 as input, then the output is a random number that conforms to the distribution:

y = G (x) y = G (x) and=G(x)

Central limit theorem

import numpy as np
import pylab as plt

n = 12
N = 5000
x = np.zeros([N])
for j in range(N):
    a = np.random.rand(n)
    u = sum(a)
    x[j] = u - n * 0.5
plt.hist(x)
plt.show()

Box Muller

import numpy as np
import pylab as plt

N = 1000
x1 = np.random.rand(1, N)
x2 = np.random.rand(1, N)
y1 = np.sqrt(-2 * np.log(x1)) * np.cos(2 * np.pi * x2)
y2 = np.sqrt(-2 * np.log(x1)) * np.sin(2 * np.pi * x2)
y = np.hstack([y1, y2])
plt.hist(y)
plt.show()	

37% law

In an event, n girls holding roses of different lengths in their hands, arranged in a disorderly row, a boy walked from beginning to end, trying to get a longer rose, once he took one, he couldn't take the other Yes, if you miss it, you can't look back, ask the best strategy?

Guess you like

Origin blog.csdn.net/TQCAI666/article/details/114994979