100 questions and answers for 2023 double non-computer master's degree candidates applying for algorithm positions in autumn recruitment

Self introduction

A good self-introduction will definitely leave a deep and good impression on the interviewer. Moreover, this part is all worded by yourself, and the key points to be highlighted are also controlled by you, so the interviewer can listen to your narrative. It determines your familiarity with the project and the depth of your thinking. Therefore, it is particularly important to prepare in advance. During the interview, each project must be described according to a certain logic. The most important thing in algorithm projects is of course the data. , features, models, and effects, if you explain them clearly according to this framework, the interviewer will listen easily, and the next interview stage will be smoother, because the interviewer will capture the keywords in your self-introduction for future reference. In the subsequent Q&A session, you will be asked a series of questions, which implies a tip, that is, what you say must be understood better than the interviewer, and do not move in those vague things, otherwise the secret will be exposed sooner or later.

  • Project Introduction

This is the top priority. The project will reflect the overall quality of an interviewer. So what should those of us without internship experience do? It’s easy to do. Go to the competition, strive to get the ranking, and then become familiar with the algorithms used in the competition. The interviewer will definitely expand on the details of your project and let you analyze some points that he has doubts about, such as the correct Negative sample selection, feature processing, model details, and if you use a tree model in your competition, you need to know all the knowledge points related to the tree model. Let me give you a few examples: Why is XGBoost insensitive to missing values? Compared with ordinary GBDT, how does XGBoost handle missing values? Why can xgboost/gbdt achieve high accuracy with a very small tree depth when adjusting parameters? For detailed questions like this, once you answer vaguely, the interviewer will definitely deduct points, so don’t take a chance and think that the interviewer will not ask. Murphy’s theorem tells us that everything Things that may go wrong will definitely happen. These can all be prepared. It is almost an open-book exam, so why not think about how to answer in advance instead of going to the examination room and racking your brains to come up with a different answer. As for competition platforms, everyone must know that platforms such as Alibaba Tianchi, Kaggle, etc. are all platforms where everyone often participates in competitions.

  • Algorithm details

In addition to examining the details of the algorithms that appeared in the project, the interviewer will also ask questions about the foundation of your machine learning algorithms. I will summarize some of the more important ones here. Traditional algorithms: logistic regression, naive Bayes, and tree models. (random forest/Adaboost/xgboost/lightgbm), SVM, PageRank, clustering; some machine learning theories, non-equilibrium problems, over-fitting problems, cross-validation problems, model selection problems; recommendation systems: collaborative filtering, FM/FFM , LS-PLM, Wide&Deep, DeepFM, DIN, DIEN, ESMM, Embedding, recall, EE, performance evaluation; these are the core of algorithm positions. In addition, some code language inspections will also be valued by some interviewers. For example, C++/python/Spark, etc. In order to prepare for this item, I converted all the questions I could think of into the form of questions and answers, and prepared by asking and answering questions by myself. I listed all the questions at the end of the article. . You can prepare according to this, or choose some of them. Because I am applying for a recommendation algorithm position, I will focus more on this aspect. If I have time in the future, I can expand to other fields, such as natural language processing, computer vision, etc. The recommended books are also some of the experiences of Niukeshang masters, such as Li Hang's "Statistical Learning Methods", "Hundred Faces of Machine Learning", "Hundred Faces of Deep Learning", "Deep Learning Recommendation System", and Zhou Zhihua's "Machine Learning". Of course, it is absolutely not enough to read these books. Reading does not mean that you have mastered it. You can go through the list of questions I gave you and answer them in your own mind, or write them directly. This will definitely have the best effect. I am here When it comes to autumn moves, you can really answer them almost instantly, and you are completely under your control. I will also publish the answers to these questions one after another on my official account. I have basically finished writing them. You can also check out my website. The website and official account are introduced below. Everyone is welcome to communicate.

  • Data structure and algorithm questions

This is also very critical. Some companies will even decide whether you will stay or leave based on your performance in this area. Toutiao is a famous master of dynamic programming. He always likes to take some intermediate or hard questions, which is a headache. Many non-computer background students may not have clear ideas on the spot because they have no basic knowledge and have not practiced well. My suggestion is to first brush according to the topic, such as dynamic programming, sliding window topic, dual pointer, fast and slow pointer, topK, etc., brush about 200 questions, and then you can randomly brush, be sure to brush more questions, this This is an interview rule that cannot be emphasized a hundred times. The recommended book is "Sword Pointer Offer", and the website can be found on leetcode Chinese website.

1. Machine Learning Questions and Answers
1. Logistic Regression

  • Linear regression
    Linear regression can usually be divided into two types, one is simple linear regression (y=ax), and the other is multiple linear regression (y=ax1+a2x2+…+anxn). The basic principle is to use the gradient descent method to calculate the least square The error function of the multiplication algorithm is optimized.
  • Logistic regression definition
    Logistic regression is a generalized linear classification model and its model structure can be regarded as a single-layer neural network, consisting of an input layer and an output layer of neurons with only one sigmoid activation function, and No hidden layers. The function of its model can be simplified into two steps, "linear summation of input features [x] through model weight [w] + sigmoid activation output probability".

Predicted values ​​can be output:

  • Maximum likelihood estimation:
    Condition: Assume that the samples are independent and identically distributed;
    Goal: Estimate the parameter theta in this distribution;
    Method: When the probability of this group of samples is the largest, it corresponds to the parameter value of the model.

  • Derive the loss function of logistic regression and explain its meaning.
    The loss function for a binary classification problem can be written in the following form:
    If the loss function is to calculate the average of m samples, just stack the m Loss and average it:

This lies in the learning goal of lr derived by the maximum likelihood method - cross entropy loss (or logarithmic loss function), that is, maximizing the model's prediction probability to obey the distribution of the true value, and the distribution of the prediction probability is closer to the true distribution. The closer, the better the model. You can pay attention to one point. For example, when the logistic regression of the above formula uses cross entropy as the target to output the predicted probability of sigmoid, the probability value can only approach 0 or 1 as much as possible. Similarly, the loss will not be 0.
LR needs to obtain the maximum likelihood estimate, that is, the predicted distribution obtained by the model should be as close as possible to the actual distribution of the data.

KL divergence (relative entropy) is a measure of the difference between two probability distributions. The model needs to obtain the maximum likelihood estimate, which is equivalent to finding the minimum value after multiplying by the negative Log. This is equivalent to minimizing the KL divergence (relative entropy). So getting the KL divergence means getting the maximum likelihood. And because KL divergence contains two parts, the first part is cross entropy, and the second part is information entropy, that is, KL = cross entropy − information entropy. Information entropy is a measure of the amount of information required to eliminate uncertainty. Simply put, it is the true probability distribution, and this part is fixed. So optimizing KL divergence is approximately equivalent to optimizing cross entropy.

In machine learning we have the concept of a loss function, which measures how wrong a model predicts. If we take the average log-likelihood loss over the entire dataset, we get:

That is, in the logistic regression model, our maximizing the likelihood function and minimizing the loss function are actually equivalent.

  • In the advertising LR model, why do we need to combine features?
    Feature combination refers to the composite feature formed by multiplying two or more features.
    Multiplicative combinations of features can provide predictive power beyond what these features alone can provide.
    If you simply combine them in pairs, it can easily lead to problems such as too many parameters and multiple fittings, and not all feature combinations are meaningful. So how to carry out effective feature combination? The most commonly used is the feature combination search method based on decision tree.
  • Why does the LR model use the sigmoid function? What is the mathematical principle behind it? Why not use other functions?

The Sigmoid function commonly used in deep learning is a special form of the Logistic distribution function that obeys the standard normal distribution.

From the purpose of LR, there are two conditions that must be met when selecting a function:

  1. The value range is between 0~1.
  2. For the occurrence of an event, 50% is the watershed of its outcome, and the selection function should be symmetrical around the 0.5 center.
  3. The derivative of the sigmoid function is the function f′(x)=f(x)(1−f(x)) with itself as the dependent variable

The Bernoulli distribution is a distribution with only two values ​​and a given expected value with maximum entropy. The Bernoulli distribution also belongs to the exponential distribution family.

https://blog.csdn.net/saltriver/article/details/57531963

  • Why can LR be used for click-through rate prediction?
  • What kind of data is best to use LR for? In other words, in order for LR to work better, what should be done to the data?
  • Can logistic regression solve nonlinear classification problems?
    Perform nonlinear transformation kernel mapping on features. Linear kernel mapping is the inner product, which is a projection.
  • Given a data set with m samples and n-dimensional features, what are the dimensions of the gradient in the LR algorithm?
  • Why does the logistic regression loss function use maximum likelihood estimation instead of least squares?
  • How to solve for the parameters of logistic regression?
  • SVM
    SVM is a two-class classifier. Its goal is to find a hyperplane, using two classes of data as far away from the hyperplane as possible, so that new data can be classified more accurately, even if the classifier is more robust.

Support Vector: These are the points closest to the separating hyperplane.
Finding the maximum interval: It is to find the distance from the maximum support vector to the separating hyperplane, and under this condition, find the separating hyperplane.

  • work process:

  • When SVM is linearly inseparable:
    use the kernel function: map the input space to a high-dimensional feature space through a certain nonlinear transformation φ(x). Kernel functions commonly used in machine learning
    Linear: Inner product

  • Slack variable (soft margin):

    The situations discussed before are based on the assumption that the samples are linearly separable. When the samples are linearly inseparable, we can try to use the kernel function to map the features to high dimensions, so that they are likely to be separable. However, we cannot 100% guarantee separability after mapping. So what should we do? We need to adjust the model to ensure that the separating hyperplane can be found as much as possible even if it is inseparable.

Soft interval: After introducing non-negative parameters (called slack variables), the function interval of some sample points is allowed to be less than 1, that is, within the maximum interval interval, the sample points are in the other party's area. After relaxing the constraints, we need to readjust the objective function:

After the objective function is added, the more outliers there are, the greater the objective function value, and what we require is the smallest possible objective function value.

  • What are the similarities and differences between SVM and LR?
    Difference:
    LR is a statistical method, SVM is a geometric method;
    SVM's processing method is to only consider the Support Vectors, that is, the few points most relevant to the classification to learn the classifier. Logistic regression reduces the weight of points far away from the classification plane through nonlinear mapping, and relatively increases the weight of the data points most relevant to the classification; the loss functions are different: the loss function of LR
    is cross entropy, and the loss function of SVM is HingeLoss, the purpose of these two loss functions is to increase the weight of data points that have a greater impact on classification, and to reduce the weight of data points that have a smaller relationship with classification. For HingeLoss, its zero area corresponds to ordinary samples that are not support vectors, so all ordinary samples do not participate in the determination of the final hyperplane. This is the biggest advantage of support vector machines, and its dependence on the number of training samples is greatly reduced. , and improves training efficiency;
    LR is a parametric model, and SVM is a non-parametric model. The premise of the parametric model is to assume that the data obeys a certain distribution, which is determined by some parameters (for example, the normal distribution is determined by the mean and variance). On this basis The model constructed is called a parametric model; the non-parametric model does not make any assumptions about the distribution of the population. It only knows that the population is a random variable and its distribution exists (there may also be parameters in the distribution), but it cannot know the form of its distribution. We don’t even know the relevant parameters of the distribution. We can only make inferences based on non-parametric statistical methods given some samples. Therefore, LR is affected by data distribution, especially when the samples are unbalanced, and needs to be balanced first, while SVM does not directly depend on the distribution; LR can
    generate probabilities, but SVM cannot;
    LR does not rely on the distance between samples, and SVM is based on distance;
    The LR model is relatively simpler and easier to understand, especially when it comes to large-scale linear classification, parallel computing is more convenient. The understanding and optimization of SVM is relatively complicated. After SVM is converted into a dual problem, classification only needs to calculate the distance from a few support vectors. This has obvious advantages when calculating complex kernel functions and can greatly simplify the model and calculation. ; The loss function of SVM comes with regularization (1/2||w||^2 in the loss function), while LR requires additional regularization terms.

Same:
Linear SVM and LR are both linear classifiers.
They are both classification models, and their essence is to find the best classification hyperplane.
They are both discriminative models. The discriminant model does not care about how the data is generated, but only cares about the relationship between the data. Difference, and then use the difference to simply classify a given piece of data; both
are supervised learning algorithms;
different regular terms can be added.

  • Under what circumstances are Linear SVM and LR used
    ? Both methods are common classification algorithms. From the perspective of the objective function, the difference is that logistic regression uses logistic loss, and svm uses hinge loss. The purpose of both loss functions is to increase the weight of data points that have a greater impact on classification and reduce the weight of data points that have a smaller relationship with classification. The processing method of SVM is to only consider support vectors, that is, the few points most relevant to the classification, to learn the classifier. Logistic regression, through nonlinear mapping, greatly reduces the weight of points far away from the classification plane, and relatively increases the weight of the data points most relevant to the classification. The fundamental purpose of both is the same. In addition, according to needs, both methods can add different regularization terms, such as l1, l2, etc. Therefore, in many experiments, the results of the two algorithms are very close. However, the logistic regression model is relatively simple, easy to understand, and easy to implement, especially for large-scale linear classification. The understanding and optimization of SVM is relatively complicated. However, the theoretical foundation of SVM is more solid, and there is a theoretical basis for structured risk minimization, although people who generally use it do not pay much attention to it. Another very important point is that after SVM is converted into a dual problem, classification only needs to calculate the distance from a few support vectors. This has obvious advantages when calculating complex kernel functions and can greatly simplify the model and calculation amount.

  • Why is LR not suitable for MSE?

  • Why does logistic regression need to discretize features first?

  • Implementation of parallel LR

  • What are the applications of logistic regression in the financial field?

Naive Bayes
prior probability: refers to based on past experience and analysis. Probabilities that can be obtained before the experiment or sampling.
Posterior probability: refers to the fact that something has happened and you want to calculate the probability that the cause of this event is caused by a certain factor. The posterior probability can be regarded as a "more detailed characterization and update" of the prior probability, because at this time we observe X and have additional information. The posterior probability cannot be obtained directly, so we need to find a way to calculate it, and the solution is to introduce Bayes' formula.
Conditional probability: Generally written as p(A|B), that is, the probability of A occurring only when event B occurs:

After deformation, we get
the same logic: , so

We get the Bayesian formula:
We call P(A) "prior probability", which is our subjective judgment on the probability of event A without knowing event B.
P(B|A)/P(B) is called the "likelyhood". This is an adjustment factor, which is the adjustment brought by the new information B. Its function is to change the prior probability (previous subjective judgment) Adjust to be closer to the true probability.
How to calculate the adjustment factor: It can be obtained through the total probability.
If the "possibility function" P(B|A)/P(B)>1, it means that the "prior probability" is enhanced and the possibility of event A becomes greater. ;
If "possibility function" = 1, it means that event B does not help to judge the possibility of event A;
if "possibility function" < 1, it means that the "prior probability" is weakened and the possibility of event A becomes smaller. .

The premise of Bayesian is that features are independent,

  • Do you know what Naive Bayes is?

NB’s classification:
NB’s Bernoulli model, whose features are Boolean variables, consistent with 0/1 distribution. In text classification, the feature is whether a word appears;
NB’s polynomial model, whose features are discrete values, consistent with polynomial distribution, in text classification , the feature is the number of times a word appears;
NB's Gaussian model, the feature is a continuous value, consistent with Gaussian distribution (Gaussian distribution is also called normal distribution), in text classification, the feature is the TF-IDF value of the word.
Advantages and Disadvantages of NB Advantages
:
The algorithm principle is simple;
there are few estimated parameters;
it is assumed that the conditional probability calculations are independent of each other, so it can be used for distributed calculations;
it is a generative model and the convergence speed is faster than the discriminant model; it is suitable
for missing data Not too sensitive;
naturally can handle multi-classification problems.
Disadvantages:
Assuming that each feature is independent of each other is often not tenable in practical applications;
it cannot learn the interaction between features;
it is sensitive to the expression form of the input data.

  • There are 60 men and 40 women in the company, 25 men wear leather shoes, 35 wear sneakers, 10 women wear leather shoes, and 30 wear high heels. Now you only know that a person wears leather shoes, now you need to guess his gender. If it is inferred that the probability of him being a male is greater than that of a female, then he is considered to be a male, otherwise he is considered to be a female.
  • "Naive" is the shortcoming of Naive Bayes when making predictions. So with such an obvious hypothetical shortcoming, why can Naive Bayes' predictions still achieve better results?
  • What is Laplace smoothing method?
  • Are there any hyperparameters that can be adjusted in Naive Bayes?
  • Do you know what applications Naive Bayes has?
  • Is Naive Bayes a high or low variance model?
  • What are the assumptions of Naive Bayes? What are the advantages and disadvantages?
  • How does Naive Bayes perform parameter estimation?
  • What is the difference between Bayesianism and Frequentism?
  • What is the difference between logistic regression and naive Bayes?

tree model

  • Talk about your understanding of entropy, information gain and information gain ratio?
    The essence of entropy is the expectation of Shannon information log(1/p):
    the lower the probability of an event result, the longer the bit length of its encoding (the more detailed the description of the event must be).

1) Information entropy: When the coding scheme is perfect, what is the shortest average coding length.
2) Cross entropy: What is the average coding length when the coding scheme is not necessarily perfect (because the estimate of the probability distribution is not necessarily correct). Average coding length = shortest average coding length + one increment
3) Relative entropy: When the coding scheme is not necessarily perfect, the increase in the average coding length relative to the minimum value. (That is, the increment above) KL divergence is a method of measuring the difference between two distributions

  • What is the division standard of ID3 algorithm?
  • What are the flaws of the ID3 algorithm? How does the C4.5 algorithm solve the shortcomings of ID3? What are the flaws in ID3 and C4.5?
  • How does C4.5 handle missing values?
  • What is the classification standard for C4.5?
  • What are the flaws of the C4.5 algorithm?
  • What is the definition of Gini coefficient and its advantages?
  • How does CART select features for classification when feature values ​​are missing?
  • After selecting this dividing feature, how should the CART model handle samples missing this feature value?
  • What are the causes of overfitting in decision trees and their solutions?
  • What are the strategies for decision tree pruning? What are its advantages and disadvantages?
  • What is the pruning method used in C4.5?
  • How does CART handle the class imbalance problem?
  • How does CART handle continuous values?
  • Could you please tell me the differences between ID3, C4.5 and CART.
  • Why does the CART algorithm choose gini index?
  • How does the C4.5 algorithm handle continuous values?
  • How do decision trees handle missing values?
  • How to calculate the importance of each feature of a decision tree?
  • If there are many features, are the features that are not used in the decision tree necessarily useless?
  • Do decision trees need to be normalized?
  • Since classification problems can also be solved using neural networks, what is the significance of algorithms such as SVM and decision trees?
  • What is the relationship between decision trees and conditional probability distributions?
  • What is CART’s pruning strategy?
  • What impact will it have on the decision tree if it is caused by outliers or the data is unevenly distributed?
  • What are the advantages of decision trees compared with other models?
  • What is the difference between decision tree and logistic regression?
  • What is the difference between classification trees and regression trees?
  • How to understand the loss function of decision tree?
  • Should decision trees in sklearn be encoded with one-hot?
  • Briefly describe the steps of random forest
  • Will Random Forest Overfit?
  • Why doesn't random forest need to be divided into training set and test set?
  • How does random forest handle missing values?
  • Difference between Random Forest and GBDT
  • Comparison of Random Forest and SVM
  • Let’s talk about the advantages and disadvantages of random forest
  1. The difference between decision trees and random forests:
    Decision trees + Bagging + randomly selected features = random forests. Random forests can effectively prevent overfitting.

  2. Which decision tree is used in random forest
    ? CART decision tree or other

  3. What is the principle of random forest? How to adjust parameters? How is the depth of a tree generally determined, and what is it generally?
    Principle: RF is an integrated algorithm that belongs to the bagging category. It integrates multiple decision tree models and obtains the final result by voting or averaging the results of the base model, so that the final model has higher accuracy and versatility. chemical performance.
    Parameter adjustment:
    Let’s read this article by Teacher Liu Jianping: https://www.cnblogs.com/pinard/p/6160412.html

The parameter adjustment process of RF and GBDT is similar and can be compared and remembered:
no parameter fitting–>n_estimators parameter adjustment–>max_depth, min_sample_split–>min_sample_split, min_samples_leaf–>max_features

How to determine the depth of the tree: When there are many training samples and there are many data feature dimensions, this parameter needs to be limited, depending on the data distribution, generally between 10-100.
3. The difference between Bagging and Boosting
sample selection: Bagging selects training sets with replacement, and each round of training sets selected from the original data set are independent of each other; Boosting uses all the data every time, only the weight of each sample different.
Sample weight: Bagging uses uniform sampling, and the weight of each sample is the same; Boosting updates the sample weight according to the previous round of training results in each round of training. The greater the error rate, the greater the weight of the sample.
Prediction function: The prediction results of each basis function in Bagging have the same weight; in Boosting, the smaller the prediction error, the base model has a greater weight.
Bias and variance: Bagging results in low variance, and Boosting results in low bias.
Parallel computing: Bagging can generate the base model in parallel, while Boosting's prediction functions can only be generated sequentially. The parameters of the subsequent round of models require the prediction results of the previous round.
4. GBDT parameter adjustment
, I think, if the interviewer asks us this kind of question, he must have used this algorithm. If he has used it, he must understand and remember it. If he has not used it, he can honestly say that he has not used it. Everyone can Follow the link below and learn from teacher Liu Jianping.

Specific example: https://www.cnblogs.com/pinard/p/6143927.html

Parameter classification:
Boosting framework parameters: n_estimators, learning_rate, subsample
CART regression tree parameters (similar to decision trees): max_features, max_depth, min_sample_split, min_samples_leaf
General steps:
No parameter fitting –> fixed learning_rate, estimators parameter adjustment –> max_depth, min_sample_split –>min_sample_split, min_samples_leaf–>Fit view–>max_features–>subsample–>Continuously reduce the learning_rate, double the estimators to fit
5. The difference between RF and GBDT (important)
To fully understand this problem, you need these:
https ://blog.csdn.net/data_scientist/article/details/79022025
https://blog.csdn.net/xwd18280820053/article/details/68927422
https://blog.csdn.net/m510756230/article/details/82051807

The same thing: they are both composed of multiple trees, and the results are determined by multiple trees.
Differences:
GBDT is a regression tree, RF can be a regression tree or a classification tree;
GBDT is particularly sensitive to outliers, but RF is not;
GBDT can only generate multiple models serially, while RF can be parallel;
GBDT has multiple results The results are summed or weighted, RF is the result selected by voting;
GBDT improves model performance by reducing bias, RF reduces variance; RF
treats all training sets equally, and GBDT is a weak classifier based on weights.
6. Advantages and Disadvantages of Random Forest Advantages
:
Compared with other algorithms, it has great advantages in training speed and prediction accuracy; it
can handle very high-dimensional data without selecting features. After the model training is completed, it can give Importance of features;
can be written as a parallelization method;
disadvantages: easy to overfit on noisy classification and regression problems.
Dimensionality reduction

  • pca
    1. Used for: feature dimensionality reduction, removing redundant and weakly separable features;
    2. Goal: Each feature after dimensionality reduction is irrelevant, that is, the covariance between features is 0;
    3. Principle: Based on The k-order matrix U composed of the eigenvectors of the covariance matrix of the training data
    4. Physical meaning: The greater the variance, the greater the entropy, and the more information contained in the feature.
    5. Algorithm steps
    : Calculate the covariance matrix C of the training sample;
    calculate the eigenvalues ​​and eigenvectors of C;
    arrange the eigenvalues ​​of C in descending order, The corresponding eigenvectors of the eigenvalues ​​are also arranged in sequence;
    if you want to obtain the k-order dimensionality reduction matrix
    of The matrix after dimensionality;

Example

Taking this as an example, we use the PCA method to reduce this set of two-dimensional data to one dimension.
Because each row of this matrix already has zero mean, we can directly find the covariance matrix:

Then find its eigenvalues ​​and eigenvectors. The eigenvalues ​​after solving are:
λ1=2, λ2=2/5
and their corresponding eigenvectors are:

Since the corresponding eigenvectors are respectively a general solution, c1 and c2 can take any real number. Then the normalized feature vector is:

So our matrix P is:

The diagonalization of the covariance matrix C can be verified:

It is best to multiply the data matrix by the first row of P to obtain the dimensionally reduced data representation:

The projection result after dimensionality reduction is as follows:

LDA
principle: after projection, the intra-class variance is the smallest and the inter-class variance is the largest.
Workflow:

LDA vs PCA:
Similar points: 1) Both can reduce the dimensionality of data.
    2) Both use the idea of ​​matrix eigendecomposition when reducing dimensions.
    3) Both assume that the data follows a Gaussian distribution (normal distribution).
Differences: 1) LDA is a supervised dimensionality reduction method, while PCA is an unsupervised dimensionality reduction method.
    2) LDA can reduce the dimensionality to a maximum of k-1 categories, while PCA does not have this limitation.
    3) In addition to dimensionality reduction, LDA can also be used for classification.
    4) LDA selects the projection direction with the best classification performance, while PCA selects the direction with the largest variance in the sample point projection.

  • Briefly describe the weight update method of Adaboost
  • Derive the sample weight update formula of Adaboost
  • During the training process, why is there always a problem of classification errors in each round of training, but the entire Adaboost can converge quickly?
  • What are the pros and cons of Adaboost?
  • What are the similarities and differences between AdaBoost and GBDT?
  • Please briefly describe the principle of GBDT
  • Why can regression trees be used as iterative learners for GBDT?
  • How is GBDT used for classification problems?
  • Why does GBDT divide the CART regression tree into m binary trees (each tree has only two leaf nodes), instead of seeking an m+1 level binary tree (with up to 2m leaf nodes)?
  • How does GBDT perform regularization?
  • Why is the residual of gbdt replaced by negative gradient?
  • What are the advantages of GBDT?
  • What is the role of reduction in GBDT?
  • Why is residual-based GBDT not a good choice?
  • Why is it said in the gradient boosting tree that the negative gradient of the objective function with respect to the current model is an approximation of the residual?
  • Why can xgboost/gbdt achieve high accuracy with a very small tree depth when adjusting parameters?
  • Why do GBDT and Random Forest work so well in actual kaggle competitions?
  • How is GBDT used in click-through rate prediction?
  • How is the gradient in GBDT calculated? Who's gradient against whom?
  • For m×n data set, if GBDT is used, what is the dimension of the gradient? Or is it something to do with the depth of the tree? Or is it related to the number of leaf nodes in the tree?
  • Similarities and Differences between Random Forest and GBDT
  • What are the differences and connections between GBDT and Adaboost in machine learning algorithms?
  • Introduce the principle of XGBoost
  • What is the difference between XGBoost and GBDT
  • Similarities and Differences between RF and GBDT
  • Why does XGBoost use Taylor second-order expansion?
  • How is the parallelization part of XGBoost implemented?
  • Why is XGBoost fast?
  • How to calculate the weight of leaf nodes in XGBoost? Why can leaf node scores be used to measure tree complexity?
  • Stop growing condition of a tree in XGBoost
  • Please derive Xgboost
  • What are the methods for preventing overfitting in the XGBoost algorithm?
  • How XGBoost handles imbalanced data
  • Comparing LR and GBDT, tell us under what circumstances GBDT is inferior to LR
  • How to prune trees in XGBoost
  • When using XGBoost to train a model, how to adjust the parameters if it is overfitted?
  • How does XGBoost choose the best split point?
  • How to reflect the scalability of XGBoost
  • How XGBoost evaluates the importance of features
  • General steps for XGBooost parameter tuning
  • How to solve the problem if the XGBoost model is overfitted?
  • Why is XGBoost insensitive to missing values? Compared with ordinary GBDT, how does XGBoost handle missing values?
  • How is XGBoost's regularization implemented?
  • The difference between XGBoost and LightGBM
  • How does XGBoost find the inverse of the Hessian matrix?
  • How to understand the approximation algorithm used in xgboost algorithm to find the split points?
  • What are the advantages and disadvantages of LightGBM compared to XGBoost?
  • Please introduce several common integrated learning frameworks: boosting/bagging/stacking
  • Why is ensemble learning better than a single learner?
  • Please briefly describe the meaning of variance and bias of the model?
  • Is the base model in ensemble learning necessarily a weak model?
  • Please calculate the overall expectation and overall variance of the model
  • Why must the base model in Bagging be a strong model?
  • Why must the base model in the Boosting framework be a weak model?


EM algorithm for regression problems

  1. Definition:
    EM (Expectation-Maximum) algorithm is also called the expectation maximization algorithm. The EM algorithm is an iterative optimization strategy.
  2. What models can be solved using the EM algorithm? Why not use Newton's method or gradient descent method? (I feel like the second question is a bit wrong, mark) The terms of
    Gaussian mixture model, collaborative filtering, and KMeans summation increase exponentially with the number of hidden variables, causing trouble in gradient calculation, and EM is a non-gradient optimization algorithm.
  3. Use EM algorithm to derive and explain KMeans.
    In KMeans, the center of each cluster is the hidden variable.
    Step E: Randomly initialize k cluster centers.
    Step M: Calculate the nearest centroid of each sample point and cluster it to this centroid.

Repeat the above two steps until the cluster center does not change. The principle, time complexity, advantages and disadvantages and improvement principle of
KMeans K-means: For a given sample set, divide the sample into several clusters according to the distance between samples, so that the distance within the cluster is as small as possible, and the distance between clusters as large as possible; steps: randomly select k samples as the initial clustering center; calculate the distance between the sample and each clustering center, and assign it to the cluster with the smallest distance; calculate the new clustering center; iterate until the clustering center is no longer Updated or reached the maximum number of times. Time complexity: O(knd*t) | k: category, n: number of samples, d: time complexity of calculating the distance between samples, t: number of iterations; Advantages and Disadvantages: Advantages: 1. Easy to understand the principle and simple implementation , fast convergence speed, good effect 2. Strong interpretability 3. There are only a few adjustable parameters, only k; Disadvantages: 1. The clustering effect is greatly affected by the k value 2. Non-convex data sets are difficult to converge = discrete data is more uncomfortable 3. When the hidden categories are unbalanced, the effect is poor. 4. Iterative algorithm only obtains local optimality. 5. Sensitive to noise and abnormal data. Improvement: The impact of random initialization of the K value + optimization of the time-consuming calculation of the distance from the sample point to the center of mass











KMeans++ algorithm
KMeans randomly selects k points as clustering centers, while KMeans++ uses the following method:
assuming that after n clustering centers have been selected, when the n+1th clustering center is selected, the distance between the n clustering centers is The farther points have a greater probability of being selected; when selecting the first cluster center (n=1), it also needs to be randomly selected like KMeans.

Gaussian Mixture Model
The joint probability distribution, referred to as the joint distribution, is the probability distribution of a random vector composed of two or more random variables. The probability distribution of the coordinates (x, y) hit during target shooting is the joint probability distribution (involving two random variables).

The joint distribution of variables X and Y completely determines the probability distribution of X and the probability distribution of Y (called the marginal distribution of the joint distribution):

Gaussian distribution (normal distribution) follows the following probability density function (Probability Density Function):

The Gaussian mixture model can be regarded as a model composed of K single Gaussian models, and these K sub-models are the hidden variables of the mixture model. Generally speaking, a mixture model can use any probability distribution. The Gaussian mixture model is used here because the Gaussian distribution has good mathematical properties and good computational performance.
Steps: First randomly assign values ​​to parameters
E: Calculate the probability M of each point belonging to each sub-model based on the current parameters
: Improve the variance, mean and model weight based on the probability obtained in the previous step

Similarities and differences between Kmeans and Gmm:
Same: GMM and K-means are very similar. The similarity is that the classification of the two is affected by the initial value; both may be limited to local optimal solutions; the number of categories in both depends on guesswork.

Difference: K-means belongs to hard clustering, which either belongs to this type or that type, while GMM belongs to mixed soft clustering, 70% of a sample belongs to A and 30% belongs to B; at the same time, multi-dimensional GMM is calculating the mean and variance. Covariance is used and the mutual constraints between different dimensions are applied.

Kmeans considers minimizing the distance between point k and point xi.
Gmm considers the probability that xi belongs to a certain class. When the probability is 100%, gmm degenerates into kmeans.
Probabilities can be calculated and new samples can be generated.
In terms of Euclidean distance and Mahalanobis distance, K-means is Euclidean distance, and GMM is Mahalanobis distance. When the covariance off-diagonal element is 0, Mahalanobis distance becomes Euclidean distance. See the optimization functions of K-means and GMM in the figure below:

optimization

  • What is gradient descent?
    The original meaning of gradient is a vector (vector), which means that the directional derivative of a certain function at that point reaches the maximum value along that direction, that is, the function changes fastest along that direction at that point, and the maximum rate of change is the gradient. The module is the largest. The gradient is the derivative, and for multi-dimensions it is the partial derivative.

Gradient descent: 1. Determine where the minimum point x is based on the sign of the gradient (derivative);
2. Let the function value decrease (become smaller).
The function of gradient descent is to find the value of the independent variable corresponding to the minimum value of the function (the value of x corresponding to the lowest point of the curve). Remember our purpose is to find x.

  • What are the problems with training SVM using gradient descent?
    Many machine learning methods such as least squares, logistic regression, and support vector machines can be summarized as convex optimization. The
    objective function of SVM is a convex optimization problem. The convex optimization problem has been studied thoroughly. Using the theory of convex optimization to solve the training speed will be faster. quick.
    Convex optimization studies the problem of minimizing a convex function defined in a convex set. In a sense, convex optimization is simpler than general mathematical optimization problems. For example, in convex optimization, the local optimal value must be the global optimal value.
    A convex set is all points that contain a straight line between two distinct points.

A convex function is a real-valued function defined on a convex subset C (interval) of a certain vector space.

From a geometric point of view, a convex function means that the chord between any two points (that is, the line segment formed by the two points) is in the function image (here it refers to the function image between the two points, not all function images). ) above.

  • What is the difference between least squares, maximum likelihood, and gradient descent?
  • Why does Newton's method require fewer iterations than the gradient descent method to solve optimization problems?
  • Why is nn not using a convex function as the activation function when the biggest problem is that it will fall into a local optimum?

loss function

  • Please explain the definition of the loss function.
    The loss function (Loss function) is used to estimate the degree of inconsistency between the predicted value f(x) of your model and the true value Y. It is a non-negative real-valued function, usually L(Y,f (x)) to represent. The smaller the loss function, the better the robustness of the model.

  • Please tell me about your understanding of the logistic regression loss function

  • Please talk about your understanding of the square loss function.

  • Please talk about your understanding of exponential loss functions.

  • Please talk about your understanding of Hinge loss function.

  • Please compare the loss functions of logistic regression and SVM.

  • For logistic regression, why is the squared loss function said to be non-convex?

  • How to connect the derivation of SVM to the loss function?

  • How does a neural network design its own loss function? If you need to modify or design your own loss, what rules need to be followed?

  • What is the relationship between softmax and cross-entropy?

  • Why is the loss function of neural network non-convex?

  • What are the commonly used loss functions (optimization objective functions) in deep learning?

  • In neural networks, what are the techniques for designing loss function?

  • In a neural network, why not directly derive the partial derivative of the loss function and make it equal to zero to find the optimal weight, but use the gradient descent method (iteration) to calculate the weight?

  • When using the cross-entropy loss function, I only want to punish fuzzy values ​​​​such as 0.4~0.6. How should I change it?
    Neural network
    1. Convolutional layer
    regularization

  • Please explain what you mean by regularization.
    Regularization is also called penalty and regularization. The purpose of adding regularization terms is to avoid overfitting. Adding regularization terms is equivalent to adding prior knowledge. When there is little data, prior knowledge can prevent overfitting.

For alpha =0, ​​that is, without adding regularization constraints, the Gaussian prior distribution equivalent to the parameters has infinite covariance. Then this prior constraint will be very weak. In order for the model to fit all training data, w can change It's extremely unstable. The larger the alpha, the smaller the prior Gaussian covariance, the model is approximately stable, and the relative variance is also smaller.

  • How does regularization relate to the prior distribution of the data?
  • Why is it easier to obtain sparse solutions in L1 than in L2?
    L1 norm L2 norm The sum
    of the absolute values ​​of each parameter The square root of the sum of the squares of each parameter The prior
    distribution is the Laplace distribution Gaussian (normal) distribution
    makes the parameters sparse, and has the function of feature selection to make the parameters close to 0, prevent overfitting (the simpler the model, the less likely it is to overfit)
    Lasso regression Ridge regression

Laplace:

  • Why can L1 regularization make the coefficient become 0? How does the L1 regular handle the situation where the zero point is not differentiable?
  • How to prevent overfitting in deep learning?
    Reasons: 1. Data is noisy; 2. Insufficient training data, limited training data; 3. Over-training leads to model complexity.
    Methods to prevent overfitting: 1. Early stopping: Stop iterations before the model iteratively converges on the training data.
    Specific method: At the end of each Epoch, calculate the accuracy of validation_data. When the accuracy no longer improves, stop training. (Pay attention to multiple observations. If the accuracy is not improved multiple times, the iteration will stop)
  1. Dropout: During training, certain nodes in the hidden layer are ignored with a certain probability.
  2. Regularization
  3. Data set expansion: 1. Obtain more data from the source; 2. Data enhancement (expand the data through certain rules); 3. Estimate distribution parameters based on the current data and use the distribution to obtain more data.
  • How to solve the problem when multiple L1 and L2 regularization terms are used in the objective function at the same time?

AUC

  • Please explain AUC.
    , the possibility that the probability value of the predicted positive example is greater than the probability value of the predicted negative example, the area under the ROC curve AUC=∫y(t)dx(t)

  • Are AUC and accuracy necessarily positively related? Is there any intrinsic relationship?
    The accuracy is calculated based on a certain threshold, and the AUC is calculated based on all possible thresholds, which is more robust. AUC does not pay attention to the performance under a certain threshold, but combines the prediction performance of all thresholds. Therefore, if the accuracy is high, the AUC may not be large, and vice versa.

  • Why is AUC more commonly used than accuracy in binary classification? Why is AUC insensitive to sample class proportion?
    To calculate accuracy, the probability needs to be converted into categories, which requires manually setting a threshold. Those above this threshold are placed in category A, and those below this threshold are placed in category B. This threshold greatly affects the calculation of accuracy. AUC avoids converting probabilities into categories.
    Roc curve:
    A curve drawn with the true positive rate as the ordinate and the false positive rate as the abscissa. The larger the auc, the better, but it must be less than or equal to 1.
    TPR = TP / (TP + FN) True
    FPR = TN / ( TN + FP) False

  • What are the advantages and disadvantages of precision, recall, F1 value, ROC, and AUC?
    Accuracy. As the name suggests, it is the total proportion of all predictions that are correct (positive and negative).

Precision, accuracy rate. That is, the proportion of correct predictions that are positive to all predictions that are positive. Personal understanding: The proportion of all positive predictions that are truly correct.

Recall rate (Recall), recall rate. That is, the proportion of correctly predicted positives to all actual positives. Personal understanding: The truly correct ones account for the proportion of all actual positive ones.

F1 value (H-mean value). The F1 value is the arithmetic mean divided by the geometric mean, and the bigger the better. Bringing in the above formulas of Precision and Recall, you will find that when the F1 value is small, True Positive increases relatively, while False decreases relatively, that is, both Precision and Recall increase relatively. , that is, F1 weights both Precision and Recall.

  • In machine learning, F1 and ROC/AUC, how to evaluate indicators for multi-classification?
  • How to solve the problem of inconsistent offline and online AUC and online click-through rates?
  • Why is AUC insensitive to the ratio of positive and negative samples?

ElseWhat
are the discriminant and generative algorithms, and what is the difference? Difference
: The most essential difference between the two is the difference in modeling objects.
The evaluation object of the discriminative model is to maximize the conditional probability P(Y|X) and model it, which is characterized by high accuracy; the
evaluation object of the generative model is to maximize the joint probability P(X,Y) and model it Modeling is characterized by fast convergence speed.
Discriminant models: linear regression, decision tree, support vector machine SVM, k-nearest neighbor, neural network, etc.;
generative models: hidden Markov model HMM, naive Bayes model, Gaussian mixture model GMM, LDA.

Imbalanced data

  • What are the methods for processing unbalanced data sets in machine learning?
  • Please briefly describe how the SMOTE sampling method handles unbalanced data?
  • What's wrong with the original SMOTE algorithm? How to improve?
  • Please briefly describe the Tomek Links undersampling method.
  • Please briefly describe the NearMiss method
  • How does the EasyEnsemble algorithm solve the problem of unbalanced data?
  • How does the BalanceCascade algorithm solve the problem of unbalanced data?
  • Can SMOTE oversampling and Tomek Links undersampling algorithms be combined?

3. Deep Learning

  • Please write down the commonly used loss functions, square loss, cross-entropy loss, softmax loss function and hinge loss function.
  • Why is it so difficult to train deep neural networks? What are the main reasons?
  • Could you please explain forward propagation and back propagation with examples?
  • What is the role of introducing nonlinear activation functions in deep learning?
  • Please name the commonly used activation functions and draw their corresponding images.
  • How to choose activation function? Please explain the characteristics of various activation functions.
  • What are the advantages of Relu activation function?
  • Please explain the definition and function of Softmax activation function? How is the Softmax activation function applied to multi-classification?
  • Why is batch size needed during deep model training? How to choose the appropriate batch size and its impact on the results?
  • Please explain the principle of BN and why batch normalization is needed?
  • What is fine tuning? Please explain the three states of the fine-tuning model and what are their respective characteristics?
  • Why can unsupervised pre-training help deep learning?
  • What are the methods for initializing weight bias? Explain their characteristics respectively.
  • What is the purpose of setting the learning rate? What are the commonly used learning rate decay methods? Explain their respective characteristics
  • What are some methods to prevent overfitting in deep learning?
  • Please name several commonly used optimization algorithms and their respective characteristics.
  • How to balance variance and bias in deep learning? What should we do if the deviation is too large? What if the variance is too large?
  • Please explain the principle of Dropout. What is the difference between dropout during training and testing?
  • What are the commonly used data enhancement methods in deep learning?
  • How to understand Internal Covariate Shift?

4. C++ Questions and Answers
Basics

  • What is the role of variables? What is the syntax for creating variables?
  • What is the role of constants in C++? Please write down two ways to define constants.
  • Please give some examples of pre-reserved keywords in C++
  • What is the memory space occupied by short type, int type, long type and long long type respectively?
  • What is the role of sizeof keyword?
  • How much memory space does a character variable occupy? What characteristics does it have when stored?
  • Please name a few escape characters you commonly use in C++?
  • What is the difference between pre-increment and post-increment in C++?
  • Write an example of ternary operator? and explain it.
  • What is the function of break in switch case statement?
  • What is the execution order of the starting expression, conditional expression, final loop body and loop statement in a for loop statement?
  • What are the functions of break statement and continue statement?

array

  • What are the characteristics of arrays? How to define an array?
  • What is the relationship between the name of a one-dimensional array and its memory address?
  • How to define a two-dimensional array? What is the relationship between the name of a two-dimensional array and its memory address?

function

  • Explain the meaning of formal and actual parameters.
  • What is the meaning of passing by value? What is the impact on formal parameters and actual parameters?
  • What is the purpose of function declaration?

pointer

  • What is the role of pointers? What is the difference between pointer variables and ordinary variables?
  • How much memory space do pointers occupy?
  • What is the difference between constant pointer and pointer constant?
  • What is the difference between passing by value and passing by address?

Structure

  • How to create a structure? Please write down two methods.
  • How to create an array of structures?
  • How does a structure pointer access the members of a structure?
  • How do structures nest structures? Give an example
  • Can a structure be passed as a parameter to a function?

Memory

  • Please briefly describe the functional characteristics of each memory area (code area, global area, stack area, and heap area) when a C++ program is executed.
  • What is the function of new operator? how to use?

Quote

  • What is the role of citations? What is its essence?
  • When a reference is used as a function parameter, what is the difference between passing by value and passing by address?
  • What are the functions and writing methods of constant references?
  • What should I pay attention to when writing default parameters for functions?

Overload

  • What conditions need to be met for function overloading?

encapsulation

  • What is the meaning of encapsulation?
  • What are the access rights to the members and behaviors of a class? What are the differences?
  • What is the difference between a class and a structure?
  • What are the advantages of making member properties private?

initialization

  • What are the functions of constructors and destructors?
  • What is the constructor syntax? What are the characteristics of constructor?
  • What is the destructor syntax? What are the characteristics of destructor?
  • What are the rules for constructor calling?
  • Please explain deep copy and shallow copy in C++?
  • What is the initialization list syntax in C++?
  • Class B has object A as a member, and A is an object member. When creating object B, who constructs and destructs A and B first and who comes last?
  • What are the characteristics of static members?
  • Are member variables and member functions within a class stored separately? Do non-static member variables occupy object space?
  • What is the role of this pointer?
  • What effect does const modified member function have? What is the function of the keyword mutable?
  • What is the role of friends in C++? How are global functions, classes, and member functions implemented as friends?
  • What are the methods of inheritance? What are its permissions?
  • Can a subclass inherit private members of a parent class?
  • What is the order of constructors and destructors of parent and child classes?
  • When there are members with the same name in a subclass and a parent class, how to access data with the same name in the subclass or parent class through the subclass object?
  • What problems does diamond inheritance cause? How is it solved in C++?
  • What is the difference between static polymorphism and dynamic polymorphism?
  • What are the conditions for satisfying and using polymorphism?
  • What are the advantages of polymorphism?
  • What is the meaning of pure virtual function? What does the syntax look like? What does it have to do with abstract classes?
  • Explain the meaning, syntax and differences between virtual destruction and pure virtual destruction?
  • How to create a function template? what is the function? What should I pay attention to?
  • What is the difference between ordinary functions and function templates? What are its calling rules?
  • What problem does the reified function template solve?
  • What is the role of class templates? What does the syntax look like? What is the difference between it and function template?
  • What is the timing for creating member functions in class templates?
  • Please explain containers, algorithms and iterators in STL.

5. Python Questions and Answers

  • What are the differences between list, tuple, dict, set and other types in python?
  • What are the forms of function parameter passing? What are their characteristics?
  • Please explain Python's default parameter trap problem.
  • Please give an example of the difference between shallow copy and deep copy
  • What are the concepts of generators and iterators?
  • Please briefly describe the usage of the built-in function zip. How is it handled when the iterator lengths are inconsistent? Is there any alternative?
  • What are the usages of high-order functions map/reduce/filter/sorted? for example.
  • What is the concept of closure? for example.
  • What are the benefits of anonymous functions? Please give an example to illustrate its usage.
  • What is the concept of decorator? how to use?
  • What is the concept of partial function? how to use?
  • What are the advantages of enumerate over range?
  • What is a factory function? for example.
  • Give an example to illustrate the difference between class attributes and instance attributes.
  • Please explain the concepts of inheritance and polymorphism with examples.
  • How to set access restrictions on properties within a class?
  • How to use __slots__?
  • What are the functions of customized classes __str__, iter , getitem , getattr and __call__ respectively?
  • What is the difference between static methods, class methods and member methods?
  • What are @classmethod, @staticmethod, @property?
  • What is the difference between __init__ and __new__?
  • What is Python introspection?
  • How does python manage memory?
  • What is GIL?
  • Please briefly describe python’s exception handling mechanism.
  • How do you locate bugs in python programs? How to implement single-step execution in python?
  • What is the use of assert assertion?
  • What built-in properties does the class have?
  • How to convert a list of strings into a space-separated string?
  • How does the is operator in python compare?
  • Please write a regular expression that matches email addresses.
  • How to pass command line parameters in python?
  • How to understand threads in python?
  • Please briefly describe multi-processing in python.

Guess you like

Origin blog.csdn.net/qq_41950533/article/details/129181675