Classic machine learning 200 interview questions and answers

This article summarizes the BAT machine learning interview questions in previous years. It is full of dry goods and is worth collecting.

If you want to join a big factory, it can be said that thousands of troops cross a single-plank bridge.

In order to pass the tests, it is absolutely necessary to brush the questions . Grab this book of brushing questions !

1. Please briefly introduce SVM.

SVM, the full name is support vector machine, and the Chinese name is support vector machine. SVM is a data-oriented classification algorithm whose goal is to determine a classification hyperplane to separate different data.

extension:

Support vector machine learning methods include building simple to complex models: linear separable support vector machines, linear support vector machines and nonlinear support vector machines. When the training data is linearly separable, learn a linear classifier by maximizing the hard interval, that is, linearly separable support vector machine, also known as hard interval support vector machine; when the training data is approximately linearly separable, through the soft interval Maximize, and also learn a linear classifier, that is, linear support vector machine, also known as soft margin support vector machine; when the training data is linearly inseparable, learn nonlinear support vector machine by using kernel techniques and soft margin maximization.

A popular introduction to support vector machines (understanding the three-level realm of SVM)

https://www.cnblogs.com/v-July-v/archive/2012/06/01/2539022.html

In-depth understanding of machine learning SVM

http://blog.csdn.net/sinat_35512245/article/details/54984251

2. Please briefly introduce the calculation graph of Tensorflow.

@寒小阳: Tensorflow is a programming system that expresses calculations in the form of calculation graphs. Computational graphs are also called data flow graphs. Computational graphs can be regarded as a directed graph. Every calculation in Tensorflow is a calculation. A node on the graph, and the edges between nodes describe the dependencies between computations.

3. What is the difference between GBDT and XGBoost?

@Xijun LI: XGBoost is similar to the optimized version of GBDT, both in terms of accuracy and efficiency. Compared with GBDT, the specific advantages are:

  • The loss function is approximated by the Taylor expansion binomial, rather than the first-order derivative like in GBDT;

  • Regularization constraints are applied to the structure of the tree to prevent the model from being overly complex and reduce the possibility of overfitting;

  • The way of node splitting is different. GBDT uses the Gini coefficient, and XGBoost is optimized and derived.

Links to Knowledge Points: Summary of Integrated Learning

https://xijunlee.github.io/2017/06/03/summary of integrated learning/

4. In k-means or kNN, we use Euclidean distance to calculate the distance between nearest neighbors. Why not use Manhattan distance?

The Manhattan distance only calculates the horizontal or vertical distance, and there are dimension restrictions. On the other hand, Euclidean distance can be used for distance calculation problems in any space. Since, data points can exist in any space, Euclidean distance is a more feasible choice. For example: imagine a chess board, the moves made by a bishop or a rook are calculated by the Manhattan distance as they are made in their respective horizontal and vertical directions.

5. Baidu 2015 school recruitment machine learning written test questions.

Knowledge point link: Baidu 2015 school recruitment machine learning written test questions

http://www.itmian4.com/thread-7042-1-1.html

6. Briefly talk about feature engineering.

7. About LR.

@rickjin: Tell LR from head to toe. Modeling, on-site mathematical derivation, the principle of each solution, regularization, the relationship between LR and maxent model, why LR is better than linear regression. There are quite a few people who know how to memorize answers, but they get confused when they ask logical details. The principle will be? Then ask the project, how to do parallelization, there are several parallelization methods, and which open source implementations have you read. If you still can, then you are ready to accept it, and by the way, ask about the development history of the LR model.

Knowledge point link: Logistic regression of machine learning (logistic regression)

http://blog.csdn.net/sinat_35512245/article/details/54881672

8. How to solve overfitting?

dropout、regularization、batch normalizatin

9. What is the connection and difference between LR and SVM?

@朝阳在看看,Contact:

1. Both LR and SVM can handle classification problems, and are generally used to deal with linear binary classification problems (multi-classification problems can be handled under improved conditions) 

2. Both methods can add different regularization terms, such as L1, L2 and so on. Therefore, in many experiments, the results of the two algorithms are very close.

the difference:

1. LR is a parametric model, and SVM is a non-parametric model.

2. From the perspective of the objective function, the difference is that logistic regression uses Logistical Loss, and SVM uses hinge loss. The purpose of these two loss functions is to increase the weight of data points that have a greater impact on classification and reduce the relationship with classification The weight of the smaller data points.

3. The processing method of SVM is to only consider the Support Vectors, that is, the few points most related to the classification, to learn the classifier. Logistic regression greatly reduces the weight of points far from the classification plane through nonlinear mapping, and relatively increases the weight of data points most relevant to classification.

4. The logistic regression model is relatively simple and easy to understand, especially for large-scale linear classification. The understanding and optimization of SVM is relatively complicated. After SVM is transformed into a dual problem, classification only needs to calculate the distance from a few support vectors. This has obvious advantages when performing complex kernel function calculations, and can greatly simplify the model and calculation. .

5. SVM can do what Logic can do, but there may be problems with the accuracy rate, and some Logic can not do what SVM can do.

Answer source: Common interview questions about machine learning (1)

http://blog.csdn.net/timcompp/article/details/62237986

10. What is the difference and connection between LR and linear regression?

@nishizhen: Personally, I feel that both logistic regression and linear regression are generalized linear regression. Secondly, the optimization objective function of the classic linear model is least squares, while logistic regression is a likelihood function. In addition, linear regression is performed in the entire real number domain. For prediction, the sensitivity is consistent, while the classification range needs to be in [0,1]. Logistic regression is a regression model that reduces the prediction range and limits the predicted value to [0,1]. Therefore, for this type of problem, the robustness of logistic regression is better than that of linear regression.

@善善猫皮狗: The logistic regression model is essentially a linear regression model, and logistic regression is supported by linear regression theory. However, the linear regression model cannot achieve the nonlinear form of sigmoid, and sigmoid can easily handle 0/1 classification problems.

11. Why does XGBoost use Taylor expansion, and what are the advantages?

@AntZ: XGBoost uses first-order and second-order partial derivatives, and the second-order derivative is conducive to faster and more accurate gradient descent. Using Taylor expansion to obtain the second-order reciprocal form can be used without selecting the specific form of the loss function Algorithm optimization analysis. In essence, the selection of the loss function is separated from the model algorithm optimization/parameter selection. This decoupling increases the applicability of XGBoost.

12. How does XGBoost find the optimal features? Was it released again or not?

@AntZ: XGBoost gives the scores of each feature during the training process, thus indicating the importance of each feature to model training. XGBoost uses the gradient optimization model algorithm, and the samples are not replaced (imagine that a sample is continuously and repeatedly drawn, and the gradient steps back and forth, will you be happy). But XGBoost supports sub-sampling, that is, each round of calculation does not need to use all samples.

13. Talk about negotiating alternative models and generative models?

Discriminant method: directly learn the decision function Y = f(X) from the data, or use the conditional distribution probability P(Y|X) as the prediction model, that is, the discriminant model.

Generation method: Learn the joint probability density distribution function P(X,Y) from the data, and then find the conditional probability distribution P(Y|X) as the prediction model, that is, the generation model.

A discriminative model can be obtained from a generative model, but a generative model cannot be obtained from a discriminative model.

Common discriminant models are: K nearest neighbor, SVM, decision tree, perceptron, linear discriminant analysis (LDA), linear regression, traditional neural network, logistic regression, boosting, conditional random field

Common generative models are: Naive Bayesian, Hidden Markov Model, Gaussian Mixture Model, Document Topic Generation Model (LDA), Restricted Boltzmann Machine

14. The difference between L1 and L2.

L1 norm (L1 norm) refers to the sum of the absolute values ​​of each element in the vector, and it also has a good name called "Lasso regularization".

For example, vector A=[1, -1, 3], then the L1 norm of A is |1|+|-1|+|3|. 

A brief summary is:

L1 norm: It is the sum of the absolute values ​​of each element of the x vector.

L2 norm: It is the 1/2 power of the square sum of each element of the x vector, and the L2 norm is also called the Euclidean norm or the Frobenius norm

Lp norm: It is the 1/p power of the sum of the absolute value p power of each element of the x vector.

In the learning process of the support vector machine, the L1 norm is actually an optimal process for solving the cost function. Therefore, the L1 norm regularization adds the L1 norm to the cost function to make the learning results satisfy sparseness. This makes it easier for humans to extract features.

The L1 norm can make the weight sparse and facilitate feature extraction.

The L2 norm can prevent overfitting and improve the generalization ability of the model.

15. What distributions do the L1 and L2 regular priors obey?

@齐同学: In the interview, what distributions do the regular priors of L1 and L2 obey? L1 is Laplace distribution, and L2 is Gaussian distribution.

16. The most successful application of CNN is in CV, so why can many problems of NLP and Speech be solved by CNN? Why is CNN also used in AlphaGo? Where is the similarity between these several unrelated problems? By what means does CNN capture this commonality?

@xuhan

Links to Knowledge Points (Answer Analysis): Deep Learning Job Interview Questions Organize Notes

https://zhuanlan.zhihu.com/p/25005808

17. Talk about Adaboost, the weight update formula. When the weak classifier is Gm, the weight of each sample is w1, w2..., please write the final decision formula.

answer analysis

http://www.360doc.com/content/14/1109/12/20290918_423780183.shtml

18. LSTM structure derivation, why is it better than RNN?

Deduce the changes of forget gate, input gate, cell state, hidden information, etc.; because LSTM has entry and exit and the current cell informaton is superimposed after being controlled by input gate, RNN is multiplication, so LSTM can prevent gradients from disappearing or exploding.

19. Friends who often search for things on the Internet know that when you accidentally enter a word that does not exist, the search engine will prompt you whether to enter a correct word, for example, when you enter "Julw" in Google, The system will guess your intention: whether you want to search for "July", as shown in the picture below:

This is called spell checking. According to How to Write a Spelling Corrector, an article written by a Google employee, Google's spell check is based on a Bayesian approach. Please talk about your understanding, specifically how Google uses the Bayesian method to implement the "spell check" function.

When a user enters a word, it may be spelled correctly or it may be misspelled. If the correct spelling is recorded as c (representing correct) and the misspelling is recorded as w (representing wrong), then what the "spelling check" has to do is: in the case of w, try to infer c. In other words: Know w, and then find the most likely c among several alternatives, that is, find the maximum value of P(c|w)P(c|w). And according to Bayes' theorem, there are:

Since all alternative c correspond to the same w, their P(w) are the same, so we only need to maximize P(w|c)P(c). in:

P(c) represents the "probability" of a correct word, which can be replaced by "frequency". If we have a large enough text library, then the frequency of occurrence of each word in this text library is equivalent to its occurrence probability. The higher the frequency of a word, the larger P(c). For example, when you enter the wrong word "Julw", the system is more inclined to guess that the word you might want to enter is "July" rather than "Jult", because "July" is more common.

P(w|c) represents the probability of misspelling w given an attempt to spell c. In order to simplify the problem, it is assumed that the closer the two words are in shape, the more likely they are misspelled, and the larger P(w|c) will be. For example, a spelling that differs by one letter is more likely to occur than a spelling that differs by two letters. If you want to spell the word July, you are more likely to misspell it as Julw (by one letter) than as Jullw (by two letters). It is worth mentioning that this kind of problem is generally called "edit distance".

http://blog.csdn.net/v_july_v/article/details/8701148#t4

Therefore, we compare the frequency of occurrence of all words with similar spellings in the text library, and then pick out the one with the highest frequency, which is the word that the user most wants to enter. Please refer to How to Write a Spelling Corrector for the specific calculation process and the defects of this method.

http://norvig.com/spell-correct.html

20. Why is Naive Bayes so "naive"?

Because it assumes that all features in the data set are equally important and independent. As we know, this assumption is very untrue in the real world, therefore, Naive Bayes is really "naive".

21. In machine learning, why do we often normalize data?

@zhanlijun

The source of this question analysis: Why do some machine learning models need to normalize the data?

http://www.cnblogs.com/LBSer/p/4440590.html

22. Talk about the normalization problem in deep learning.

See this video for details: Normalization in Deep Learning

http://www.julyedu.com/video/play/69/686

23. Please briefly talk about the process of a complete machine learning project.

1. Abstract into mathematical problems 

Identifying the problem is the first step in machine learning. The training process of machine learning is usually a very time-consuming thing, and the time cost of random attempts is very high.

The abstraction here is a mathematical problem, which means that we know what kind of data we can get, whether the goal is a classification or regression or clustering problem, if it is not, if it is classified as a certain type of problem.

2. Get data 

The data determines the upper limit of the machine learning results, and the algorithm is only as close as possible to this upper limit. The data must be representative, otherwise it will inevitably overfit. And for classification problems, the data skew should not be too severe, and the number of data of different categories should not have a gap of several orders of magnitude.

Moreover, there is also an evaluation of the magnitude of the data, how many samples, how many features, can estimate the degree of memory consumption, and judge whether the memory can be put in the training process. If you can't let it go, you have to consider improving the algorithm or using some dimensionality reduction techniques. If the amount of data is too large, it is necessary to consider distributed.

3. Feature preprocessing and feature selection 

Good data must be able to extract good features in order to be truly effective.

Feature preprocessing and data cleaning are critical steps, which often can significantly improve the effect and performance of the algorithm. Normalization, discretization, factorization, missing value handling, collinearity removal, etc., a lot of time is spent on them in the data mining process. These tasks are simple and reproducible, and the income is stable and predictable. They are the basic and necessary steps of machine learning.

Screening out salient features and discarding insignificant ones requires machine learning engineers to repeatedly understand the business. This has a decisive influence on many outcomes. Once the features are selected, very simple algorithms can give good, stable results. This requires the use of related techniques for feature validity analysis, such as correlation coefficient, chi-square test, average mutual information, conditional entropy, posterior probability, logistic regression weights and other methods.

4. Training model and tuning 

It is not until this step that the algorithm we mentioned above is used for training. Now many algorithms can be packaged into black boxes for human use. But the real test is to adjust the (hyper)parameters of these algorithms to make the results better. This requires us to have a deep understanding of the principles of the algorithm. The deeper the understanding, the more you can find the crux of the problem and propose a good tuning solution.

5. Model diagnosis 

How to determine the direction and thinking of model tuning? This requires techniques for diagnosing the model.

Overfitting and underfitting judgments are a crucial step in model diagnosis. Common methods such as cross-validation, drawing learning curves, etc. The basic tuning idea of ​​overfitting is to increase the amount of data and reduce the complexity of the model. The basic tuning idea of ​​underfitting is to improve the quantity and quality of features and increase the complexity of the model.

Error analysis is also a crucial step in machine learning. By observing the error samples, we can comprehensively analyze the cause of the error: whether it is a parameter problem or an algorithm selection problem, is it a feature problem or a problem with the data itself...

The diagnosed model needs to be tuned, and the tuned new model needs to be re-diagnosed. This is a process of repeated iterations and constant approximation, which requires continuous attempts to reach the optimal state.

6. Model Fusion 

Generally speaking, after model fusion, the effect can be improved to a certain extent. And it works great.

In engineering, the main method to improve the accuracy of the algorithm is to work hard on the front end (feature cleaning and preprocessing, different sampling modes) and the back end (model fusion) of the model respectively. Because they are relatively standard and reproducible, the effect is relatively stable. And the work of directly adjusting parameters will not be much, after all, it is too slow to train with a large amount of data, and the effect is difficult to guarantee.

7. Go online 

This part of the content is mainly related to the realization of the project. Engineering is result-oriented, and the effect of the model running online directly determines the success or failure of the model. It not only includes its accuracy, error, etc., but also includes its running speed (time complexity), resource consumption (space complexity), and whether the stability is acceptable.

These workflows are mainly some experience summed up in engineering practice. Not every project contains a complete process. The part here is just a guiding explanation. Only when you practice more and accumulate more project experience, will you have a deeper understanding of yourself.

Therefore, based on this, every ML algorithm class in July Online hereby adds related courses such as feature engineering and model tuning. For example, here is an open class video "Feature Processing and Feature Selection".

24. What is the difference between new and malloc?

Knowledge point link: the difference between new and malloc

https://www.cnblogs.com/fly1988happy/archive/2012/04/26/2470542.html

25. Hash conflicts and solutions?

@Sommer_Xia

Elements with different key values ​​may be mapped to the same address in the hash table and a hash collision will occur. Solution:

1) Open addressing method: When a conflict occurs, use a certain probing (also known as probing) technology to form a probing (testing) sequence in the hash table. Search unit by unit along this sequence until you find a given keyword, or encounter an open address (that is, the address unit is empty) (if you want to insert, after detecting the open address, you can insert the address to be inserted The new node is stored in this address unit). If an open address is detected during the search, it indicates that there is no keyword to be searched in the table, that is, the search fails.

2) Re-hashing: Construct multiple different hash functions at the same time.

3) Chain address method: All the elements whose hash address is i form a single-linked list called a synonym chain, and store the head pointer of the single-linked list in the i-th unit of the hash table, so the search, insertion and deletion are mainly Do it in the synonym chain. The chain address method is suitable for frequent insertion and deletion.

4) Establish a public overflow area: Divide the hash table into two parts, the basic table and the overflow table. All elements that conflict with the basic table will be filled in the overflow table.

26. How to solve gradient disappearance and gradient expansion?

(1) The gradient disappears:

According to the chain rule, if the partial derivative of each layer of neurons to the output of the previous layer multiplied by the weight is less than 1, then even if the result is 0.99, after enough multi-layer propagation, the bias of the error to the input layer lead tends to 0. The ReLU activation function can be used to effectively solve the situation where the gradient disappears.

(2) Gradient expansion:

According to the chain rule, if the partial derivative of each layer of neurons to the output of the previous layer is greater than 1, after enough multi-layer propagation, the partial derivative of the error to the input layer will tend to infinity

It can be solved by activation function.

27. Which of the following does not belong to the advantages of the CRF model over the HMM and MEMM models ( )

A. Features are flexible 

B. fast 

C. Can accommodate more contextual information 

D. Global Optimum 

Answer: First of all, CRF, HMM (Hidden Horse Model), MEMM (Maximum Entropy Hidden Horse Model) are commonly used to model sequence annotations.

One of the biggest shortcomings of the hidden horse model is that due to its output independence assumption, it cannot consider the characteristics of the context, which limits the selection of features.

The maximum entropy hidden horse model solves the problem of hidden horses, and can choose features arbitrarily, but because it must be normalized at each node, it can only find a local optimal value, and it also brings label bias. The problem is that all situations that do not appear in the training corpus are ignored.

The conditional random field is a good solution to this problem. It does not normalize at each node, but global normalizes all features, so the global optimal value can be obtained.

The answer is B.

28. Briefly explain the difference between supervised learning and unsupervised learning?

Supervised learning: Learning from labeled training samples to classify and predict data outside the training sample set as much as possible. (LR, SVM, BP, RF, GBDT)

Unsupervised learning: training and learning on unlabeled samples, than discovering the structural knowledge in these samples. (KMeans,DL)

29. Do you understand regularization?

Regularization is proposed for overfitting, thinking that the best solution to the model is to optimize the minimum empirical risk, and now add the item of model complexity to the empirical risk (the regularization term is the norm of the model parameter vector ), and use a rate ratio to weigh the weight of model complexity and past experience risk. If the model complexity is higher, the structural experience risk will be greater. Now the goal becomes the optimization of structural experience risk, which can prevent Model training is overly complex, effectively reducing the risk of overfitting.

The principle of Occam's razor can explain the known data well and is very simple is the best model.

30. What is the difference between covariance and correlation?

Correlation is a normalized form of covariance. Covariance itself is difficult to compare. For example: If we calculate the covariance of salary ($) and age (years), because these two variables have different measures, we will get different covariances that cannot be compared. To solve this problem, we calculate the correlation to get a value between -1 and 1, ignoring their respective different measures.

31. The difference and pros and cons of linear classifiers and nonlinear classifiers.

If the model is a linear function of the parameters and there is a linear classification surface, then it is a linear classifier, otherwise it is not.

Common linear classifiers are: LR, Bayesian classification, single-layer perceptron, and linear regression.

Common nonlinear classifiers: decision tree, RF, GBDT, multi-layer perceptron.

SVM has both (see linear kernel or Gaussian kernel).

Linear classifiers are fast and easy to program, but may not fit very well.

The programming of nonlinear classifier is complex, but the effect fitting ability is strong.

32. The logical storage structure of data (such as array, queue, tree, etc.) has a very important impact on software development. Try to briefly analyze the various storage structures you know from the aspects of running speed, storage efficiency and applicable occasions .

33. What is a distributed database?

The distributed database system is developed on the basis of the mature technology of the centralized database system, but it does not simply implement the centralized database in a decentralized manner, it has its own nature and characteristics. Many concepts and technologies of centralized database systems, such as data independence, data sharing and redundancy reduction, concurrency control, integrity, security and recovery, have different and richer content in distributed database systems .

34. Briefly talk about Bayes' theorem.

Before introducing Bayes' theorem, learn a few definitions:

Conditional probability (also known as posterior probability) is the probability of event A occurring under the condition that another event B has occurred. Conditional probability is denoted as P(A|B), read as "probability of A given B".

For example, for events or subsets A and B in the same sample space Ω, if an element randomly selected from Ω belongs to B, then the probability that this randomly selected element still belongs to A is defined as under the premise of B The conditional probability of A, so: P(A|B) = |A∩B|/|B|, then divide the numerator and denominator by |Ω| to get:

The joint probability expresses the probability that two events occur together. The joint probability of A and B is expressed as P(A∩B) or P(A, B).

Marginal probability (also known as prior probability) is the probability that an event will occur. The marginal probability is obtained in this way: In the joint probability, those unnecessary events in the final result are combined into their full probability, and they are eliminated (for discrete random variables, the total probability is obtained by summing, and for continuous random variables, integral This is called marginalization. For example, the marginal probability of A is expressed as P(A), and the marginal probability of B is expressed as P(B).

Next, consider a problem: P(A|B) is the probability of A happening given B happening.

1) First, before event B occurs, we have a basic probability judgment on the occurrence of event A, which is called the prior probability of A, denoted by P(A);

2) Second, after event B occurs, we re-evaluate the probability of event A, which is called the posterior probability of A, represented by P(A|B);

3) Similarly, before event A occurs, we have a basic probability judgment on the occurrence of event B, which is called the prior probability of B, denoted by P(B);

4) Similarly, after event A occurs, we re-evaluate the probability of event B, which is called the posterior probability of B, denoted by P(B|A).

The formula expression of Bayes' theorem:

35. What is the difference between #include <filename.h> and #include "filename.h"?

Knowledge point link: What is the difference between #include<filename.h> and #include "filename.h"

http://blog.csdn.net/u010339647/article/details/77825788

36. After studying the sales record data of a supermarket, it is found that people who buy beer will also buy diapers with a high probability. What kind of problem does this belong to in data mining? (A) 

A. Association rule discovery B. Clustering C. Classification D. Natural language processing

37. Which of the following steps is the task of integrating, transforming, dimensionally reducing, and numerically reducing the original data? (C) 

A. Frequent pattern mining B. Classification and prediction C. Data preprocessing D. Data stream mining

38. Which of the following is not a data preprocessing method?  (D) 

A Variable substitution B Discretization C Aggregation D Estimation of missing values

39. What is KDD?  (A) 

A. Data Mining and Knowledge Discovery B. Domain Knowledge Discovery C. Document Knowledge Discovery D. Dynamic Knowledge Discovery

40. When the label of the data is not known, which technique can be used to separate data with the same label from data with other labels? (B) 

A. Classification B. Clustering C. Association analysis D. Hidden Markov chain

41. Build a model, and use this model to predict which other variable values ​​belong to which type of task in data mining based on known variable values? (C) 

A. Retrieval by content B. Modeling description 

C. Predictive modeling D. Finding patterns and rules

42. Which of the following methods is not a standard method of feature selection? (D) 

A Embedding B Filtering C Packaging D Sampling

43. Please write the function find_string in python, search and print the content from the text, and require support for wildcards asterisk and question mark.

1find_string('hello\nworld\n','wor')
2['wor']
3find_string('hello\nworld\n','l*d')
4['ld']
5find_string('hello\nworld\n ','o.')
6['or']
7 answers
8def find_string(str,pat):
9import re
10return re.findall(pat,str,re.I)
11----------- ---------- 
12 Author: qinjianhuang
13 Source: CSDN
14 Original text: https://huangqinjian.blog.csdn.net/article/details/78796328 
15 Copyright statement: This article is the blogger's original article, reproduced Please include a link to the blog post!

44. Talk about the five properties of red-black trees.

Teach you a preliminary understanding of red-black trees

http://blog.csdn.net/v_july_v/article/details/6105630

45. Briefly talk about the sigmoid activation function.

Commonly used nonlinear activation functions include sigmoid, tanh, relu, etc. The former two sigmoid/tanh are more common in fully connected layers, and the latter relu is common in convolutional layers. Here is a brief introduction to the most basic sigmoid function (btw, mentioned at the beginning of the SVM article in this blog).

The function expression of Sigmoid is as follows:

In other words, the function of the Sigmoid function is equivalent to compressing a real number between 0 and 1. When z is a very large positive number, g(z) will tend to 1, and when z is a very small negative number, g(z) will tend to 0.

What is the use of compressing to 0 to 1? The usefulness is that in this way, the activation function can be regarded as a "classification probability". For example, if the output of the activation function is 0.9, it can be interpreted as a 90% probability of being a positive sample.

For example, as shown in the figure below (the figure is quoted from the Stanford Machine Learning Open Course):

46. ​​What is convolution?

Do the inner product (element-by-element multiplication) on the image (different data window data) and the filter matrix (a set of fixed weights: because the multiple weights of each neuron are fixed, it can be regarded as a constant filter filter) The operation of summing) is the so-called "convolution" operation, which is also the source of the name of the convolutional neural network.

In a non-strict sense, the part framed by the red box in the figure below can be understood as a filter, that is, a set of neurons with fixed weights. Multiple filters are stacked to form a convolutional layer.

OK, let's give a specific example. For example, in the figure below, the left part of the figure is the original input data, the middle part of the figure is the filter filter, and the right part of the figure is the new output two-dimensional data.

Break down the upper picture

47. What is the pooling pool layer of CNN?

Pooling, in short, is to take the average or maximum of the area, as shown in the figure below (the figure is quoted from cs231n):

The above figure shows the largest area, that is, the 6 is the largest in the 2x2 matrix in the upper left corner of the above figure, the 8 is the largest in the 2x2 matrix in the upper right corner, the 3 is the largest in the 2x2 matrix in the lower left corner, and the 4 is the largest in the 2x2 matrix in the lower right corner. The largest, so get the result of the right part of the figure above: 6 8 3 4. It's simple isn't it?

48. Briefly describe what is a generative confrontation network.

The reason why GAN is confrontational is because GAN is internally competitive. One party is called generator. Its main job is to generate pictures and try to make them look like they come from training samples. The other side is the discriminator, whose goal is to judge whether the input picture belongs to the real training sample.

To put it more bluntly, imagine the generator as a counterfeiter and the discriminator as the police. The purpose of the generator is to make the counterfeit currency as real as possible, so that it can fool the discriminator, that is, generate a sample and make it look like it comes from a real training sample.

The left and right scenes in the figure below:

See more in this course: Generative Adversarial Networks

https://www.julyedu.com/course/getDetail/83

49. What is the principle of painting after learning from Van Gogh?

Here is an experimental tutorial on how to make Van Gogh-style paintings to teach you how to use DL to learn from Van Gogh's paintings from beginning to end: GTX 1070 cuda 8.0 tensorflow gpu version. As for the principle, please watch this video: NeuralStyle Artistic Pictures principle).

http://blog.csdn.net/v_july_v/article/details/52658965

http://www.julyedu.com/video/play/42/523

50. Now there are 26 elements from a to z, write a program to print any combination of 3 elements from a to z (for example, print abc, dyz, etc.).

An interview question for Baidu machine learning engineer position

http://blog.csdn.net/lvonve/article/details/53320680

51. Which machine learning algorithms do not require normalization?

Probabilistic models do not need normalization, because they do not care about the value of the variable, but the distribution of the variable and the conditional probability between the variables, such as decision trees, RF. And optimization problems like Adaboost, GBDT, XGBoost, SVM, LR, KNN, KMeans need normalization.

52. Talk about the gradient descent method.

@LeftNotEasy

Mathematics in machine learning (1) - regression (regression), gradient descent (gradient descent)

http://www.cnblogs.com/LeftNotEasy/archive/2010/12/05/mathmatic_in_machine_learning_1_regression_and_gradient_descent.html

53. Does the gradient descent method necessarily find the direction of the fastest decline?

The gradient descent method is not the direction of the fastest decline, it is just the direction of the fastest decline of the objective function on the tangent plane of the current point (of course, high-dimensional problems cannot be called planes). In Practical Implementation, the Newton direction (considering the Hessian matrix) is generally considered to be the direction with the fastest decline, which can reach the convergence speed of Superlinear. The convergence speed of the gradient descent algorithm is generally Linear or even Sublinear (in some problems with complex constraints).

Knowledge point link: This article clearly explains the gradient descent algorithm (including its variant algorithm) in machine learning

http://blog.csdn.net/wemedia/details.html?id=45460

54. What is the difference between Newton's method and gradient descent method?

@wtq1993 knowledge point link: Common optimization algorithms in machine learning

http://blog.csdn.net/wtq1993/article/details/51607040

55. What are Quasi-Newton Methods?

@wtq1993 Common optimization algorithms in machine learning

56. Please talk about the problems and challenges of the stochastic gradient descent method?

57. Talk about the conjugate gradient method?

@wtq1993 Common optimization algorithms in machine learning

http://blog.csdn.net/wtq1993/article/details/51607040

58. For all optimization problems, is it possible to find better algorithms than the currently known algorithms?

answer link

https://www.zhihu.com/question/41233373/answer/145404190

59. What least square method?

We often say verbally: Generally speaking, on average. For example, on average, the health of non-smokers is better than that of smokers. The reason for adding the word "average" is that there are exceptions to everything. There is always a special person who smokes but because of regular exercise, his health may Will be better than his non-smoking friends around him. One of the simplest examples of the least squares method is the arithmetic mean.

The method of least squares (also known as the method of least squares) is a mathematical optimization technique. It finds the best function fit to the data by minimizing the sum of squared errors. The unknown data can be easily obtained by using the least square method, and the sum of squares of the errors between the obtained data and the actual data can be minimized. Expressed as a function:

Since the arithmetic mean is a tried and tested method, and the reasoning above shows that the arithmetic mean is a special case of the least squares, so the superiority of the least squares method is illustrated from another angle, which makes us more confident in the least squares method .

After the least squares method was published, it was quickly recognized and accepted by everyone, and it was quickly widely used in data analysis practice. However, some people in history attributed the invention of the least squares method to Gauss, so what is going on. Gauss also published the method of least squares in 1809 and claimed to have used it for many years. Gauss invented the mathematical method of asteroid positioning, and used the least square method in data analysis to calculate, and accurately predicted the position of Ceres.

By the way, what is the connection between the least squares method and SVM? Please refer to a popular introduction to support vector machines (understanding the three-level realm of SVM).

http://blog.csdn.net/v_july_v/article/details/7624837

60. Look at the print on your T-shirt: Life is too short, I use Python, can you tell me what kind of language Python is? You can compare other technologies or languages ​​to answer your question.

15 important Python interview questions to test whether you are suitable for Python?

http://nooverfit.com/wp/15 important Python interview questions to test whether you are suitable for Python? /

61. How does Python manage memory?

2017 Python latest interview questions and answers 16 questions

http://www.cnblogs.com/tom-gao/p/6645859.html

62. Please write a piece of Python code to delete duplicate elements in a list.

1. Use the set function, set(list);

2. Use the dictionary function:

1a=[1,2,4,2,4,5,6,5,7,8,9,0]
2b={}
3b=b.fromkeys(a)
4c=list(b.keys())
5c

63. Programming uses sort to sort, and then judges from the last element.

1a=[1,2,4,2,4,5,7,10,5,5,7,8,9,0,3]
2
3a.sort()
4last=a[-1]
5for i inrange(len(a)-2,-1,-1):
6if last==a[i]:
7del a[i]
8else:last=a[i]
9print(a)

64. How to generate random numbers in Python?

@Tom_junsong

random module

Random integer: random.randint(a,b): returns a random integer x, a<=x<=b

random.randrange(start,stop,[,step]): Returns a random integer in the range (start,stop,step), excluding the end value.

Random real number: random.random( ): returns a floating point number between 0 and 1

random.uniform(a,b): returns a floating point number within the specified range.

65. Talk about common loss functions.

For a given input X, the corresponding output Y is given by f(X), and the predicted value f(X) of this output may or may not be consistent with the real value Y (know that sometimes losses or errors are unavoidable) , using a loss function to measure the degree of prediction error. The loss function is denoted as L(Y, f(X)).

The commonly used loss functions are as follows (basically quoted from "Statistical Learning Methods"):

66. Briefly introduce the logistic regression.

The purpose of Logistic regression is to learn a 0/1 classification model from features, and this model uses the linear combination of features as an independent variable, because the value range of the independent variable is negative infinity to positive infinity. Therefore, use the logistic function (or sigmoid function) to map the independent variable to (0,1), and the mapped value is considered to belong to the probability of y=1.

Hypothetical function:

Where x is an n-dimensional feature vector, and the function g is the Logistic function. And: g(z)=11+e−z The image of g(z)=11+e−z is:

As you can see, infinity is mapped to (0,1). The hypothesis function is the probability that the feature belongs to y=1.

67. Seeing that you are a visual artist, what CV frameworks are you familiar with? By the way, how about the development history of CV in the past five years?

Answer analysis
https://mp.weixin.qq.com/s?__biz=MzI3MTA0MTk1MA==&mid=2651986617&idx=1&sn=fddebd0f2968d66b7f424d6a435c84af&scene=0#wechat_redirect

68. What is the frontier progress of deep learning in the field of vision?

@元峰 Source of analysis of this question: Frontier progress of deep learning in the field of computer vision

https://zhuanlan.zhihu.com/p/24699780

69. What is the difference between HashMap and HashTable?

The difference between HashMap and Hashtable

http://oznyang.iteye.com/blog/30690

70. In classification problems, we often encounter situations where the amount of positive and negative sample data is unequal. For example, the positive sample is 10w pieces of data, and the negative sample is only 1w pieces of data. The following most suitable processing method is ( )

A. Repeat the negative sample 10 times to generate a sample size of 10w, and participate in the classification in random order 

B. Direct classification can maximize the use of data 

C. Randomly select 1w from 10w positive samples to participate in classification 

D. Set each weight of the negative sample to 10, and the weight of the positive sample to 1, and participate in the training process

@实医生: To be precise, in fact, these methods in the options have their own advantages and disadvantages, and specific problems need to be analyzed in detail. There is an article that analyzes the advantages and disadvantages of various methods. Interested students can refer to it:

How to handle Imbalanced Classification Problems in machine learning?

https://www.analyticsvidhya.com/blog/2017/03/imbalanced-classification-problem/

71. Deep learning is currently a very popular machine learning algorithm. In deep learning, a large number of matrix multiplications are involved. Now it is necessary to calculate the product ABC of three dense matrices A, B, and C. Assume that the dimensions of the three matrices are 90 Respectively m∗n, n∗p, p∗q, and m <n <p <q, the following calculation sequence is the most efficient (A)

A.(AB)C 

B.AC(B) 

CA(BC) 

D. So the efficiency is the same

Correct answer: A

@BlackEyes_SGC: m*n*p <m*n*q, m*p*q < n*p*q, so (AB)C is the smallest

72.Nave Bayes is a special Bayes classifier, the feature variable is X, the category label is C, and one of its assumptions is : ( C )

A. The prior probability P(C) of each category is equal 

B. Normal distribution with 0 as the mean and sqr(2)/2 as the standard deviation 

C. Each dimension of the feature variable X is a category conditional independent random variable 

DP(X|C) is a Gaussian distribution

Correct answer: C

@BlackEyes_SGC: The condition of Naive Bayes is that each variable is independent of each other.

73. Regarding the support vector machine SVM, which of the following statements is wrong (C)

A. L2 regular term, the function is to maximize the classification interval, so that the classifier has stronger generalization ability 

B. Hinge loss function, the role is to minimize the empirical classification error 

C. The classification interval is 1||w||1||w||, where ||w|| represents the modulus of the vector 

D. When the parameter C is smaller, the classification interval is larger, and the classification errors are more, which tends to be under-learning

Correct answer: C

@BlackEyes_SGC:

A is correct. Consider the reason for adding a regularization term: imagine a perfect data set, y>1 is the positive class, y<-1 is the negative class, the decision surface y=0, add a positive noise sample of y=-30, then the decision surface It will become "crooked" a lot, the classification interval will become smaller, and the generalization ability will decrease. After adding the regularization item, the fault-tolerant ability to noise samples is enhanced. In the example mentioned above, the decision-making surface will be less "crooked", making the classification interval larger and improving the generalization ability.

B is correct.

C error. The interval should be 2||w||2||w||, and the second half of the sentence should be correct. The modulus of a vector usually refers to its second norm.

D is correct. When considering soft intervals, the impact of C on the optimization problem is to limit the range of a from [0, +inf] to [0, C]. The smaller C is, the smaller a will be. The derivative of the objective function Lagrangian function is 0, so w=∑iai∗yi∗xiw=∑iai∗yi∗xi, the smaller a makes w smaller, so the interval 2||w||2||w|| becomes bigger

74. In HMM, if the observation sequence and the state sequence that produces the observation sequence are known, which of the following methods can be used to directly estimate the parameters ( D )

A.EM algorithm 

B. Viterbi Algorithm 

C. Forward-backward algorithm 

D. Maximum Likelihood Estimation

correct answer: D

@BlackEyes_SGC:

EM algorithm: only observation sequence, no state sequence to learn model parameters, that is, Baum-Welch algorithm

Viterbi algorithm: use dynamic programming to solve the prediction problem of HMM, not parameter estimation

Forward and backward algorithm: used to calculate probability

Maximum likelihood estimation: that is, a supervised learning algorithm when both the observation sequence and the corresponding state sequence exist, used to estimate the parameters

Note that estimating model parameters given a sequence of observations and a corresponding state sequence can be estimated using maximum likelihood. If there is no corresponding state sequence for a given observation sequence, EM is used to see the state sequence as unmeasurable hidden data.

75. Assuming that a classmate accidentally repeated the two dimensions of the training data when using the Naive Bayesian (NB) classification model, then the correct statement about NB is: (BD )

A. The decisive role of this repeated feature in the model will be strengthened 

B. The accuracy of the model effect will be reduced compared to the case of no repeated features 

C. If all the features are repeated, the obtained model prediction results are the same as the model prediction results without repetition.

D. When the two columns of features are highly correlated, the conclusions obtained when the two columns of features are the same cannot be used to analyze the problem 

E.NB can be used for least squares regression 

F. None of the above statements are correct

Correct answer: BD

@BlackEyes_SGC: The core of NB is that it assumes that all components of the vector are independent. In the Bayesian theoretical system, there is an important conditional independence assumption: it is assumed that all features are independent of each other, so that the joint probability can be split.

76. Which of the following methods cannot directly classify text? (A)

A、Kmeans 

B. Decision tree 

C. Support vector machine 

D. KNN

Correct Answer: A Classification is different from clustering.

@BlackEyes_SGC: A: Kmeans is a clustering method, a typical unsupervised learning method. Classification is a supervised learning method, and BCD is a common classification method.

77. Knowing the covariance matrix P of a set of data, the following statement about the principal component is wrong ( C )

A. The best criterion for principal component analysis is to decompose a set of data according to a set of orthogonal bases. Under the condition of only taking the same number of components, the mean square error is used to calculate the censored error to be the smallest. 

B. After principal component decomposition, the covariance matrix becomes a diagonal matrix 

C. Principal component analysis is KL transformation 

D. The principal component is obtained by finding the eigenvalue of the covariance matrix

Correct answer: C 

@BlackEyes_SGC: KL transformation and PCA transformation are different concepts. The transformation matrix of PCA is a covariance matrix, and there are many kinds of transformation matrices of KL transformation (second-order matrix, covariance matrix, total intra-class dispersion matrix, etc.). When the KL transformation matrix is ​​a covariance matrix, it is equivalent to PCA.

78. What is the complexity of Kmeans?

Time complexity: O(tKmn), where t is the number of iterations, K is the number of clusters, m is the number of records, and n is the number of dimensions Space complexity: O((m+K)n), where K is the cluster The number of , m is the number of records, and n is the number of dimensions.

Specific reference: In-depth understanding of machine learning K-means, difference from KNN algorithm and its code implementation

http://blog.csdn.net/sinat_35512245/article/details/55051306

79. What is incorrect about Logit regression and SVM is (A)

A. Logit regression is essentially a method of maximum likelihood estimation of weights based on samples, and the posterior probability is proportional to the product of the prior probability and the likelihood function. Logit only maximizes the likelihood function, and does not maximize the posterior probability, let alone minimize the posterior probability. A mistake

B. The output of Logit regression is the probability that the sample belongs to the positive category, the probability can be calculated, correct 

C. The goal of SVM is to find the hyperplane that separates the training data as much as possible and has the largest classification interval, which should belong to the minimum structural risk.

D. SVM can control the complexity of the model through the regularization coefficient to avoid overfitting.

@BlackEyes_SGC: The objective function of Logit regression is to minimize the posterior probability. Logit regression can be used to predict the probability of event occurrence. The goal of SVM is to minimize structural risk. SVM can effectively avoid model overfitting.

80. The size of the input image is 200×200, and it goes through a layer of convolution (kernel size 5×5, padding 1, stride 2), pooling (kernel size 3×3, padding 0, stride 1), and another layer of convolution (kernel size 3×3, padding 1, stride 1), the output feature map size is: ()

Correct answer: 97

@BlackEyes_SGC: Computing dimensions not divisible is only encountered in GoogLeNet. Convolution is rounded down, and pooling is rounded up.

This question (200-5+2*1)/2+1 is 99.5, take 99 

(99-3)/1+1 is 97 

(97-3+2*1)/1+1 is 97

If you have studied the network, you can see that when the stride is 1, when the kernel is 3, the padding is 1 or the kernel is 5, the padding is 2. It can be seen that the size does not change before and after convolution. The same is true for the size of the entire process of calculating GoogLeNet.

81. The main factors affecting the results of the clustering algorithm are (BCD) 

A. Sample quality of known classes;

B. Classification criteria;

C. Feature selection;

D. Pattern Similarity Measure

82. In pattern recognition, the advantage of the horse distance over the Euclidean distance is (CD) 

A. Translation invariance;

B. Rotation invariance;

C. Scale invariance;

D. Considering the distribution of modes

83. The main factors affecting the basic K-means algorithm are (ABD) 

A. Sample input order;

B. Pattern similarity measure;

C. Clustering criteria;

D. Selection of initial class centers

84. In statistical pattern classification problems, when the prior probability is unknown, you can use (BD) 

A. Minimum loss criterion;

B. Minimum and maximum loss criterion;

C. The minimum probability of misjudgment criterion;

D. NP verdict

85. If the correlation coefficient of the eigenvector is used as the pattern similarity measure, the main factors affecting the results of the clustering algorithm are (BC) 

A. Known category sample mass;

B. Classification criteria;

C. Feature selection;

D. Dimension

86. The Euclidean distance has (AB); the horse distance has (ABCD).

A. Translation invariance;

B. Rotation invariance;

C. Scale invariance;

D. Dimension-independent properties

87. What experience do you have in Deep Learning (RNN, CNN) tuning?

Answer analysis, from Zhihu

https://www.zhihu.com/question/41631631

88. Briefly talk about the principle of RNN.

When we entered the third year of high school to prepare for the college entrance examination, the knowledge at this time was synthesized from the knowledge learned in the second year of high school and before the second year of high school plus the knowledge learned in the third year of high school. When "I am" appears on the movie subtitle, you will naturally think: "I am Chinese".

89. What is RNN?

@一个鸟的天空, the source of this question analysis:

Introduction to Recurrent Neural Networks (RNN)

http://blog.csdn.net/heyongluoyao8/article/details/48636251

90. How is RNN constructed step by step from a single-layer network?

@何之源, source of analysis for this question:

Fully illustrated RNN, RNN variants, Seq2Seq, Attention mechanism

https://zhuanlan.zhihu.com/p/28054589

101. Deep learning (CNN RNN Attention) solves large-scale text classification problems.

Using Deep Learning (CNN RNN Attention) to Solve Large-Scale Text Classification Problems-Summary and Practice

https://zhuanlan.zhihu.com/p/25928551

102. How to solve the problem of RNN gradient explosion and dispersion?

Deep learning and natural language processing (7)_Stanford cs224d language model, RNN, LSTM and GRU

http://blog.csdn.net/han_xiaoyang/article/details/51932536

103. How to improve the performance of deep learning?

Machine Learning Series (10)_How to Improve the Performance of Deep Learning (and Machine Learning)

http://blog.csdn.net/han_xiaoyang/article/details/52654879

104. What is the difference between RNN, LSTM, and GRU?

@我爱大糖糕, the source of the analysis of this question: Interview Written Examination 3: Deep Learning Machine Learning Interview Question Preparation (must) http://blog.csdn.net/woaidapaopao/article/details/77806273

105. When machine learning performance encounters a bottleneck, how do you optimize it?

You can try from these four aspects: based on data, using algorithms, using algorithms to adjust parameters, and using model fusion. Of course, how detailed and in-depth you can talk depends on your experience.

Here is a reference list: Machine Learning Series (20)_Machine Learning Performance Improvement Cheat Sheet

http://blog.csdn.net/han_xiaoyang/article/details/53453145

106. What kind of machine learning projects have you done? For example, how to build a recommendation system from scratch?

Recommend the open class of the system http://www.julyedu.com/video/play/18/148, and recommend another course: machine learning project class [10 pure project explanations, 100% pure combat] (https:/ /www.julyedu.com/course/getDetail/48).

107. What kind of data set is not suitable for deep learning?

@abstract monkey, source:

Know the answer

https://www.zhihu.com/question/41233373

108. How is the generalized linear model applied in deep learning?

@许汉, Source: Zhihu Answers

https://www.zhihu.com/question/41233373/answer/145404190

109. What theoretical knowledge should I know to prepare for a machine learning interview?

@Mu Wen

Know the answer

https://www.zhihu.com/question/62482926

Expand on three aspects:

[Theoretical skills] Mainly examine the understanding of machine learning models, and ask selective questions (if you encounter an interviewer's research direction that you don't understand but is interested in, you will be very happy, and you can take the opportunity to learn a haha) The questions in this area will be more detailed , I have thought deeply about it myself (endorsement is useless, I can ask you about any point here), and I will knock it all out here.

Overfitting and underfitting (to give a few examples to judge, by the way, ask the purpose of cross-validation, hyperparameter search method, EarlyStopping), L1 regularization and L2 regularization, the idea behind regularization (by the way, ask BatchNorm, Covariance Shift), the principle of L1 regularization to generate sparse solutions, why logistic regression is a linear model (by the way, how does LR solve low-dimensional inseparability, LR and naive Bayesian and unsupervised from the perspective of graphical models), several parameter estimation methods MLE/ The connection and difference between MAP/Bayesian, briefly talk about the support vector of SVM (by the way, ask about the KKT condition, why it is dual, the popular understanding of the kernel), whether the GBDT random forest can be parallelized (by the way, bagging boosting), and generate model discrimination For example, the model, the mastery of clustering methods (by the way, ask about the EM derivation ideas of Kmeans, the understanding of spectral clustering and Graph-cut), the difference between gradient descent methods and Newton methods (by the way, ask Adam, L-BFGS ideas), semi-supervised ideas (by the way, how some specific semi-supervised algorithms use unlabeled data, semi-supervised from the perspective of MAP), common evaluation indicators of classification models (by the way, how to draw cross-entropy and ROC , physical meaning of AUC, class imbalanced samples)

a. The convolution operation and convolution kernel function in CNN, the maxpooling function, the connection between the convolution layer and the fully connected layer, the concept of gradient explosion and disappearance (by the way, how to initialize the weight of the neural network, why can slow down the gradient explosion disappearance , What are the solutions in CNN, how does LSTM solve it, how to cut gradients, how dropout is used in RNN series networks, dropout prevents overfitting), why convolution can be used on images/speech/sentences (by the way, ask the channel meaning in different types of data sources)

b. If the interviewer does NLP and recommendation systems like me, I will continue to ask about the relationship between CRF and the logistic regression maximum entropy model, the optimization method of CRF, the relationship between CRF and MRF, and the relationship between HMM and CRF (by the way, ask Pu Subei The connection between Yass and HMM, the principle of LSTM+CRF for sequence labeling, the point function and edge function of CRF, the empirical distribution of CRF), several common methods and principles of WordEmbedding (by the way, ask about language model, perplexity evaluation index, Similarities and differences between word2vec and Glove), talk about topic model, why CNN can be used in text classification, examples of syntactic and semantic problems, common Sentence embedding methods, attention mechanism (by the way, ask about several different situations of attention mechanism, why it is introduced , seq2seq principle), sequence labeling evaluation index, semantic disambiguation method, common word-related features, factorization machine, common matrix decomposition model, how to use classification model for product recommendation (including data set division, model verification, etc. ), sequence learning, wide&deep model (by the way why wide and deep)

[Coding ability] Mainly examine the ability to implement algorithms and optimize code. I usually look at the github repo of the interviewer (if the resume is given) to see the code style and architectural ability (I will study it carefully when I meet a master haha). Without github, I would avoid asking typical test-taking questions, and instead ask some small algorithm questions that I abstracted from real problems, such as:

a. Given the matrix of nodes and the matrix of edges, find the path and the largest path (derived from the Viterbi algorithm, which is essentially a dynamic programming), at least give an idea and pseudocode (by the way, talk about forward propagation and backpropagation)

b. Given an array, the array elements are pairs, representing a directed acyclic graph <father node, child node>, using the optimal method to turn it into a new ordered array, the array elements are For all nodes in the directed acyclic graph, the order of the array is reflected in: the father node is in front of the child node (derived from the small trick in the implementation of the Bayesian network)

[Project Ability] It mainly examines the ideas of solving practical problems and the ability to fill holes. This part is the most test of the interviewer's skills. It is necessary to be able to find meaningful points from the interviewer's exaggerated description and dig deeper step by step. In addition, a lot of dirty work (data preprocessing, text cleaning, parameter adjustment experience, algorithm complexity optimization, Bad case analysis, modification of loss function, etc.) is also digging deep in this step

110. What is the difference between standardization and normalization?

To put it simply, standardization is to process data according to the columns of the feature matrix, and convert the feature values ​​of the samples to the same dimension by calculating the z-score method. Normalization is to process data according to the rows of the feature matrix. Its purpose is that the sample vectors have a unified standard when calculating the similarity of the point multiplication or other kernel functions, that is to say, they are all converted into "unit vectors". The normalization formula with rule L2 is as follows:

Missing value handling for eigenvectors:

1. There are many missing values. The feature is directly discarded, otherwise it may bring in a large noise, which will adversely affect the result.

2. There are fewer missing values, and the missing values ​​of the remaining features are all within 10%. We can take many ways to deal with it:

1) Take NaN directly as a feature, assuming it is represented by 0;

2) fill with the mean value;

3) Predict filling with algorithms such as random forest

111. How Random Forest handles missing values.

Method 1 (na.roughfix) is simple and rude. For the training set, the data under the same class, if the categorical variable is missing, use the mode to make up, if the continuous variable is missing, use the median to make up.

Method 2 (rfImpute) is computationally intensive, but is it better than Method 1? Hard to judge. First use na.roughfix to fill in the missing values, then build a forest and calculate the proximity matrix, and then look back at the missing values. If it is a categorical variable, use the method of weighted averaging without a matrix to fill in the missing values. Then iterate 4-6 times. This idea of ​​filling missing values ​​is somewhat similar to KNN for voting with the weight in the proximity of 1 missing observation instance. If it is a continuous variable, use the proximity moment 2.

112. How Random Forests Evaluate Feature Importance.

There are two ways to measure the importance of variables, Decrease GINI and Decrease Accuracy:

1) Decrease GINI: For regression problems, directly use argmax(VarVarLeftVarRight) as the criterion, that is, the variance Var of the current node training set minus the variance VarLeft of the left node and the variance VarRight of the right node.

2) Decrease Accuracy: For a tree Tb(x), we can get a test error of 1 with the OOB sample; then randomly change the jth column of the OOB sample: keep the other columns unchanged, and randomly replace the jth column up and down, Get error 2. So far, we can use error 1-error 2 to describe the importance of variable j. The basic idea is that if a variable j is important enough, changing it will greatly increase the test error; conversely, if changing it does not increase the test error, it means that the variable is not so important.

113. Optimize Kmeans.

Use Kd tree or Ball Tree to build all observation instances into a kd tree. Before each cluster center needs to calculate the distance from each observation point in turn, now these cluster centers only need to calculate the nearby one according to the kd tree. local area.

114. Selection of KMeans initial cluster center points.

The basic idea of ​​the K-means++ algorithm to select the initial seeds is: the mutual distance between the initial cluster centers should be as far as possible.

1. Randomly select a point from the input data point set as the first cluster center 

2. For each point x in the data set, calculate the distance D(x) between it and the nearest cluster center (referring to the selected cluster center) 

3. Select a new data point as the new cluster center. The principle of selection is: the point with a larger D(x) has a higher probability of being selected as the cluster center

4. Repeat 2 and 3 until k cluster centers are selected 

5. Use the k initial cluster centers to run the standard k-means algorithm

115. Explain the concept of duality.

An optimization problem can be investigated from two angles, one is the primal problem, the other is the dual problem, that is, the dual problem. In general, the dual problem gives the lower bound of the optimal value of the main problem. In the case of strong duality, the dual The optimal lower bound of the main problem can be obtained for the problem, and the dual problem is a convex optimization problem, which can be solved better. In SVM, the Primal problem is converted into a dual problem for solving, thereby further introducing the idea of ​​the kernel function.

116. How to perform feature selection?

Feature selection is an important data preprocessing process. There are two main reasons: one is to reduce the number of features, reduce the dimension, make the model generalization ability stronger, and reduce over-fitting; the other is to enhance the relationship between features and feature values. understand.

Common feature selection methods:

1. Remove features with smaller variance.

2. Regularization. 1 Regularization can generate sparse models. The performance of L2 regularization is more stable, because useful features often correspond to non-zero coefficients.

3. Random forest. For classification problems, Gini impurity or information gain is usually used. For regression problems, variance or least squares fitting is usually used. Generally, cumbersome steps such as feature engineering and parameter adjustment are not required. Its two main problems, 1 is that important features may have a low score (associated feature problem), and 2 is that this method is more beneficial to features with more feature variable categories (biased problem).

4. Stability selection. It is a relatively new method based on the combination of subsampling and selection algorithm. The selection algorithm can be regression, SVM or other similar methods. Its main idea is to run the feature selection algorithm on different data subsets and feature subsets, repeat it continuously, and finally summarize the feature selection results, such as counting the frequency of a feature that is considered to be an important feature (selected as an important feature divided by the number of times its subset was tested). Ideally, important features would score close to 100%. Slightly weaker features will have non-zero scores, while the most useless features will have scores close to 0.

117. Data preprocessing.

1. Missing value, fill the missing value fillna:

i. Discrete: None,

ii. Continuous: mean.

iii. If there are too many missing values, delete the column directly 

2. Continuous value: Discretization. Some models (such as decision trees) require discrete values

3. Binarize quantitative features. The core is to set a threshold, the value greater than the threshold is 1, and the value less than or equal to the threshold is 0. such as image manipulation

4. Pearson correlation coefficient, remove highly correlated columns

118. Briefly talk about feature engineering.

119. Do you know what kind of data processing and feature engineering processing?

120. Please compare the three activation functions of Sigmoid, Tanh, and ReLu?

121. What are the shortcomings or deficiencies of the three activation functions of Sigmoid, Tanh, and ReLu? Is there any improved activation function?

@我爱大糖糕, Source: Interview Written Examination 3: Deep Learning Machine Learning Interview Question Preparation (must)

http://blog.csdn.net/woaidapaopao/article/details/77806273

122. How to understand the decision tree and xgboost can handle missing values? And some models (svm) are more sensitive to missing values?

Know the answer

https://www.zhihu.com/question/58230411

123. Why introduce a nonlinear activation function?

@Begin Again, Source: Zhihu Answers

https://www.zhihu.com/question/29021768

If you don’t use the activation function (in fact, the activation function is f(x) = x), in this case, the output of each layer of yours is a linear function of the input of the upper layer, which is easy to verify. No matter how many layers your neural network has, the output It is a linear combination of inputs, which is equivalent to the effect of no hidden layer. This is the most primitive perceptron (Perceptron).

Because of the above reasons, we decided to introduce a nonlinear function as the activation function, so that the deep neural network makes sense (it is no longer a linear combination of inputs, and can approximate any function). The earliest idea is the Sigmoid function or Tanh function, the output is bounded, and it is easy to serve as the input of the next layer (and some people's biological interpretation).

124. Why is ReLu better than Tanh and Sigmoid function in artificial neural network?

@Begin Again, Source: Zhihu Answers

https://www.zhihu.com/question/29021768

125. Why are there two activation functions of Sigmoid and Tanh in the LSTM model?

Source of analysis of this question: Zhihu Answers

https://www.zhihu.com/question/46197687

@beanfrog: The purpose of the two is different: sigmoid is used on various gates to generate a value between 0 and 1. Generally, only sigmoid is the most direct. Tanh is used in the state and output, which is the processing of data. This may also be possible with other activation functions.

@hhhh: See also section4.1 of A Critical Review of Recurrent Neural Networks for Sequence Learning, which says that both tanh can be replaced by others.

126. Measure how good a classifier is.

@我爱大糖糕, source: answer analysis

http://blog.csdn.net/woaidapaopao/article/details/77806273

Here we must first know the four types of TP, FN (true is judged as false), FP (false is judged as true), and TN (you can draw a table).

Several commonly used indicators:

Precision precision = TP/(TP+FP) = TP/~P (~p is the number of true predictions) 

Recall recall = TP/(TP+FN) = TP/ P 

F1值:2/F1 = 1/recall + 1/precision

ROC curve: ROC space is a plane represented by a two-dimensional coordinate system with the false positive rate (FPR, false positive rate) as the X axis and the true positive rate (TPR, true positive rate) as the Y axis. Among them, the true positive rate TPR = TP / P = recall, and the false positive rate FPR = FP / N

127. What is the physical meaning of auc in machine learning and statistics?

For details, see How to understand auc in Machine Learning and Statistics?

https://www.zhihu.com/question/39840928

128. Observe the gain gain, the larger the alpha and gamma, the smaller the gain?

@AntZ: XGBoost's criterion for finding segmentation points is to maximize gain. Considering that the traditional greedy method of enumerating all possible segmentation points of each feature is too inefficient, XGBoost implements an approximate algorithm. The general idea is to list several candidates that may become segmentation points according to the percentile method, and then calculate Gain from the candidates to find the best segmentation point by the maximum value. Its calculation formula is divided into four items, which can be adjusted by the regularization item parameters (lamda is the coefficient of the sum of squares of leaf weights, and gama is the number of leaves):

The first item is the weight score of the left child assuming split, the second item is the right child, the third item is the overall score without splitting, and the last item is the complexity loss of introducing a node.

It can be seen from the formula that the larger gama is, the smaller the gain is, and the larger lamda is, the gain may be small or large.

The original problem is alpha instead of lambda, which is not mentioned in the paper here, and this parameter is available in the XGBoost implementation. The above is the answer I understand from the paper, and the following is the search:

How to tune the parameters of the XGBoost model

https://zhidao.baidu.com/question/2121727290086699747.html?fr=iks&word=xgboost%20lamda&ie=gbk

129. What causes the gradient disappearance problem? Deduce it.

@许汉, source: In the training of the neural network, by changing the weight of the neurons, the output value of the network is made as close as possible to the label to reduce the error value. The BP algorithm is generally used for training. The core idea is to calculate the relationship between the output and the label. The loss function value, and then calculate its gradient relative to each neuron, and iterate the weight value.

The disappearance of the gradient will cause the weight update to be slow, and the difficulty of model training will increase. One reason for the disappearance of the gradient is that many activation functions squeeze the output value into a small interval, and the gradient is 0 in the large range of domains at both ends of the activation function, causing learning to stop.

130. What are gradient disappearance and gradient explosion?

@寒小阳, the multiplication brought by the chain rule in backpropagation, if the number is small and tends to 0, the result will be very small (the gradient disappears); if the number is relatively large, the result may be very large (gradient explosion ).

@车车, Gradient disappearance and gradient explosion in neural network training

https://zhuanlan.zhihu.com/p/25631496

131. How to solve gradient disappearance and gradient expansion?

(1) The gradient disappears:

  According to the chain rule, if the partial derivative of each layer of neurons to the output of the previous layer multiplied by the weight is less than 1, then even if the result is 0.99, after enough multi-layer propagation, the bias of the error to the input layer The guide will tend to 0, and the ReLU activation function can be used to effectively solve the situation where the gradient disappears.

(2) Gradient expansion 

  According to the chain rule, if the partial derivative of each layer of neurons to the output of the previous layer multiplied by the weight is greater than 1, after enough multi-layer propagation, the partial derivative of the error to the input layer will tend to infinity, which can be Solved by activation function.

132. Derivation of Backpropagation.

@我爱大糖糕, source: derivation process

http://blog.csdn.net/woaidapaopao/article/details/77806273

133. SVD and PCA.

The idea of ​​PCA is to maximize the variance of the data projection, find such a projection vector, and satisfy the condition of the maximum variance. After removing the mean, you can use SVD decomposition to solve such a projection vector, and select the direction with the largest eigenvalue.

134. Data imbalance problem.

This is mainly due to the unbalanced data distribution. The solution is as follows:

1) Sampling, adding noise to small samples and downsampling to large samples 

2) Perform special weighting, such as in Adaboost or SVM 

3) Adopt an algorithm that is not sensitive to imbalanced data sets 

4) Change the evaluation standard: use AUC/ROC to evaluate

5) Using methods such as Bagging/Boosting/Ensemble 

6) Consider the prior distribution of the data

135. Briefly describe the development of neural networks.

MP model+sgn—->single-layer perceptron (only linear)+sgn—Minsky trough—>multilayer perceptron+BP+Sigmoid—(low trough)—>deep learning+Pretraining+ReLU/Sigmoid

136. Common methods of deep learning.

@SmallisBig, Source: Machine Learning Job Interview Question Summary - Deep Learning

http://blog.csdn.net/u010496169/article/details/73550487

137. The neural network model (Neural Network) got its name because it was inspired by the human brain. The neural network is composed of many neurons (Neuron), each neuron accepts an input, processes the input and gives an output. Which of the following statements about neurons is correct? (E)

A. Each neuron has only one input and one output 

B. Each neuron has multiple inputs and one output 

C. Each neuron has one input and multiple outputs 

D. Each neuron has multiple inputs and multiple outputs 

E. All of the above are correct

Answer: (E)

Each neuron can have one or more inputs, and one or more outputs

138. The figure below is a mathematical representation of a neuron ,

139. In a neural network, knowing the weights and biases of each neuron is the most important step. If you know the exact weights and biases of neurons, you can approximate any function, but how do you know the weights and biases of each neuron ? (C)

A. Search every possible combination of weights and biases until the best value is obtained 

B. Give an initial value, then check the difference from the optimal value, and iteratively adjust the weight 

C. Random assignment, resign to fate 

D. None of the above is correct

Answer: (C)

Option C is a description of gradient descent.

140. What are the correct steps for the gradient descent algorithm? (D)

1. Calculate the error between the predicted value and the true value 

2. Repeat iterations until the optimal value of the network weight is obtained 

3. Pass the input into the network and get the output value 

4. Initialize weights and biases with random values 

5. For each neuron that produces an error, adjust the corresponding (weight) value to reduce the error

A. 1, 2, 3, 4, 5 

B. 5, 4, 3, 2, 1 

C. 3, 2, 1, 5, 4 

D. 4, 3, 1, 5, 2

Answer: (D)

141. Known:

- The brain is composed of many things called neurons, and the neural network is a simple mathematical expression of the brain.

- Each neuron has an input, a processing function and an output.

- Neurons are combined to form a network that can fit any function.

- In order to get the best neural network, we use the gradient descent method to continuously update the model 

Given the above description about neural networks, when is a neural network model called a deep learning model?

A. Add more layers to increase the depth of the neural network 

B. Have higher dimensional data 

C. When it is a pattern recognition problem 

D. None of the above is correct

Answer: (A)

More layers means a deeper network. A model that does not strictly define how many layers is called a deep model. At present, if there are more than 2 hidden layers, it can also be called a deep model.

142. The convolutional neural network can perform multiple transformations (rotation, translation, scaling) on ​​an input. Is this statement correct?

Answer: wrong

A series of data preprocessing (that is, rotation, translation, and scaling) needs to be done before the data is passed into the neural network, and the neural network itself cannot complete these transformations.

143. Which of the following operations can achieve a similar effect to Dropout in neural networks? (B)

A. Boosting 

B. Bagging 

C. Stacking 

D. Mapping 

Answer: B

Dropout can be considered as an extreme bagging. Each model is trained on separate data. At the same time, it achieves high regularization of model parameters by sharing corresponding parameters with other models.

144. Which of the following introduces nonlinearity in a neural network? (B)

A. Stochastic Gradient Descent 

B. Rectified Linear Unit (ReLU) 

C. Convolution function 

D. None of the above is correct

Answer: (B)

Rectified linear units are nonlinear activation functions.

145. When training the neural network, the loss function (loss) did not decrease in the first few epochs. What is the possible reason? (A)

A. The learning rate is too low 

B. The regularization parameter is too high 

C. Stuck in a local minimum 

D. All of the above are possible

Answer: (A)

146. Which of the following statements about model capacity is correct? (Refers to the ability of the neural network model to fit complex functions) (A)

A. The number of hidden layers increases, and the model ability increases 

B. The proportion of Dropout increases, and the model ability increases 

C. As the learning rate increases, the model capacity increases 

D. Neither is correct

Answer: (A)

147. If you increase the number of hidden layers of the Multilayer Perceptron, the classification error will decrease. Is this statement true or false?

Answer: wrong

Not always true. Overfitting can lead to increased error.

148. Construct a neural network that takes as input the output of the previous layer and itself. Which of the following architectures has a feedback connection? (A)

A. Recurrent Neural Networks 

B. Convolutional Neural Networks 

C. Restricted Boltzmann Machine 

D. neither

Answer: (A)

149. Which of the following introduces nonlinearity in a neural network? What is the order of tasks in a Perceptron?

1. Randomly initialize the weights of the perceptron 

2. Go to the next batch of the data set (batch) 

3. If the predicted value and output are inconsistent, adjust the weight 

4. For an input sample, calculate the output value

Answer: 1 - 4 - 3 - 2

150. Suppose you need to tune parameters to minimize the cost function, which of the following techniques can be used? (D)

A. Exhaustive search 

B. Random search 

C. Bayesian optimization 

D. Any of the above

Answer: (D)

151. In which of the following situations, first-order gradient descent does not necessarily work correctly (may get stuck)? (B)

Answer: (B)

This is a classic example of gradient descent at the Saddle Point. In addition, this question comes from: The source of the question

https://www.analyticsvidhya.com/blog/2017/01/must-know-questions-deep-learning/

152. The figure below shows the relationship between the accuracy of the trained 3-layer convolutional neural network and the number of parameters (number of feature cores).

It can be seen from the trend in the figure that if the width of the neural network is increased, the accuracy will increase to a certain threshold and then begin to decrease. What is the possible reason for this phenomenon? (C)

A. Even if the number of convolution kernels is increased, only a small number of kernels will be used for prediction 

B. When the number of convolution kernels increases, the predictive ability (Power) of the neural network will decrease 

C. When the number of convolution kernels increases, the correlation between them increases (correlate), resulting in overfitting 

D. None of the above is correct

Answer: (C)

As option C points out, the likely cause is correlation between nuclei.

153. Suppose we have a hidden layer as shown in the figure below. The hidden layer plays a certain dimensionality reduction role in this network. If now we use another method of dimensionality reduction, such as principal component analysis (PCA) to replace this hidden layer. So, is the output of the two the same?

Answer: Different, because PCA is used for relevant features and hidden layer is used for predictive features.

154. Can a neural network form a function (y=1xy=1x)?

Answer: Yes, because the activation function can be a reciprocal function.

155. Which of the following neural network structures will share weights? (D)

A. Convolutional Neural Networks 

B. Recurrent Neural Networks 

C. Fully connected neural network 

D. Options A and B 

Answer: (D)

156. What are the benefits of Batch Normalization? (A)

A. Normalize (change) all inputs before passing them to the next layer 

B. It takes the normalized mean and standard deviation of the weights 

C. It is a very efficient backpropagation (BP) method 

D. None of these

Answer: (A)

157. In a neural network, which of the following methods can be used to deal with overfitting? (D) 

A. Dropout 

B. Batch Normalization 

C. Regularization 

D. can

Answer: (D)

158. What happens if we use an excessively large learning rate? (D) 

A. The neural network will converge 

B. hard to say 

C. Neither 

D. The neural network will not converge

Answer: (D)

159. The network shown in the figure below is used to train the recognition characters H and T, as follows:

What is the output of the network? (D)

D. Could be A or B, depending on the weight settings of the neural network

Answer: (D)

Without knowing what the weights and biases of a neural network are, it is impossible to tell what output it will give.

160. Suppose we have trained a convolutional neural network on the ImageNet dataset (object recognition). Then feed this convolutional neural network an all-white picture. The output for this input is equally likely to be any kind of object, right? (D)

A. Right 

b. don't know 

C. It depends 

D. No

Answer: (D) The responses of each neuron are different

 

161. When a pooling layer is added to a convolutional neural network, the invariance of the transformation is preserved, isn't it? (C)

A. don't know 

B. It depends 

C. is 

D. no

Answer: (C) Invariance occurs when pooling is used.

162. Which gradient descent method is more efficient when the data is too large to be processed simultaneously in RAM? (A)

A. Stochastic Gradient Descent 

b. don't know 

C. Full Batch Gradient Descent 

D. neither

Answer: (A)

163. The figure below is a gradient descent diagram of a neural network training with four hidden layers using the sigmoid function as the activation function. This neural network suffers from the problem of vanishing gradients. Which of the following statements is correct? (A)

A. The first hidden layer corresponds to D, the second hidden layer corresponds to C, the third hidden layer corresponds to B, and the fourth hidden layer corresponds to A 

B. The first hidden layer corresponds to A, the second hidden layer corresponds to C, the third hidden layer corresponds to B, and the fourth hidden layer corresponds to D 

C. The first hidden layer corresponds to A, the second hidden layer corresponds to B, the third hidden layer corresponds to C, and the fourth hidden layer corresponds to D 

D. The first hidden layer corresponds to B, the second hidden layer corresponds to D, the third hidden layer corresponds to C, and the fourth hidden layer corresponds to A

Answer: (A) Since the backpropagation algorithm enters the initial layer, the learning ability is reduced, which is the gradient disappearance.

164. For a classification task, if the weight of the neural network is not randomly assigned at the beginning, and the second is set to 0, which of the following statements is correct? (C)

A. None of the other options are correct 

B. No problem, the neural network will start training normally 

C. Neural networks can be trained, but all neurons end up recognizing the same thing 

D. The neural network will not start training because no gradient changes

Answer: (C)

165. The figure below shows that when training starts, the error is consistently high because the neural network is stuck in a local minimum before progressing towards the global minimum. To avoid this situation, which of the following strategies can we adopt? (A)

A. Change the learning rate, such as changing the learning rate continuously for the first few training cycles 

B. Initially reduce the learning rate by a factor of 10, then use the momentum term (momentum) 

C. Increase the number of parameters so that the neural network will not be stuck at the local optimum 

D. Everything else is wrong

Answer: (A)

Option A can extract the neural network trapped in the local minimum.

166. For an image recognition problem (find a cat in a photo), which of the following neural networks can better solve this problem? (D)

A. Recurrent Neural Networks 

B. Perceptron 

C. Multilayer Perceptron 

D. Convolutional Neural Networks

Convolutional neural networks would be better suited for image-related problems because of the inherent nature of taking into account changes in position near the image.

Answer: (D)

167. Suppose that during training we suddenly encounter a problem where after a few iterations the error decreases instantaneously. You think there is something wrong with the data, so you plot the data and find that maybe the data is too skewed to cause the problem.

What are you going to do to deal with this problem? (D)

A. Normalize the data 

B. Take the logarithmic change of the data 

C. Neither 

D. Perform principal component analysis (PCA) and normalization on the data

Answer: (D)

First remove the relevant data, and then set it to zero.

168. Which decision boundary below is generated by a neural network? (E)

A. A 

B. D 

C. C 

D. B 

E. All of the above

Answer: (E)

169. In the figure below, we can observe many small "fluctuations" in the error. Should we be worried about this situation? (B)

A. Yes, this may mean that there is a problem with the learning rate of the neural network 

B. No, as long as there is a cumulative drop on the training set and the cross-validation set 

C. don't know 

D. hard to say

Answer: (B)

Option B is correct, in order to reduce these "fluctuations", you can try to increase the batch size (batch size).

170. When choosing the depth of the neural network, which of the following parameters need to be considered? (C)

1 Types of neural networks (such as MLP, CNN) 

2 Enter data 

3 Computing capabilities (determined by hardware and software capabilities) 

4 Learning Rate 

5 Mapped output functions

A. 1,2,4,5 

B. 2,3,4,5 

C. Both need to be considered 

D. 1,3,4,5

Answer: (C)

All of the above factors are important for choosing the depth of a neural network model.

171. When considering a specific problem, you may have only a small amount of data to solve it. But luckily you have a pre-trained neural network for a similar problem. Which of the following methods can be used to utilize this pre-trained network? (C)

A. Freeze all layers except the last layer and retrain the last layer 

B. Retrain the entire model on new data 

C. Only tune the last few layers (fine tune) 

D. Evaluate each layer model and select a few of them to use

Answer: (C)

172. Is increasing the size of the convolution kernel necessary to improve the effect of the convolutional neural network?

Answer: No, increasing the kernel size does not necessarily improve performance. This question depends heavily on the dataset.

173. Please briefly describe the development history of neural networks.

@SIY.Z. The source of the analysis of this question: Analysis of the Capsule plan recently proposed by Hinton

https://zhuanlan.zhihu.com/p/29435406

174. Talk about spark performance tuning.

https://tech.meituan.com/spark-tuning-basic.html 

https://tech.meituan.com/spark-tuning-pro.html

175. In machine learning, what are the engineering methods for feature selection?

Data and features determine the upper limit of machine learning, and models and algorithms only approach this upper limit

1. Calculate the correlation between each feature and the response variable: Commonly used methods in engineering include calculating the Pearson coefficient and the mutual information coefficient. The Pearson coefficient can only measure linear correlation and the mutual information coefficient can measure various correlations well. , but the calculation is relatively complicated. Fortunately, many toolkits include this tool (such as sklearn's MINE). After getting the correlation, you can sort and select features;

2. Build a model of a single feature, sort the features by the accuracy of the model, and use this to select features;

3. Select features through the L1 regularization item: The L1 regularization method has the characteristics of sparse solutions, so it naturally has the characteristics of feature selection, but it should be noted that the features that are not selected by L1 do not mean that they are not important, because the two have high correlation Only one of the features may be retained. If you want to determine which feature is important, you should pass the L2 regular method cross-test*;

4. Train a pre-selected model that can score features: RandomForest and Logistic Regression can score the features of the model, and then train the final model after obtaining the correlation through scoring;

5. Select features after feature combination: such as combining user id and user features to obtain a larger feature set and then select features. This approach is more common in recommendation systems and advertising systems. This is also the so-called billion or even ten. The main source of billion-level features is that the user data is relatively sparse, and the combination of features can take into account both the global model and the personalized model. This issue can be expanded if there is an opportunity.

6. Feature selection through deep learning: At present, this method is becoming a method with the popularity of deep learning, especially in the field of computer vision. The reason is that deep learning has the ability to automatically learn features. The reason for calling unsupervised feature learning. After selecting the features of a certain neural layer from the deep learning model, it can be used to train the final target model.

176. What are the common classification algorithms?

SVM, neural network, random forest, logistic regression, KNN, Bayesian

177. What are the common supervised learning algorithms?

Perceptrons, SVMs, Artificial Neural Networks, Decision Trees, Logistic Regression

178. Under the premise that other conditions remain unchanged, which of the following practices is likely to cause overfitting problems in machine learning (D)

A. Increase the amount of training set 

B. Reduce the number of hidden layer nodes in the neural network 

C. Remove sparse features 

D. Use Gaussian kernel/RBF kernel instead of linear kernel in SVM algorithm

Correct answer: (D)

@刘玄320

In general, the more complex the system, the higher the possibility of overfitting. If the general model is relatively simple, the generalization ability will be better.

B. It is generally believed that increasing the number of hidden layers can reduce network errors (some literature also believes that it may not be able to effectively reduce), improve accuracy, but also complicate the network, thereby increasing the training time of the network and the tendency of "overfitting" , the svm Gaussian kernel function is more complex than the linear kernel function model, and it is easy to overfit

D. Explanation of the radial basis (RBF) kernel function/Gaussian kernel function, which can map the original space to an infinite-dimensional space. For parameters, if the selection is very large, the weight on the high-order features actually decays very quickly, which is actually (approximately numerically) equivalent to a low-dimensional subspace; conversely, if the selection is very small, you can Map arbitrary data to be linearly separable - of course, this is not necessarily a good thing, because the attendant may be very serious overfitting problem. In general, though, the Gaussian kernel is actually quite flexible by tuning its parameters and is one of the most widely used kernel functions.

179. Which of the following time series models can better fit the analysis and prediction of volatility? (D)

A. AR model 

B. MA model 

C. ARMA model 

D. GARCH model

Correct answer: (D)

@刘玄320

The R model is a kind of linear prediction, that is, given N data, the data before or after the Nth point can be deduced by the model (set P point), so its essence is similar to interpolation.

MA model (moving average model) is a moving average model, in which the trend moving average method is used to establish a linear trend prediction model.

ARMA model (auto regressive moving average model) auto regressive moving average model, one of the high-resolution spectral analysis methods of the model parameter method. This method is a typical method for studying the rational spectrum of stationary stochastic processes. Compared with the AR model method and the MA model method, it has more accurate spectrum estimation and better spectral resolution performance, but its parameter estimation is more cumbersome.

The GARCH model is called the generalized ARCH model, which is an extension of the ARCH model and was developed by Bollerslev (1986). It is a generalization of the ARCH model. The GARCH(p,0) model is equivalent to the ARCH(p) model. The GARCH model is a regression model specially tailored for financial data. Except for the same as the ordinary regression model, GARCH further models the variance of the error. It is especially suitable for the analysis and prediction of volatility. Such analysis can play a very important guiding role in the decision-making of investors, and its significance often exceeds the analysis and prediction of the value itself.

180. Which of the following belongs to the optimal criterion for linear classifiers? (ACD)

A. Perceptual criterion function 

B. Bayesian Classification 

C. Support vector machine 

D.Fisher criterion

Correct answer: (ACD)

@刘玄320

There are three major categories of linear classifiers: perceptron criterion function, SVM, Fisher criterion, and Bayesian classifiers are not linear classifiers.

Perceptual criterion function: The principle of the criterion function is to minimize the sum of the distances from misclassified samples to the interface. Its advantage is that the classifier function is corrected by the information provided by the misclassified samples. This criterion is the basis of the artificial neural network multilayer perceptron.

Support Vector Machine: The basic idea is that under the condition of two types of linear separability, the designed classifier interface maximizes the interval between the two types, and its basic starting point is to minimize the risk of expected generalization. (Non-linear problems can be solved using kernel functions)

Fisher's Criterion: The broader name is linear discriminant analysis (LDA), which projects all samples onto a straight line starting from a far point, so that the distance between samples of the same type is as small as possible, and the distance between samples of different types is as large as possible, specifically to maximize the "generalized Ray profit business".

According to the characteristics of two types of samples, which are generally dense in the class and separated between the classes, the best normal vector direction of the linear classifier is found, so that the projection of the two types of samples in this direction satisfies that the class is as dense as possible and the class is separated as much as possible. This measure is realized by the intra-class discrete matrix SwSw and the inter-class discrete matrix SbSb.

181. What is the advantage of the HK algorithm based on the quadratic criterion function compared to the perceptron algorithm (BD)?

A. Small amount of calculation 

B. Can determine whether the problem is linearly separable 

C. Its solution is fully applicable to the case of nonlinear separability 

D. The adaptability of its solution is better

Correct answer: (BD)

@刘玄320

The idea of ​​HK algorithm is very simple, which is to obtain the weight vector under the minimum mean square error criterion.

Its advantage over the perceptron algorithm is that it is suitable for linearly separable and nonlinearly separable cases. For linearly separable cases, the optimal weight vector is given, and for non-linearly separable cases, it can be distinguished. to exit the iterative process.

182. Which of the following statements is correct (BD)?

A. SVM is robust to noise (such as noisy samples from other distributions) 

B. In the AdaBoost algorithm, the weight update ratio of all misclassified samples is the same 

C. Boosting and Bagging are both methods of combining multiple classifier votes, both of which determine their weight based on the correct rate of a single classifier 

D. Given n data points, if half of them are used for training and generally for testing, the difference between training error and testing error will decrease as n increases

Correct answer: (BD)

@刘玄320

A. SVM is robust to noise (such as noise samples from other distributions) 

  SVM itself has a certain robustness to noise, but experiments have proved that when the noise rate is lower than a certain level, the noise does not have much impact on SVM, but as the noise rate continues to increase, the recognition rate of the classifier will decrease.

B. In the AdaBoost algorithm, the weight update ratio of all misclassified samples is the same 

  Different training sets in the AdaBoost algorithm are realized by adjusting the weight corresponding to each sample. At the beginning, the weight corresponding to each sample is the same, that is, where n is the number of samples, and a weak classifier is trained under this sample distribution. For misclassified samples, increase their corresponding weights; and for correctly classified samples, reduce their weights, so that misclassified samples are highlighted, thereby obtaining a new sample distribution. Under the new sample distribution, the sample is trained again to obtain a weak classifier. By analogy, all weak classifiers are overlapped to get a strong classifier.

C, Boost and Bagging are all methods of combining multiple classifier votes, both of which determine the weight of a single classifier based on its correct rate.

  The difference between Bagging and Boosting:

  The sampling method is different.

  Bagging uses uniform sampling, while Boosting samples according to the error rate.

  Each prediction function of Bagging has no weight, while Boosting has weight.

  Each prediction function of Bagging can be generated in parallel, while each prediction function of Boosing can only be generated sequentially.

183. The size of the input image is 200×200, and it goes through a layer of convolution (kernel size 5×5, padding 1, stride 2), pooling (kernel size 3×3, padding 0, stride 1), and another layer of convolution After (kernel size 3×3, padding 1, stride 1), the output feature map size is (C):

A. 95 

B. 96 

C. 97 

D. 98

Correct answer: (C)

@刘玄320

First of all, we should know the formula for calculating the size after convolution or pooling:

out_height=((input_height - filter_height + padding_top+padding_bottom)/stride_height )+1 

out_width=((input_width - filter_width + padding_left+padding_right)/stride_width )+1

Among them, padding refers to the size of the edge that expands outward, and stride is the step size, that is, the length of each movement.

This is much easier. First of all, the length and width are generally large, so we only need to calculate one dimension. In this way, the size after the first convolution is: (200-5+2)/2+1, take 99; the size after the first pooling is: (99-3)/1+1 is 97; the size after the second convolution is: (97-3+2)/1+1 is 97.

184. In the basic analysis module of SPSS, the role is to "reveal the relationship between data in the form of a row list" (C)

A. Data description 

b. relevant 

C. Crosstab 

D. Multiple Correspondence

Correct answer: (C)

185. A prison face recognition access system is used to identify the identities of persons waiting to enter. This system includes the identification of 4 different persons: prison guards, thieves, food delivery staff, and others. Which of the following learning methods is most suitable for this application requirement : (B).

A. Binary Classification Problem 

B. Multiple Classification Problems 

C. Hierarchical clustering problems 

D. k-centroid clustering problem 

E. Regression problems 

F. Structural Analysis Issues 

Correct answer: (B)

@刘玄320

Binary classification: Each classifier can only divide samples into two categories. The samples in the prison are prison guards, thieves, food delivery staff, and others. Binary classification definitely won't work. The basic support vector machine proposed by Wapnik in 1995 is a binary classification classifier. The learning process of this classifier is to solve an optimal planning problem (dual problem) derived from positive and negative binary classification. For classification problems, decision trees are used to cascade the classifiers of the two classifications. The concept of VC dimension refers to the complexity of this matter.

Hierarchical Clustering: Creates a hierarchical hierarchy to decompose a given dataset. The objects in the prison are prison guards, thieves, food delivery staff, or others. Their levels should be equal, so it doesn't work. This method is divided into top-down (decomposition) and bottom-up (merge) two modes of operation.

K-medoid clustering: Pick actual objects to represent clusters, using one representative object for each cluster. It's a rule of division around a center point, so it doesn't fit here.

Regression analysis: A statistical method to deal with the correlation between variables. There is no direct relationship between prison guards, thieves, food delivery staff, and others.

Structural analysis: Structural analysis method is to calculate the proportion of each component on the basis of statistical grouping, and then analyze the internal structural characteristics of an overall phenomenon, the nature of the overall, and the change regularity of the overall internal structure over time statistical method. The basic form of structural analysis is to calculate structural indicators. It doesn't work here either.

Multi-classification problem: train several different weak classifiers for different attributes, and then integrate them into one strong classifier. Here, prison guards, thieves, food delivery staff, and others are set according to their characteristics, and then differentiated and identified.

186. Which is incorrect about Logit regression and SVM is (A).

A. Logit regression objective function is to minimize the posterior probability 

B. Logit regression can be used to predict the probability of event occurrence 

C. The goal of SVM is to minimize structural risk 

D. SVM can effectively avoid model overfitting

Correct answer: (A)

@刘玄320

A. Logit regression is essentially a method of maximum likelihood estimation of weights based on samples, and the posterior probability is proportional to the product of the prior probability and the likelihood function. Logit only maximizes the likelihood function, and does not maximize the posterior probability, let alone minimize the posterior probability. And minimizing the posterior probability is what the Naive Bayesian algorithm does. A mistake

B. The output of Logit regression is the probability that the sample belongs to the positive category, the probability can be calculated, correct 

C. The goal of SVM is to find the hyperplane that separates the training data as much as possible and has the largest classification interval, which should belong to the minimum structural risk.

D. SVM can control the complexity of the model through the regularization coefficient to avoid overfitting.

187. There are two sample points, the first point is a positive sample, its eigenvector is (0,-1); the second point is a negative sample, its eigenvector is (2,3), from these two The classification surface equation of constructing a linear SVM classifier with a training set composed of sample points is ( C )

A. 2x+y=4 

B. x+2y=5 

C. x+2y=3 

D. 2x-y=0

Correct answer: (C)

Analysis: This question is simplified. For two points, the maximum interval is the vertical bisector, so just find the vertical bisector.

188. Which of the following descriptions about the accuracy rate, recall rate, and F1 value of the classification algorithm is wrong? ( C)

A. The accuracy rate is the ratio of the number of retrieved relevant documents to the total number of retrieved documents, and measures the precision rate of the retrieval system 

B. The recall rate refers to the ratio of the number of relevant documents retrieved to the number of all relevant documents in the document library, and measures the recall rate of the retrieval system 

C. The correct rate, recall rate and F value are all between 0 and 1. The closer the value is to 0, the higher the precision or recall rate 

D. In order to solve the conflict between precision and recall, the F1 score is introduced

Correct answer: (C)

Analysis: The commonly used evaluation indicators for two-class classification problems are precision and recall. Usually the class of interest is the positive class, and other classes are the negative class. The prediction of the classifier on the test data set is either correct or incorrect. The total numbers of the four cases are recorded as:

  TP - predict the positive class as the number of positive classes 

  FN - predict positive class as negative class number 

  FP——Predict the negative class as the number of positive classes 

  TN - predict the negative class as the number of negative classes 

thus:

  Precision is defined as: P = TP / (TP + FP)

  Recall is defined as: R = TP / (TP + FN)

  The F1 value is defined as: F1 = 2 PR / (P + R)

The precision rate, recall rate, and F1 value are all between 0 and 1. If the precision rate and recall rate are high, the F1 value will also be high. There is no saying that the closer the value is to 0, the higher the value. It should be that the closer the value is to 1, the higher the value.

189. The following model methods belong to the Discriminative Model ( A) 

1) Mixed Gaussian model 

2) Conditional random field model 

3) Discrimination training 

4) Hidden Markov Model 

A. 2,3 

B. 3,4 

C. 1,4 

D. 1,2

Correct answer: (A)

@刘玄320

Common discriminative models are: Logistic Regression (Logistical Regression)

  Linear discriminant analysis (Linear discriminant analysis) 

  Supportvector machines 

  Boosting (integrated learning) 

  Conditional random fields 

  Linear regression 

  Neural networks 

Common generative models are: Gaussian mixture model and other types of mixture model (Gaussian mixture and other types of mixture model) 

  Hidden Markov model (Hidden Markov) 

  NaiveBayes (Naive Bayesian) 

  AODE (Average Single Dependency Estimator) 

  Latent Dirichlet allocation (LDA topic model) 

  Restricted Boltzmann Machine

The generative model is to multiply the result according to the probability, while the discriminative model is to give the input and calculate the result.

190. In SPSS, the function of data sorting is mainly concentrated in menus such as (AD).

A. data 

B. Direct selling 

C. Analysis 

D. to convert 

Correct answer: (AD)

@刘玄320

Analysis: The arrangement of data is mainly in the data and conversion function menu.

191. Deep learning is currently a very popular machine learning algorithm. In deep learning, a large number of matrix multiplications are involved. Now it is necessary to calculate the product ABC of three dense matrices A, B, and C. Assume that the dimensions of the three matrices are respectively m∗n, n∗p, p∗q, and m<n<p<q, the most efficient calculation sequence is (A)

A. (AB)C 

B. AND (B) 

C. A(BC) 

D. So the efficiency is the same

Correct answer: (A)

@刘玄320

First, according to simple matrix knowledge, because A*B, the number of columns of A must be equal to the number of rows of B. Therefore, option B can be ruled out.

Then, look at options A and C. In the A option, the product of the matrix A of m∗n and the matrix B of n∗p results in the matrix A*B of m∗p, and each element of A∗B requires n multiplications and n-1 additions, Ignoring addition, a total of m∗n∗p multiplication operations are required. In the same situation, analyze the situation when A*B is multiplied by C, and a total of m∗p∗q multiplication operations are required. Therefore, the number of multiplications required for option A (AB)C is m∗n∗p+m∗p∗q. In the same way, the number of multiplications required for C option A (BC) is n∗p∗q+m∗n∗q.

Since m∗n∗p<m∗n∗q, m∗p∗q<n∗p∗q, it is obvious that the number of operations of A is less, so A is chosen.

192.Nave Bayes is a special Bayes classifier, the feature variable is X, and the category label is C. One of its assumptions is: ( C )

A. The prior probability P(C) of each category is equal 

B. Normal distribution with 0 as the mean and sqr(2)/2 as the standard deviation 

C. Each dimension of the feature variable X is a category conditional independent random variable 

D. P(X|C) is a Gaussian distribution

Correct answer: ( C )

@刘玄320

The condition of Naive Bayes is that each variable is independent of each other.

193. Regarding the support vector machine SVM, which of the following statements is wrong (C)

A. L2 regular term, the function is to maximize the classification interval, so that the classifier has stronger generalization ability 

B. Hinge loss function, the role is to minimize the empirical classification error 

C. The classification interval is 1||w||1||w||, where ||w|| represents the modulus of the vector 

D. When the parameter C is smaller, the classification interval is larger, and the classification errors are more, which tends to be under-learning

Correct answer: (C)

@刘玄320

A is correct. Consider the reason for adding a regularization term: imagine a perfect data set, y>1 is the positive class, y<-1 is the negative class, the decision surface y=0, add a positive noise sample of y=-30, then the decision surface It will become "crooked" a lot, the classification interval will become smaller, and the generalization ability will decrease. After adding the regularization item, the fault-tolerant ability to noise samples is enhanced. In the example mentioned above, the decision-making surface will be less "crooked", making the classification interval larger and improving the generalization ability.

B is correct.

C error. The interval should be 2||w||2||w||, and the second half of the sentence should be correct. The modulus of a vector usually refers to its second norm.

D is correct. When considering soft intervals, the impact of C on the optimization problem is to limit the range of a from [0, +inf] to [0, C]. The smaller C is, the smaller a will be. The derivative of the objective function Lagrangian function is 0, so w=∑iai∗yi∗xiw=∑iai∗yi∗xi, the smaller a makes w smaller, so the interval 2||w||2||w|| becomes larger.

194. In HMM, if the observation sequence and the state sequence that produces the observation sequence are known, which of the following methods can be used to directly estimate the parameters ( D )

A. EM algorithm 

B. Viterbi Algorithm 

C. Forward-backward algorithm 

D. Maximum Likelihood Estimation

Correct answer: ( D )

@刘玄320

EM algorithm: only observation sequence, no state sequence to learn model parameters, that is, Baum-Welch algorithm

Viterbi algorithm: use dynamic programming to solve the prediction problem of HMM, not parameter estimation

Forward and backward algorithm: used to calculate probability

Maximum likelihood estimation: that is, a supervised learning algorithm when both the observation sequence and the corresponding state sequence exist, used to estimate the parameters

Note that estimating model parameters given a sequence of observations and a corresponding state sequence can be estimated using maximum likelihood. If there is no corresponding state sequence for a given observation sequence, EM is used to see the state sequence as unmeasurable hidden data.

195. Assuming that a classmate accidentally repeated the two dimensions of the training data when using the Naive Bayesian (NB) classification model, then the correct statement about NB is: (BD  )

A. The decisive role of this repeated feature in the model will be strengthened 

B. The accuracy of the model effect will be reduced compared to the case of no repeated features 

C. If all the features are repeated, the obtained model prediction results are the same as the model prediction results without repetition.

D. When the two columns of features are highly correlated, the conclusions obtained when the two columns of features are the same cannot be used to analyze the problem 

E. NB can be used for least squares regression 

F. None of the above statements are correct 

Correct answer: (BD)

196.L1 and L2 norms In Logistic Regression, if the L1 and L2 norms are added at the same time, what effect will it have (A).

A. Can do feature selection and prevent overfitting to a certain extent 

B. Can solve the curse of dimensionality problem 

C. Can speed up calculation 

D. More accurate results can be obtained

Correct answer: ( A )

@刘玄320

The L1 norm has the characteristics of a coefficient solution, but it should be noted that the features not selected by L1 do not mean that they are not important, because only one of the two highly correlated features may be retained. If you need to determine which feature is important, then pass cross-validation.

Add a regular term after the cost function, L1 is Losso regression, and L2 is ridge regression. The L1 norm refers to the sum of the absolute values ​​of each element in the vector, which is used for feature selection. The L2 norm refers to the sum of the squares of each element of the vector and then the square root, which is used to prevent overfitting and improve the generalization ability of the model. So choose A.

For detailed answers to norm regularization in machine learning, that is, L0, L1, L2 norms, please refer to norm regularization.

197. What is the difference between L1 regularization and L2 regularization in machine learning? (AD)

A. Use L1 to get sparse weights 

B. Use L1 to get smooth weights 

C. Use L2 to get sparse weights 

D. Use L2 to get smooth weights

Correct answer: (AD)

@刘玄320

L1 regularization tends to be sparse, and it will automatically perform feature selection and remove some useless features, that is, reset the weights corresponding to these features to 0.

The main function of L2 is to prevent overfitting. When the required parameters are smaller, the model is simpler, and the simpler the model, the more smooth it tends to be, thereby preventing overfitting.

L1 regularization/Lasso 

L1 regularization adds the L1 norm of the coefficient w to the loss function as a penalty term. Since the regular term is non-zero, this forces the coefficients corresponding to those weak features to become 0. Therefore, L1 regularization tends to make the learned model very sparse (coefficient w is often 0), which makes L1 regularization a good feature selection method.

L2 regularization/Ridge regression 

L2 regularization adds the L2 norm of the coefficient vector to the loss function. Since the coefficient in the L2 penalty item is quadratic, there are many differences between L2 and L1. The most obvious point is that L2 regularization will make the value of the coefficient average. For relational features, this means that they can obtain closer corresponding coefficients. Still take Y=X1+X2Y=X1+X2 as an example, assuming that X1X1 and X1X1 have a strong correlation, if L1 regularization is used, no matter whether the learned model is Y=X1+X2Y=X1+X2 or Y=2X1Y= 2X1, the penalty is the same, both are 2α2α. But for L2, the first model's penalty is 2α2α, but the second model's is 4α4α. It can be seen that when the sum of the coefficients is constant, the penalty is the smallest when the coefficients are equal, so there is the characteristic that L2 will make the coefficients tend to be the same.

It can be seen that L2 regularization is a stable model for feature selection, unlike L1 regularization, where coefficients fluctuate due to subtle data changes. Therefore, the value provided by L2 regularization and L1 regularization is different. L2 regularization is more useful for feature understanding: the coefficient corresponding to the feature with strong expressive ability is non-zero.

Therefore, the summary in one sentence is: L1 will tend to generate a small number of features, while other features are 0, and L2 will select more features, which will be close to 0. Lasso is very useful in feature selection, while Ridge is just a regularization.

198. The role of the cumulative potential function K(x) of the potential function method is equivalent to ( AD ) in the Bayes judgment

A. Posterior probability 

B. Prior probability 

C. Class probability density 

D. The product of the class probability density and the prior probability

Correct Answer: (AD)

In fact, AD is saying the same thing.

Reference link: The potential function is mainly used to determine the classification surface, and its idea comes from physics.

199. The three basic problems of the hidden Markov model and the corresponding algorithm statement are correct (ABC)

A. Evaluation—Forward-Backward Algorithm 

B. Decoding—Viterbi Algorithm 

C. Learning—Baum-Welch Algorithm 

D. Learning—Forward-Backward Algorithm

Correct Answer: (ABC)

Analysis: To evaluate the problem, you can use the forward algorithm, the backward algorithm, and the forward and backward algorithm.

200. When the feature is larger than the amount of data, what kind of classifier to choose?

Answer: Linear classifier, because when the dimension is high, the data is generally sparse in the dimensional space, and it is likely to be linearly separable.

Guess you like

Origin blog.csdn.net/qq_45368632/article/details/126888441