12 basic questions about machine learning

1. Explain the significance of batch normalization

 Algorithm 1: Batch normalization transformation, applied to activations x on a mini-batch.

Batch normalization is an effective method for training neural network models. The goal of this method is to normalize the features (so that the output of each layer of the network is activated) to obtain a zero-mean state with a standard deviation of 1. So its opposite is a non-zero mean. How this affects model training:

First of all, this can be understood as a non-zero mean is the phenomenon that the data is not distributed around the value of 0, but rather that most of the values ​​of the data are greater than 0 or less than 0. Combined with high variance problems, the data can become either very large or very small. This problem is common when training neural networks with many layers. If the features are not distributed in a stable interval (from small to large values), it will have an impact on the optimization process of the network. We all know that optimizing neural networks will require derivative calculations.

Assuming a simple layer calculation formula y = (Wx + b), the derivative of y on W is like this: dy=dWx. Therefore, the value of x will directly affect the value of the derivative (of course, the gradient concept of the neural network model will not be so simple, but in theory, x will affect the derivative). Therefore, if x introduces unstable changes, this derivative will either be too large or too small, ultimately causing the learned model to be unstable. And this also means that when using batch normalization, we can use higher learning rates in training.

Batch normalization helps us avoid saturating the value of x after passing through a nonlinear activation function. That is, batch normalization ensures that activations are never too high or too low. This facilitates weight learning - without using this scheme, some weights may never be learned. This also helps us reduce dependence on the initial values ​​of the parameters.

Batch normalization can also be used as a form of regularization to help minimize overfitting. When using batch normalization, we no longer need to use too many dropouts; this is helpful because we don't have to worry about losing too much information when performing dropouts. However, it is still recommended to use a combination of both techniques.

2.Explain the concepts of bias and variance and the trade-off relationship between them

What is bias? It’s easy to understand that the bias is the difference between the average predicted outcome of the current model and the actual outcome we need to predict. When a model's bias is high, it means it doesn't pay enough attention to the training data. This can make the model too simple and unable to achieve good accuracy in both training and testing. This phenomenon is also called "underfitting".

Variance can be simply understood as the distribution (or clustering) of model output on a data point. The greater the variance, the more likely the model is paying closer attention to the training data and failing to provide generalization on data it has never seen. The result is that the model can achieve very good results on the training data set, but perform very poorly on the test data set. This phenomenon is called overfitting.

The relationship between these two concepts can be illustrated by the following figure:

 In the picture above, the center of the circle is a model that can perfectly predict accurate values. In fact, you will never find such a good model. As we get further away from the center of the circle, the model's predictions get worse.

We can change the model so that we increase the number of model guesses so that as many of them fall in the center of the circle as possible. There needs to be a balance between bias and variance. If our model is too simple and has very few parameters, then it is likely to have high bias and low variance.

On the other hand, if our model has a large number of parameters, it will have higher variance and lower bias. This is the basis for calculating model complexity when designing algorithms.

 3. Assume that the deep learning model has found 10 million face vectors, how to find a new face as quickly as possible through query?

This question involves the practical application of deep learning algorithms, and the key point lies in the method of indexing data. This is the final step in applying One Shot Learning to face recognition, but it is also the most important step in making the application easy to deploy in practice.

Basically, for this problem, you should first give an overall overview of face recognition methods through One Shot Learning. This can be simply understood as converting each face into a vector, and then identifying new faces is finding the vector that is closest (most similar) to the input face. Typically, one uses a deep learning model with a custom loss function called triplet loss to accomplish this task.

However, if the number of images grows like at the beginning of this article, calculating the distance to 10 million vectors in every recognition is not a smart solution and will make the system very slow. We need to think about ways to index data on real vector spaces to make queries easier.

The main idea of ​​these methods is to partition the data into simple structures in order to query new data (perhaps similar to a tree structure). When new data is available, querying in the tree helps quickly find the closest vector.

There are some methods that can be used for this purpose, such as Locality Sensitive Hashing (LSH), Approximate Nearest Neighbors Oh Yeah - Annoy Indexing, Faiss, etc.

4.Is the accuracy index completely reliable for classification problems? What metrics do you typically use to evaluate your models?

There are many evaluation methods for classification problems. Accuracy is a simple metric, which is correct prediction data divided by total data. This sounds reasonable, but the reality is that this metric is not significant enough for imbalanced data problems. Suppose we are building a predictive model for predicting network attacks (assuming that attack requests account for approximately 1/100000 of the total requests).

If the model predicted that all requests were normal, it would be 99.9999% accurate, but in this classification model, this number is often unreliable. The accuracy calculation above usually results in the percentage of data that was correctly predicted, but does not detail the classification details of each category. Instead, we can use a confusion matrix. Basically, a confusion matrix shows the class that a data point actually belongs to, and the class that the model predicts. Its form is as follows:

In addition to expressing the variation of the true and false positive metrics corresponding to each threshold that defines the classification, we also have a graph called the Receiver Operating Characteristic (ROC). Based on ROC, we can know whether the model is effective.

 The closer the ideal ROC is to the orange line in the upper left corner (i.e., higher true positives and lower false positives), the better the results.

5.What do you understand by backpropagation? Please explain the mechanism of action.

The goal of this question is to test whether the interviewee understands how neural networks work. You need to explain the following points:

The forward process (forward calculation) is the process that helps the model calculate the weights of each layer, and the resulting calculation will give a result yp. At this time, the value of the loss function will be calculated; this value of the loss function can reflect the quality of the model. If this loss function is not good enough, we need to find a way to reduce the value of this loss function. The training goal of the neural network is actually to minimize a certain loss function. The loss function L(yp,yt) represents the degree of difference between the output value of the yp model and the actual value of the yt data label.

In order to reduce the value of the loss function, we need to use derivatives. Backpropagation helps us calculate the derivatives of each layer of the network. Based on the value of the derivative at each layer, the optimizer (Adam, SGD, AdaDelta, etc.) can update the weights of the network through gradient descent.

Backpropagation uses a chain rule mechanism or a derivative function to calculate the gradient value of each layer from the last layer to the first layer.

6.What is the meaning of activation function? What is the saturation point of the activation function?

1. The meaning of activation function

The purpose of the activation function is to break through the linear nature of the neural network. We can simply understand these functions as a kind of filter, which determines whether information can pass through the neuron. During neural network training, the activation function plays a very important role in adjusting the slope of the derivative.

Using nonlinear activation functions allows neural networks to learn more complex functional representations than using linear functions; but in order to use them effectively, we need to understand the properties of these nonlinear functions. Most activation functions are continuously differentiable functions.

These functions are continuous functions, which means that if there is a small differentiable change in the input (it has a derivative at every point in its domain), then there will be a small change in the output. Of course, as mentioned earlier, the calculation of the derivative is very important and determines whether our neurons can be trained. A few activation functions worth mentioning are Sigmoid, Softmax and ReLU.

2. Saturation range of activation function

Nonlinear activations such as Tanh, Sigmoid, and ReLU functions all have saturation intervals.

It is easy to understand that the saturation range of the activation function is the interval in which the output value no longer changes when the input value changes. There are two problems with this variation range.

The first problem is that in the forward direction of the neural network, the values ​​of the layers that fall within the saturation range of the activation function will gradually get many of the same output values. This results in the same data flow throughout the model. This phenomenon is called covariance shifting.

The second problem is that in the reverse direction, the derivative in the saturation range is zero, so the network can hardly learn anything anymore. This is why we mentioned in the batch normalization problem that the range of values ​​should be set to zero mean.

7.What are the hyperparameters of the model? What is the difference between hyperparameters and parameters?

1. What are the model parameters?

Let’s first review the nature of machine learning. To do machine learning, we need a data set. How can we learn without data? Once the data is available, the machine needs to find connections between the data.

Suppose our data is weather information such as temperature and humidity, and the task we want the machine to perform is to find a correlation between these factors and whether our loved one is angry. This may not sound relevant, but the machine learning backlog can be ridiculous at times. Now, we use the variable y to represent whether our lover is angry, and the variables x_1, x_2, x_3... to represent the weather elements. We use the following function f(x) to represent the relationship between these variables:

See the coefficients w_1, w_2, w_3? This represents the relationship between the data and the results, which are called model parameters. Therefore, we can define "model parameters" like this:

Model parameters are values ​​generated by the model based on the training data and help show the relationship between the amounts of data in the data.

So when we say we want to find the best model for a problem, we mean we want to find the most appropriate model parameters for the problem based on the existing data set. Model parameters have the following characteristics:

  • Can be used to predict new data;

  • Can demonstrate the capabilities of the model we use, usually expressed through indicators such as accuracy;

  • It is learned directly from the training data set;

  • Not manually set by humans.

Model parameters also come in different forms, such as weights in neural networks, support vectors in support vector machines, and coefficients in linear regression and logistic regression algorithms.

2. What are model hyperparameters?

Some people may think that model hyperparameters are or are like model parameters, but this is not the case. In fact, these two concepts are completely different. While model parameters are modeled from the training data set, this is not the case at all for model hyperparameters, which lie entirely outside the model and do not depend on the training data. So what is the role of model hyperparameters? In fact they have the following tasks:

  • Used during the training process to help the model find the most appropriate parameters;

  • Usually selected manually during model design;

  • It can be defined based on several heuristic strategies.

We have absolutely no idea what the optimal hyperparameter model is for a specific problem. Therefore, we actually need to use some technique (such as grid search) to estimate the optimal range of these values ​​(for example, the k coefficient in the k nearest neighbor model). Here are some examples of model hyperparameters:

  • Learning rate index when training artificial neural networks;

  • C and σ parameters when training support vector machines;

  • The k coefficient in the k-nearest neighbor model.

8. What happens when the learning rate is too high or too low?

When the learning rate of the model is too low, the training speed of the model will become very slow because each update of the weights will become very small. The model will require a large number of updates to reach the local optimum.

If the learning rate is too high, the model is likely to fail to converge because the weights are updated too large. In the weighted steps, the model may fail to achieve local optimization, which then makes it difficult for the model to update to the optimal point (because each update step jumps too far, causing the model to swing near the local optimal point).

9. How many times does the number of CNN parameters increase when the size of the input image doubles? Why?

For interviewers, this question is very misleading, because most people think about this question in the direction of how many times the number of parameters of the CNN will increase. However, let’s look at the architecture of CNN:

 

As can be seen, the number of parameters of a CNN model depends on the number and size of filters rather than the input image. Therefore, doubling the size of the input image does not change the number of parameters of the model.

10. What are the methods to deal with data imbalance problem?

This question tests whether the interviewer knows how to handle problems with real data. Generally speaking, real data and sample data (standard data sets that do not require adjustment) are very different in nature and data volume. When using real data sets, the data may be unbalanced, that is, the data of different categories are unbalanced. To address this problem, we can consider using the following techniques:

Choose appropriate metrics for model evaluation: When using an unbalanced data set, it is very inappropriate to use accuracy for evaluation (already mentioned earlier). Instead, you should choose precision, recall, F1 score, AUC and other evaluation indicators.

Resampling the training dataset: In addition to using different evaluation metrics, one can also obtain different datasets through certain techniques. There are two ways to create a balanced data set from an unbalanced data set: undersampling and oversampling, using techniques such as repetition, bootstrapping, or SMOTE (Synthetic Minority Oversampling Technique).

Integrating multiple different models: Achieving model generality by creating more data is not advisable in practice. For example, suppose you have two categories: a rare category with 1,000 data samples and a common category with 10,000 data samples. Instead of trying to find 9000 data samples for rare classes to train the model on, we can adopt a 10-model training scheme. Each of these models is trained using 1000 rare data samples and 1000 common data samples. Then use integrated techniques to get the best results.

Redesign the model - cost function: Use penalty techniques in the cost function to severely penalize data-rich categories to help the model itself better learn data from rare categories. This allows the loss function to more comprehensively cover all categories.

11.What do the concepts of epoch, batch and iteration mean when training a deep learning model?

These are very basic concepts when training neural networks, but in fact many interviewers are often confused when distinguishing these concepts. Specifically, you should answer:

  • epoch: represents an iteration over the entire data set (all included in the training model);

  • Batch: refers to dividing the data set into smaller data set batches when we cannot input the entire data set into the neural network at one time;

  • iteration: refers to the number of batches required to run an epoch. For example, if our dataset contains 10,000 images and the batch_size is 200, then an epoch consists of 50 iterations (10,000 divided by 200).

12.What is the concept of data generator? What do I need to use the data generator?

Generating functions are also very important in programming. Data generation functions help us generate data in each training batch that can directly fit the model.

Using generating functions is helpful when training big data. Therefore, the entire data set does not need to be loaded into RAM, which is a waste of memory; in addition, if the data set is too large, it may cause memory overflow and the processing time of the input data will become longer.

Guess you like

Origin blog.csdn.net/Angelina_Jolie/article/details/134973151