Common interview questions-NLP articles (continuous update)

The principle of Word2Vec and the techniques used? https://mp.weixin.qq.com/s/lerKdFXkhqQaaVl4BGgblA
How is Word2Vec's multi-layer softmax realized, what is the idea, and the loss function changes https://zhuanlan.zhihu.com/p/56382372
Word2vec's loss function http://zh.d2l.ai/chapter_natural-language-processing/approx-training.html (When there are no items or internships on the resume, interviewers like to ask Word2vec, which is very boring)
Why is word2vec based on skip-gram more effective than cbow in low-frequency vocabulary? Why is it good? http://sofasofa.io/forum_main_post.php?postid=1002735
How does Word2Vec turn the obtained word vector into a sentence vector, and how to measure the quality of the word vector https://blog.csdn.net/Matrix_cc/article/details/105138478
Why is Word2vec useful? The essence of word2vec is to map similar words or words with similar contexts to almost the same point, so the weights of words with similar semantics are close, and doing downstream tasks is better than randomly initialized word vectors.
What are the disadvantages of word2vec? 1. The structure is too simple to learn syntactic and semantic knowledge better. 2. There is no way to solve OOV. 3. Unable to solve the problem of word order. 4. Unable to solve the problem of polysemy
Why does hierarchical softmax build a binary tree with word frequency https://www.zhihu.com/question/398884697
How is the weighted sampling of negative sampling, and why is the weight of negative sampling raised to the power of 3/4? https://zhuanlan.zhihu.com/p/144563199
What is the difference between the negative sample of negative sampling and the negative sample of hierarchical softmax? The number of negative samples for negative sampling is fixed, and the negative samples for hierarchical softmax are unbalanced. The root nodes of words with high word frequency are close, and the negative samples sampled are fewer, and the negative samples for low-frequency word sampling are more.
The difference between fasttext, word2vec, glove, elmo, bert, GPT, xlnet, what are the advantages of fasttext compared to word2vec, and how does it classify?
The principle of textcnn, why choose to use 2,3,4 convolution kernels of different sizes
How is BiLSTM+Attention classified? https: //blog.csdn.net/google19890102/article/details/94412928
The structure of Transformer, the role of multi-head Attention Ps. pytorch manually implement multi-head https://blog.csdn.net/Matrix_cc/article/details/104868571
The difference between Transformer's position coding and bert's position coding, why add position coding? Transformer's position coding is artificially defined, bert learns it himself, and learns its word order information in training like word Embedding. Because the traditional LSTM model comes with word order information, and Transformer does not carry word order information, it needs to add position coding.
What is the problem if the word vector in the self-attention in Transformer does not multiply the QKV parameter matrix?
The core of Self-Attention is to use other words in the text to enhance the semantic representation of the target word, so as to make better use of contextual information. In self-attention, each word in the sequence will be dot-producted with each word in the sequence to calculate the similarity, including the word itself. If you don't multiply the QKV parameter matrix, the q, k, and v corresponding to this word are exactly the same. In the case of the same magnitude, the value of the dot product of qi and ki will be the largest (it can be analogized from "when the sum of two numbers is the same, the product corresponding to the same two numbers is the largest"). In the weighted average after softmax, the word itself will have the largest proportion, making the proportion of other words very small, and it is impossible to effectively use context information to enhance the semantic representation of the current word. Multiplying by the QKV parameter matrix will make the q, k, and v of each word different, which can greatly reduce the above-mentioned influence.
The residual effect of Transformer: reduce the problem of gradient disappearance and gradient explosion, and at the same time solve the degradation problem. The degradation problem refers to: when the network has more hidden layers, the accuracy of the network reaches saturation and then degrades sharply, and this degradation is not caused by overfitting.
How does Transformer solve the long text problem?
The principle of LN or BN Ps. LN is used in Transformer, pytorch manually implement it
Where is dropout mainly used in Transformer model
Why divide by the radical d in self-attention? In order to reduce the value of Q*K, prevent it from falling into the saturation interval of the softmax function. Because the gradient of the saturation region of the softmax function is almost 0, the gradient disappears easily.
Why does Transformer perform better than LSTM and CNN? https://blog.csdn.net/None_Pan/article/details/106485896
Why does BERT perform well? https://blog.csdn.net/jlqCloud/article/details/104742091 1. Large pre-training data and two pre-training methods 2. The model structure used is better than LSTM and CNN 3. Model Layer depth
Why do you want to add the three embeddings of BERT? answer
Disadvantages of BERT : 1. It cannot solve the problem of long text. 2. Input noise [MASK] causes the difference between the two stages of pre-training and fine-tuning. 3. Poor performance of the generation task: the pre-training process and the generation process are inconsistent, resulting in Poor results on the generation task 4. Position coding uses absolute coding https://www.jiqizhixin.com/articles/2019-08-26-16
The mask mechanism in Bert ? It is 15% of the tokens in the random mask corpus, 80% of the words are replaced by [MASK] tokens, 10% of the words are replaced by arbitrary words, and 10% of the words remain unchanged.
What is the difference between the mask in bert and the CBOW in word2vec ?
Similarities: The core idea of CBOW is to predict the input word based on its context-before and context-after, given a context. Bert is essentially the same.

Differences: First, in CBOW, each word will be called input word, while only 15% of words in Bert will be called input word. Secondly, for the data input part, the input data in CBOW only has the context of the word to be predicted, while the input of Bert is a "complete" sentence with [MASK] token, which means that Bert uses the input word to be predicted on the input side. [MASK] token replaced.

In addition, after training through the CBOW model, the word embedding of each word is unique, so it cannot handle the problem of multiple words. The word embedding (token embedding) obtained by the Bert model integrates contextual information, even if it is The same word, in different contexts, the word embedding obtained is different.
Why does BERT use word granularity instead of word granularity? Because when doing MLM pre-training tasks, the final word prediction is predicted with softmax. If you use the word granularity, the total number of words is about 2w, and if you use the word granularity, there are hundreds of thousands of words, and the video memory will explode during training.
The principles and differences between HMM and CRF, and the difference in the complexity of the Viterbi algorithm:
1. HMM is a generative model, CRF is a discriminative model

2. HMM is a probability directed graph, and CRF is a probability undirected graph

3. The HMM solution process may be locally optimal, and CRF may be globally optimal

4. HMM is a Markov hypothesis, and CRF is Markov, because Markov is a condition to guarantee or judge whether a probability graph is a probability undirected graph. HMM principle : three questions: 1. Probability Calculation problem: Given the model λ=(A,B,π) and the observation sequence Q={q1,q2,...,qT} in the forward-backward algorithm, calculate the probability P that the sequence Q appears under the model λ (Q|λ); 2. Learning problem: Baum-Welch algorithm (state unknown) known observation sequence Q={q1,q2,...,qT}, estimate model λ=(A,B,π) parameters , Make the observation sequence P (Q|λ) maximum under this model. 3. Prediction problem: Viterbi algorithm given model λ=(A,B,π) and observation sequence Q={q1,q2,...,qT} , Find the state sequence I with the largest conditional probability P (I|Q, λ) of the given observation sequence.
Why is the SOP in Albert effective?

ALBERT believes that NSP (Next Sentence Prediction) confuses topic prediction and coherence prediction. For reference, NSP uses two sentences-a positive match is the second sentence from the same document, and a negative match is the second sentence from another document. In contrast, the author of ALBERT believes that the coherence between sentences is a task/loss that really needs attention, not topic prediction, so SOP does this:

Two sentences are used, both from the same document. The positive sample test case is that the order of these two sentences is correct. A negative sample is the order of the two sentences is reversed.
What is add&norm in bert and its function
The difference between local attention and global attention: https://easyai.tech/ai-definition/attention/
Understanding of Attention and its advantages and disadvantages: Attention is to select a small amount of important information from a large amount of information, and focus on these important information, ignoring most of the unimportant information. The larger the weight, the more it focuses on its corresponding Value value, that is, the weight represents the importance of information, and Value is its corresponding information. advantage:
Few parameters

Compared with CNN and RNN , the model complexity is smaller and the parameters are also fewer. Therefore, the requirement for computing power is even smaller.

high speed

Attention solves the problem that RNN cannot be calculated in parallel. Each step of the Attention mechanism does not depend on the calculation results of the previous step, so it can be processed in parallel like CNN.

Good effect

Before the introduction of the Attention mechanism, there is a problem that everyone has been distressed: long-distance information will be weakened, just like people with weak memory ability cannot remember the past.
Disadvantages: unable to capture position information, that is, unable to learn the order relationship in the sequence. This can be improved by adding position information, such as position vectors
The difference between the two Attention mechanisms of Bahdanau and Luong: https://zhuanlan.zhihu.com/p/129316415
The principle of graph embedding
The principle of TF-IDF https://blog.csdn.net/zrc199021/article/details/53728499
The principle of n-gram and what are the smoothing processing https://blog.csdn.net/songbinxu/article/details/80209197
Solution to OOV: How does the mainstream of NLP research deal with out of vocabulary words?
Dimensionality reduction of word vectors
What are the nlp word segmentation techniques and how to divide it
What data enhancement methods does nlp have? https://blog.csdn.net/Matrix_cc/article/details/104864223
What are the methods of text preprocessing

Common interview questions-NLP articles (continuous update)

Guess you like