Common interview questions-NLP articles (continuous update)

  1. The principle of Word2Vec and the techniques used?  https://mp.weixin.qq.com/s/lerKdFXkhqQaaVl4BGgblA
  2. How is Word2Vec's multi-layer softmax realized, what is the idea, and the loss function changes  https://zhuanlan.zhihu.com/p/56382372
  3. Word2vec's loss function  http://zh.d2l.ai/chapter_natural-language-processing/approx-training.html  (When there are no items or internships on the resume, interviewers like to ask Word2vec, which is very boring)
  4. Why is word2vec based on skip-gram more effective than cbow in low-frequency vocabulary? Why is it good? http://sofasofa.io/forum_main_post.php?postid=1002735
  5. How does Word2Vec turn the obtained word vector into a sentence vector, and how to measure the quality of the word vector https://blog.csdn.net/Matrix_cc/article/details/105138478
  6. Why is Word2vec useful? The essence of word2vec is to map similar words or words with similar contexts to almost the same point, so the weights of words with similar semantics are close, and doing downstream tasks is better than randomly initialized word vectors.
  7. What are the disadvantages of word2vec? 1. The structure is too simple to learn syntactic and semantic knowledge better. 2. There is no way to solve OOV. 3. Unable to solve the problem of word order. 4. Unable to solve the problem of polysemy
  8. Why does hierarchical softmax build a binary tree with word frequency  https://www.zhihu.com/question/398884697
  9. How is the weighted sampling of negative sampling, and why is the weight of negative sampling raised to the power of 3/4? https://zhuanlan.zhihu.com/p/144563199
  10. What is the difference between the negative sample of negative sampling and the negative sample of hierarchical softmax? The number of negative samples for negative sampling is fixed, and the negative samples for hierarchical softmax are unbalanced. The root nodes of words with high word frequency are close, and the negative samples sampled are fewer, and the negative samples for low-frequency word sampling are more.
  11.  The difference between fasttext, word2vec, glove, elmo, bert, GPT, xlnet, what are the advantages of fasttext compared to word2vec, and how does it classify?
  12.  The principle of textcnn, why choose to use 2,3,4 convolution kernels of different sizes
  13. How is BiLSTM+Attention classified? https: //blog.csdn.net/google19890102/article/details/94412928 
  14. The structure of Transformer, the role of multi-head Attention Ps. pytorch manually implement multi-head   https://blog.csdn.net/Matrix_cc/article/details/104868571[NLP] Detailed Transformer
  15. The difference between Transformer's position coding and bert's position coding, why add position coding? Transformer's position coding is artificially defined, bert learns it himself, and learns its word order information in training like word Embedding. Because the traditional LSTM model comes with word order information, and Transformer does not carry word order information, it needs to add position coding.
  16. What is the problem if the word vector in the self-attention in Transformer does not multiply the QKV parameter matrix?

    The core of Self-Attention is to use other words in the text to enhance the semantic representation of the target word, so as to make better use of contextual information. In self-attention, each word in the sequence will be dot-producted with each word in the sequence to calculate the similarity, including the word itself. If you don't multiply the QKV parameter matrix, the q, k, and v corresponding to this word are exactly the same. In the case of the same magnitude, the value of the dot product of qi and ki will be the largest (it can be analogized from "when the sum of two numbers is the same, the product corresponding to the same two numbers is the largest"). In the weighted average after softmax, the word itself will have the largest proportion, making the proportion of other words very small, and it is impossible to effectively use context information to enhance the semantic representation of the current word. Multiplying by the QKV parameter matrix will make the q, k, and v of each word different, which can greatly reduce the above-mentioned influence.

  17. The residual effect of Transformer: reduce the problem of gradient disappearance and gradient explosion, and at the same time solve the degradation problem. The degradation problem refers to: when the network has more hidden layers, the accuracy of the network reaches saturation and then degrades sharply, and this degradation is not caused by overfitting.
  18. How does Transformer solve the long text problem?
  19. The principle of LN or BN Ps. LN is used in Transformer, pytorch manually implement it
  20. Where is dropout mainly used in Transformer model
  21. Why divide by the radical d in self-attention?  In order to reduce the value of Q*K, prevent it from falling into the saturation interval of the softmax function. Because the gradient of the saturation region of the softmax function is almost 0, the gradient disappears easily.
  22. Why does Transformer perform better than LSTM and CNN? https://blog.csdn.net/None_Pan/article/details/106485896
  23. Why does BERT perform well?  https://blog.csdn.net/jlqCloud/article/details/104742091  1. Large pre-training data and two pre-training methods 2. The model structure used is better than LSTM and CNN 3. Model Layer depth 
  24. Why do you want to add the three embeddings of BERT? answer
  25. Disadvantages of BERT : 1. It cannot solve the problem of long text. 2. Input noise [MASK] causes the difference between the two stages of pre-training and fine-tuning. 3. Poor performance of the generation task: the pre-training process and the generation process are inconsistent, resulting in Poor results on the generation task 4. Position coding uses absolute coding  https://www.jiqizhixin.com/articles/2019-08-26-16
  26. The mask mechanism in Bert ? It is 15% of the tokens in the random mask corpus, 80% of the words are replaced by [MASK] tokens, 10% of the words are replaced by arbitrary words, and 10% of the words remain unchanged.
  27. What is the difference between the mask in bert and the CBOW in word2vec ?

    Similarities: The core idea of ​​CBOW is to predict the input word based on its context-before and context-after, given a context. Bert is essentially the same.

    Differences: First, in CBOW, each word will be called input word, while only 15% of words in Bert will be called input word. Secondly, for the data input part, the input data in CBOW only has the context of the word to be predicted, while the input of Bert is a "complete" sentence with [MASK] token, which means that Bert uses the input word to be predicted on the input side. [MASK] token replaced.

    In addition, after training through the CBOW model, the word embedding of each word is unique, so it cannot handle the problem of multiple words. The word embedding (token embedding) obtained by the Bert model integrates contextual information, even if it is The same word, in different contexts, the word embedding obtained is different.

  28. Why does BERT use word granularity instead of word granularity? Because when doing MLM pre-training tasks, the final word prediction is predicted with softmax. If you use the word granularity, the total number of words is about 2w, and if you use the word granularity, there are hundreds of thousands of words, and the video memory will explode during training.

  29. The principles and differences between HMM and CRF, and the difference in the complexity of the Viterbi algorithm:

    1. HMM is a generative model, CRF is a discriminative model

    2. HMM is a probability directed graph, and CRF is a probability undirected graph

    3. The HMM solution process may be locally optimal, and CRF may be globally optimal 

    4. HMM is a Markov hypothesis, and CRF is Markov, because Markov is a condition to guarantee or judge whether a probability graph is a probability undirected graph. HMM principle : three questions: 1. Probability Calculation problem: Given the model λ=(A,B,π) and the observation sequence Q={q1,q2,...,qT} in the forward-backward algorithm, calculate the probability P that the sequence Q appears under the model λ (Q|λ); 2. Learning problem: Baum-Welch algorithm (state unknown)  known observation sequence Q={q1,q2,...,qT}, estimate model λ=(A,B,π) parameters , Make the observation sequence P (Q|λ) maximum under this model. 3. Prediction problem: Viterbi algorithm  given model λ=(A,B,π) and observation sequence Q={q1,q2,...,qT} , Find the state sequence I with the largest conditional probability P (I|Q, λ) of the given observation sequence.

  30. Why is the SOP in Albert effective?

    ALBERT believes that NSP (Next Sentence Prediction) confuses topic prediction and coherence prediction. For reference, NSP uses two sentences-a positive match is the second sentence from the same document, and a negative match is the second sentence from another document. In contrast, the author of ALBERT believes that the coherence between sentences is a task/loss that really needs attention, not topic prediction, so SOP does this:

    Two sentences are used, both from the same document. The positive sample test case is that the order of these two sentences is correct. A negative sample is the order of the two sentences is reversed.
     

  31. What is add&norm in bert and its function
  32. The difference between local attention and global attention: https://easyai.tech/ai-definition/attention/
  33. Understanding of Attention and its advantages and disadvantages: Attention is to select a small amount of important information from a large amount of information, and focus on these important information, ignoring most of the unimportant information. The larger the weight, the more it focuses on its corresponding Value value, that is, the weight represents the importance of information, and Value is its corresponding information. advantage:

    Few parameters

     Compared with CNN and RNN , the model complexity is  smaller and the parameters are also fewer. Therefore, the requirement for computing power is even smaller.

    high speed

    Attention solves the problem that RNN cannot be calculated in parallel. Each step of the Attention mechanism does not depend on the calculation results of the previous step, so it can be processed in parallel like CNN.

    Good effect

    Before the introduction of the Attention mechanism, there is a problem that everyone has been distressed: long-distance information will be weakened, just like people with weak memory ability cannot remember the past.

    Disadvantages: unable to capture position information, that is, unable to learn the order relationship in the sequence. This can be improved by adding position information, such as position vectors
  34. The difference between the two Attention mechanisms of Bahdanau and Luong: https://zhuanlan.zhihu.com/p/129316415
  35. The principle of graph embedding
  36. The principle of TF-IDF   https://blog.csdn.net/zrc199021/article/details/53728499
  37. The principle of n-gram and what are the smoothing processing  https://blog.csdn.net/songbinxu/article/details/80209197
  38. Solution to OOV:  How does the mainstream of NLP research deal with out of vocabulary words?
  39. Dimensionality reduction of word vectors
  40. What are the nlp word segmentation techniques and how to divide it
  41. What data enhancement methods does nlp have?  https://blog.csdn.net/Matrix_cc/article/details/104864223
  42. What are the methods of text preprocessing

Guess you like

Origin blog.csdn.net/Matrix_cc/article/details/105513836