1. Introduction
论文:Learning Deep Structured Semantic Models for Web Search using Clickthrough Data
The core idea of the deep learning model for calculating text similarity proposed by Microsoft in 13 years is to map query and doc to the semantic space of a common dimension. By maximizing the cosine similarity between the semantic vector of query and doc, the hidden Contains a semantic model to achieve the purpose of retrieval. DSSM has a wide range of applications, such as search engine retrieval, advertising relevance, question answering system, machine translation, etc.
The network framework is as follows:
2. Principle
For the one-hot vector of the input text, the effect of dimensionality reduction is achieved through Word Hashing, and then it is sent to a traditional neural network to extract semantic features and calculate the similarity between semantic features.
2.1 Word hashing
It should be noted that the data processed by DSSM in the original paper is English data. For English data, the number of English words is very large and can be regarded as an infinite set, but the set of ngrams corresponding to 26 English letters is a limited set. . The thesis uses the word segmentation into ngram form to achieve a preliminary dimensionality reduction effect.
For example, for the word good, first add the front and back tags "#good#", and further divide it into trigram forms, [#go, goo, ood, od#]. This may face a problem. The ngram word hashing results of some words may be the same, and there will be conflicts. The author conducted statistics and obtained the following conclusions:
1. Word hashing can effectively achieve the effect of dimensionality reduction
2. The proportion of conflicts is low
In addition, Word Hashing has another advantage: it is difficult to process new words based on the feature representation of the word, and the ngram of the letter can be effectively represented, and it is robust.
2.2 Fully connected layer
Using full connection, multiplying the weight of W by input, plus bias; the activation function uses the tanh activation function:
After N layers of fully connected layers, the y value is obtained, that is, the vector value obtained by extracting features from the fully connected layer
2.3 Similarity calculation
Similarity calculation:
Finally, the similarity between query and document is calculated by cosine similarity:
Probability value calculation:
Finally, normalize by softmax:
It should be noted here that a smoothing factor is added here. This smoothing factor is very important here. The larger the control value, the larger the positive value and the smaller the negative value, which will affect the training phase. It has a relatively large impact on the training gradient, sometimes it converges faster, and sometimes leads to overfitting.
2.4 Loss function
The loss function uses a logarithmic loss function, and only maximizes the positive example data:
The reason why the log loss function is used here is because the probability calculation is used before, so that the loss function is minimized by maximum likelihood estimation in the training phase, and the model is optimized by gradient descent.
3. Experimental results
Four, advantages and disadvantages:
The DSSM algorithm has the following advantages and disadvantages:
advantage:
- Solved one of the biggest problems in LSA, LDA, Autoencoder and other methods: dictionary explosion (resulting in very high computational complexity), because in English words, the number of words may be unlimited, but the number of letter ngrams is usually limited of
- Word-based feature representation is more difficult to deal with new words, letter ngrams can be effectively represented, and the robustness is strong
- Use supervised methods to optimize the mapping problem of semantic embedding
- Eliminates manual feature engineering
Disadvantages:
- word hashing may cause conflicts
- DSSM uses a bag of words model, which loses contextual information
- In the ranking, the ranking of the search engine is determined by many factors. Because the higher the ranking of the doc when the user clicks, the greater the probability of the click. If the click is only used to determine whether it is a positive or negative sample, the noise is relatively large and it is difficult to converge
Five, expansion
Chinese processing
Everyone noticed that the paper deals with English data, including Word Hashing, which is only effective for English, but will have a counter-effect for Chinese, causing a dimensional explosion. But the author uses DL to obtain the feature representation, and finally the idea of calculating the similarity through cosine can be used for reference. When I think about it, I have to say that the paper mentioned that the y value represented by the feature can be stored to reduce the online time cost.
Here, in our use of Chinese, we usually replace DNN with BiLSTM+attention or BERT, and use the word vector as input directly at the input.
Model variants
There are many variants of DSSM optimization, such as CNN-DSSM, LSTM-DSSM, MV-DSSM, etc. Most of the hidden layers have been modified. The original paper is as follows:
CNN-DSSM:http://www.iro.umontreal.ca/~lisa/pointeurs/ir0895-he-2.pdf
LSTM-DSSM:https://arxiv.org/pdf/1412.6629.pdf
MV-DSSM:https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/frp1159-songA.pdf
7. Reference link:
DSSM paper reading and summary https://zhuanlan.zhihu.com/p/53326791
DSSM (and CNN-DSSM, LSTM-DSSM) https://blog.csdn.net/jokerxsy/article/details/107169406
DSSM algorithm-calculate text similarity https://www.cnblogs.com/wmx24/p/10157154.html
Several models of semantic matching based on deep learning DSSM, ESIM, BIMPM, ABCNN https://blog.csdn.net/pengmingpengming/article/details/88534968
I am too talented, if there is any mistake, please point it out, thank you!
Some refer to other blog posts, if there is any infringement, please contact me!