Semantic similarity matching (1)-DSSM model

1. Introduction

论文:Learning Deep Structured Semantic Models for Web Search using Clickthrough Data

The core idea of ​​the deep learning model for calculating text similarity proposed by Microsoft in 13 years is to map query and doc to the semantic space of a common dimension. By maximizing the cosine similarity between the semantic vector of query and doc, the hidden Contains a semantic model to achieve the purpose of retrieval. DSSM has a wide range of applications, such as search engine retrieval, advertising relevance, question answering system, machine translation, etc.

The network framework is as follows:

2. Principle

For the one-hot vector of the input text, the effect of dimensionality reduction is achieved through Word Hashing, and then it is sent to a traditional neural network to extract semantic features and calculate the similarity between semantic features.

2.1 Word hashing

It should be noted that the data processed by DSSM in the original paper is English data. For English data, the number of English words is very large and can be regarded as an infinite set, but the set of ngrams corresponding to 26 English letters is a limited set. . The thesis uses the word segmentation into ngram form to achieve a preliminary dimensionality reduction effect.

For example, for the word good, first add the front and back tags "#good#", and further divide it into trigram forms, [#go, goo, ood, od#]. This may face a problem. The ngram word hashing results of some words may be the same, and there will be conflicts. The author conducted statistics and obtained the following conclusions:

1. Word hashing can effectively achieve the effect of dimensionality reduction 

2. The proportion of conflicts is low

In addition, Word Hashing has another advantage: it is difficult to process new words based on the feature representation of the word, and the ngram of the letter can be effectively represented, and it is robust.

2.2 Fully connected layer

Using full connection, multiplying the weight of W by input, plus bias; the activation function uses the tanh activation function:

After N layers of fully connected layers, the y value is obtained, that is, the vector value obtained by extracting features from the fully connected layer

2.3 Similarity calculation

Similarity calculation:

Finally, the similarity between query and document is calculated by cosine similarity:

Probability value calculation:

Finally, normalize by softmax:

It should be noted here that a smoothing factor is added here. This smoothing factor is very important here. The larger the control value, the larger the positive value and the smaller the negative value, which will affect the training phase. It has a relatively large impact on the training gradient, sometimes it converges faster, and sometimes leads to overfitting.

2.4 Loss function

The loss function uses a logarithmic loss function, and only maximizes the positive example data:

The reason why the log loss function is used here is because the probability calculation is used before, so that the loss function is minimized by maximum likelihood estimation in the training phase, and the model is optimized by gradient descent.

3. Experimental results

Four, advantages and disadvantages:

The DSSM algorithm has the following advantages and disadvantages:

advantage:

  • Solved one of the biggest problems in LSA, LDA, Autoencoder and other methods: dictionary explosion (resulting in very high computational complexity), because in English words, the number of words may be unlimited, but the number of letter ngrams is usually limited of
  • Word-based feature representation is more difficult to deal with new words, letter ngrams can be effectively represented, and the robustness is strong
  • Use supervised methods to optimize the mapping problem of semantic embedding
  • Eliminates manual feature engineering

Disadvantages:

  • word hashing may cause conflicts
  • DSSM uses a bag of words model, which loses contextual information
  • In the ranking, the ranking of the search engine is determined by many factors. Because the higher the ranking of the doc when the user clicks, the greater the probability of the click. If the click is only used to determine whether it is a positive or negative sample, the noise is relatively large and it is difficult to converge

Five, expansion

Chinese processing

Everyone noticed that the paper deals with English data, including Word Hashing, which is only effective for English, but will have a counter-effect for Chinese, causing a dimensional explosion. But the author uses DL to obtain the feature representation, and finally the idea of ​​calculating the similarity through cosine can be used for reference. When I think about it, I have to say that the paper mentioned that the y value represented by the feature can be stored to reduce the online time cost.

Here, in our use of Chinese, we usually replace DNN with BiLSTM+attention or BERT, and use the word vector as input directly at the input.

Model variants

There are many variants of DSSM optimization, such as CNN-DSSM, LSTM-DSSM, MV-DSSM, etc. Most of the hidden layers have been modified. The original paper is as follows:

CNN-DSSM:http://www.iro.umontreal.ca/~lisa/pointeurs/ir0895-he-2.pdf

LSTM-DSSM:https://arxiv.org/pdf/1412.6629.pdf

MV-DSSM:https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/frp1159-songA.pdf

 

7. Reference link:

DSSM paper reading and summary  https://zhuanlan.zhihu.com/p/53326791

DSSM (and CNN-DSSM, LSTM-DSSM)  https://blog.csdn.net/jokerxsy/article/details/107169406 

DSSM algorithm-calculate text similarity   https://www.cnblogs.com/wmx24/p/10157154.html

Several models of semantic matching based on deep learning DSSM, ESIM, BIMPM, ABCNN  https://blog.csdn.net/pengmingpengming/article/details/88534968

 

 

I am too talented, if there is any mistake, please point it out, thank you!

Some refer to other blog posts, if there is any infringement, please contact me!

 

Guess you like

Origin blog.csdn.net/katrina1rani/article/details/110125255