DSSM for reading the paper web search

In this original

Summary

[Hidden] doc query semantic model vector similarity
a discriminant model [proposed] DSSM (so learning ideas for: conditional distribution, maximum posterior probability, (likelihood function Arguments after priori model parameters) maximum, maximum likelihood)
[ Word hashing] a method of processing mass data may be presented

Brief introduction

Under the conventional method of matching search scenario:

First, the word level matching ( literally match )
  • Representatives: TFIDF, BM25
  • Inadequate: multi-meaning words can not solve the problem synonymous, word order semantic issues (such as "deep learning" and "learning depth")
Second, the key document - query words ( semantic matching )
  • Latent Semantic Model
  • Representative: LSA PLSA LDA
  • Inadequate: Evaluation Method Contact unsupervised learning, the objective function and the actual retrieval tasks not close

There are two ways to expand the semantic model:
[Click] did not understand the data

[Depth] from the encoder
- learning hierarchy between the query and doc by deep learning of methods
- inadequate: because the same unsupervised learning, making the model the effect is not much better than keyword match; learning model still need large-scale matrix computation

Third, the new method: DSSM
  • DNN] to use [doc at a given query to sort, query and doc about to nonlinear mapping to a simple semantic space, and then calculate the cosine similarity; NN is a discriminant model, is also unsupervised learning and training, but the potential and different semantic model is directly on the web doc sort optimization
  • ] [Word hashing process with large vocabularies, and the high-dimensional vector doc query are mapped to low dimensional vectors (based on 3 words)

Related work

DSSM work on the basis of relevant research before two.

Implicit semantic model and click data

Use of the SVD general term - document matrix mapping low-dimensional space, and then calculate the cosine distance.
In addition, you can use the translation model (translation model) semantic matching. This model is to train children through its query and click on the document. Different translation models and latent semantic model is that the former is the direct study of the relationship between the two term (term query of the term and in the doc). If a large number of click data, then the performance of this model is very good.

Depth study

Proposed a semantic hash (semantic hashing) for information retrieval (that is the use of the final layer of depth from the encoder characteristics). There is a two-step process:

  • Doc stackRBM by the term vector representation of the mapped into low dimensional space;
  • Optimization of the model parameters to minimize cross entropy manner

The intermediate layer output model may be characterized as a sort the documents.

Inadequate:
1. Given a query, can not distinguish between the doc and is not related to the query. (Because the model is self-coding related doc, that is, in their own learning);
2. To reduce computation, for a doc to select only the most frequent 2,000 word.

DSSM

Get semantic feature DNN

Input: high dimensional feature BoW, that count the number of times each term appears respectively in query, doc (not normalized).
Output: low-dimensional feature space semantic feature
high dimensional feature not directly into DNN, first by word hashing

Word hashing

One word, front and rear plus #, then three letters cut, so two different words will not yield the same triad, which has done a statistical paper, said the collision probability is very low, 500K months word probability can be reduced to 30k-dimensional, conflict of 0.0044%

DSSM training process

Assumed query and click on the doc is relevant (or relevant part) by the method of learning model parameters supervised learning (ie, maximizing the conditional probability: Given a query, the probability doc is clicked). The objective function is clicked doc probability of cross-validation entropy. Network parameters {W, b} can be turned, so optimized by gradient descent method.

Overall flow chart

Positive and negative samples are labeled by whether the doc is clicked. Ratio of positive and negative samples 1: 4, negative samples randomly sampled. Paper, negative samples are sampled had no significant effect on the final result (really say this is the biggest paper I read this post harvest)

Published 120 original articles · won praise 35 · views 170 000 +

Guess you like

Origin blog.csdn.net/u012328476/article/details/78915163