ES count points affect correlation factors

Correlation count points
  Refers to the degree of correlation between the document and the query, you can get a list of documents that match the query by inverted index
 
How to best meet the needs of the user query documents into the forefront of it?
  Nature of the problem is a sort of problem, ordering is based on the correlation count points, inverted index to determine which document the top surface
 
Parameters related to the impact of the count points:
  A. TF (Term Frequency): word frequency, i.e. the number of words appearing in the document, the higher the word frequency, the higher the correlation, the formula: tf (t in d) = √frequency
  B. Document Frequency (DF): document word frequency, word appears in a number of articles in the document
  C. IDF (Inverse Document Frequency): inverse document frequency, in contrast to the document word frequency, i.e. 1 / DF. I.e., the fewer number of occurrences of the word document, the higher the degree of correlation (if a word appears in the document set, the less important the word is counted), calculated: idf (t) = 1 + log (numDocs / (docFreq + 1) )
  D. Field-length Norm: field length reduction, how long the field? The shorter field, then its weight is higher. If an entry appears in a short field, such as title field, then the contents of the field compared to the longer body field, it is more likely to be on the term, calculated: norm (d) = 1 / √numTerms
 
•  TF / IDE models
                   
  a) score (q, d), the document relevance scores of query q and d (relevance score)
  b) queryNorm (q), regularization factor query (query normalization factor) attempts to query regularization to compare the results from the two different query
  c) coord (q, d), coordination factor (coordination factor)
    
  d) tf (t in d), term t in the document d Frequencies
  Inverse document frequency e) idf (t), term t in
  f) t.getBoost (), custom boost used in the query, with PPC
  g) norm (t, d), the text of the document d is a positive value of the length
 
• BM25 model (the default model after 5.X)
                  
  a) | D |: document length
  b) avgdl: Average document length all documents
  c) k1, b are free parameters, lucene default k1 = 1.2, b = 0.75
  d) IDF = log((#Docs - #DocsHit + 0.5)/(#DocsHit + 0.5))
  e) TF = query count in one doc
 
 
 
BM25 compared to a large optimization TF / IDF is to reduce the tf right when excessive weight, avoid excessive word frequency effect on query

Guess you like

Origin www.cnblogs.com/sx66/p/11885441.html