Solr-- score modified formula

The following from solr in action.

contain:

  • Lexical items frequency. Query term entry appears in the number of queries in the current document.
  • Inverse Document Frequency. Query words entries appear in the total number of all documents.
  • This weight.
  • Normalization factor:
    • Field specification:
      • Document weight.
      • Field weights.
      • Length normalization. Eliminate the advantages of long documents. Because the frequency of long documents lexical items generally will be relatively large.
    • Coordination factor. Avoid too many times in a document in a word item results in a total score too. The purpose is to get more results include items that document all the words appear.

Specifically described below.

The following is reproduced from the network. Original Address: http://tec.5lulu.com/detail/110d8n2ehpg2j85ec.html

Brief

 

Content similarity calculated by the common retrieval model several search engines are:

Vector Space Model

    Briefly described in a variety of similarity calculation method is good, Solr index file has .tvx, .tvd, tvf stores information of the term vector, we first learn how to use the term vector to reflect the degree of similarity.

 

    Vector term d1 represents the term with v (d1), termterm given a query and a document, how to calculate the value of their similarity, consider the following formula, which uses the following concepts: term frequency (tf), inverse document frequency (idf), term boosts (t.getBoost), field normalization (norm), coordination factor (coord), and query normalization (queryNorm). 

 

 

  • tf (t in d) indicates that the idf (t) represents t.getBoost () also called this weight. Normalization factor, which includes three parameters:
  • : The higher the value, the more important instructions in this document. Also known as document weight. : The larger the gamut, the more important this domain. Also known as field weights. : The more a domain that contains the total number of (I understand all of this is that all documents, without being limited to a query), that the longer the document, the smaller this value is, the shorter the document, the greater the value. It is also called the length of normalization. The purpose is to eliminate the advantages of long field.

 coord (q, d): a search may contain multiple search terms, but also a document may contain multiple search terms, this means that when the more search terms a document contains, this document is The higher the score, numTermsInDocumentFromQuery / numTermsInQuery 

 queryNorm (q): the variance calculated for each entry and query, this value does not affect the sort, but only so that the score between the different query can be compared.

ps: here understand it, the following details can be studied later.

Scoring mechanism

It represents the degree of matching of the article, if it appeared more times in an article, indicating that the greater the importance of the article, and therefore more match. The less the opposite appears stating that the article does not match. It should be noted, however, appear not proportional to the number and importance of such occurences, appears times, the importance of the article is not the times, so the value here is the square root calculation.

tf(t in d) = numTermOccurrencesInDocument 1/2

  • idf, indicates the number that contains the article, and different tf, idf shows that the greater the less important of the term. For example, this article contains a lot, but it is not very helpful for matching articles. This is the technology we as programmers have learned, it is for the programmer itself, the technology to master the deeper the better (grasp the deeper explanation spend more time looking, tf greater), the more competitive when looking for a job . However, for all programmers, people who understand the technology, the better (fewer people know little df), find a job more competitive. Human values ​​that can not be replaced is the truth.

    idf(t) = 1 + log (numDocs / (docFreq +1)) 

  • t.getBoost, boost a man to lift heavy weights term process, we can add the term boost in the Index and Query respectively, but due to the Query process more flexible, so Here's to Query boost. term boost not only to Pharse be, can also be a single term, expressed ^ behind the increase in numbers at the time of the query:
  • title: (solr in action) ^ 2.5 solr in action the pair of boost provided pharse
  • title: When (solr in action) default boost 1.0
  • title:(solr^2in^.01action^1.5)^3OR"solrinaction"^2.5 
  • norm (t, d) i.e. field norm, comprising Document boost, Field boost, lengthNorm. Compared to t.getBoost () can be dynamically set, norm inside f.getBoost () and d.getBoost () at the time of the query can only be set during indexing, if you need to modify these two boost, only then rebuild the index. Their value is stored in the .nrm file.

     

    norm(t,d) = d.getBoost() • lengthNorm(f) • f.getBoost() 

  • d.getBoost () document the boost, boost is provided on a document by a Field set for each boost achieved.
  • f .getBoost () field of Boost, where the need to mention, is a multi-range mode the Solr indexing, i.e., a plurality of the same field value, the following code. When more than one document in the time range of the same name that appears, and inverted index vector will be additional entries into lexical units logically in these domains. When the multi-range storage, and store them in order in the document are separate, so when you retrieve a document during a search, you'll find multiple instances Field. Figure example shown below, when the query author: Lucene occur when two author fields, which is called the multi-range phenomenon. 
  • Document doc = new Document();
    for (String author : authors){
    doc.add(new Field("author",author,Field.Store.YES,Field.Index.ANALYZED));
    }
     
    //首先对多值域建立索引
    Directory dir = FSDirectory.open(new File("/Users/rcf/workspace/java/solr/Lucene"));
    IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LUCENE_48,new WhitespaceAnalyzer(Version.LUCENE_48));
    @SuppressWarnings("resource")
    IndexWriter writer = new IndexWriter(dir,indexWriterConfig);
    Document doc = new Document();
    doc.add(new Field("author","lucene",Field.Store.YES,Field.Index.ANALYZED));
    doc.add(new Field("author","solr",Field.Store.YES,Field.Index.ANALYZED));
    doc.add(new Field("text","helloworld",Field.Store.YES,Field.Index.ANALYZED));
    writer.addDocument(doc);
    writer.commit();
    //对多值域进行查询
    IndexReader reader = IndexReader.open(dir);
    IndexSearcher search = new IndexSearcher(reader);
    Query query = new TermQuery(new Term("author","lucene"));
    TopDocs docs = search.search(query, 1);
    Document doc = search.doc(docs.scoreDocs[0].doc);
    for(IndexableField field : doc.getFields()){
    System.out.println(field.name()+":"+field.stringValue());
    }
    System.out.print(docs.totalHits);
    //运行结果
    author:lucene
    author:solr
    text:helloworld
    2
  • When set to boost the multi-range time, then finally how to count the field of boost it? That is, each range is multiplied by the boost. Field title this example, the first boost is 3.0, the second one, the third 0.5, the result is 1 * 3 * 0.5.
  • Boost: (3) · (1) · (0.5) = 1.5 
  • lengthNorm, Norm is the reciprocal of the length of the square root of the number field in the term, the term of the number of field is defined as the length of the field. The greater the length of the field, the smaller the Norm Field, indicating more important term, whereas the more important, it is well understood, Beijing appears once and appear in the body there are 200 words in Beijing twice in the words of title 10, which field more match, of course, is the title.

     

  • Finally, the description, document boost, field boost and lengthNorm index is stored in byte form, encoding and decoding process will lose value such that, similar to the impact of the loss value calculation minimal.
  • queryNorm, the variance calculated for each entry and query, this value does not affect the sort, but only so that the score between the different query can be compared. It is to say, for the same query word, his influence on all the document is the same, it does not affect the results of the query, it is mainly in order to distinguish the different query.

    queryNorm(q) = 1 / (sumOfSquaredWeights )

    sumOfSquaredWeights = q.getBoost()2 • ∑ ( idf(t) • t.getBoost() )

    coord (q, d), indicates the number of documents that meet the query term, if the more the number of query term in the document, the score of this document will be higher.

    numTermsInDocumentFromQuery / numTermsInQuery 

            比如Query:AccountantAND("SanFrancisco"OR"NewYork"OR"Paris") 

            A document containing the above three term, then coord is 3/4, if you include one, then coord is 4/4

4 Source

    The above describes a similar value calculation formula, then now view the code Solr achieved, this is in part achieved DefaultSimilarity class. 

@Override
public float coord(int overlap, int maxOverlap) {
return overlap / (float)maxOverlap;
}
 
@Override
public float queryNorm(float sumOfSquaredWeights) {
return (float)(1.0 / Math.sqrt(sumOfSquaredWeights));
}
   
@Override
public float lengthNorm(FieldInvertState state) {
final int numTerms;
if (discountOverlaps)
numTerms = state.getLength() - state.getNumOverlap();
else
numTerms = state.getLength();
return state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));
}
   
@Override
public float tf(float freq) {
return (float)Math.sqrt(freq);
}
 
@Override
public float idf(long docFreq, long numDocs) {
return (float)(Math.log(numDocs/(double)(docFreq+1)) + 1.0);
}
 

Process Solr calculated score (q, d) as follows:

1: Call IndexSearcher.createNormalizedWeight () calculated queryNorm ()

public Weight createNormalizedWeight(Query query) throws IOException {
query = rewrite(query);
Weight weight = query.createWeight(this);
float v = weight.getValueForNormalization();
float norm = getSimilarity().queryNorm(v);
if (Float.isInfinite(norm) || Float.isNaN(norm)) {
norm = 1.0f;
}
weight.normalize(norm, 1.0f);
return weight;
}

 

  1. Specific steps are as follows:

  • Weight weight = query.createWeight(this);
  • 创建BooleanWeight->new TermWeight()->this.stats = similarity.computeWeight)->this.weight = idf * t.getBoost()
  • public IDFStats (Field, String, Explanation idf, float queryBoost) {
     // TODO: the Validate? 
    the this .field = Field,;
     the this .idf = idf;
     the this .queryBoost = queryBoost;
     the this .queryWeight = idf.getValue () * queryBoost; / / Compute Query weight 
    } 
    calculated sumOfSquaredWeights 
    S = weights.get (I) .getValueForNormalization () is calculated (idf (t) • t.getBoost ( )) 2 as the following code, queryWeight calculated on a the 
     public  a float getValueForNormalization () {
     // TODO: (sorta LUCENE-1907) make non-static class and expose this squaring via a nice method to subclasses?
    return queryWeight * queryWeight; // sum of squared weights
    }
  • BooleanWeight->getValueForNormalization->sum = (q.getBoost)*∑(this.weight)= (q.getBoost)*∑(idf * t.getBoost())2
  •  
  • public float getValueForNormalization() throws IOException {
    float sum = 0.0f;
    for (int i = 0 ; i < weights.size(); i++) {
    // call sumOfSquaredWeights for all clauses in case of side effects
    float s = weights.get(i).getValueForNormalization(); // sum sub weights
    if (!clauses.get(i).isProhibited()) {
    // only add to sum for non-prohibited clauses
    sum += s;
    }
    }
     
    sum *= getBoost() * getBoost(); // boost each sub-weight
     
    return sum ;
    }

     

  • Calculating complete querynorm () = 1 / Math.sqrt (sumOfSquaredWeights)); 
  • public float queryNorm(float sumOfSquaredWeights) {
    return (float)(1.0 / Math.sqrt(sumOfSquaredWeights));
    }
  • weight.normalize(norm, 1.0f) 计算norm()
  • topLevelBoost *= getBoost();   
  • IDF = calculated value () * * queryNorm queryWeight = IDF () 2 * t.getBoost () * queryNorm (queryWeight the previously calculated) 

  • public void normalize(float queryNorm, float topLevelBoost) {
    this.queryNorm = queryNorm * topLevelBoost;
    queryWeight *= this.queryNorm; // normalize query weight
    value = queryWeight * idf.getValue(); // idf for document
    }
  •  

    2: Call IndexSearch.weight.bulkScorer () calculated coord (q, d), and acquires docFreq each term, and press docFreq td from small to large. 

  

if (optional.size() == 0 && prohibited.size() == 0) {
float coord = disableCoord ? 1.0f : coord(required.size(), maxCoord);
return new ConjunctionScorer(this, required.toArray(new Scorer[required.size()]), coord);
}

 

  1. 3: score.score () score is calculated, obtaining similarity value, and placed in the priority queue for scoring the highest doc id.

  • weightValue= value =idf()2*t.getBoost()*queryNorm
  • sore = Σ (tf () * weightValue) * cood calculate the final similarity value
  • Here seemingly not used lengthNorm, 
  • public float score(int doc, float freq) {
    final float raw = tf(freq) * weightValue; // compute tf(f)*weight
    return norms == null ? raw : raw * decodeNormValue(norms.get(doc)); // normalize for field
    }
       
    public float score() throws IOException {
    // TODO: sum into a double and cast to float if we ever send required clauses to BS1
    float sum = 0.0f;
    for (DocsAndFreqs docs : docsAndFreqs) {
    sum += docs.scorer.score();
    }
    return sum * coord;
    }
     
    public void collect(int doc) throws IOException {
    float score = scorer.score();
       
    // This collector cannot handle these scores:
    assert score != Float.NEGATIVE_INFINITY;
    assert !Float.isNaN(score);
       
    totalHits++;
    if (score <= pqTop.score) {
    // Since docs are returned in-order (i.e., increasing doc Id), a document
    // with equal score to pqTop.score cannot compete since HitQueue favors
    // documents with lower doc Ids. Therefore reject those docs too.
    return;
    }
    pqTop.doc = doc + docBase;
    pqTop.score = score;
    pqTop = pq.updateTop();
    }

     

The following from solr in action.

contain:

  • Lexical items frequency. Query term entry appears in the number of queries in the current document.
  • Inverse Document Frequency. Query words entries appear in the total number of all documents.
  • This weight.
  • Normalization factor:
    • Field specification:
      • Document weight.
      • Field weights.
      • Length normalization. Eliminate the advantages of long documents. Because the frequency of long documents lexical items generally will be relatively large.
    • Coordination factor. Avoid too many times in a document in a word item results in a total score too. The purpose is to get more results include items that document all the words appear.

Specifically described below.

The following is reproduced from the network. Original Address: http://tec.5lulu.com/detail/110d8n2ehpg2j85ec.html

Brief

 

Content similarity calculated by the common retrieval model several search engines are:

Vector Space Model

    Briefly described in a variety of similarity calculation method is good, Solr index file has .tvx, .tvd, tvf stores information of the term vector, we first learn how to use the term vector to reflect the degree of similarity.

 

    Vector term d1 represents the term with v (d1), termterm given a query and a document, how to calculate the value of their similarity, consider the following formula, which uses the following concepts: term frequency (tf), inverse document frequency (idf), term boosts (t.getBoost), field normalization (norm), coordination factor (coord), and query normalization (queryNorm). 

 

 

  • tf (t in d) indicates that the idf (t) represents t.getBoost () also called this weight. Normalization factor, which includes three parameters:
  • : The higher the value, the more important instructions in this document. Also known as document weight. : The larger the gamut, the more important this domain. Also known as field weights. : The more a domain that contains the total number of (I understand all of this is that all documents, without being limited to a query), that the longer the document, the smaller this value is, the shorter the document, the greater the value. It is also called the length of normalization. The purpose is to eliminate the advantages of long field.

Guess you like

Origin www.cnblogs.com/LCharles/p/11413550.html