Elasticsearch: Practical BM25 - Part 2: The BM25 Algorithm and Its Variations

This is the continuation of the first part " Elasticsearch: Practical BM25 - Part 1: How Sharding Affects Relevance Scoring in Elasticsearch ".

BM25 algorithm

I'll go as far as I can into the math here to explain what's going on, but this is the part where we look at the structure of the BM25 formula to get some insight into what's going on. First let's look at the formula, then I'll break down each component into understandable parts:

We can see some common components like qi, IDF(qi), f(qi,D), k1, b, and something about the field length. Here's the full content of each:

1) qi is the ith query term.

For example, if I search for "shane", there is only 1 query term, so q0 is "shane". If I search for "shane connelly" in English, Elasticsearch will see the space and flag it as 2 terms: q0 will be "shane" and q1 will be connelly". These query terms are plugged into the other bits of the equation, all are all added up.

2) IDF(qi) is the inverse document frequency of the ith query term.

For those of you who have used TF/IDF before, the concept of IDF may be familiar to you. If not, no worries! (If so, note the difference between the IDF formula in TF/IDF and the IDF in BM25.) The IDF component of our formula measures how often a term occurs across all documents , and "penalizes" common terms . The actual formula used by Lucene/BM25 for this part is:

where docCount is the total number of documents with that field value in the shard (across shards, if you use search_type=dfs_query_then_fetch), and f(qi) is the number of documents containing the ith query term. We can see in the example that "shane" occurs in all 4 documents, so for the term "shane" we end up with an IDF("shane"):

However, we can see that "connelly" only occurs in 2 documents, so we get an IDF("connelly"):

We can see here that queries containing these rarer terms (“connelly” is rarer than “shane” in our 4-document corpus) have higher multipliers, so they contribute more to the final score. This is intuitive: the term "the" is likely to appear in almost all English-language documents, so when a user searches for something like "the elephant", "elephant" is likely to be more important—we want it to score—rather than the term " the" (appears in almost all documents). 

3) We see that the length of the field in the denominator divided by the average field length is fieldLen/avgFieldLen .

We can think of it as the length of a document relative to the average document length. If the document is longer than the average, the denominator becomes larger (the score decreases), and if the document is shorter than the average, the denominator becomes smaller (the score increases). Note that the implementation of field lengths in Elasticsearch is based on the number of terms (relative to other factors such as character length). This is exactly as described in the original BM25 paper, but we do have a special flag ( discount_overlaps ) to handle synonyms specifically if you like. The way to think about this is that the more terms in a document -- at least terms that don't match the query -- the lower the document's score . Again, this makes intuitive sense: if a document is 300 pages long and only mentions my name once, it is less likely to relate to me than a short tweet that mentions me once.

4) We see a variable b appearing in the denominator, which is multiplied by the ratio of field lengths we just discussed. If b is larger, the influence of document length compared to the average length is greater. Seeing this, you can imagine that if b is set to 0, the effect of the length ratio will be completely nullified, and the length of the document will not affect the score. By default, b has a value of 0.75 in Elasticsearch.

5) Finally, we see that the two components of the fraction appear in both the numerator and denominator: k1 and f(qi,D). Their presence on the sides makes it hard to understand what they do just by formula, but let's jump right in.

  • f(qi,D) is "How many times does the i-th query term appear in document D?" In all these documents, f("shane", D) is 1, but f("connelly", D) varies Different: 1 for documents 3 and 4, 0 for documents 1 and 2. If there is a 5th document with the text "shane shane", its f("shane",D) is 2. We can see that f(qi, D) is in the numerator and denominator, and that special "k1" factor we'll talk about next. The way to think about f(qi,D) is that the more times a query term occurs in a document, the higher its score. This makes intuitive sense: a document that has our name appearing multiple times is more likely to be related to us than a document that only appears once.
  • k1  is a variable that helps determine the frequency saturation characteristics of the term. That is, it limits how much a single query term can affect a given document's score. It does this by approaching the asymptote . You can see how BM25 compares to TF/IDF here:

Higher/lower k1 values ​​mean that the slope of the "tf() for BM25" curve changes. This has the effect of changing the way "extra occurrences of a term gain additional points". The interpretation of k1 is that, for documents of average length, the value of term frequency gives a score that is half the maximum score for the term considered. The effect curve of tf on score grows rapidly when tf() ≤ k1, and becomes slower and slower when tf() > k1.

Continuing with our example, for k1 we control the answer to the question "How much more does adding a second 'shane' to the document contribute to the score than the first or the third compared to the second?". A higher k1 means that for more instances of that term, the score for each term can continue to rise relatively more. A value of 0 for k1 means that everything but IDF(qi) cancels out. By default, k1 has a value of 1.2 in Elasticsearch .

Revisit our search with our new knowledge

We'll delete our people index and recreate it with only 1 shard so we don't have to use search_type=dfs_query_then_fetch. We will test our knowledge by setting three indices: one k1 with value 0, b with 0.5, a second index (people2) with b at 0, k1 with 10, and a third index (people3) with b The value of is 1, k1 is 5.

DELETE people
PUT people
{
  "settings": {
    "number_of_shards": 1,
    "index" : {
        "similarity" : {
          "default" : {
            "type" : "BM25",
            "b": 0.5,
            "k1": 0
          }
        }
    }
  }
}
PUT people2
{
  "settings": {
    "number_of_shards": 1,
    "index" : {
        "similarity" : {
          "default" : {
            "type" : "BM25",
            "b": 0,
            "k1": 10
          }
        }
    }
  }
}
PUT people3
{
  "settings": {
    "number_of_shards": 1,
    "index" : {
        "similarity" : {
          "default" : {
            "type" : "BM25",
            "b": 1,
            "k1": 5
          }
        }
    }
  }
}

Now we'll add some documents to all three indexes:

POST people/_doc/_bulk
{ "index": { "_id": "1" } }
{ "title": "Shane" }
{ "index": { "_id": "2" } }
{ "title": "Shane C" }
{ "index": { "_id": "3" } }
{ "title": "Shane P Connelly" }
{ "index": { "_id": "4" } }
{ "title": "Shane Connelly" }
{ "index": { "_id": "5" } }
{ "title": "Shane Shane Connelly Connelly" }
{ "index": { "_id": "6" } }
{ "title": "Shane Shane Shane Connelly Connelly Connelly" }

POST people2/_doc/_bulk
{ "index": { "_id": "1" } }
{ "title": "Shane" }
{ "index": { "_id": "2" } }
{ "title": "Shane C" }
{ "index": { "_id": "3" } }
{ "title": "Shane P Connelly" }
{ "index": { "_id": "4" } }
{ "title": "Shane Connelly" }
{ "index": { "_id": "5" } }
{ "title": "Shane Shane Connelly Connelly" }
{ "index": { "_id": "6" } }
{ "title": "Shane Shane Shane Connelly Connelly Connelly" }

POST people3/_doc/_bulk
{ "index": { "_id": "1" } }
{ "title": "Shane" }
{ "index": { "_id": "2" } }
{ "title": "Shane C" }
{ "index": { "_id": "3" } }
{ "title": "Shane P Connelly" }
{ "index": { "_id": "4" } }
{ "title": "Shane Connelly" }
{ "index": { "_id": "5" } }
{ "title": "Shane Shane Connelly Connelly" }
{ "index": { "_id": "6" } }
{ "title": "Shane Shane Shane Connelly Connelly Connelly" }

Now, when we do:

GET /people/_search
{
  "query": {
    "match": {
      "title": "shane"
    }
  }
}

We can see in people that all documents have a score of 0.074107975. This fits with our understanding of setting k1 to 0: only the IDF of the search term has an effect on the score!

Now let's check people2, which has b = 0 and k1 = 10:

GET /people2/_search
{
  "query": {
    "match": {
      "title": "shane"
    }
  }
}

Two things can be seen from the results of this search.

First, we can see that the scores are sorted exactly by the number of occurrences of "shane". Documents 1, 2, 3, and 4 all have "shane" once, and thus share the same score of 0.074107975. Document 5 has "shane" twice, so it gets a higher score (0.13586462) because f("shane", D5) = 2, and document 6 gets a higher score again because f("shane", D6) = 3 Score (0.18812023). This fits with our intuition for setting b in people2 to 0: length—or the total number of terms in the document—does not affect scoring; only the count and relevance of matches.

The second thing to note is that the difference between these scores is non-linear, although the 6 documents do appear to be very close to linear.

  • The difference in score between no occurrences of our search term and the first occurrence is 0.074107975
  • Adding the score difference between the second and first occurrence of our search term is 0.13586462 - 0.074107975 = 0.061756645
  • Adding the score difference between the third and second occurrence of our search term is 0.18812023 - 0.13586462 = 0.05225561

0.074107975 is very close to 0.061756645, and 0.061756645 is very close to 0.05225561, but they are clearly decreasing. The reason this looks almost linear is because k1 is large. We can at least see that the score doesn't increase linearly with the number of occurrences - if it does, we'd like to see the same difference for each additional term. We'll come back to this idea after looking at people3.

Now let's examine people3, which has k1 = 5 and b = 1:

GET /people3/_search
{
  "query": {
    "match": {
      "title": "shane"
    }
  }
}

We get the following hits:

"hits": [
      {
        "_index": "people3",
        "_type": "_doc",
        "_id": "1",
        "_score": 0.16674294,
        "_source": {
          "title": "Shane"
        }
      },
      {
        "_index": "people3",
        "_type": "_doc",
        "_id": "6",
        "_score": 0.10261105,
        "_source": {
          "title": "Shane Shane Shane Connelly Connelly Connelly"
        }
      },
      {
        "_index": "people3",
        "_type": "_doc",
        "_id": "2",
        "_score": 0.102611035,
        "_source": {
          "title": "Shane C"
        }
      },
      {
        "_index": "people3",
        "_type": "_doc",
        "_id": "4",
        "_score": 0.102611035,
        "_source": {
          "title": "Shane Connelly"
        }
      },
      {
        "_index": "people3",
        "_type": "_doc",
        "_id": "5",
        "_score": 0.102611035,
        "_source": {
          "title": "Shane Shane Connelly Connelly"
        }
      },
      {
        "_index": "people3",
        "_type": "_doc",
        "_id": "3",
        "_score": 0.074107975,
        "_source": {
          "title": "Shane P Connelly"
        }
      }
    ]

We can see in people3 that now the ratio of matches ("shane") to non-matches is the only factor affecting the relative score. So a document like document 3 that has only 1 word match in 3 scores lower than documents 2, 4, 5, and 6, all of which match exactly half of the words, and those scores lower than document 1's exact matching documents.

Again, we can notice a "huge" difference between the highest-scoring and lower-scoring documents in people2 and people3. This is (again) due to the large value of k1. As an extra exercise, try removing people2/people3 and resetting them to something like k1 = 0.01, you'll find that with fewer documents the score between is smaller. When b = 0 and k1 = 0.01:

  • The difference in score between no occurrences of our search term and the first occurrence is 0.074107975
  • Adding the score difference between the second and first occurrence of our search term is 0.074476674 - 0.074107975 = 0.000368699
  • Adding the score difference between the third and second occurrence of our search term is 0.07460038 - 0.074476674 = 0.000123706

Therefore, when k1 = 0.01, we can see that the fractional impact per additional occurrence drops off faster than when k1 = 5 or k1 = 10. The 4th occurrence adds much less to the score than the 3rd, and so on. In other words, these smaller k1 values ​​saturate term scores faster. As we expected!

Hope this helps to understand what these parameters do for various document sets. Armed with this knowledge, next we'll jump into how to choose an appropriate b and k1 and how Elasticsearch provides tools to understand scores and iterate your methods.

Guess you like

Origin blog.csdn.net/UbuntuTouch/article/details/131239480