Understand and learn Solr's score mechanism

 
 
In Solr's document document definition (schema.xml), each field needs to be defined indexed, stored, which means:
 
 
Field Name Field meaning  
indexed
If the field is to be queried, it needs to be set to indexed and indexed so that it can be queried based on this field.
However, it has nothing to do with the specific word segmentation method. If it involves word segmentation, you need to use the type attribute.
 
stored It can be returned normally in the solr query result. If a field stored=false, the query result will not include this field.  
 
 
Only indexed fields can be used for queries, although there is a sort field in solr that can be sorted, which is generally used for exact matching queries, such as searching by category/brand to get results. If the user uses keywords to perform fuzzy matching, it will be out of date to use sort to sort according to a certain field, and it will not help the user to search for the results he wants (mostly controlled by sort, which is like Baidu's bidding ranking system).
 
According to the relevant documents of query scoring, it can be seen that the most basic vector space model is used in Solr:
 


 
 
The existing index files. tvx, tvd, tvf store term vector related information, we learn how to use term vector to reflect the degree of similarity. In the vector space model, the smaller the included angle, the greater the degree of similarity, which can be calculated by the cosine function theorem.

 

 
 
Similarity calculation formula: t=term, d=document, q=query, f=field
 
  • tf(t in d ) indicates how often the term appears in this document (ie how many times it appears).
  •  idf(t) represents the number of documents in which the term appears.
  •  t.getBoost() The weight of each word in the query statement, you can set a word to be more important in the query.

 

  • norm(t,d) normalization factor d.getBoost() • lengthNorm(f) • f.getBoost() , which includes three arguments:
    • Document boost: The larger the value, the more important the document is.
    • Field boost: The larger the field, the more important this field is.
    • lengthNorm(field) = (1.0 / Math.sqrt(numTerms)): The more the total number of terms contained in a field, that is, the longer the document, the smaller the value, the shorter the document, the larger the value.
  • coord(q,d): A search may contain multiple search terms, and a document may also contain multiple search terms. This item indicates that when a document contains more search terms, the document will be scored. Higher, numTermsInDocumentFromQuery / numTermsInQuery 
  • queryNorm(q): Calculates the variance sum of each query item. This value does not affect the ranking, but only allows the scores between different queries to be compared.
 
The /select related configuration of our current environment:
 
 
<!-- SearchHandler -->
  <requestHandler name="/select" class="com.zp.solr.handler.component.ZpSearchHandler">
    <lst name="defaults">
      <str name="defType">edismax</str>
      <str name="echoParams">explicit</str>
      <str name="wt">json</str>
      <str name="indent">true</str>
      <str name="df">text</str>
      <str name="bf">
          map(psfixstock,0,0,0,100)
      </str>
    </lst>
    <shardHandlerFactory class="HttpShardHandlerFactory">
      <int name="maxConnectionsPerHost">1000</int>
      <int name="corePoolSize">50</int>
    </shardHandlerFactory>
  </requestHandler>
 
  <!-- A request handler that returns indented JSON by default -->
  <requestHandler name="/query" class="solr.SearchHandler">
     <lst name="defaults">
       <str name="echoParams">explicit</str>
       <str name="wt">json</str>
       <str name="indent">true</str>
       <str name="df">text</str>
     </lst>
  </requestHandler>
 
 
It can be seen that the edismax method is currently used, the default query field df is text, and the boost function has been set, but there is a big problem: the sorting of the search is seriously related to the inventory quantity of the goods, which will definitely prevent users from querying what they want. the most accurate result (even if it is out of stock, the user may want to be able to return the correct item, not an irrelevant item that is in stock). List of common parameters in Solr query:
 
 
parameter meaning  
df default fields, default query fields  
wt writer type, specifies the query output structure format, the default is xml  
defType Set the query parser name  
bf boost function, which accepts multiple function queries, separated by spaces  
qf query fields, specify the query fields in the index, if not specified, use df by default  
q query string, required  
q.op Default query connector, AND OR  
sort 排序,sort=<field_name>+<desc|asc>,...  
start Pagination defines the number of starting records for the result, the default is 0  
rows Pagination defines the number of records returned per page, the default is 10  
fq filter query, which can make full use of the filter query cache to improve retrieval performance. In the q query matching result, the fq query is also matching  
fl field list, specifying the returned result fields, separated by spaces or commas  
timeAllowed Set query timeout  
bq boost query, specify a word or phrase to boost the query weight  
mm Minimum Should Match,指定查询中必须匹配的最小规则数,如果没有在查询中或在solrconfig.xml文件中指定mm参数值,q.op参数的有效性将会受到影响。如果q.op是AND,则mm=100%,如果q.op是OR,则mm=1(100%表示全部匹配,1表示只要有一个匹配即可)。如果用户想修改这些行为,可以在solrconfig.xml文件中定义mm参数  
 
 
Solr 支持多种查询解析,给搜索引擎开发人员提供灵活的查询解析。Solr 中主要包含这几个查询解析器:标准查询解析器、DisMax 查询解析器,扩展 DisMax 查询解析器(eDisMax)。
 
 
在solr查询时,使用debugQuery可以打印出其打分的详细信息以便我们能够正确的分析:
 
 
"1046888": "
26.279617 = sum of:
  0.9810601 = sum of:
    0.1401725 = weight(text:女士 in 431) [DefaultSimilarity], result of:
      0.1401725 = score(doc=431,freq=2.0), product of:
        0.3656968 = queryWeight, product of:
          1.4455243 = idf(docFreq=37139, maxDocs=57987)
          0.25298557 = queryNorm
        0.3833025 = fieldWeight in 431, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          1.4455243 = idf(docFreq=37139, maxDocs=57987)
          0.1875 = fieldNorm(doc=431)
    0.8408876 = weight(text:手提包 in 431) [DefaultSimilarity], result of:
      0.8408876 = score(doc=431,freq=2.0), product of:
        0.89569205 = queryWeight, product of:
          3.5404868 = idf(docFreq=4570, maxDocs=57987)
          0.25298557 = queryNorm
        0.9388133 = fieldWeight in 431, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          3.5404868 = idf(docFreq=4570, maxDocs=57987)
          0.1875 = fieldNorm(doc=431)
  25.298557 = FunctionQuery(map(int(psfixstock),0.0,0.0,const(0))), product of:
    100.0 = map(int(psfixstock)=1,min=0.0,max=0.0,target=const(0))
    1.0 = boost
    0.25298557 = queryNorm
",
  
 
女士手提包,搜索词拆分成两个词元:“女士” “手提包”,q.op默认为OR(通过设置mm的值可以影响该属性),
 
 
idf为出现的频率,单独搜索“女士”总条目数37139,单独搜索“手提包”总条目数4570,出现频次越多就越不重要,idf的计算公式:
 
 
idf(t) = 1 + log (numDocs / (docFreq +1))
  
 
 
termFreq=2.0,tf的计算公式,2的1/2次方,得出1.414:
 
 
tf(t in d) = numTermOccurrencesInDocument 1/2
 
 
 
fieldNorm取决于匹配的文档field总数,大概在29个左右(由于查询中并没有设置boost),计算公式:
 
lengthNorm(field) = (1.0 / Math.sqrt(numTerms))
 
 
 
queryNorm,用来计算每个查询条目的方差和,使得不同的query之间的分数可以进行比较:
 
写道
queryNorm(q) = 1 / (sumOfSquaredWeights )
sumOfSquaredWeights = q.getBoost()2 • ∑ ( idf(t) • t.getBoost() )2
 
 
Solr Copy Field对打分的影响
 
 
如果使用了solr中的copyfield,会对打分造成什么影响?copyfield,solr允许将不同的字段copy到一个字段中,搜索只需要搜索拷贝字段即可,当然这样会造成内容中包含非常多的搜索词。
 
根据在StackOverflow上的回答,如果要设置各自字段的boost,就不能使用统一的copyfield,或者将copyfield进行分组:
 
 
 
当前我们设置的默认df(default field)为<str name="df">text</str>,整个字段,当前我们可以通过更改qf的方式来做自定义boost,
 
SearchText, SearchText2^3, SearchText3^10, SearchText4^100
 
 
另一种说法,在solr中支持的多值域(multiValued)其实也就是copyfield,同一个field对应多个value。当一个文档中出现同名的多值域时,倒排索引和项向量都会在逻辑上将这些词的词汇单元附加进去。当对多值域进行存储的时候,它们在文档中的存储顺序是分离的,当在搜索期间对文档进行检索时,会发现多个field实例,最后该field的boost如何计算?使用每一个值域的boost相乘。
 
使用Query Field
 
 
我们设置query field,将下面的三个条件作为筛选条件,并设置其boost权重:
 
 
      <str name="qf">brand_name^0.9 category_name^0.8 product_name^2.0</str>
 
 
 
此时我们查询 “胸饰”,从打开debugQuery,可以看到其中的详细评分:
 
 
"debug": { "rawquerystring": "胸饰", "querystring": "胸饰", "parsedquery": "(+DisjunctionMaxQuery((brand_name:胸饰^0.9 | category_name:胸饰^0.8 | product_name:胸饰^2.0)) FunctionQuery(map(int(psfixstock),0.0,0.0,const(0),const(100))))/no_coord", "parsedquery_toString": "+(brand_name:胸饰^0.9 | category_name:胸饰^0.8 | product_name:胸饰^2.0) map(int(psfixstock),0.0,0.0,const(0),const(100))",
 
 
 
我们可以定义多个 requestHandler,用来专门处理多种类型,例如模糊查询,根据商品id定位商品,后端的内部查询,将它们分开这样就可以完成不同的查询了。
 
 
 

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=327077064&siteId=291194637