Solr word segmentation fieldType word segmentation parser setting error causes query error

      Today, in the online production environment, I encountered a problem that Solr's query conditions could not match the query results. Although the problem is small, the process of finding the problem is indeed quite complicated. Fortunately, the final problem is just a layer of window paper, which is recorded here for the purpose of Memo.

       The problem is this, the business side told me that there is a query condition, and there is no way to match the target record. The query condition is: name: Y9 Portland Cake Luto I received a problem, so I started my troubleshooting.

        First, confirm that the original text field of the name field is "Y9 Portland Cake Luto", the name of the corresponding column is name, the field type is a custom type, and the type is:

 

<fieldType name="like" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false" omitNorms="true" omitPositions="true">
      <analyzer type="index">
        <tokenizer class="solr.PatternTokenizerFactory" pattern=",\s*"/>
        <filter class="com.dfire.tis.solrextend.fieldtype.pinyin.AllWithNGramTokenFactory"/>
        <filter class="solr.StandardFilterFactory"/>
        <filter class="solr.TrimFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.PatternTokenizerFactory" pattern=",\s*"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

 This filetype is in order to meet the needs of users to implement similar data-like queries on search engines (of course, wildcardQuery can also be used on solr, but wildcardQuery is more performance-intensive, and there should be no problem with small data volumes. If the data volume is large, consider it).

 

 

   The filter type AllWithNGramTokenFactory extends Solr's default NGram type, and an additional literal of the entire field is added as a Term vocabulary unit during analysis.

     In order to troubleshoot the problem, I wrote a small program to print whether the record whose name field term is "Y9 Portland Cake Luto" on the index file exists or not, and wrote the following code:

private void readIteraveTerm() throws Exception {
        EmbeddedSolrServer server ;
		SolrCore core = server.getCoreContainer().getCore("supplygood");
		IndexReader rootreader = core.getSearcher().get().getIndexReader();
		LeafReader reader = null;
		Terms terms = null;
		TermsEnum termEnum = null;
		PostingsEnum posting = null;

		BytesRef term = null;
		int docid = 0;
		for (LeafReaderContext leaf : rootreader.getContext().leaves()) {
			reader = leaf.reader();
			liveDocs = reader.getLiveDocs();
			terms = reader.terms("name");

			termEnum = terms.iterator();
			int count = 0;
			String find = "Y9 Portland Cake Luto";

			if ((termEnum.seekExact(new BytesRef(find)))) {

				System.out.println(termEnum.term().utf8ToString());
				posting = termEnum.postings(posting);
				do {

					docid = posting.nextDoc();
					System.out.println("docid:" + ("" + docid) + "_" + leaf.docBase);
				} while (docid != PostingsEnum.NO_MORE_DOCS);	
			}
		}
	}

 The execution result of this code proves that there is a record in the index with " Y9 Porter Cake Luto" as the Term. What is strange is why, through the query condition name: Y9 Porter Cake Luto, the corresponding record cannot be found? ?

 

   I remembered to debug Solr's code again. In the end, the underlying Lucene will  execute the query through the TermQuery class:

package org.apache.lucene.search;

public class TermQuery extends Query {
  .....
  public TermQuery(Term t) {
    term = Objects.requireNonNull(t);
    perReaderTermState = null;
  }
}

 So add a debugging breakpoint to the constructor of the TermQuery class, start the program to perform the condition, and see what the value of the parameter Term passed in the constructor is . " Too ", strange, how the uppercase Y became lowercase. Then I went to the index through "y9 Portland Cake Luto" to find that there was no record of this Term, and then I looked at the fieldType of the Schema, and sure enough, there was <filter class="solr.LowerCaseFilterFactory" in the analysis configuration of the query. />This configuration item, that is to say, in the query stage, all uppercase letters contained in the name column will be converted into lowercase letters, and the operation of converting uppercase to lowercase is not performed when the index is generated, so it is natural to The problem mentioned at the beginning of this article occurs.

  Just change the FiledType to the following:

  <fieldType name="like" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false" omitNorms="true" omitPositions="true">
      <analyzer type="index">
        <tokenizer class="solr.PatternTokenizerFactory" pattern=",\s*"/>
        <filter class="com.dfire.tis.solrextend.fieldtype.pinyin.AllWithNGramTokenFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.TrimFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.PatternTokenizerFactory" pattern=",\s*"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

 Adding a <filter class="solr.LowerCaseFilterFactory"/> filter to the analyzer solves the problem, hahacool

 

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326464133&siteId=291194637