Synonym configuration in solr and interpretation of key source code

Since I need to make synonyms in my work, I looked at the implementation and source code of solr today, and took a note. The version of solr I look at is 5.5.3.

There is already an example in solr's schema.xml (the 5.x version is the managed-schema file). The screenshot is as follows:

    <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100"> <!-- Remember, two analyzers with type=index or type=query can only be set in TextField. I have seen the source code, other doesn't work when -->
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/> <!--Factory of tokenizer used-->
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> <!--Add stopwords filter factory-->
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <!-- Add synonym filter factory -->
        <filter class="solr.LowerCaseFilterFactory"/> <!--Add a filter factory converted to lowercase-->
      </analyzer>
    </fieldType>

 The key is the configured SynonyFilterFactory, let's take a look at his source code:

The SynonymFilterFactory class inherits from the TokenFilterFactory class, which is the abstract class of all TokenFilter factories. The LowerCaseFilterFactory in the above figure also inherits from this class. The key to the abstract class of TokenFilterFactory is the create(TokenStream) method, which is to continue adding operations according to the operation of the tokenizer. This is easy to understand. There is also a create method in the TokenizerFactory, but it has no parameters, because the tokenStream has not been generated at this time. .

After understanding the TokenFilterFactory, look at the structure of the SynonymFilterFactory class, and directly look at its construction method:

  public SynonymFilterFactory(Map<String,String> args) {//map is the parameters in the configuration, such as the above synonyms, ignoreCase, expand
    super(args);
    ignoreCase = getBoolean(args, "ignoreCase", false);//ignoreCase indicates whether to ignore case when re-partitioning matches
    synonyms = require(args, "synonyms");//The location of the synonym dictionary
    format = get(args, "format");//The format object used when parsing the synonym dictionary, that is, how to interpret synonyms from the synonym dictionary
    expand = getBoolean(args, "expand", true);//This is also a parameter when interpreting the thesaurus dictionary, it is not easy to describe in language, and we will talk about it later

    analyzerName = get(args, "analyzer");//This is to read a string in the dictionary table for word segmentation, this specifies the tokenizer used
    tokenizerFactory = get(args, "tokenizerFactory");//This has the same meaning as above, except that the factory mode is used
    if (analyzerName != null && tokenizerFactory != null) {
      throw new IllegalArgumentException("Analyzer and TokenizerFactory can't be specified both: " +
                                         analyzerName + " and " + tokenizerFactory);
    }

. . . // ignore unimportant parameters
  }

 Then we look at the loading of the synonym dictionary. There is a loadSynonyms method in the org.apache.lucene.analysis.synonym.SynonymFilterFactory.inform(ResourceLoader) method. As the name suggests, it is the method of loading the synonym dictionary table.

protected SynonymMap loadSynonyms(ResourceLoader loader, String cname, boolean dedup, Analyzer analyzer) throws IOException, ParseException {//The second parameter is the name of the format object used, and the third is whether to exclude duplicates when loading the table, The fourth is the tokenizer used
    CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder()
        .onMalformedInput(CodingErrorAction.REPORT)
        .onUnmappableCharacter(CodingErrorAction.REPORT);

    SynonymMap.Parser parser;
    Class<? extends SynonymMap.Parser> clazz = loader.findClass(cname, SynonymMap.Parser.class);
    try {
      parser = clazz.getConstructor(boolean.class, boolean.class, Analyzer.class).newInstance(dedup, expand, analyzer);//This is used to generate the final SynonyMap, the most critical in it is an FST and a similar BytesRefHash of HashMap.
    } catch (Exception e) {
      throw new RuntimeException(e);
    }

    List<String> files = splitFileNames(synonyms);//Multiple synonym dictionary files can be passed
    for (String file : files) {
      decoder.reset();
      try (final Reader isr = new InputStreamReader(loader.openResource(file), decoder)) {//Read the thesaurus dictionary file
        parser.parse(isr);//Parsing, entering synonyms into fst, the most critical is this method.
      }
    }
    return parser.build();//Build a SynonymMap
  }

 The most important thing now is a parser.parse method. The parser here is the SolrSynonymParser class, which inherits from the Builder class and is used to construct fst

  public void parse(Reader in) throws IOException, ParseException {
    LineNumberReader br = new LineNumberReader(in);//Read a line
    try {
      addInternal(br);//Call the addInternal method
    . . . // remove useless code
  }

  private void addInternal(BufferedReader in) throws IOException {
    String line = null;
    while ((line = in.readLine()) != null) {
      if (line.length() == 0 || line.charAt(0) == '#') {//Commented line
        continue; // ignore empty lines and comments
      }
      
      // TODO: we could process this more efficiently.
      String sides[] = split(line, "=>");//Separate according to =>,
      if (sides.length > 1) { // If the number after the current basis => is greater than 1, that is, the synonym dictionary in aa=>bb format
        if (sides.length != 2) {
          throw new IllegalArgumentException("more than one explicit mapping specified on the same line");
        }
        String inputStrings[] = split(sides[0], ",");//The part on the left, use, to separate,
        CharsRef[] inputs = new CharsRef[inputStrings.length];
        for (int i = 0; i < inputs.length; i++) {
          inputs[i] = analyze(unescape(inputStrings[i]).trim(), new CharsRefBuilder());//Use the top tokenizer for the left part
        }
        // Do the same for the right part as the left
        String outputStrings[] = split(sides[1], ",");
        CharsRef[] outputs = new CharsRef[outputStrings.length];
        for (int i = 0; i < outputs.length; i++) {
          outputs[i] = analyze(unescape(outputStrings[i]).trim(), new CharsRefBuilder());
        }
        // these mappings are explicit and never preserve original
        for (int i = 0; i < inputs.length; i++) {//The words on the left of the loop, each word adds all the words on the right as synonyms, that is, if the configuration is a=>b,c, add to The synonymMap is a->b, a->c, but will not add b->c, b->a, c->a.
          for (int j = 0; j < outputs.length; j++) {
            add(inputs[i], outputs[j], false);
          }
        }
      } else {//This is the form without =>, only a, b, c, these form synonyms
        String inputStrings[] = split(line, ",");
        CharsRef[] inputs = new CharsRef[inputStrings.length];
        for (int i = 0; i < inputs.length; i++) {
          inputs[i] = analyze(unescape(inputStrings[i]).trim(), new CharsRefBuilder());
        }
        if (expand) {//Comparison of if-else expand means whether to add in the opposite direction, such as a, b, c, if it is not expand, it will only record b->a, c->a, no Will record a->b, a->c, b->c, c->b.
          // all pairs
          for (int i = 0; i < inputs.length; i++) {//These two for loops will form all synonym pairs, such as the above
            for (int j = 0; j < inputs.length; j++) {
              if (i != j) {
                add(inputs[i], inputs[j], true);//add is added to the array of BytesRefHash to return the position in the array, and recorded in fst, so that the first thing obtained from fst according to the term is At the location of bytesRefHash (can be multiple), and then get synonyms based on the location,
              }
            }
          }
        } else {
          // all subsequent inputs map to first one; we also add inputs[0] here
          // so that we "effectively" (because we remove the original input and
          // add back a synonym with the same text) change that token's type to
          // SYNONYM (matching legacy behavior):
          for (int i = 0; i < inputs.length; i++) {
            add(inputs[i], inputs[0], false);//If it is not expand, then only the synonym will be mapped to the first word.
          }
        }
      }
    }
  }

So far, I have understood the usage of synonyms, but it is still not involved in fst, but synonyms can already be used. There is still a problem, how to update the synonym dictionary? You can't restart it after updating. Also, it is very troublesome to keep the dictionary in solr. How to modify the dictionary dynamically? This is left in the next blog, just a little bit With some modifications, you can add dictionaries dynamically and synonyms dynamically.

 

 

 

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326572738&siteId=291194637