Elasticsearch deep pit review: highlight+fvh+copy to+json order of complex usage scenarios

1. Background

The index used by the business has been switched from the old index (old cluster) to the new index (new cluster), the number of primary shards has been changed , and other conditions have not been modified. On the day of the switching, both the R&D and testers passed the test.

2. Problem description

On the second day of the index switch, R&D students reported an error when using the query:

0aa9348c5bcf66f874b60ebef71651ac.png

After preliminary debugging, it is found that the content of the error is used in the fvh type of the highlight module. This operation mainly queries the field a-name.

Here is a supplement highlightto the usage type knowledge of .

highlightThere are three types of unified/plain/fvh for highlighting, and fvh is more suitable for large text .

fvh HighlighterThe term vector created during indexing will be directly used to obtain the query word segmentation for highlight segment matching. Here, a mapping configuration "term_vector" : "with_positions_offsets" is required.

Specific highlight typeproperties can refer to official documents

https://www.elastic.co/guide/en/elasticsearch/reference/current/highlighting.html)

At the same time, I found the source comment of this error on github:

https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/FastVectorHighlighter.java

It is certain term vectorthat the calculation of caused this error.

Fortunately, the old index is still there, term vectorand you can also use the api to view the stored information content.

GET 索引名/_termvectors/id?fields=字段名

The search keywords in the new index are as follows:

48f4973e246e89200fbbac7083460f5a.png

Search keywords in the old index looked like this:

f89f144c96f1b5652a58a044121868b5.png

position offsetIt can be clearly seen that the two information stored in the old and new indexes are completely different .

And try to calculate in real time (on the fly https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html#docs-termvectors-api-generate-termvectors) through termvector, or get the same two results.

The problem now: Different term vectorresults .

3. Investigation direction

After communicating and discussing with R&D students, we mainly conducted two investigations:

  • 1. Copy the index configuration on the new and old clusters, and first eliminate the problems of the cluster environment.

  • 2. The index has a lot of custom tokenizers and complex parameters used, further analysis of the index configuration.

After creating a new test index, we found that the test index on the old cluster also had this error . And the termvector information of the test index is consistent with the new index information.

5a303a440987360ca0346471d0f6370b.png

Therefore, the problem of the cluster environment can be ruled out.

At the same time, the research and development students noticed that termvectorinformation outside their own documents appeared in the information:

“text”:["某某科技公司"]
“term_vectors":["mou","ke","ji","gong","si","某","科技","公司"]

The redundant term vectorinformation comes from another field that stores pinyin b-name, and this field is also set with copy tothe attribute , and the content is copied to the question field a-name.

c4a0c0ad31297979c0169dbcd094af69.png

At this time, the research and development students discovered the problem: during the process of switching between the old and new indexes, the client was also replaced. Previously, nodejs JSON was uploaded, and the json of nodejs was in order, but after switching, the client used golang, and the jsoniter in golang was not in order, and the arrangement was random.

Therefore, under the complex use of copy to, different sorted json fields produce different term vectors. This was reproduced with a test index. The test data is as follows:

{
"b-name" : "mou mou ke ji gong si",
"a-name" : "某某科技公司",
}

 b-name is before a-name, and the term vector is as follows:

339b8e0de8e1df9e27359b0809c661fa.png

After the fields are swapped:49407ff3221c6d9be50fb2b153874515.png

4. Conclusion and review

In the actual process, the brain map we checked is as follows:

d228a966e7ce82bd38a9e161e2551ebe.png

It is recommended to zoom in on the picture

There are also many pitfalls for this complex usage scenario. It's really the devil hiding in the details, and the order of the json fields can also lead to such obscure and hard-to-find bugs .

At the same time, there is also a small problem: the API (on the fly) of term vector cannot reproduce the problem of this scene. Is it because the data that has been written into the lucene file is calculated, or the simulated data is written? Is that a functional bug, or a poor understanding?

5. Introduction of the author

Jin Duoan, Elastic certified engineer, Elastic senior operation and maintenance engineer, guest of Sike Elasticsearch knowledge planet, planet Top active technical expert, responsible editor of Elastic Chinese Community Daily

recommended reading

  1. First release on the whole network! From 0 to 1 Elasticsearch 8.X clearance video

  2. Heavyweight | Dead Elasticsearch 8.X Methodology Cognition List

  3. How to systematically learn Elasticsearch?

  4. 2023, do something

  5. What should I do if Elasticsearch can retrieve it but cannot highlight it correctly?

  6. Dry goods | Elasticsearch avoidance guide for small and medium-sized enterprises

a1f39ff740f88e97889c060cd649e0ee.jpeg

Acquire more dry goods faster in a shorter time!

Improve with nearly 1900+ Elastic enthusiasts around the world!

49f9e20a487f3223c090136c25e81dab.gif

Learn advanced dry goods one step ahead of your colleagues!

Guess you like

Origin blog.csdn.net/wojiushiwo987/article/details/129543380