Engineering practice of Elasticsearch vector search

1. Background

As a search engine company, we rely heavily  ES on help with tasks including article recall, data source partitioning, entity, tag management, etc., and have received good results.

Recently, we need to model the industry knowledge base, which may involve various recall and scoring methods such as entity matching, fuzzy search, vector search, etc. In the end, we chose to pass  ES 7.X (finally choose 7.10) in the new function, Dense vector Help to complete this part of the needs together.

2. Technical selection

2.1 Solution Requirements

  1. Support Vector Search

  2. Support multi-dimensional filtering and filtering

  3. throughput rate

  4. Learning and use costs

  5. Operation and maintenance cost

2.2 Use scene design

  1. Offline data preparation

    1. After the offline data is constructed, it is stored in the engine

    2. The engine indexes each field in the data

  2. Online data recall

    1.  Data recall based on sentences constructed  query from comprehension results query

    2. Filter the results

    3. Sort the results by a certain score

2.3 Data Structure Design

After determining the usage scenario of the data, we have determined that the data structure will roughly contain the following fields

  1. Unique id: used for deduplication and quick acquisition of knowledge

  2. Entity, attribute, value: used to describe the specific content of knowledge

  3. Confidence: the credibility used to describe knowledge

  4. Classification flag: main classification of knowledge and recommended category, etc.

  5. Vector representation: as a basis for knowledge similarity, relevance recall, and scoring

  6. ref information: source information used to retrospectively parse/obtain the knowledge

  7. Other properties: including valid, delete, modification time and other supporting general properties

2.4 Solution Comparison

In order to support the above-mentioned usage requirements, we have compared   various solutions including ES, FaissAmong them, Faiss and  SPTAG is only the core algorithm library, which needs to be re-developed and packaged into services; Milvus the  1.x version can only store  id and  向量, which cannot fully meet our use requirements; based on the considerations of cluster stability and maintainability, compared to the post-plug-in For deployment, we prefer to use  ES the native function, so choose  ES the native vector search function as our final choice.

Comparison reference:

type implementation language Client support Multiple condition recall learning cost cost of introduction Operation and maintenance cost distributed performance Community Remark
Elasticsearch Java Java / Python yes Low Low middle yes middle active native function
Do Python Python no middle high high no high generally need secondary development
Milvus Python + GoLang Python/Java/GoLang no middle middle middle no high generally 1.x is not fully functional
OpenDistro Elasticsearch KNN Java + C++ Java / Python yes middle middle middle yes middle generally Built-in plugin
SPTAG C++ Python + C# no high middle middle no high generally need secondary development

3. Data flow process

3.1 Offline data processing part

  1. Collect data from multiple data sources

  2. Data cleaning and preprocessing

  3. Extract knowledge through algorithmic engines

  4. Convert knowledge to vectors through an algorithmic engine

  5. Store the basic information of knowledge together with vector data in ES

3.2 Online Data Recall Section

  1. Get search criteria from frontend

  2. query Retrieval condition analysis by  understanding module

  3. ES search from 

  4. Score adjustments to results

  5. Back to front

4. Example of using ES vector search

4.1 Index Design

Settings

{
    "settings": {
        "number_of_shards": 3,
        "number_of_replicas": 2,
        "index": {
            "routing": {
                "allocation": {
                    "require": {
                        "node_group": "hot" // 1)
                    }
                }
            },
            "store": {
                "preload": [ // 2)
                    "knowledge",
                    "category",
                    "available",
                    "confidence",
                    "del",
                    "kid"
                ]
            },
            "search": {
                "slowlog": {
                    "threshold": {
                        "query": {
                            "warn": "1s" // 3)
                        },
                        "fetch": {
                            "warn": "1s" // 3)
                        }
                    }
                }
            },
            "translog": {
                "flush_threshold_size": "512mb", // 4)
                "sync_interval": "5m", // 4)
                "durability": "async" // 4)
            },
            "sort": {
                "field": [ // 5)
                    "kid",
                    "confidence"
                ],
                "order": [ // 5)
                    "asc",
                    "desc"
                ]
            }
        }
    }
}
  • illustrate:

  1. Because the vector data is large, it tends to place the entire index on nodes with better hardware performance

  2. To support high-performance filtering, frequently used fields are preloaded in memory

  3. Enable logs for slow queries to facilitate follow-up performance investigations

  4. The reconstruction of the knowledge base is offline, and a lot of writing will be performed during the update, so the  translog correct commit interval is lengthened to speed up the writing speed.

  5. In actual use, kid is an auto-incrementing id, and at the same time, it may sort the confidence of knowledge, etc., so  sort field these two fields will be used to store

Mapping

{
    "mappings": {
        "properties": {
            "kid": {
                "type": "keyword"
            },
            "knowledge": {
                "type": "keyword"
            },
            "knowledge_phrase": { // 1)
                "type": "text",
                "analyzer": "faraday"
            },
            "attribue": { // 1)
                "type": "keyword",
                "fields": {
                    "phrase": {
                        "type": "text",
                        "analyzer": "faraday"
                    }
                }
            },
            "value": { // 1)
                "type": "keyword",
                "fields": {
                    "phrase": {
                        "type": "text",
                        "analyzer": "faraday"
                    }
                }
            },
            "confidence": { // 2)
                "type": "double"
            },
            "category": {
                "type": "keyword"
            },
            "vector": { // 3)
                "type": "dense_vector",
                "dims": 512
            },
            "ref": {
                "type": "text",
                "index": false
            },
            "available": {
                "type": "keyword"
            },
            "del": {
                "type": "keyword"
            },
            "create_timestamp": {
                "type": "date",
                "format": [
                    "strict_date_hour_minute_second",
                    "yyyy-MM-dd HH:mm:ss"
                ]
            },
            "update_timestamp": {
                "type": "date",
                "format": [
                    "strict_date_hour_minute_second",
                    "yyyy-MM-dd HH:mm:ss"
                ]
            }
        }
    }
}
  • illustrate:

  1. In addition to the complete search of knowledge items, fuzzy retrieval is also required. We used a self-developed  farady tokenizer to segment each part of the knowledge item.

  2. Some knowledge items in the knowledge base will be reviewed and maintained by experts/humans, so different confidence levels will be set for different items

  3. After data preprocessing, it will be converted into a 512-bit vector and stored in this field

4.2 Data flow

  • Offline part:

  1. Data collection and cleaning

  2. By  模型A finding knowledge items from articles

  3. By  模型B converting knowledge items into vectors

    1. This  模型A 模型B is a self-developed model, using algorithms including knowledge density calculation and  bert tersonflow other frameworks

  4. Insert core content such as original text and knowledge items into the database

  5. Assemble core knowledge content, vectors, etc. into retrieval units for insertion ES

  6. A team of experts reviews, revises and iterates on knowledge items in the database

  7. The algorithm team will iterate the model in the data link according to the update of knowledge items and other annotations, and update the online knowledge base

  • Online section:

  1. After the front end receives the request, the  query 理解 component is called for analysis

  2. After removing the invalid content, after finding  query the classification information and other intentions in it, construct the vector for recall and related filter conditions

  3. The knowledge base is screened through the combined  ES conditions  query , and the results are adjusted with confidence, etc.

  4. Adjust and sort the scores of different strategies on the recall results, and finally output them to the front end

4.3 Example query

POST knowledge_current_reader/_search
{
    "query": {
        "script_score": {
            "query": {
                "bool": {
                    "filter": [
                        {
                            "term": {
                                "del": 0
                            }
                        },
                        {
                            "term": {
                                "available": 1
                            }
                        }
                    ],
                    "must": {
                        "bool": {
                            "should": [
                                {
                                    "term": {
                                        "category": "type_1",
                                        "boost": 10
                                    }
                                },
                                {
                                    "term": {
                                        "category": "type_2",
                                        "boost": 5
                                    }
                                }
                            ]
                        }
                    },
                    "should": [
                        {
                            "match_phrase": {
                                "knowledge_phrase": {
                                    "query": "some_query",
                                    "boost": 10
                                }
                            }
                        },
                        {
                            "match": {
                                "attribute": {
                                    "query": "some_query",
                                    "boost": 5
                                }
                            }
                        },
                        {
                            "match": {
                                "value": {
                                    "query": "some_query",
                                    "boost": 5
                                }
                            }
                        },
                        {
                            "term": {
                                "knowledge": {
                                    "value": "some_query",
                                    "boost": 30
                                }
                            }
                        },
                        {
                            "term": {
                                "attribute": {
                                    "value": "some_query",
                                    "boost": 15
                                }
                            }
                        },
                        {
                            "term": {
                                "value": {
                                    "value": "some_query",
                                    "boost": 10
                                }
                            }
                        }
                    ]
                }
            },
            "script": {
                "source": "cosineSimilarity(params.query_vector, 'vector') + sigmoid(1, Math.E, _score) + (1 / Math.log(doc['confidence'].value))",
                "params": {
                    "query_vector": [ ... ]
                }
            }
        }
    }
}
  • illustrate:

  1. The above  query conditions and parameters are only for illustration, and belong to the desensitization and simplified version used in actual online.

  2. The calculation formula is a certain version in the iteration, and subsequent adjustments and upgrades are not reflected.

  3. Boundary conditions and null values  pipeline ​​are processed in the auxiliary service sum, which simplifies part of the logic of boundary condition processing and judgment

5. Problems encountered

5.1 Long response time

Due to the need for vector calculation, ES it takes a lot of time and resources to calculate the distance. For this reason, we have carried out the following optimizations:

  1. Eigenvalues ​​truncate decimal places:

    1. In order to ensure the representation of the features, we did not adjust  bert the number of bits of the vector output by the framework

    2. After weighing the access efficiency, data precision and calculation speed, we  truncate label the precision of each bit to decimal places165

    3. In this way, although some precision is lost (approximately  X%), the access and calculation time are greatly reduced (approximately  Y%)

  2. query Pre-analyze intent, possible classifications before proceeding 

    1. In order to reduce the data included in the computational ranking, we   analyze the raw content query before assembly query

    2. Cooperate with user behavior buried points and the prior knowledge of experts, roughly classify the knowledge, and  query match the classification with different weights

    3. This reduces the recall rate (approximately  X%), but increases the accuracy (approximately  Y%), and also improves part of the computational efficiency (approximately  Z%)

  3. Simplified calculation formula

    1. Externalize some of the score calculation logic, and simplify  ES the operation logic that needs to be processed as much as possible

    2. After the recall, a variety of scoring strategies are added, and operations such as application and weight adjustment are performed through configuration

    3. This reduces the  ES response time (approximately  X%), and at the same time adjusts the external scoring formula, which indirectly improves the accuracy (approximately  Y%)

5.2 The quality of knowledge varies

Since knowledge items are extracted through algorithms, and knowledge has a certain timeliness, it may cause problems such as inaccuracy of knowledge. For this reason, we have carried out the following optimizations:

  1. Continuous algorithm iterations:

    1. Continuously iterate the model based on user buried point information and annotation information

    2. Select higher-quality knowledge extraction results for full/incremental update of online data

    3. After  X batch iterations, the correctness of the knowledge has been  Y% improved from Z%

  2. Post-processing the knowledge output by the model

    1. Filter and merge knowledge items with differences in only some auxiliary words (such as )

    2. Set expiration times for some popular knowledge items, and intervene in the production of knowledge items through partial manual review

    3. Mark and escalate trusted knowledge by maintaining expert knowledge base

    4. Maintained the  expert knowledge of the X category  items, Y and at the same time manually intervened the general  Z% knowledge items, which  W% improved  the correctness of the knowledge fromK%

Conclusion and Outlook

Based on the usage scenarios of our company, this article gives a general description of a system built around the  ES vector field ( Dense vector), and expounds some common problems and solutions.

At present, this solution supports our related search function for the knowledge base. Compared with the previous  ngram solution based purely on entity recognition and matching, the overall accuracy and recall rate have been improved by nearly double-digit percentages.

In the future, we will improve the response speed and stability of the entire system, and continue to iterate on the construction efficiency of the knowledge base and the accuracy of knowledge.

about the author

The mortal enemy wen, Elastic certified engineer, search architect, 10 years + work experience, graduated from Fudan University.

Blog: https://blog.csdn.net/weixin_40601534

Github:https://github.com/godlockin

Reprinted: Dry Goods | Engineering Practice of Elasticsearch Vector Search 

 

Guess you like

Origin blog.csdn.net/yangbindxj/article/details/123911972