1. Background
As a search engine company, we rely heavily ES
on help with tasks including article recall, data source partitioning, entity, tag management, etc., and have received good results.
Recently, we need to model the industry knowledge base, which may involve various recall and scoring methods such as entity matching, fuzzy search, vector search, etc. In the end, we chose to pass ES 7.X
(finally choose 7.10) in the new function, Dense vector Help to complete this part of the needs together.
2. Technical selection
2.1 Solution Requirements
-
Support Vector Search
-
Support multi-dimensional filtering and filtering
-
throughput rate
-
Learning and use costs
-
Operation and maintenance cost
2.2 Use scene design
-
Offline data preparation
-
After the offline data is constructed, it is stored in the engine
-
The engine indexes each field in the data
-
-
Online data recall
-
Data recall based on sentences constructed
query
from comprehension resultsquery
-
Filter the results
-
Sort the results by a certain score
-
2.3 Data Structure Design
After determining the usage scenario of the data, we have determined that the data structure will roughly contain the following fields
-
Unique id: used for deduplication and quick acquisition of knowledge
-
Entity, attribute, value: used to describe the specific content of knowledge
-
Confidence: the credibility used to describe knowledge
-
Classification flag: main classification of knowledge and recommended category, etc.
-
Vector representation: as a basis for knowledge similarity, relevance recall, and scoring
-
ref information: source information used to retrospectively parse/obtain the knowledge
-
Other properties: including valid, delete, modification time and other supporting general properties
2.4 Solution Comparison
In order to support the above-mentioned usage requirements, we have compared various solutions including ES
, Faiss
Among them, Faiss
and SPTAG
is only the core algorithm library, which needs to be re-developed and packaged into services; Milvus
the 1.x
version can only store id
and 向量
, which cannot fully meet our use requirements; based on the considerations of cluster stability and maintainability, compared to the post-plug-in For deployment, we prefer to use ES
the native function, so choose ES
the native vector search function as our final choice.
Comparison reference:
type | implementation language | Client support | Multiple condition recall | learning cost | cost of introduction | Operation and maintenance cost | distributed | performance | Community | Remark |
---|---|---|---|---|---|---|---|---|---|---|
Elasticsearch | Java | Java / Python | yes | Low | Low | middle | yes | middle | active | native function |
Do | Python | Python | no | middle | high | high | no | high | generally | need secondary development |
Milvus | Python + GoLang | Python/Java/GoLang | no | middle | middle | middle | no | high | generally | 1.x is not fully functional |
OpenDistro Elasticsearch KNN | Java + C++ | Java / Python | yes | middle | middle | middle | yes | middle | generally | Built-in plugin |
SPTAG | C++ | Python + C# | no | high | middle | middle | no | high | generally | need secondary development |
3. Data flow process
3.1 Offline data processing part
-
Collect data from multiple data sources
-
Data cleaning and preprocessing
-
Extract knowledge through algorithmic engines
-
Convert knowledge to vectors through an algorithmic engine
-
Store the basic information of knowledge together with vector data in
ES
3.2 Online Data Recall Section
-
Get search criteria from frontend
-
query
Retrieval condition analysis by understanding module -
ES
search from -
Score adjustments to results
-
Back to front
4. Example of using ES vector search
4.1 Index Design
Settings
:
{
"settings": {
"number_of_shards": 3,
"number_of_replicas": 2,
"index": {
"routing": {
"allocation": {
"require": {
"node_group": "hot" // 1)
}
}
},
"store": {
"preload": [ // 2)
"knowledge",
"category",
"available",
"confidence",
"del",
"kid"
]
},
"search": {
"slowlog": {
"threshold": {
"query": {
"warn": "1s" // 3)
},
"fetch": {
"warn": "1s" // 3)
}
}
}
},
"translog": {
"flush_threshold_size": "512mb", // 4)
"sync_interval": "5m", // 4)
"durability": "async" // 4)
},
"sort": {
"field": [ // 5)
"kid",
"confidence"
],
"order": [ // 5)
"asc",
"desc"
]
}
}
}
}
-
illustrate:
-
Because the vector data is large, it tends to place the entire index on nodes with better hardware performance
-
To support high-performance filtering, frequently used fields are preloaded in memory
-
Enable logs for slow queries to facilitate follow-up performance investigations
-
The reconstruction of the knowledge base is offline, and a lot of writing will be performed during the update, so the
translog
correct commit interval is lengthened to speed up the writing speed. -
In actual use, kid is an auto-incrementing id, and at the same time, it may sort the confidence of knowledge, etc., so
sort field
these two fields will be used to store
Mapping
:
{
"mappings": {
"properties": {
"kid": {
"type": "keyword"
},
"knowledge": {
"type": "keyword"
},
"knowledge_phrase": { // 1)
"type": "text",
"analyzer": "faraday"
},
"attribue": { // 1)
"type": "keyword",
"fields": {
"phrase": {
"type": "text",
"analyzer": "faraday"
}
}
},
"value": { // 1)
"type": "keyword",
"fields": {
"phrase": {
"type": "text",
"analyzer": "faraday"
}
}
},
"confidence": { // 2)
"type": "double"
},
"category": {
"type": "keyword"
},
"vector": { // 3)
"type": "dense_vector",
"dims": 512
},
"ref": {
"type": "text",
"index": false
},
"available": {
"type": "keyword"
},
"del": {
"type": "keyword"
},
"create_timestamp": {
"type": "date",
"format": [
"strict_date_hour_minute_second",
"yyyy-MM-dd HH:mm:ss"
]
},
"update_timestamp": {
"type": "date",
"format": [
"strict_date_hour_minute_second",
"yyyy-MM-dd HH:mm:ss"
]
}
}
}
}
-
illustrate:
-
In addition to the complete search of knowledge items, fuzzy retrieval is also required. We used a self-developed
farady
tokenizer to segment each part of the knowledge item. -
Some knowledge items in the knowledge base will be reviewed and maintained by experts/humans, so different confidence levels will be set for different items
-
After data preprocessing, it will be converted into a 512-bit vector and stored in this field
4.2 Data flow
-
Offline part:
-
Data collection and cleaning
-
By
模型A
finding knowledge items from articles -
By
模型B
converting knowledge items into vectors-
This
模型A
模型B
is a self-developed model, using algorithms including knowledge density calculation andbert
tersonflow
other frameworks
-
-
Insert core content such as original text and knowledge items into the database
-
Assemble core knowledge content, vectors, etc. into retrieval units for insertion
ES
-
A team of experts reviews, revises and iterates on knowledge items in the database
-
The algorithm team will iterate the model in the data link according to the update of knowledge items and other annotations, and update the online knowledge base
-
Online section:
-
After the front end receives the request, the
query 理解
component is called for analysis -
After removing the invalid content, after finding
query
the classification information and other intentions in it, construct the vector for recall and related filter conditions -
The knowledge base is screened through the combined
ES
conditionsquery
, and the results are adjusted with confidence, etc. -
Adjust and sort the scores of different strategies on the recall results, and finally output them to the front end
4.3 Example query
POST knowledge_current_reader/_search
{
"query": {
"script_score": {
"query": {
"bool": {
"filter": [
{
"term": {
"del": 0
}
},
{
"term": {
"available": 1
}
}
],
"must": {
"bool": {
"should": [
{
"term": {
"category": "type_1",
"boost": 10
}
},
{
"term": {
"category": "type_2",
"boost": 5
}
}
]
}
},
"should": [
{
"match_phrase": {
"knowledge_phrase": {
"query": "some_query",
"boost": 10
}
}
},
{
"match": {
"attribute": {
"query": "some_query",
"boost": 5
}
}
},
{
"match": {
"value": {
"query": "some_query",
"boost": 5
}
}
},
{
"term": {
"knowledge": {
"value": "some_query",
"boost": 30
}
}
},
{
"term": {
"attribute": {
"value": "some_query",
"boost": 15
}
}
},
{
"term": {
"value": {
"value": "some_query",
"boost": 10
}
}
}
]
}
},
"script": {
"source": "cosineSimilarity(params.query_vector, 'vector') + sigmoid(1, Math.E, _score) + (1 / Math.log(doc['confidence'].value))",
"params": {
"query_vector": [ ... ]
}
}
}
}
}
-
illustrate:
-
The above
query
conditions and parameters are only for illustration, and belong to the desensitization and simplified version used in actual online. -
The calculation formula is a certain version in the iteration, and subsequent adjustments and upgrades are not reflected.
-
Boundary conditions and null values
pipeline
are processed in the auxiliary service sum, which simplifies part of the logic of boundary condition processing and judgment
5. Problems encountered
5.1 Long response time
Due to the need for vector calculation, ES
it takes a lot of time and resources to calculate the distance. For this reason, we have carried out the following optimizations:
-
Eigenvalues truncate decimal places:
-
In order to ensure the representation of the features, we did not adjust
bert
the number of bits of the vector output by the framework -
After weighing the access efficiency, data precision and calculation speed, we truncate
label
the precision of each bit to decimal places16
5
-
In this way, although some precision is lost (approximately
X%
), the access and calculation time are greatly reduced (approximatelyY%
)
-
-
query
Pre-analyze intent, possible classifications before proceeding-
In order to reduce the data included in the computational ranking, we analyze the raw content
query
before assemblyquery
-
Cooperate with user behavior buried points and the prior knowledge of experts, roughly classify the knowledge, and
query
match the classification with different weights -
This reduces the recall rate (approximately
X%
), but increases the accuracy (approximatelyY%
), and also improves part of the computational efficiency (approximatelyZ%
)
-
-
Simplified calculation formula
-
Externalize some of the score calculation logic, and simplify
ES
the operation logic that needs to be processed as much as possible -
After the recall, a variety of scoring strategies are added, and operations such as application and weight adjustment are performed through configuration
-
This reduces the
ES
response time (approximatelyX%
), and at the same time adjusts the external scoring formula, which indirectly improves the accuracy (approximatelyY%
)
-
5.2 The quality of knowledge varies
Since knowledge items are extracted through algorithms, and knowledge has a certain timeliness, it may cause problems such as inaccuracy of knowledge. For this reason, we have carried out the following optimizations:
-
Continuous algorithm iterations:
-
Continuously iterate the model based on user buried point information and annotation information
-
Select higher-quality knowledge extraction results for full/incremental update of online data
-
After
X
batch iterations, the correctness of the knowledge has beenY%
improved fromZ%
-
-
Post-processing the knowledge output by the model
-
的
Filter and merge knowledge items with differences in only some auxiliary words (such as ) -
Set expiration times for some popular knowledge items, and intervene in the production of knowledge items through partial manual review
-
Mark and escalate trusted knowledge by maintaining expert knowledge base
-
Maintained the expert knowledge of the
X
category items,Y
and at the same time manually intervened the generalZ%
knowledge items, whichW%
improved the correctness of the knowledge fromK%
-
Conclusion and Outlook
Based on the usage scenarios of our company, this article gives a general description of a system built around the ES
vector field ( Dense vector
), and expounds some common problems and solutions.
At present, this solution supports our related search function for the knowledge base. Compared with the previous ngram
solution based purely on entity recognition and matching, the overall accuracy and recall rate have been improved by nearly double-digit percentages.
In the future, we will improve the response speed and stability of the entire system, and continue to iterate on the construction efficiency of the knowledge base and the accuracy of knowledge.
about the author
The mortal enemy wen, Elastic certified engineer, search architect, 10 years + work experience, graduated from Fudan University.
Blog: https://blog.csdn.net/weixin_40601534
Github:https://github.com/godlockin
Reprinted: Dry Goods | Engineering Practice of Elasticsearch Vector Search