Alibaba Cloud OpenSearch launches LLM question-and-answer search product to help enterprises efficiently build conversational search services

1. Enterprise-specific Q&A search

1.1. World knowledge vs company-specific knowledge

ChatGPT and Tongyi Qianwen are leading the revolution in search technology, and the key to their performance of "understand everything and talk about everything" is to rely on the world knowledge compressed in the large language model (Large Language Model, LLM) of the base. But no matter how powerful the LLM is, the amount of knowledge that can be compressed is still limited.

The questions in the picture below are about Alibaba’s internal technology products, which belong to the company’s exclusive knowledge. Even the answers given by the powerful ChatGPT model are completely wrong and irrelevant.

In response to this problem, OpenAI proposed chatgpt-retrieval-plugin , WebGPT , and the open source community proposed a series of solutions such as DocsGPT , ChatPDF , langchain-based retrieval enhancement chatbot , etc., which is enough to prove the industry's understanding of how to combine LLM on personal/enterprise-specific data The demand is strong.

1.2. LLM's enhanced retrieval capabilities

The OpenSearch team combines years of search practice experience to propose enhanced LLM retrieval capabilities to provide users with a one-stop SaaS industry Q&A search solution.

For the query entered by the user, if it is input to LLM together with the results retrieved from the business data, a more accurate answer can be obtained.

As follows:

Query: What is Ali's TPP platform

The results retrieved in the enterprise internal documents are as follows:

TPP is Ali's personalized algorithm development platform, relying on Ali's AI OS engine (feature, recall, scoring, etc.) to provide serverless online service capabilities for many personalized services (search, recommendation, advertising, etc.). Users write business codes on the TPP platform, do AB experiments and provide external services without caring about machine resources, application deployment structures, or writing service frameworks. On the TPP product page, you can manage the entire life cycle of business code, including compilation, debugging, release and launch, monitoring and alarming, and troubleshooting. Combined with the AI ​​OS engine suite interface and high-performance graphical development framework, users only need to implement their own business logic to have stable, high-performance personalized online services.

After inputting the search result as a prompt into the model, the model gives a more precise and concise answer:

For the enhanced retrieval capability of LLM, the following two points need to be paid special attention to:

  1. Validity : The generated results are summarized based on the most relevant part of the query in the search results.
  2. Harmful : The generated results should not be randomly fabricated from the search results, and wrong information will mislead users.

In this scenario, the OpenSearch intelligent question and answer version finetune the large model in advance, and adjusted the model parameters and prompt format in a targeted manner to ensure the accuracy and reliability of the question and answer results as much as possible.

2. Technical implementation

2.1. System Architecture

The system architecture of OpenSearch intelligent question and answer version mainly includes three parts: business data processing, large model pre-training, and online service of question and answer search.

2.1.1. Business data processing

Compared with traditional search engines, the biggest change in the offline data processing process of OpenSearch Smart Question Answering Edition lies in the processing of business data:

  1. The data source of traditional search engines is structured text, but what needs to be processed here is often unstructured text, and the data format will be more diverse (HTML, Markdown, plain text, PDF, etc.)
  2. Traditional search engines build indexes based on the unique primary key of the document, but here, due to differences in data sources, it is necessary to split the document into paragraphs first, and generate a new paragraph primary key for the split paragraphs
  3. Traditional search engines perform content matching based on text indexes, but vector indexes are used here, which makes it easier to adapt to rich data formats and long text searches

2.1.2. Online Services

Compared with traditional search engines, the online service architecture of OpenSearch Smart Q&A Edition has changed greatly. The main differences are:

  1. The number of results returned by traditional search is generally more than 10, and there are often page-turning queries. The search here is to find the most relevant paragraph content. The N in Top N should not be too large (generally within 3), and it needs to be controlled Relevance, do not recall passages that are too low in relevance to be misleading
  2. After the retrieval is completed and the Top N search results are obtained, the results will be added to the prompt to input the large model. This stage is generally time-consuming. The OpenSearch intelligent question and answer version supports streaming output to alleviate the experience problem of long waiting time
  3. When the results are returned, based on the user's business data, the relevant search results under the specified Query and the Q&A results generated by the model will be output through the API

2.2. Retrieval enhancement

2.2.1. Paragraph splitting model: ops-text-ace-001

"A clever woman can't cook without rice", under the framework of the retrieval-enhanced LLM, the final effect of the model is largely determined by the retrieval results given in the prompt.

Traditional document retrieval systems only need to provide the most relevant document list for the Query, and the specific information screening and summarization is left to the user to complete.

Retrieval-enhanced LLM needs to provide specific paragraphs related to the query, and the paragraphs here should not have missing semantic information or long input, and it is better to include a complete piece of semantic information.

Weighing efficiency and effect, the paragraph splitting model of OpenSearch intelligent question answering version has the following characteristics:

The final split effect can refer to the following example:

名词解释
<a name="ucuiH"></a>
## 实例管理
| **名称** | **说明** |
| --- | --- |
| 实例 | 实例是用户的一套数据配置,包括数据源结构、索引结构及其它属性配置。一个实例即一个搜索服务。 |
| 文档 | 文档是可搜索的结构化数据单元。文档包含一个或多个字段,但必须有主键字段,OpenSearch通过主键值来确定唯一的文档。主键重复则文档会被覆盖。 |
| 字段 | 字段是文档的组成单元,包含字段名称和字段内容。 |
| 插件 | 为了在导入过程中进行一些数据处理,系统内置了若干数据处理插件,可以在定义应用结构或者配置数据源时选择。 |
| 源数据 | 原始数据,包含一个或多个源字段。 |
| 源字段 | 组成源数据的最小单元,包含字段名称和字段值,可选数据类型请参见[应用结构&索引结构]。 |
| 索引 | 索引是用于加速检索速度的数据结构,一个实例可以创建多个索引。 |
| 组合索引 | 可将多个TEXT或SHORT_TEXT文本类型的字段配置到同一个索引,用来做组合索引。如一个论坛搜索,需要提供基于标题(title)的搜索及基于标题(title)和内容(body)的综合搜索,那么可以将title建立title_search索引,将title和body建立default组合索引。那么,在title_search上查询即可实现基于标题的搜索,在default上查询即可实现基于标题和内容的综合搜索。 |
| 索引字段 | 在[query子句]中使用,需要定义索引字段,通过索引字段来做高性能的检索召回。 |
| 属性字段 | 在[filter]、[sort]、[aggregate]、[distinct]子句使用,用来实现过滤、统计等功能。 |
| 默认展示字段 | 用来做结果展示。可以通过API参数fetch_fields来控制每次结果的返回字段,需注意在程序中配置fetch_fields该参数后会覆盖默认展示字段配置,以程序中的fetch_fields设置为主;若程序中不设置fetch_fields参数则以默认展示字段为主。 |
| 分词 | 对文档进行词组切分,TEXT类型按检索单元切分,SHORT_TEXT按单字切分。如“浙江大学”,TEXT类型会切分成2个词组:“浙江”、“大学”。SHORT_TEXT会切分成4个词组:“浙”、“江”、“大”、“学”。 |
| term | 分词后的词组称为term。 |
| 构建索引 | 分词后会进行索引构建,以便根据查询请求,快速定位到文档。搜索引擎会构建出两种类型的链表:倒排和正排链表。 |
| 倒排 | 词组到文档的对应关系组成的链表,query子句采用这种排序方式进行查询。例如:term1->doc1,doc2,doc3;term2->doc1,doc2。 |
| 正排 | 文档到字段对应关系组成的链表,filter子句采用这种排序方式,性能略慢于倒排。例如:doc1->id,type,create_time。 |
| 召回 | 通过查询的关键词进行分词,将分词后的词组通过查找倒排链表快速定位到文档。 |
| 召回量 | 召回得到的文档数为召回量。 |

<a name="aLREa"></a>
## 数据同步
| **名称** | **说明** |
| --- | --- |
| 数据源 | 数据来源,目前支持阿里云RDS、MaxCompute、PolarDB的数据同步。 |
| 索引重建 | 重新构建索引。在配置/修改应用结构、数据源后需要索引重建。 |

<a name="wuTSI"></a>
## 配额管理
| **名称** | **说明** |
| --- | --- |
| 文档容量 | 实例中各个表的总文档大小累加值(不考虑字段名,字段内容按照string来计算容量)。 |
| QPS | 每秒查询请求数。 |
| LCU | LCU(逻辑计算单元)是**衡量搜索计算能力的单位**,一个LCU代表搜索集群中1/100个核的计算能力。 |

{"text": "实例管理:名称:实例,说明:实例是用户的一套数据配置,包括数据源结构、索引结构及其它属性配置。一个实例即一个搜索服务。|名称:文档,说明:文档是可搜索的结构化数据单元。文档包含一个或多个字段,但必须有主键字段,OpenSearch通过主键值来确定唯一的文档。主键重复则文档会被覆盖。|名称:字段,说明:字段是文档的组成单元,包含字段名称和字段内容。|名称:插件,说明:为了在导入过程中进行一些数据处理,系统内置了若干数据处理插件,可以在定义应用结构或者配置数据源时选择。|名称:源数据,说明:原始数据,包含一个或多个源字段。|名称:源字段,说明:组成源数据的最小单元,包含字段名称和字段值,可选数据类型请参见应用结构&索引结构。|名称:索引,说明:索引是用于加速检索速度的数据结构,一个实例可以创建多个索引。|名称:组合索引,说明:可将多个TEXT或SHORT_TEXT文本类型的字段配置到同一个索引,用来做组合索引。如一个论坛搜索,需要提供基于标题(title)的搜索及基于标题(title)和内容(body)的综合搜索,那么可以将title建立title_search索引,将title和body建立default组合索引。那么,在title_search上查询即可实现基于标题的搜索,在default上查询即可实现基于标题和内容的综合搜索。|名称:索引字段,说明:在query子句中使用,需要定义索引字段,通过索引字段来做高性能的检索召回。|名称:属性字段,说明:在filter、sort、aggregate、distinct子句使用,用来实现过滤、统计等功能。|名称:默认展示字段,说明:用来做结果展示。可以通过API参数fetch_fields来控制每次结果的返回字段,需注意在程序中配置fetch_fields该参数后会覆盖默认展示字段配置,以程序中的fetch_fields设置为主;若程序中不设置fetch_fields参数则以默认展示字段为主。", "index": 1, "source": {"title": "名词解释", "url": "url"}}
{"text": "实例管理:名称:分词,说明:对文档进行词组切分,TEXT类型按检索单元切分,SHORT_TEXT按单字切分。如“浙江大学”,TEXT类型会切分成2个词组:“浙江”、“大学”SHORT_TEXT会切分成4个词组:“浙”、“江”、“大”、“学”。|名称:term,说明:分词后的词组称为term。|名称:构建索引,说明:分词后会进行索引构建,以便根据查询请求,快速定位到文档。搜索引擎会构建出两种类型的链表:倒排和正排链表。|名称:倒排,说明:词组到文档的对应关系组成的链表,query子句采用这种排序方式进行查询。例如:term1->doc1,doc2,doc3;term2->doc1,doc2。|名称:正排,说明:文档到字段对应关系组成的链表,filter子句采用这种排序方式,性能略慢于倒排。例如:doc1->id,type,create_time。|名称:召回,说明:通过查询的关键词进行分词,将分词后的词组通过查找倒排链表快速定位到文档。|名称:召回量,说明:召回得到的文档数为召回量。", "index": 2, "source": {"title": "名词解释", "url": "url"}}
{"text": "数据同步:名称:数据源,说明:数据来源,目前支持阿里云RDS、MaxCompute、PolarDB的数据同步。|名称:索引重建,说明:重新构建索引。在配置/修改应用结构、数据源后需要索引重建。配额管理:名称:文档容量,说明:实例中各个表的总文档大小累加值(不考虑字段名,字段内容按照string来计算容量)。|名称:QPS,说明:每秒查询请求数。|名称:LCU,说明:LCU(逻辑计算单元)是衡量搜索计算能力的单位,一个LCU代表搜索集群中1/100个核的计算能力。", "index": 3, "source": {"title": "名词解释", "url": "url"}}

2.2.2. Text vectorization model: ops-text-embedding-001

Compared with traditional search, in the interaction with LLM, a big change is that users can use very natural language instead of keywords in traditional search. For natural language input, the semantic-based vector retrieval architecture is a natural fit.

Driven by the wave of large models, the semantic vector model based on large models has also brought a considerable change to the retrieval field. On the Massive Text Embedding Benchmark ( MTEB ), it can be seen that a series of semantic vector models based on large model bases represented by OpenAI's text-embedding-ada-002 have "epoch-making" improvements in the effect of retrieval tasks.

In order to better adapt to multi-language and multi-industry question-and-answer search scenarios, the OpenSearch algorithm team conducted customized training and effect optimization based on the self-developed semantic vector large model, and optimized the model efficiency to meet the real-time requirements of search scenarios needs, and finally produce the ops-text-embedding-001 model. We have done verification on the Chinese dataset Multi-CPR, and the effect on key indicators such as retrieval correlation MRR@10 has been better than OpenAI's text-embedding-ada-002:

industry Model MRR@10
e-commerce text-embedding-ada-002 0.386
ops-text-embedding-001 0.429
entertainment text-embedding-ada-002 0.346
ops-text-embedding-001 0.411
the medical text-embedding-ada-002 0.355
ops-text-embedding-001 0.310
Overall text-embedding-ada-002 0.362
ops-text-embedding-001 0.383

In addition to the advantages in Chinese retrieval effect, the text vectorization model of OpenSearch intelligent question and answer version has the following characteristics:

In terms of vector retrieval capabilities, the OpenSearch intelligent question and answer version has a built-in self-developed high-performance vector retrieval engine, which is good at dealing with large model scenarios with higher vector dimensions.

Compared with other open source vector search engines, OpenSearch is more suitable for intelligent search scenarios, and can achieve several times the search performance and higher recall rate of open source engines.

2.2.3. Image vectorization model: ops-image-embedding-001

For the content industry, especially product documents and articles, a large amount of key information is presented in the form of pictures, and the multi-modal display combining pictures and texts can greatly improve the search effect of enterprise-specific intelligent questions and answers.

The following figure shows the corresponding access flow chart when introducing how to access OpenSearch products, which will allow users to understand more intuitively.

In order to realize the above-mentioned image search capability, the image retrieval model of OpenSearch Smart Question Answering Edition has the following characteristics:

The model combines multi-modal information to calculate the image-text correlation between the query and the images in the document, and finally returns the image with the highest correlation as the reference image result.

2.3. Large model (LLM)

2.3.1. Model training

With the LLM model base, in order to improve the effectiveness of the model in the retrieval enhancement scenario and reduce the harmfulness, the OpenSearch intelligent question answering version also performs supervised model fine-tuning (Supervised Finetune, SFT) on the model to further strengthen the retrieval enhancement capability.

Specifically, a well-constructed retrieval-enhanced SFT Dataset is used to compose the prompt with Query and the paragraphs returned under the corresponding retrieval system, and the Answer is the answer of the model.

Comparing the question-and-answer search model after SFT with the original LLM model, the model after SFT is better at summarizing the content in the input document, so as to accurately and concisely answer user questions and achieve the effect of intelligent question-and-answer search.

3. Product capability

3.1. Demonstration of Q&A search effect

In order to demonstrate the effect of the OpenSearch intelligent question-and-answer version, we use Alibaba Cloud product document data as business data, and build a question-and-answer search system through the OpenSearch intelligent question-and-answer version.

The following is a demonstration of the question and answer search effect:

In the demonstration of the above effects, you only need to import the corresponding document data into the OpenSearch intelligent Q&A version, and the answer generated by the Q&A model and the corresponding reference link can be returned according to the Query input by the user, realizing the intelligent Q&A search effect.

Compared with the original system, the product documentation Q&A search system based on the OpenSearch intelligent Q&A version can achieve a Q&A accuracy rate of more than 70% , which is a year-on-year increase of more than 10% compared with the original system , and greatly reduces the cost of manual maintenance.

3.2. Knowledge Base Intervention

In addition, for the common high-frequency Query intervention and effect tuning problems in the search field, the OpenSearch intelligent question answering version also supports manual intervention based on the knowledge base. Users can specify intervention questions and corresponding answers, and the OpenSearch intelligent question-and-answer version will identify similar questions and give corresponding answers based on the preset results in the knowledge base, thereby realizing operational intervention for specified queries, activities, and other scenarios.

For example, the following questions are directly given to the system because the relevant content cannot be found in the documentation, and the generated results cannot be answered:

With knowledge base intervention, the same question can be automatically matched to the preset question in the knowledge base through semantic similarity, and the corresponding intervention result is given:

3.3. Application scenarios

  • Enterprise internal search: enterprise internal information, document search, relevant content result generation and other scenarios
  • Content search: Content community, education search questions and other scenarios, directly return the corresponding answer and related content according to the question
  • E-commerce and marketing industries: search for questions and answers around products, prices, demands, etc., and answer questions more accurately and in a timely manner

3.4. Use process

(1) Obtain the invitation test qualification

(2) Go to the sales page of OpenSearch LLM intelligent question and answer version to purchase an instance

(3) Import business data through the data upload API

(4) Enter Query on the console test page or through the API to obtain the corresponding question and answer search results

For more product details, click to view product documentation

4. Summary and planning

This article introduces the technical implementation and capability demonstration of the OpenSearch LLM intelligent question answering version. At present, the OpenSearch LLM intelligent question and answer version has opened an invitation test. Users who have enterprise-specific question and answer search needs should go to the invitation test to apply for the invitation test qualification.

In the future, the OpenSearch LLM intelligent question answering version will launch more industry question answering search related functions, and support the application of more large models in search scenarios, so stay tuned.

Original link

This article is the original content of Alibaba Cloud and may not be reproduced without permission.

Guess you like

Origin blog.csdn.net/yunqiinsight/article/details/131073306