CodeFuse open source ModelCache large model semantic cache

CodeFuse open source is in full swing! What is open sourced this time is the ModelCache large model semantic cache, which can significantly reduce the inference cost of large model applications and improve user experience.

CodeFuse-ModelCache project address:

https://github.com/codefuse-ai/CodeFuse-ModelCache

0 background

In the context of the LLM technology wave sweeping the world, the rapidly growing parameter scale of large-scale models has brought great challenges to the deployment of required inference resources. In order to improve the inference performance and efficiency of large models, we try to solve the current dilemma of large-scale service deployment of large models from the perspective of caching. Similar to traditional applications, user access to large models also has locality in time and space (for example: content related to popular topics, popular GitHub repo). If there is a caching layer, there is no need to call the large model service when similar requests are encountered, and existing results are directly returned to the user from the cached data, which will greatly reduce the cost of inference and improve the user experience.

1 The significance of large model caching

Currently, large model services face the following three challenges:

  1. High cost: Large model parameters are in the hundreds of billions, and a single instance requires multiple A10 cards, making large-scale deployment costly. Therefore, current large model services are basically billed according to the number of tokens processed, resulting in high user-side usage costs.
  2. Slow: Inference speed for large models is also a critical issue. In many real-time applications, such as dialogue systems and business assistants, response time requirements are very high, usually at the millisecond level. However, the inference speed of large models is often slow, on the second level, resulting in the inability to return results in real time and a degraded user experience.
  3. Stability is not guaranteed: Due to the high cost of single-instance deployment, when current large-model services receive large traffic requests, current limiting is used to prevent service unavailability.

2 Plan research

We investigated the open source solution GPTCache, which is a project dedicated to building a semantic cache for storing LLM responses. The project provides a semantic similarity matching framework and provides relatively complete functional modules and interfaces. Has the following advantages: 

  • The liveliness of the project, which constantly introduces new features, allows us to keep up with the latest developments. 
  • It has open functional modules that can be adjusted and optimized, which facilitates business expansion.

The overall architecture of GPTCache is shown in Figure 1:

Figure 1. GPTCache architecture

However, GPTCache still has many shortcomings in practical applications, including:

  1. Architecturally, large model calls and data writebacks are black-boxed for users, making large model products more complicated in terms of streaming output, security audits, and problem troubleshooting.
  2. By default, faiss and sqlite are used as storage, which cannot be deployed in a distributed manner. Especially in relational databases, the SqlAlchemy framework cannot support Ant OceanBase.
  3. In terms of data and resource isolation, it cannot handle multi-model and multi-version scenarios.
  4. Multi-round sessions are not supported, especially when the model has system instructions, which is not very compatible. More functions to be improved will be introduced in detail in Section 3.3.

3 ModelCache construction

In response to the above problems, we conducted secondary development based on GPTCache and built Ant's internal cache product ModelCache. The overall architecture is shown in Figure 2. Next, we will introduce our work in detail, including: 3.1 Overall architecture; 3.2 Function upgrade. In the function upgrade section, the new function points in ModelCache will be introduced in detail.

3.1 Overall architecture

Figure 2. ModelCache architecture and upstream and downstream

3.1.1 Technical architecture

In the initial architecture, large model calls and data writebacks were black-boxed to the user. However, this architecture makes troubleshooting cumbersome and makes it difficult to meet enterprise-level requirements in terms of streaming output and data security audits.

Therefore, we re-adjusted the architecture and adopted a lightweight access method for ModelCache without disturbing the functional implementation of large model products. We designed ModelCache to be a redis-like structure that provides open data query, data writeback, data management and other APIs. At the same time, it decouples large model calls and can be embedded into large model products as an independent module. Through ModelCache, the product side can manage and use large models more flexibly, improving the maintainability and scalability of the system.

3.1.2 Core module

ModelCache contains a series of core modules, including adapter, embedding, rank , data_manager, etc. The specific functions are as follows:

  1. adapter module: Its main function is to handle the business logic of various tasks, and can connect embedding, rank , data_manager and other modules in series.
  2. Embedding module: This module is mainly responsible for converting text into semantic vector representation. It converts user queries into vector form and is used for subsequent recall or storage operations.
  3. Rank module: used to rank and evaluate the similarity of recalled vectors. It can score the similarity of two vectors and sort them based on L2 distance, cosine similarity or evaluation model.
  4. data_manager module: This module is mainly used to manage databases, including vector databases and relational databases. It is responsible for data storage, query, update and delete operations.
    1. Vector database (Milvus): As a high-performance, scalable, and multi-functional vector database, Milvus is suitable for a variety of application scenarios that require vector retrieval.
    2. Relational database (OceanBase): We use Ant's OceanBase database to store user query, LLM response, model version and other information.

3.1.3 Function comparison

In terms of functionality, in order to solve the huggingface network problem and improve the reasoning speed, embedding local reasoning capabilities are added. In view of some limitations of the SqlAlchemy framework, we have rewritten the relational database interaction module to implement database operations more flexibly. In practice, large model products need to interface with multiple users and multiple models, so ModelCache has added support for multi-tenancy and is also initially compatible with system instructions and multi-round sessions. Please see Table 1 for a more detailed feature comparison.

Table 1. Comparison of function points between ModelCache and GPTCache

3.2 Function upgrade

In order to apply Cache products to enterprise-level users and achieve real implementation effects, we have made a large number of iterative upgrades to its functions. The core functions are shown in Figure 3.

Figure 3. ModelCache core functions

3.2.1 Data management

Cache needs to ensure that outdated or unnecessary data does not accumulate in the cache. Cache management is a key part of Cache. To this end, we have implemented two important functions:

  • One-click clearing capability: ModelCache provides a data removal interface, allowing users to clear their cache with one click. This function ensures that when the model version or parameters change, the data from the previous version will not interfere with the online answers.
  • Cache elimination strategy: ModelCache supports customizable cache elimination strategies, allowing users to customize the cache according to their own needs.

3.2.2 Data isolation

In practical applications, data isolation is an important function. In order to meet the needs of different scenarios, ModelCache implements a variety of data isolation functions, including:

  • Environment isolation: Supports data isolation in different environments, including DEV, pre-release and online environments. These environments may have differences in model versions and parameters, thus ensuring the independence of data in different environments.
  • Model isolation: Supports model-level isolation, using vector database tables and OB table fields to achieve independent storage. In this way, data between different models can be effectively isolated, ensuring data security and integrity.

3.2.3 Data reflow

The data reflow function has the ability to persist knowledge and ensure that important system data is effectively saved and continuously used, thus supporting the long-term development of the system. To this end, Cache provides a data reflow function so that data in the system can be effectively persisted. This function is performed asynchronously to minimize the impact on system performance.

3.2.4 System command and multi-round dialogue support

In ModelCache, system instructions and multi-round dialogue support are provided to meet user needs. details as follows:

  • System directive support: ModelCache supports system directives, especially when subsequent users can customize system directives. The semantic similarity of conversations under different system directives will be distinguished to maintain the stability of Cache. In the future, we also plan to combine system directives with system directives. Sessions are separated to further improve system flexibility and scalability.
  • Multi-turn dialogue capability: ModelCache also supports multi-turn dialogue, that is, it can match the semantic similarity of continuous dialogues.

3.2.5 Portability

ModelCache has excellent portability and can adapt to different scenarios. OceanBase can be seamlessly migrated to products such as mysql. Milvus is also a database service that can be deployed quickly, so whether it is a private cloud or a public cloud, it can respond quickly and Provide high quality services. This portability means that ModelCache can provide users with more flexible and scalable deployment solutions to meet different needs and scenarios.

3.2.6 Embedding capabilities

In the current cache, users can use the Chinese universal embedding model (text2vec-base-chinese). We also support the embedding capability of the large model embedding layer, which enables the embedding to better adapt to the semantics of the model itself. However, using only the embedding layer of the large model has evolved into a bag-of-word model and cannot obtain the weight of each token. To this end, we are training SGPT (GPT Sentence Embeddings  for Semantic  Search) to better support ModelCache.

 

4 Effect statistics

4.1 Efficiency statistics

Based on the GOC log information of Ant's internal large model products, we calculated the cache hit time and direct model call time. Because the product uses streaming output, the time will increase to a certain extent. According to actual system statistics, hitting the cache can reduce the average time consumption by 10 times, and the overall effective speed improvement can reach 14.5%. The definition of effective speed increase is as follows:

Based on the reflow data (excluding the delay of streaming output), the cache time consumption was evaluated. The time consumption of cache misses has been controlled to the order of hundreds of milliseconds. We are still optimizing the query time consumption.

4.2 Continuous optimization of embedding model

In the caching scenario, we found that simply evaluating semantic similarity is not enough. The core goal should be to determine whether the output of the large model corresponding to the query is consistent (the semantic similarity of the query is not equivalent to the consistency of the reply of the large model). For example, the following query has a word difference, but the generated results are completely different.

    • query: Traverse from 1 to 1000 and find all numbers that are divisible by 13 and 23, implemented in Python
    • query: Traverse from 1 to 1000 and find all numbers that are divisible by 13 and 23, implemented in Java

We have investigated many models in the field of SentenceTransformer, but none of them can meet the needs of caching scenarios. Therefore, an embedding model for enterprise-level applications was trained, and we hope to further improve the accuracy of semantic similarity evaluation on this basis to improve the accuracy of caching.

5 Future Outlook

In the future, we aim to provide solutions with stronger performance and higher accuracy to meet the needs of LLM Cache scenarios. Research and optimization will be continuously conducted to ensure that the Cache system can achieve the best performance and accuracy in practical applications.

In terms of performance, we will achieve faster recall times through in-depth optimization of all aspects, including algorithms, data and computing resources. The goal is to compress the overall processing time to less than 300 milliseconds to provide a faster and more efficient user experience.

In terms of accuracy, we will focus on the construction of semantic models. Through in-depth research and improvement of semantic representation technology, we are committed to improving the model's ability to accurately understand complex semantics, so as to more accurately match the user's query. In addition, the similarity evaluation module will be optimized to further improve the recall rate. We will comprehensively consider multiple evaluation metrics such as accuracy, recall, and F1 score to ensure that the model achieves significant improvements in all aspects.

 

To learn more about CodeFuse, click to enter the official CodeFuse website: https://codefuse.alipay.com

Alibaba Cloud suffered a serious failure, affecting all products (has been restored). The Russian operating system Aurora OS 5.0, a new UI, was unveiled on Tumblr. Many Internet companies urgently recruited Hongmeng programmers . .NET 8 is officially GA, the latest LTS version UNIX time About to enter the 1.7 billion era (already entered) Xiaomi officially announced that Xiaomi Vela is fully open source, and the underlying kernel is .NET 8 on NuttX Linux. The independent size is reduced by 50%. Microsoft launches a new "Windows App" FFmpeg 6.1 "Heaviside" is released
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/6942768/blog/10143074