When the large model is not a problem, how to deal with the engineering implementation challenges of LLM?

A few months ago, in Thoughtworks' internal AIGC workshop, we have been reaching a series of consensus points, such as: If there is no "open source model" to reduce the cost of enterprise application of LLM, then LLM will die soon. Therefore, we believe that open source LLM + LoRA fine-tuning will become a mainstream method for enterprises. Today, we can see models such as LLaMA 2, Code LLaMA 2, etc. constantly refreshing this possibility.

67de0d73cff971d2066eb56142681f02.png

After the model is not a problem, as architects and developers, we should be committed to: implement LLM in an engineering way . Therefore, in the past few months, we have developed a series of LLM application PoCs in different fields, trying to think about how to build LLM applications from different perspectives. such as:

  • From the perspective of language and ecology, explore and optimize the interaction between languages?

  • How should the technical architecture be designed?

  • Prompt modeling and optimization?

  • What are some patterns for building better model contexts?

  • What should a language API contain?

Some other questions include how to reduce the cost of large models through small models and traditional LLM? Each question is a more interesting question, and it is also something we have to consider when we land.

From the perspective of language and ecology: LLM Service as a API vs FFI

A large number of enterprises have tried to use Python + LangChain to build PoC of knowledge enhancement tools. From an engineering point of view it means that we need to consider:

  1. Do you use Python for all smart services and provide APIs externally?

  2. Looking for possible code solutions in the existing language and infrastructure?

And because of the dynamic nature of Python, it affects the intelligent analysis of the IDE, which in turn affects the development efficiency—even with a type library like Pydantic. Therefore, my first consideration for the language is: to combine with the existing infrastructure of the enterprise . In particular, all my existing libraries and frameworks are written in the JVM language.

Language AI Infrastructure

0a59ea58f77014cf57d15890a9326b84.png

So, we used Java/Kotlin, TypeScript, and Rust languages ​​to develop applications of different types and scenarios to see whether different languages ​​can build LLM applications.

Judging from the existing system, mainstream programming languages ​​all have code bases related to deep learning. We can use the Python ecosystem as a reference example. such as:

  • The Deep Java Library in the Java language system provides a large number of deep learning-related libraries, allowing us to quickly build LLM-based applications.

  • KInference in the Kotlin language system is specially optimized for inference, mainly for running ONNX model inference on the server side and locally (client side).

  • Rust language. In our construction of CoUnit, we use the Rust language as the development language, and we need a multidimensional array library such as ndarray.

But in most scenarios, we don't need any AI infrastructure. As a basic AI application, it may only need the function of calculating the Token length, so as to avoid unnecessary LLM costs.

FFI as interface method

0689219a2dfec5e21b03640fcee0294a.png

At this time, Tiktoken/Tokenizer  * * is our first library that needs FFI (Foreign Function Interface, foreign function interface) function to calculate the Token length. FFI allows codes in different programming languages ​​to call and interact with each other, such as using Python to call the underlying library implemented by Rust to achieve faster calculation speed.

In addition, from the perspective of the application side, whether it is the client or the server, some small reasoning models need to be introduced. The following are two commonly used FFI-based libraries:

  • Tokenizer/tokeniser . Whether it is OpenAI's Tiktoken or HuggingFace's Tokenizers, the choice is Rust as the underlying language. In addition, the slight difference is that JetBrains' AI Assistant uses Kotlin to implement Tokenizer, or it is for FFI performance considerations.

  • ONNX Runtime . Onnx is a cross-platform machine learning inference accelerator. It is usually used to introduce small model reasoning on the client and server, such as introducing SentenceTransformers to perform similarity searches locally. In terms of implementation, ONNX is implemented in C++, so other languages ​​also use the form of FFI.

The only point that has a greater impact on us is that in some languages, we may not have so many reference codes and reference architectures as examples, and it will take a long time to develop this type of application.

Technical Architecture of LLM Application

There is not much difference between LLM application and regular application development. It's just that we need to consider the impact of LLM and how to manage these structures in different scenarios.

Three modes of AI application in different business scenarios

0a5abc6814bc6b31428fa994d8733955.png

Generally speaking, according to different business models, the degree of combining AI is also different. We summarize them into three different models.

  • Basic LLM application. Hint engineering is usually only required to interact with pretrained models.

  • Co-pilot type applications. Use hint engineering to interact with agents (Agents), which combine pre-trained models and external tool stores. Typically, workflows are written in conjunction with specific patterns to automatically build context based on user intent.

  • autonomous agent. Use advanced agents (Agents) to automatically generate hints to control pre-trained models and external tools. Generally speaking, LLM automatically generates workflow according to user intent and automatically controls external tools.

Therefore, based on several PoCs we constructed previously, we summarize them into four basic principles of architecture design.

LLM-first software architecture principles

ecf8b0fef2d3c3b8486bc84f71de732d.png

The four architectural principles we think about here are actually limited by the capabilities of LLM:

  • User intent-driven design . Design a new human-computer interaction experience and build domain-specific AI characters to better understand user intent. For example, we can gradually guide the user to output more context based on the DSL.

  • Context engineering . Build an application architecture suitable for obtaining business context to generate more accurate prompts and explore engineering methods with high response speed.

  • Atomic Power Mapping . Analyze the atomic capabilities that LLM is good at, combine them with the capabilities that the application lacks, and perform capability mapping.

  • language interface . Explore and find a suitable new-generation API to facilitate LLM's understanding, scheduling and orchestration of service capabilities.

From a practical point of view, guiding users and improving the context are the biggest difficulties in our engineering implementation.

Prompt Modeling and Optimization

Prompt's model is closely related to the prompt strategy itself, and generally speaking, it is related to the way we parse complex problems.

Context-sensitive Prompt Modeling

Writing prompts at development time is a pain, after we built different thought chain templates, we need to provide better examples. To this end, we will model prompts to better manage and test prompts. Take the prompt in LangChain source code as an example:

Human: What is 2+2?
AI: 4
Human: What is 2+3?
AI: 5
Human: What is 4+4?

The corresponding Python code is as follows:

examples = [
    {"input": "2+2", "output": "4"},
    {"input": "2+3", "output": "5"},
]

For this reason, according to our Prompt mode, different  PromptTemplate modes are required. For example, a series of complex Prompt strategy modes are built in LangChain, such as: ,  FewShotPromptWithTemplatesFunctionExplainerPromptTemplate and so on.

Prompt template continuous optimization

And in our PoC project, there is more than one type of example. So there are considerations: how do you go about modeling them consistently? Here is an example of a QA template for a project we build with Kotlin:

@Serializable
data class QAUpdateExample(
    override val question: String,
    override val answer: String,
    val nextAction: String = "",
    val finalOutput: String = "",
) : PromptExample {
}

It is also continuously optimized as we iterate. And because the prompt needs to combine context information through variables, so we also need a template engine such as Apache Velocity to replace the variables in the prompt with real data.

Patterns for Context Construction: RAG and Domain Specific Patterns

There is no doubt that among the existing LLM (Large Language Model) applications, the scenarios related to intelligent customer service and knowledge question answering are the most. Under these scenarios, developers have been exploring better RAG (Retrieval-augmented generation) mode to make LLM answer more accurately.

However, in fact, under most scenarios, we can build proprietary patterns, and their quality is often better. However, there is not much versatility, relying on experts to build, so it seems to be insufficient cost performance.

domain specific language abstraction

Without a doubt, tools related to code generation are the second most popular LLM tools. Different from the knowledge scene, you don't necessarily need RAG to help you complete the construction of the context. In this scenario, each question is very "specific", and enough context can be obtained in the IDE and editor.

Whether it is GitHub Copilot, JetBrains AI Assistant, or our open source AutoDev, they are only generating corresponding prompts based on user behavior and results. For example, Copilot will calculate the code chunk that differs from the current code based on the latest 20 files, and generate a prompt. In AutoDev, we feel that the specification should be written into the code generation prompt to generate standardized code.

Therefore, we believe that in a specific field, it is the most reasonable way to design a DSL according to the context of the field, design a prompt strategy, and then combine it with RAG.

Retrieval Enhancement Generation and Prompt Strategy

41a3f0ed560ef8c1a53c7dcbe7e72818.png

In the internal training materials, I see RAG as a kind of prompt strategy. The basic RAG mode needs to be combined with a vector database and build a knowledge index. Under the basic RAG mode, the built prompt will not achieve satisfactory results.

Users are noobs and don't operate the system the way we expected. Their input is ambiguous, and our challenge is: how to visualize an ambiguous problem?

  • In CoUnit, we need to convert the user's intent into DSL, which includes Chinese, English, and HyDE documents (hypothetical document embedding), so as to perform semantic search to obtain possible results.

  • Under scenarios such as CoUnit, it is a model of Query Expansion. Under such scenarios, there are also a series of different modes such as Query2Doc. And with the storage of our effective historical chats, the relevant results will become more and more accurate.

  • In the case that LLM does not contain our knowledge, but there is a lot of similar knowledge inside, we need to consider the combination of Lost in the Middle: how to efficiently distribute our chunks in the prompt? That is, assigning the most relevant results among head and tail, so that LLM can catch the focus.

Like LLM half a year ago, the content related to RAG will still evolve drastically in the next few months, and we still need continuous learning.

Language API for converting non-deterministic

When interacting with LLM, natural language is required as an API. In general, it can be divided into two categories of scenarios:

  1. LLM + Workflow. LLM analyzes the user's intent to select the appropriate tools and APIs.

  2. LLM DSL generation. The LLM analyzes the user's intent, combines the specific context, and outputs the DSL, which is parsed by the application and used as the input of the program.

Languages ​​are wonderful, and combined with the essence of LLM it turns non-determinism into deterministic function call parameters, DSL, etc.

Function call: choose the appropriate extension tool based on user intent

a1982894c23467e63f49015f64184ef5.png

Simply put, it is similar to the following prompt method:

你的任务是回答关于代码库的问题。你应该使用一组工具来收集信息,以帮助你回答问题。以下工具可供使用:

In practice, it can usually be divided into these three models.

  • Tooling mode. That is, the above-mentioned way, and provide a bunch of possible optional tools. When developing an application, it is often necessary to combine the context to generate a dynamic tool list so that LLM can choose the appropriate tool.

  • Function Calling mode. When chatting by LLM, detect when a function should be called, pass input to the function, and call the function.

  • Intent recognition small model. That is, it is fine-tuned in a similar way to OpenAI to achieve similar functions in specific scenarios.

In addition to the above scenarios, LLM can also generate DSL, such as JSON, etc., and the program can process this function to achieve similar functions.

Data Driven: DSL Patterns and Guided Users

Generally speaking, we will use DSL as the intermediate language between users and LLM. Not only is it convenient for LLM to understand, it is also convenient for human to understand, and it is also suitable for program analysis. Of course, there are also some DSLs that do not need to be parsed, such as:

----------------------------------------------
|      Navigation(10x)                       |
----------------------------------------------
| Empty(2x) | ChatHeader(8x) | Empty(2x)     |
----------------------------------------------
| MessageList(10x)                           |
----------------------------------------------
| MessageInput(10x)                          |
----------------------------------------------
| Footer(10x)                                |
----------------------------------------------

It can be used as an intermediate rendering mode for further code generation by LLM.

In addition, how to guide users based on the DSL model is a very interesting thing.

Summary and next steps

In this article, we summarize some of the experience of building LLM applications in the past few months. And from these experiences, we found more and more reusable patterns.

We will explore how to better precipitate these patterns to support faster LLM application development.

Guess you like

Origin blog.csdn.net/gmszone/article/details/132658359