How to build an app based on large models

The emergence of ChatGPT has once again made large models a focus of attention in the industry. However, not every organization has to train and generate large models, and the technical accumulation and computing resources of each organization do not allow this. More often than not, we still develop business applications based on large models. The so-called AI Native often refers to applications that cannot be established without large models, which represent some new business opportunities and challenges. In many cases, we are still just Applied AI, that is, empowering applications through AI, especially large models.

Whether it is an AI-native or an AI-enabled application, they will face the problem of how to build an APP based on a large model. What are the differences in system architecture and development methods of an APP based on a large model?

de43e0cc3048691f38074dc8ecc80b8a.jpeg

1. Try to understand the capabilities boundaries of LLM

AI is not mysterious and there is no superstition. When applying any technology, you should understand the boundaries of its capabilities, not only what it can do, but also what it cannot do, or at least what the current limitations of this technology are.

1.1 Basic capabilities of LLM

So far, the main capabilities of large models are as follows:

Use LLM for language understanding and processing rather than as a source of knowledge

LLM is pre-trained on large amounts of text data on the Internet, which provides the knowledge for the large model. This also allows large models to generalize across a wide range of tasks and then be fine-tuned downstream.

When building apps based on large models, it is easy to use these LLMs simply as knowledge/fact sources (i.e. search engines). In fact, we should take advantage of LLM's powerful language understanding and processing capabilities. This will allow it to "understand" the user request and provide a "response". It should provide knowledge and data relevant to the target application and should only return facts/information from the data we provide.

LLM can also be used for basic inference

In addition to language understanding, some LLMs perform well on basic reasoning when working with step-by-step prompts. This allows us to leverage LLM to break down user requests/responses into smaller tasks.

Using LLM for review, evaluation and feedback

LLM is much more effective at reviewing text and reporting problems within it than generating it from scratch. Therefore, we use this technique as much as possible, sending the output of the LLM back to the LLM and asking it to double-check the output.

Text transformation, expansion, summarization using LLM

This is the ability of NLP itself to convert unstructured text into JSON format and vice versa, expand short text or summarize long text.

1.2 Let LLM answer questions it cannot know

In the eyes of experienced programmers, the emergent ability embodied by LLM is still the potential relationship based on inference of existing data, and true creation out of nothing is often the source of illusion. Generally speaking, we have two different approaches to getting large language models to answer questions that LLM cannot know: model fine-tuning and context injection.

The process of tuning pre-trained models is called fine-tuning, which refers to using additional data to train an existing language model to optimize its performance for specific tasks. Instead of training a language model from scratch, we use an already pre-trained model, such as LLama, and tune the model to fit the needs of a specific task by adding use-case-specific training data. Fine-tuning is a way to adjust a model to fit a specific task, but it does not actually inject new domain knowledge into the model. This is because the model has been trained on a large amount of general language data, and domain-specific data is often insufficient to replace what the model has learned. In other words, fine-tuning helps the model adapt to how its language communicates, but not necessarily what it conveys.

When using context injection, we do not modify the language model, but focus on modifying the prompt itself and inserting relevant context into the prompt, which might work like this:

857c13ab04b4dbb7a2cec91f20dd7dc0.png

Therefore, you need to think about how to provide the correct information for prompts, and you need a process that can identify the most relevant data. By using embedding techniques, we can convert text into vectors, thereby representing text in a multi-dimensional embedding space. Points that are close together in space are often used in similar contexts. In order to avoid similarity searches taking too long, vectors are generally stored and indexed in a vector database.

2. Problems faced in building simple applications based on large model APIs

The most straightforward way to build a large model app is to create a simple application layer on the LLM API, which can connect LLM with the application's use cases, data, and user sessions. It can be used to maintain the memory and state of previous interactions with the user. Or break the goal into smaller tasks and subtasks.

However, there are some drawbacks to building a simple application layer on top of LLM:

  • Responses to users will be unpredictable and contain hallucinations.

  • The response will not be relevant to the target application's data and use cases.

  • There is no way to build a moat for your product, anyone can easily achieve the same results.

  • The cost of LLM API is higher and can be quite high.

  • LLM is stateless and has no proxy functionality.

Instead, we need to build moats using our own proprietary data and knowledge, we need to reduce unnecessary calls and use cheaper models where possible, and we need to iteratively orchestrate and automate the underlying LLM to improve mission planning and inference ability.

So, for apps based on large models, is there a universal or instructive reference architecture?

3. Thinking about the system architecture of large model App

LLM-based application development frameworks such as LangChain provide a structured approach to building applications around large models. However, here we try to give the system architecture of a large model App from the abstract layer.

8d46ed307b9c544753bb28886fa34969.png

3.1 Application Orchestrator

The orchestrator simply sits underneath the application stack and connects other modules together. Among them, building multi-tenant components is very important. This will ensure:

  • Personalized for each user

  • Privacy protection, ensuring memories, context, etc. are retrieved only for the correct user.

Furthermore, each of the following modules needs to be designed with multiple multi-tenant instances in mind.

3.2 Task Planner

A good approach is to get the user's request/goal and use a model to break it down into subtasks. Each sub-task can be further broken down into smaller tasks/goals based on the application. This is an ongoing process, and as the user completes their goals, LLM can be used to expand current tasks and subtasks, or to prune tasks that are no longer necessary. Many frameworks and open source projects provide such functionality, a typical example might be AutoGPT.

Generally, it can be handled as follows:

  • Get user goals and send them to LLM with good inference capabilities

  • prompts LLM to break it into subtasks and return it as a JSON list

  • Save subtasks to database

  • Applications can update the user interface based on subtasks

  • Iterate into smaller subtasks as needed

3.3 Context data vector storage

In general, we probably should not use pre-trained LLM knowledge in the target application, but provide any necessary contextual information for each prompt and specify that the LLM responds based only on the information contained in the prompt. This will ensure that responses are relevant to the target application and use case. By using vector embeddings and vector databases, contextual data for subsets of each cue can be retrieved based on semantics, enabling greater efficiency, improved performance, and lower costs.

The method looks like this:

  • Whenever there is new contextual information, it is divided into parts and LLM is used to generate vector embeddings. The embeddings are then stored in a vector database, along with additional information (e.g. URL, image, source text, etc.)

  • Always send the request as a query to the vector store before sending it to LLM. Get the top N relevant results and add them to the request prompt, specify that the LLM should only use the information from the prompt, and then submit the prompt word.

  • Once the response is received, compare it to the context data sent to ensure there are no illusions and that it is relevant to the target application's data.

  • Iterations are performed where the responses are used to generate new queries to the vector database, and the results are then used as cue words for the next LLM.

  • It is also possible to ask LLM to generate a query to the vector store to obtain the required additional information.

It should be noted that the records received from the vector database contain other data in addition to text, which may be images, URLs, video URLs, etc. The target application can use this information to enhance the response of the user interface.

3.4  Vector storage of memory data

The vector store for memory data is similar to the vector store for contextual data, however, it is populated with key-value pairs from LLM prompts and responses previously generated using the application. The goal is to allow LLM to reference previous interactions to personalize user needs and steer in the right direction.

Memory data can also be tagged with timestamps, locations, etc. to allow filtering or pruning of relevant memory data.

General use cases:

  • Make requests based on user actions in the user interface. Requests are converted into vector embeddings and sent to the in-memory vector store to retrieve any relevant memory data.

  • Memories may include specific interactions, for example, a user posted a comment

  • The memory is then added to the prompt along with the user request and any context fetched from the context store. In the prompt, the memory may begin with "Here is a list of previous interactions, please consider these when responding to ensure you comply with previous requests and preferences" The text is prefixed.

  • Then, send the tip to LLM.

  • Generated prompts and responses are converted to vector embeddings during the current session and stored in an in-memory vector store. They are retrieved as long as they are semantically relevant in future LLM interactions.

3.5  Prompt Manager

Many times, especially in relatively complex scenes, prompt words are often lengthy and complex. Build a prompt manager that can accept many properties and build prompts in the correct structure.

Additionally, in order to be able to use the response in the target application, it must be possible to predict the format that will be received. The best way is to provide the expected JSON format in the prompt word. This JSON format can include attributes such as the UI element to be modified and the action to be taken.

3.6 Gakuen manager

The response manager is similar to the prompt manager, but it is used to validate responses and can handle the following:

  • Check the response format to ensure it meets the requirements sent in the prompt. (e.g. validating JSON format)

  • Verify that the response conforms to the loaded context and memory data to ensure it is not an illusion.

  • Send the response back to LLM, along with the original prompt, and ask LLM to decide if we have a good quality response.

  • Check LLM's responses for objectionable content, negative sentiment, etc.

If the response manager thinks there is a problem with the current LLM response, then it can generate a new prompt with a rejection reason and submit it to the LLM for a new response. This can be done iteratively until the response meets all criteria and security checks.

3.7  Performance Evaluator

LLM does a good job of evaluating user prompt words and scoring them based on predefined criteria. A common approach is to prompt the user for feedback after completing a task, and then through these prompts, LLM evaluates the feedback based on the following criteria:

  • Have users reported any dissatisfaction? (-1=Unknown, 0=No dissatisfaction, 10=Severe dissatisfaction)

  • Do users like the experience? (-1=Unknown, 0=Don’t like it at all, 10=Like it very much)

  • Does the user feel like they accomplished their goal? (-1=unknown, 0=none, 10=fully achieved)

  • etc.

Finally, LLM will return feedback in JSON format, and the evaluation results can be stored in a database and new features can be built using these results.

3.8 Large Model Manager

Each large model model has its own advantages and disadvantages, and we may need to use multiple LLMs for application development to take full advantage of their advantages. When choosing which model to use, general considerations include the following:

  • LLM inference cost and API cost

  • Screen the types of large models for different use case scenarios. For example, use an encoder model for sentiment analysis and a decoder model for text generation or chatting. For basic text manipulation, choose a smaller, faster, and cheaper model.

  • Text embedding models for semantic search and generative vector embeddings

  • Fine-tune models to achieve better performance on specific tasks

  • Command fine-tuning models can serve as assistants, such as the application of RLHF

LLM providers generally allow us to select which model to use for each request, and the output of one request can also be linked to a second model for text manipulation or review. For example, when important reasoning tasks are required, GPT-4 can be used, followed by GPT-3 for basic text manipulation or completion. This will help control API costs and ensure the most appropriate model is used for each request. We can also use open source cheaper models for certain tasks.

Through the Large Model Manager, the differences between API and model usage can be abstracted away from the application, and new models can be easily introduced using LLM's plug-in approach.

4. Simple example of building a large model App

To build an app based on a large model, you can probably take the following steps:

  1. Introduce an entry point for users to interact explicitly with natural language into the app to be created or existing (implicit methods can also be used);

  2. Clarify the problem domain space that needs to be solved, load the document content in the target domain, and segment the text;

  3. Use the embedding model to generate vectors from text data;

  4. Build a vector database for vector storage and build an index;

  5. Select the target model and introduce the API into the system;

  6. Create prompt templates and support configuration and optimization;

571e05510ce07f554c9e93648429229d.png

4.1 Introducing natural language interaction

Each App has a corresponding user interaction design (UI/UX). In order to empower the application through large models, it faces unstructured data, such as job descriptions, resumes, emails, text documents, PowerPoint slides, and voice recordings. , video and social media, natural language interaction has a wide range of application scenarios.

The introduction of natural language is generally presented in the form of an assistant, which can be directly used in the form of chat. To put it simply, it introduces an input box that allows you to view historical records into the product.

4.2 Document loading and file splitting

There are many ready-made document loaders that can be used for loaders of HTML pages, S3, PDF, Office documents, etc. Generally, you can use the existing corpus storage or knowledge base in the enterprise to complete the loading of the target data set using batch processing, and then use event triggering to achieve real-time loading.

Then, the text needs to be divided into smaller blocks of text. Each block of text represents a data point in the embedding space, allowing the computer to determine the similarities between the blocks. A common approach is to use larger blocks of text, but it's possible to do some experimentation to find the optimal size that best suits your use case. Please remember that each LLM has a token limit (GPT 3.5 has a token limit of 4000), and you need to ensure that the number of tokens in the entire prompt does not exceed the token limit of a single LLM API call.

4.3 Vector generation of text data

We need to convert text into a form that is understandable and comparable to algorithms. We must find a way to convert human language into the digital form of bits and bytes. Embedding models try to learn this goal by analyzing the context in which words typically appear. The embedding model gives us a vector for each word in the embedding space. Finally, by representing them as vectors, mathematical calculations can be performed, such as calculating the similarity between words as the distance between data points.

Convert text to embeds, common methods are Word2Vec, GloVe, fastText or ELMo. Taking Word2Vec as an example, in order to capture the similarity between words in the embedding space, Word2Vec uses a simple neural network. In fact, existing models are much larger and therefore represent words in a higher-dimensional space. For example, OpenAI's Ada embedding model uses 1536 dimensions. After the training process, the individual weights describe the position in the embedding space. . These pre-trained vectors allow us to capture the relationship between words and their meanings in such a precise way that we can perform computations on them. We can also choose from different embedding models such as Ada, Davinci, Curie and Babbage. Among them, Ada-002 is currently the fastest and lowest-cost model, while Davinci generally provides higher accuracy and performance.

Our goal using embedding models is to convert text blocks into vectors. In Ada-002, these vectors have 1536 output dimensions, which means they represent a point with a specific position or orientation in 1536-dimensional space. In order to determine the similarity between two data points when representing text blocks and user questions as vectors, it is necessary to calculate their proximity in a multidimensional space, which can be achieved through distance measures.

4.4 Build vector database and create index

Converting text data into vectors is only the first step. To be able to efficiently search our embedding vectors, we need to store and index them in a vector database.

A vector database is a type of data storage optimized for storing and retrieving large amounts of data that can be represented as vectors. These types of databases allow efficient querying and retrieval of subsets of data based on various criteria, such as similarity measures or other mathematical operations. Indexes are an important part of vector databases, providing a way to map queries to the most relevant documents or items in the vector store without having to calculate the similarity between each query and each document.

Some common vector databases are as follows:

bd4bfe8bfdc8edd48cde9c9549f15877.png

4.5 Model selection and API application

When choosing a model, you can follow the principles mentioned above. Taking the Open AI platform as an example, the Text-davinci-003 model may be the largest and most powerful model currently. On the other hand, models like Text-ada-001 are smaller, faster, and more cost-effective. Compared to the most powerful model, the Davinci, the Ada is cheaper. Therefore, if Ada's performance meets our needs, not only can we save money, but we can also achieve shorter response times.

We can also try it first with Davinci and then evaluate whether we can also get good enough results using Ada. After model selection, first set the API key to gain access, and then try to set some preferences and do some debugging in different parameters.

4.6 Configuration optimization and use of Prompt template

Prompt specifies the pattern required by the model to answer the question, that is, the desired behavioral style in which LLM is expected to generate the answer. Applying prompts to LLM, here are some simple examples:

  • Summary generation:"Summarize the following text into 3 paragraphs for executives to read: [text]"

  • Knowledge Extraction:“Based on this article: [text], what factors should people consider before buying a home?”

  • Writing content (e.g. emails, messages, code):“Send [name] an email asking about progress on project documentation. Use an informal, friendly tone. ”

  • Grammar and style improvements:"Correct this text to standard English and change the tone to a more friendly one: [text]"

  • Category:"Categories each message to the type of order: [text]"

In addition to the management and use of Prompt and its templates, we can also restrict large models and only allow LLM to use the information stored in the specified database. This limitation allows us to provide the sources upon which LLM relies to generate answers, which is critical for traceability and building trust. Additionally, it helps solve the problem of generating unreliable information and is able to provide answers that can be used for decision-making in a corporate environment.

5 Summary

Of course, we can make use of some existing or developing application frameworks or platforms, such as the open source framework LangChain. If you feel that LangChain is not simple enough to use, you can try LLMFarm. LLMFarm can be simply compared to the visual LangChain. If you still find it complicated, you can also try to use Lanying IM's large model-based enterprise knowledge base.

Therefore, building an application based on a large model is not as difficult as imagined, but it is not so easy to fully utilize the capabilities of a large model to empower business, and it is still necessary to explore and find best practices.

[Related reading]

Guess you like

Origin blog.csdn.net/wireless_com/article/details/133256599