Detailed explanation of technology stack and design pattern of LLM application

Large language models are powerful new primitives for building software. But because they are so new and behave so differently from ordinary computing resources, it's not always obvious how to use them.

In this article, we share a reference architecture for emerging LLM applications. It showcases the most common systems, tools, and design patterns we've seen used by AI startups and advanced tech companies. This stack is still early and may change significantly as the underlying technology advances, but we hope it serves as a useful reference for developers working with LLM today.

This work is based on conversations with AI startup founders and engineers.

Recommendation: Use NSDT Designer to quickly build programmable 3D scenes.

1. LLM App technology stack

Here is our current view of the LLM application stack:
insert image description here

Below is a list of links to each item for quick reference:

insert image description here

insert image description here

There are many ways to build with LLM, including training a model from scratch, fine-tuning an open-source model, or using a managed API. The technology stack we present here is based on in-context learning, which is a design pattern we see most developers starting to use (and is now only possible with base models).

This pattern is briefly explained in the next section; experienced LLM developers can skip this section.

2. LLM App Design Pattern: Context Learning

The core idea of ​​in-context learning is to use off-the-shelf LLMs (i.e. without any fine-tuning) and then control their behavior through clever cues and conditioning on private "contextual" data.

For example, suppose you're building a chatbot to answer questions about a set of legal documents. In a simple way, you can paste all documents into a ChatGPT or GPT-4 prompt and ask questions about them at the end. This might work for very small datasets, but doesn't scale. The largest GPT-4 models can only handle about 50 pages of input text, and performance (measured by inference time and accuracy) degrades severely when approaching this limit, called the context window.

Context learning solves this problem with a neat trick: instead of sending all documents at every LLM prompt, it sends only the few most relevant documents. The most relevant documentation was identified with the help of...you guessed it...LLM.

At a very high level, the workflow can be broken down into three phases:

  • Data preprocessing/embedding: This phase involves storing private data (legal documents in our example) for later retrieval. Typically, documents are divided into chunks, modeled by embedding, and then stored in specialized databases called vector databases.
  • Prompt construction/retrieval: When a user submits a query (in this case, a legal question), the application constructs a series of prompts to submit to the language model. Compiled hints typically combine a hint template hard-coded by the developer; examples of valid output known as few-shot examples; any necessary information retrieved from an external API; and a set of related documents retrieved from a vector database.
  • Hint execution/inference: After hints are compiled, they are submitted to pre-trained LLMs for inference, including proprietary model APIs and open-source or self-trained models. Some developers also add operating systems such as logging, caching, and validation at this stage.

This may seem like a lot of work, but it's usually easier than the other options: training or fine-tuning the LLM itself. You don't need a dedicated team of ML engineers to do contextual learning, nor do you need to host your own infrastructure or buy expensive dedicated instances from OpenAI. This model effectively reduces AI problems to data engineering problems that most startups and large corporations already know how to solve. It also tends to outperform fine-tuning for relatively small datasets—since a particular piece of information needs to appear at least ~10 times in the training set before the LLM can remember it through fine-tuning—and can incorporate new algorithms with near-realistic accuracy. data time.

One of the biggest questions in context learning is: what happens if we just change the underlying model to increase the context window? It's indeed possible, and it's an active area of ​​research (see, for example, the Heyna paper or this recent one). But this comes with some trade-offs - mainly the cost and time of inference scales quadratically with the length of the hint. Today, even linear scaling (the best theoretical result) is cost-prohibitive for many applications. At current API rates, a single GPT-4 query over 10,000 pages would cost hundreds of dollars. As such, we do not anticipate large-scale changes to the tech stack based on the expanded context window, but we will comment more on this in the body of the post.

If you want to go deeper into contextual learning, there are many great resources in AI Classics (especially the "A Practical Guide to LLM Construction" section). In the rest of this article, we will use the above workflow as a guide to walk through the reference stack.

insert image description here

Contextual data for LLM applications includes text documents, PDFs, and even structured formats like CSV or SQL tables. Data loading and transformation solutions varied widely among the developers we interviewed. Most use traditional ETL tools such as Databricks or Airflow. Some also use document loaders built into orchestration frameworks, such as LangChain (powered by Unstructed) and LlamaIndex (powered by Llama Hub). However, we believe this part of the technology stack is relatively underdeveloped and there is an opportunity to build data replication solutions specifically for LLM applications.

For embedding, most developers use the OpenAI API, specifically the text-embedding-ada-002 model. It's easy to use (especially if you're already using other OpenAI APIs), delivers pretty good results, and is getting cheaper and cheaper. Some larger enterprises are also exploring Cohere, with product efforts more focused on embedding and better performance in certain scenarios. For developers who prefer open source, Hugging Face's Sentence Transformers library is the standard. It is also possible to create different types of embeddings according to different use cases; this is a niche practice today, but a promising area of ​​research.

From a system perspective, the most important part of the preprocessing pipeline is the vector database. It is responsible for efficiently storing, comparing and retrieving up to billions of embeddings (i.e. vectors). The most common option we see on the market is Pinecode. It's the default, because it's fully cloud-hosted, so it's easy to get started, and has many of the features that large enterprises need in production (e.g., good performance at scale, SSO, and uptime SLAs).

However, there are a large number of vector databases available. especially:

  • Open source systems such as Weaviate, Vespa, and Qdrant: These often have excellent single-node performance and can be customized for specific applications, making them popular with experienced AI teams who like to build custom platforms.
  • Native vector management libraries like Chroma and Faiss: They have a lot of developer experience and are easy to start small apps and develop experiments. They don't necessarily replace full databases at scale.
  • OLTP extensions like pgvector:
    good vector support for developers who see a hole in every database shape and try to plug into Postgres, or for enterprises buying most of their data infrastructure from a single cloud provider solution. In the long run, it's not clear that tightly coupling vector and scalar workloads makes sense.

Going forward, most open source vector database companies are developing cloud products. Our research shows that achieving robust performance in the cloud is a very difficult problem in a broad design space of possible use cases. So the options may not change dramatically in the short term, but they may in the long run. The key question is whether vector databases will be consolidated around one or two popular systems, similar to OLTP and OLAP databases.

Another open question is how embedding and vector databases will evolve as the window of context available to most models grows. It's easy to argue that embeddings will become less relevant, since contextual data can be dropped directly into the prompt. However, expert feedback on this topic suggests the opposite — that embedded pipelines may become more important over time. Large context windows are a powerful tool, but they also require significant computational cost. Therefore, it is imperative to utilize them effectively. We may start to see different types of embedding models becoming popular, trained directly on model correlation, and vector databases designed to enable and leverage this.

insert image description here

Strategies to facilitate LLM learning and incorporate contextual data are becoming increasingly sophisticated and important as sources of product differentiation. Most developers start new projects by trying out simple hints that include direct instructions (zero-shot hints) or possibly some sample output (few-shot hints). These hints generally yield good results, but not the level of accuracy required for production deployments.

The next level of jiu-jitsu cues is to base the model's responses on some source of truth and to provide the external context that the model was not trained on. The Cue Engineering Guide lists no fewer than a dozen (!) more advanced cueing strategies, including thought chains, self-consistent, generative knowledge, thought trees, directed stimuli, and more. These strategies can also be combined to support different LLM use cases, such as document Q&A, chatbots, etc.

This is where orchestration frameworks like LangChain and LlamaIndex shine. They abstract away many details of the hint chain; interface with external APIs (including determining when an API call is required); retrieve context data from the vector database; and maintain memory across multiple LLM calls. They also provide templates for many of the common applications mentioned above. Their output is a hint or a sequence of hints submitted to the language model. These frameworks are widely used by hobbyists and startups looking to develop applications, with LangChain being the leading one.

LangChain is still a relatively new project (currently at version 0.0.201), but we are already starting to see applications built with it go into production. Some developers, especially early adopters of LLM, prefer to switch to raw Python in production to remove additional dependencies. But we expect this DIY approach to decline over time for most use cases, just like traditional web application stacks.

Eagle-eyed readers will notice a seemingly odd entry in the layout box: ChatGPT. Under normal circumstances, ChatGPT is an application, not a developer tool. But it can also be accessed as an API. And, if you look closely, it performs some of the same functions as other orchestration frameworks, such as: abstracting away the need for custom hints; maintaining state; retrieving contextual data through plugins, APIs, or other sources. While ChatGPT is not a direct competitor to the other tools listed here, it can be considered an alternative solution, and it could end up being a viable, easy alternative to instant builds.

insert image description here

Today, OpenAI is a leader in the field of language models. Nearly every developer we interviewed uses the OpenAI API to start a new LLM application, usually with a gpt-4 or gpt-4-32k model. This provides the best-case scenario for application performance, and is easy to use because it can operate on a wide range of input domains, and often requires no fine-tuning or self-hosting.

Wider options come into play when projects go into production and start to scale. Some common questions we hear include:

  • Switch to gpt-3.5-turbo: it is about 50 times cheaper than GPT-4 and significantly faster. Many applications do not require GPT-4-level accuracy, but do require low-latency inference and cost-effective support for free users.
  • Experiment with other proprietary vendors, notably Anthropic's Claude model: Claude offers fast inference, GPT-3.5-level accuracy, more customization options for large customers, and context windows up to 100k (although we found accuracy to Decreases with increasing length of time) input).
  • Classify some requests for open-source models: This is especially effective in high-volume B2C use cases like search or chat, where query complexity varies widely and free users need to be served at low cost.
  • This often makes the most sense in conjunction with fine-tuning the open-source base model. We won't discuss the tool stack in depth in this article, but a growing number of engineering teams are using platforms like Databricks, Anyscale, Mosaic, Modal, and RunPod.
  • Open source models can use a variety of inference options, including Hugging Face and Replicate's simple API interfaces; raw computing resources from major cloud providers; and more opinionated cloud offerings like the ones listed above.

Currently, the open source model lags behind proprietary products, but the gap is starting to close. Meta's LLaMa model set a new standard for open-source accuracy and sparked a series of variants. Since LLaMa is licensed for research use only, many new providers have stepped in to train alternative base models (eg Together, Mosaic, Falcon, Mistral). Meta is also discussing a true open source version of LLaMa 2.

When (not if) LLM achieves an accuracy level comparable to GPT-3.5, we expect to see a moment of steady diffusion of text, including large-scale experimentation, sharing, and production of fine-tuned models. Hosting companies like Replicate are already adding tools to make these models more accessible to software developers. Developers increasingly believe that smaller, fine-tuned models can achieve state-of-the-art accuracy in narrow use cases.

Most developers we interviewed did not yet have a deep understanding of LLM's operational tools. Caching is relatively common (often based on Redis), as it improves application response time and reduces costs. Tools like Weights & Biases and MLflow (ported from traditional machine learning) or PromptLayer and Helicone (built for LLM) are also fairly widely used. They can record, track, and evaluate LLM output, often to improve cue building, tune pipelines, or select models. There are also many new tools being developed for validating LLM output (eg Guardrails) or detecting instant injection attacks (eg Rebuff). Most of these operational tools encourage the use of their own Python clients for LLM calls, so it will be interesting to see how these solutions coexist over time.

Finally, the static part of the LLM application (that is, everything other than the models) also needs to be hosted somewhere. By far the most common solutions we've seen are standard options like Vercel or the major cloud providers. However, two new categories are emerging. Startups like Steamship provide end-to-end hosting for LLM applications, including orchestration (LangChain), multi-tenant data context, asynchronous tasks, vector storage, and key management. Companies like Anyscale and Modal allow developers to host models and Python code in one place.

3. About Agent

The most important component missing from this reference architecture is the AI ​​agent framework. Described as "an experimental open-source attempt to make GPT-4 fully autonomous," AutoGPT was the fastest-growing Github repository in history this spring, and nearly every AI project or startup today contains some form of acting.

Most of the developers we talked to were very excited about the potential of proxies. The contextual learning model we describe in this paper can effectively address hallucinations and data freshness issues to better support content generation tasks. Agents, on the other hand, provide AI applications with a whole new set of capabilities: solving complex problems, acting on the external world, and learning from post-deployment experience. They do this through a combination of high-level reasoning/planning, tool use, and memory/recursion/self-reflection.

So agents have the potential to become a core part of the LLM application architecture (or even take over the entire stack, if you believe in recursive self-improvement). Existing frameworks like LangChain already incorporate some proxy concepts. There's just one problem: Proxies don't really work yet. Today, most agent frameworks are in the proof-of-concept stage, capable of incredible demonstrations, but have yet to accomplish reliable, repeatable tasks. We are closely watching how they develop in the near future.

4. Looking to the future

Pretrained AI models represent the most significant change in software architecture since the Internet. They enable individual developers to build incredible AI applications in days, surpassing supervised machine learning projects that large teams can take months to build.

The tools and patterns we list here may be a starting point for integrating LLM, not an end state. We will update this when there are breaking changes (e.g. moving to model training) and publish new reference architectures where it makes sense. If you have any feedback or suggestions, please contact us.


Link to the original text: Reference Architecture for LLM Application Development—BimAnt

Guess you like

Origin blog.csdn.net/shebao3333/article/details/131547604