[Large Language Models] Emerging Architectures for LLM Applications


Large language models are powerful new primitives for building software. But because they are so new and behave so differently from regular computing resources, it's not always obvious how to use them.

In this paper, we share a reference architecture for emerging large-scale language model application stacks. It showcases the most common systems, tools, and design patterns we see in AI startups and advanced technology companies . This stack is still in a very early stage and it may change significantly as the underlying technology advances, but we hope it will be a useful reference for developers working with large language models today.

LLM application stack

Here is the current view of the LLM application stack:
Emerging LLM App Stack
Below is a list of links to each project for quick reference:

Data pipelines Embedding models Vector database Playground Orchestration APIs/plugins LLM cache
Databricks OpenAI Pinecone OpenAI Langchain Serp Redis
Airflow Cohere Weaviate nat.dev LlamaIndex Wolfram SQLite
Unstructured Hugging Face ChromaDB Humanloop ChatGPT Zapier GPTCache
pgvector
Logging/LLMops Validation App hosting LLM APIs (proprietary) LLM APIs (open) Cloud providers Opinionated clouds
Weights & Biases Guardrails Vercel OpenAI Hugging Face AWS Databricks
MLflow Rebuff Steamship Anthropic Replicate GCP Anyscale
PromptLayer Microsoft Guidance Streamlit Azure Mosaic
Helicone LMQL Modal CoreWeave Modal
RunPod

There are many different ways to build with LLM, including training a model from scratch , fine-tuning an open-source model , or using a managed API . The technology stack we present here is based on contextual learning , which is the design pattern we see most developers start with (now only possible with the base model).

This pattern is briefly explained in the next section; experienced LLM developers can skip this section.

Design Patterns: In-context learning

The core idea of ​​contextual learning is to use off-the-shelf LLMs (i.e., without any fine-tuning) and then control their behavior through clever cues and conditioning on private "context" data .

For example, suppose you're building a chatbot to answer questions about a set of legal documents. Take a native approach where you paste all your documents into a ChatGPT or GPT-4 prompt and ask questions about them at the end. This might work for very small datasets, but doesn't scale. The largest GPT-4 model can only handle about 50 pages of input text, and performance (measured in inference time and accuracy) degrades severely as it approaches this limit (called the context window).

Context Learning solves this problem with a clever trick: instead of sending all documents at every LLM prompt, it sends only the few most relevant ones. **The most relevant files were identified with the help of... You guessed it, it's LLM.

At a high level, the workflow can be broken down into three phases:

  • Data preprocessing/embedding : This phase involves storing private data (in our example legal documents) to be retrieved later. Typically, documents are broken into chunks, passed through an embedding model, and then stored in a 向量数据库specialized database called .
  • Prompt construction/retrieval : When a user submits a query (in this case a legal question), the application constructs a series of prompts to submit to the language model. Compiled hints typically combine a hint template hard-coded by the developer; an example of valid output called a few-show example; any necessary information retrieved from an external API; and a set of related documents retrieved from a vector database.
  • Prompt execution/inference : Once a prompt is compiled, it is submitted to a pre-trained LLM for inference, including proprietary model APIs and open-source or self-trained models. Some developers also add operating systems such as logging, caching, and validation at this stage.

This may seem like a lot of work, but is usually easier than the other options: training or fine-tuning the LLM itself . There is no need for a dedicated team of ML engineers to do contextual learning. There is also no need to host your own infrastructure or purchase expensive dedicated instances from OpenAI. This model effectively reduces AI problems to data engineering problems that most startups and large corporations already know how to solve . It also tends to outperform fine-tuning for relatively small datasets, since a particular piece of information needs to appear at least ~10 times in the training set for the LLM to remember it through fine-tuning and can incorporate new data in near real-time.

One of the biggest questions in context learning is: what happens if we just change the underlying model to increase the context window ? This is indeed possible, and is an active area of ​​research (see, for example, the Hyena paper or this recent post ). But this comes with some trade-offs, mainly that the cost and time of inference scales quadratically with the length of the hint . Today, even linear scaling (the best theoretical result) is cost-prohibitive for many applications. At current API prices, a GPT-4 query of more than 10,000 pages would cost hundreds of dollars. Therefore, large-scale changes to the stack based on an expanded context window are not expected , but we will discuss this more in the main body of the article.

If you want to dig deeper into contextual learning, there are a lot of great resources in the AI ​​canon (especially the " A Practical Guide to LLM Construction " section). In the remainder of this article, we will walk through the reference stack using the above workflow as a guide.

Data Preprocessing/Embedding

Data Preprocessing/Embedding
Contextual data for LLM applications includes text documents, PDFs, and even structured formats like CSV or SQL tables. Data loading and transformation solutions for this data varied widely among the developers we interviewed. Most use traditional ETL tools like Databricks or Airflow. Some also use document loaders built into orchestration frameworks, such as LangChain (powered by Unstructured) and LlamaIndex (powered by Llama Hub). However, we believe this segment is relatively immature and there are opportunities to build data replication solutions specifically for LLM applications .

For embeddings , most developers use the OpenAI API, especially text-embedding-ada-002the models. It's easy to use (especially if you're already using other OpenAI APIs), works reasonably well, and is getting cheaper and cheaper. Some larger enterprises are also exploring Cohere, which has a product focus on embedding and better performance in certain scenarios. For developers who prefer open source, Hugging Face's sentence converter library is a standard. It is also possible to create different types of embeddings according to different use cases ; this is a niche practice today, but a promising area of ​​research .

From a system point of view, the most important part of the preprocessing pipeline is the vector database . It is responsible for efficiently storing, comparing and retrieving up to billions of embeddings (i.e. vectors) . The most common option we see on the market is Pinecone. It's the default, it's easy to get started because it's fully cloud-hosted, and has many of the features that large enterprises need in production (e.g. good performance at scale, SSO and uptime SLA).

However, there are a large number of vector databases available. It is worth noting that:

  • Open-source systems like Weaviate, Vespa, and Qdrant: They often have excellent single-node performance and can be customized for specific applications, making them popular with experienced AI teams who like to build custom platforms.
  • Native vector management libraries such as Chroma and Faiss: They have many experienced developers and are easy to develop in small applications and development experiments. But they don't necessarily replace full-blown databases at scale.
  • OLTP extensions like pgvector: a great vector-support solution for developers who see every database-shaped hole and try to plug into Postgres or buy most of their data infrastructure from a single cloud provider . Whether tightly coupling vector and scalar workloads makes sense in the long run is unclear.

Going forward, most open source vector database companies are developing cloud products. Research shows that achieving robust performance in the cloud is a very difficult problem in the vast design space of possible use cases. So the set of options may not change dramatically in the short term, but it may in the long term. The key question is whether vector databases will be consolidated around one or two popular systems, similar to OLTP and OLAP .

Another open question is how embedding and vector databases will evolve as the window of context available to most models grows . It's easy to argue that embeddings will become less relevant, since contextual data can be dropped directly into the prompt. However, expert feedback on this topic suggests that embedding pipelines may become more important over time. Large context windows are a powerful tool, but they also require significant computational cost. Therefore, it is imperative to utilize them effectively. We may start to see different types of embedding models becoming popular, trained directly on model relevance, and vector databases designed to enable and exploit this.

Prompt Construction/Retrieval

Prompt Costruction/Retrieval
Strategies to facilitate LLM and integrate contextual data as a source of product differentiation are becoming increasingly sophisticated and important. Most developers start new projects by experimenting with simple hints , either direct instructions (zero-shot hints) or possibly some sample output (few-shot hints). These hints generally yield good results, but not the level of accuracy required for production deployments .

The next level of hinting, jiu jitsu, aims to base the model's responses on some source of truth and provide the external environment that the model was not trained on . The Guide to Prompt Engineering lists no less than 12 more advanced prompting strategies, including chains of thought , self-consistency , generated knowledge , thought trees , directional stimuli , and more. These strategies can also be used to support different LLM use cases, such as document Q&A, chatbots, etc.

This is where Orchestration frameworks like LangChain and LlamaIndex shine. They abstract away many details of hint linking; interfacing with external APIs (including determining when an API call is required); retrieving context data from the vector database; and maintaining memory across multiple LLM calls. They also provide templates for many of the common applications mentioned above . Their output is a hint or sequence of hints submitted to the language model. These frameworks are widely used among hobbyists and startups looking to launch an application, of which LangChain is the leader .

LangChain is still a relatively new project (currently at version 0.0.201), but we're already starting to see applications built with it go into production . Some developers, especially early adopters of LLM, prefer to switch to raw Python in production to remove added dependencies. But we expect this DIY approach to diminish over time in most use cases, in a similar way to traditional web application stacks.

Eagle-eyed readers will notice a seemingly odd entry in the orchestration box: ChatGPT. In its normal incarnation, ChatGPT is an application, not a development tool. But it can also be accessed as an API. And, if you look closely, it performs some of the same functions as other orchestration frameworks, such as: abstracting away the need for custom hints; maintaining state; and retrieving contextual data through plugins, APIs, or other sources. While ChatGPT is not a direct competitor to the other tools listed here, it can be considered an alternative solution that could end up being a viable, easy alternative to instant builds.

Prompt Execution/Inference

Prompt Execution/InferenceToday, OpenAI leads the way in language models. Nearly every developer we interviewed uses the OpenAI API to start a new LLM application, typically using an gpt-4or gpt-4-32kmodel. This provides a sweet spot for application performance, and is easy to use because it operates on a wide range of input domains and typically requires no fine-tuning or self-hosting.

When a project goes into production and starts to scale , a wider set of options come into play. Some common questions we hear include:

  • Switch to gpt-3.5-turbo : it's about 50 times cheaper and significantly faster than GPT-4. Many applications do not require GPT-4-level accuracy, but do require low-latency inference and cost-effective support for free users.
  • Experiment with other proprietary vendors (notably Anthropic's Claude model): Claude offers fast inference, GPT-3.5 level accuracy, more large custom options, and context windows up to 100k (although we found that accuracy scales with input decrease with increasing length).
  • Offload some requests to the open source model: This is especially effective in high-volume B2C use cases like search or chat, where query complexity varies widely and free users need to be served cheaply.
    • This often makes the most sense in conjunction with fine-tuning the open source base model. In this article, we did not delve into the tool stack, but a growing number of engineering teams use platforms such as Databricks, Anyscale, Mosaic, Modal, and RunPod.
    • There are a variety of inference options for open source models, including simple API interfaces for Hugging Face and Replicate; raw computing resources from major cloud providers; and more opinionated cloud offerings like those listed above.

The open source model currently lags behind proprietary offerings, but the gap is starting to close . Meta's LLaMa model set a new standard for open-source accuracy and sparked a series of variants. Since LLaMa is licensed for research use only, many new vendors have stepped in to train alternative base models (e.g., Together, Mosaic, Falcon, Mistral). Meta is also discussing a true open source version of LLaMa 2.

When (not if) open-source LLMs reach accuracy levels comparable to GPT-3.5, we expect to see a similar moment of steady diffusion of text, including large-scale experimentation, sharing, and productization of fine-tuned models. Hosting companies like Replicate are already adding tools to make these models more accessible to software developers. Developers increasingly believe that smaller, fine-tuned models can achieve state-of-the-art accuracy in narrow use cases .

Most of the developers we interviewed haven't delved into LLM's operational tooling yet . Caching is relatively common, usually based on Redis, because it improves application response time and cost. Tools like Weights & Biases and MLflow (ported from traditional machine learning) or PromptLayer and Helicone (built specifically for LLM) are also fairly widely used. They can record, track, and evaluate LLM output, and are often used to improve on-the-fly builds, tune pipelines, or select models. There are also many new tools being developed for validating LLM output (e.g. Guardrails) or detecting instant injection attacks (e.g. Rebuff). Most of these operational tools encourage the use of their own Python clients for LLM calls, so it will be interesting to see how these solutions coexist over time.

Finally, the static part of the LLM application (i.e. everything other than the models) also needs to be hosted somewhere . By far the most common solutions we've seen are standard options like Vercel or a major cloud provider. However, two new categories are emerging. Startups such as Steamship provide end-to-end hosting for LLM applications, including orchestration (LangChain), multi-tenant data context, asynchronous tasks, vector storage, and key management. Companies like Anyscale and Modal allow developers to host models and Python code in one place.

What about agents?

The most important components missing from this reference architecture are AI agent frameworks . Described as an "experimental open-source attempt to make GPT-4 fully autonomous," AutoGPT was the fastest -growing Github repo in history this spring , and nearly every AI project or startup today includes some form of proxy.

Most developers are excited about the potential of Agents. The contextual learning model described in this paper is effective in addressing hallucinations and data freshness issues to better support content generation tasks. Agents , on the other hand, provide AI applications with a whole new set of capabilities: solving complex problems, acting on the external world, and learning from experience after deployment . They do this through a combination of advanced reasoning/planning, tool use, and memory/recursion/self-reflection.

As such, agents have the potential to become a central part of the LLM application architecture (or even take over the entire stack if you believe in recursive self-improvement). Existing frameworks like LangChain already include some proxy concepts. There's just one problem: the proxy doesn't really work yet. Today, most agent frameworks are in the proof-of-concept stage, capable of incredible demonstrations, but not yet capable of doing things reliably and reproducibly. **We are closely watching how they develop in the near future.

Outlook

Pretrained AI models represent the most significant architectural change in software since the Internet. They make it possible for individual developers to build incredible AI applications in days, surpassing supervised machine learning projects that large teams can take months to build.

The tools and patterns listed here may be the starting point for an integrated LLM, not the end. We will update this when there are breaking changes (e.g. the shift to model training) and publish new reference architectures where it makes sense.

References

  1. Emerging Architectures for LLM Applications
  2. in-context learning

Guess you like

Origin blog.csdn.net/ARPOSPF/article/details/131603936