Exploring Open Source AI Infrastructure for Large Language Models: Review of Beijing Open Source AI Meetup

原文参见
Explore open source AI Infra for Large Language Models: Highlights from the Open Source AI Meetup Beijing | Cloud Native Computing Foundation

Background introduction:

The recent surge in popularity of large-scale language models and their applications, fueled by the success of ChatGPT, has sparked significant interest in the inner workings of the techniques behind these models. To delve into the infrastructure behind large-scale language models and related applications, WasmEdge organized a developer meetup in Beijing on July 8 with the support of the Cloud Native Computing Foundation (CNCF) . This event brought together experts and developers from various fields in the AI cloud-native open source community to discuss and analyze different technologies in the life cycle of large language model development.

We discussed the following topics

Michael Yuan - Building Lightweight AI Applications with Rust and Wasm

Michael Yuan, founder of the CNCF WasmEdge runtime, explores leveraging the WebAssembly (Wasm) container infrastructure to build large language model (LLM) plugins.

He outlined several key issues of the current large language model functions and plugins:

LLM lock-in forces users to stay in a single provider ecosystem. This limits flexibility.
Model workflow lock-in means that components such as tokenizers or inference engines cannot be easily replaced. Everything has to stay within an overall framework.
UI Lockdown limits the UI/UX to what the vendor provides, leaving less room for customization.
Lack of support for machine input - Today's large language models are built for dialogue models with human input. They don't work well with structured, machine-generated data.
Large language models cannot initiate conversations or provide unsolicited information. The user must drive all interactions.

Existing open source frameworks also present challenges:

Even for basic applications, developers must build and manage the infrastructure. The serverless mode cannot be selected.
Everything relies on Python, which is slow to infer compared to compiled languages like Rust.
Developers must write custom authentication and connectors to external services (such as databases). This overhead slows down development.

To overcome these limitations, WebAssembly and Serverless functions are a great way to build lightweight LLM applications. Wasm provides a portable runtime that starts quickly, supports multiple languages, including Rust, and is ideal for computationally intensive inference.

WasmEdge built a platform flows.network that allows developers to run serverless rust functions in R&D management, DevRel, marketing automation, and training/learning in WasmEdge, providing memory, ears, hands, and action capabilities for large language models, so that they can use Implement large model applications in minutes in a serverless manner . This can reduce development time from months to minutes. It can realize a new generation of customizable vertical large language model applications.

Through this speech, the audience learned to use flows.network to build AI applications in a serverless manner within 3 minutes.

Fangchi Wang - FATE-LLM: Federated Learning Meets Large Language Models

Wang Fangchi, a senior engineer at VMware CTO Office and FATE project maintainer, introduced FATE-LLM, a forward-looking solution that combines federated learning with large language model technology. FATE-LLM allows multiple participants to collaboratively fine-tune large models using their private data, ensuring data privacy without sharing data outside the local domain. The presentation covered the latest results in applying federated learning to large language models such as ChatGLM and LLaMA, discussing technical challenges, design concepts, and future plans.

Federated learning is a promising approach to address the data privacy problem of large language models. Federated learning helps overcome the following challenges of large language models:

Use private data when public data is exhausted or insufficient
Maintain privacy during LLM construction and use

FATE-LLM (FATE Federated Large Language Model) allows participants to fine-tune a shared model using their own private data without transferring the original data. This could allow more organizations to benefit from large language models.

Multiple clients can support horizontal federated learning through FATE's built-in pre-trained model, and use private data for large-scale model fine-tuning;
Support 30+ participants for collaborative training

Li Chen - Vector Database: Long-term Memory for Large Models

Li Chen, head of operations and ecosystem development at Milvus, emphasized the importance of vector databases for organizations building custom large-scale language models. Milvus is an open-source vector database designed for cloud-native environments. It adopts Kubernetes (K8s)-based microservice architecture to realize distributed cloud-native operations. Milvus uses storage and computing separation to provide elastic scalability, allowing seamless expansion and contraction according to workload requirements. Its high availability ensures fast recovery from failures, usually within minutes.

One of the notable capabilities of Milvus is its ability to handle billions of vectors, demonstrating its scalability and applicability to large-scale applications. Milvus uses message queues to implement real-time insertion and deletion of data to ensure efficient data management.

Milvus is integrated with the current popular AI ecosystem, including OpenAl, Langchain, Huggingface and PyTorch, providing seamless compatibility with popular frameworks and libraries. In addition, it also provides a comprehensive set of ecological tools, such as GUI, CLI, monitoring and backup functions, providing users with a powerful toolkit to manage and optimize Milvus deployment.

To sum up, Milvus provides a distributed, cloud-native vector database solution that excels in scalability, fault tolerance, and integration with different AI ecosystems. Its microservice design combined with its expansive ecosystem of tools make Milvus a powerful tool for managing large-scale AI applications.

Zhang Zhi——Technical practice related to model quantification in development

Zhang Zhi, SenseTime Model Quantization Framework Engineer, discussed in depth the widely used neural network quantization technology. The presentation focuses on various quantization techniques used in large language models, such as weight-only quantization and grouped kv-cache quantization. Application scenarios and performance benefits of these techniques are discussed, and insights are provided on model deployment on servers, performance optimization, and reduction of storage and computation costs.

Model quantization and compression are critical for deploying large language models, especially on resource-constrained devices such as edge devices. Tools such as PPQ, developed by the OpenPPL team, can quantize neural networks to reduce their size and computational cost, allowing them to run on a wider range of hardware. This speech was full of dry goods, and it talked about many practical technical details of large-scale model quantification. Teacher Zhang's video was released on station B and received warm praise and welcome.

The tea break provides pizza and fruit:

Summarize

The meetup was an exciting event for attendees who are passionate about cloud native and AI technologies. The speaker focused on large-scale language models and deeply discussed different open source projects for large-scale models, including lightweight AI application development, large-scale model federated learning, vector databases, model quantization, and LLM evaluation. Developers attending the conference can gain valuable insight into the intricate details of these technologies, enabling everyone to take advantage of the synergy between open source cloud-native and artificial intelligence projects and applications.

Overall, the meetup highlighted how open source technologies can help organizations build and apply large language models. By sharing knowledge and collaborating, the AI and cloud-native communities can work together to address the challenges involved in advancing and productizing next-generation AI systems.