Distributed Inference and Fine-tuning of Large Language Models Over The Internet

This article is a series of LLM articles, focusing on the translation of "Distributed Inference and Fine-tuning of Large Language Models Over The Internet".

Distributed inference and fine-tuning of large language models on the Internet

Summary

Large language models (LLMs) are useful in many NLP tasks and become more powerful as they scale, with the best open source models having over 50 billion parameters. However, using these 50B+ models requires high-end hardware, which puts them out of reach for most researchers. In this work, we study cost-effective inference and fine-tuning methods for LLM, comparing local and distributed strategies. We observe that sufficiently large models (50B+) can run efficiently even on geographically distributed devices in consumer-grade networks. This enables efficient running of LLM by pooling spare computing resources from multiple research groups and volunteers. We address two open problems: (1) how to reliably perform inference and fine-tuning if any device may suddenly disconnect; and (2) how to partition LLM among devices with uneven hardware, connecting and leaving at will. To do this, we developed special fault-tolerant reasoning algorithms and load balancing protocols that automatically allocate devices to maximize the total throughput of the system. We demonstrate these algorithms in PETALS1, a decentralized system that runs Llama 2 (70B) and BLOOM (176B) on the Internet up to 10 times faster than offloading interaction generation. We evaluate the performance of our system in simulated conditions and real-world settings across two continents.

1 Introduction

2 Background: Efficient training and inference

3 methods

4 experiments

5 Conclusion

In this paper, we introduce a new fault-tolerant algorithm for inferencing large language models. Most importantly, we introduce a decentralized system for running LLM on distributed unreliable devices connected via the Internet, which significantly outperforms other methods of running inference on consumer-grade hardware. We demonstrate that the proposed system scales to the largest public language models with hundreds of billions of trainable parameters.
While our work focuses on technical aspects, it is important to consider the limitations of our approach, such as the privacy of data processed by external peers, as well as the broader implications of making LLM more accessible. We discuss these issues and outline directions for future work in Appendix H.

おすすめ

転載: blog.csdn.net/c_cpp_csharp/article/details/135064268