Setting up an open source LLM model for local development

While ChatGPT remains popular, leaked internal Google documents suggest the open source community is catching up and making major breakthroughs. We are now able to run large LLM models on consumer GPUs.

So, if you are a developer who wants to try these LLMs in your local environment and build some applications with it, then in this article I will walk through some options that can help you.

First option:

https://github.com/oobabooga/text-generation-webui?source=post_page-----dcbf80c8d818--------------------------------

Gradio Web UI can be used to run almost any available LL.M. It supports different formats of LLM such as GGML or GPTQ.

Second option:

https://github.com/ggerganov/llama.cpp?source=post_page-----dcbf80c8d818--------------------------------

AC/C++ based library focused on running LLM inference only on the CPU, but recently added support for GPU acceleration. It is designed as a standalone library, so if you want to build an application that integrates with it, you may have to build your own bindings or use the community bindings library:

Note: For llama-cpp-python, if you are using an Apple Silicon (M1) Mac, make sure you have installed a version of Python that supports the arm64 architecture. Otherwise, the installation will build an x86 version of llama.ccp, which is 10 times slower on Apple Silicon (M1) Macs.

Third option:

If you have a decent GPU with more than 8GB of VRAM, you can choose to use GPTQ quantization for the GPU, such as GPTQ-for-LLaMa.

However, GPTQ-for-LLaMa only provides CLI-like examples and limited documentation. Therefore, I created a sample repository that uses the GPTQ-for-LLaMa implementation and serves the generated text through an HTTP API.

https://github.com/mzbac/GPTQ-for-LLaMa-API?source=post_page-----dcbf80c8d818--------------------------------

In short, whether it is Gradio Web UI, llama.cpp or GPTQ-for-LLaMa, each option meets the different hardware capabilities of running LLM locally. Make your selection based on your hardware resources. Dive into the exciting world of the LL.M. and happy coding!

Guess you like

Origin blog.csdn.net/iCloudEnd/article/details/133479317