Create production-level Llama large model service

For anyone who wants to try AI or native LLM without being shocked by unexpected cloud bills or API fees, I can tell you what my own journey was like and how I got started executing Llama2 using cheap consumer grade hardware reasoning.

This project has been developing at a very active pace, which makes it very exciting. This also means that anything you read here is subject to change, as was the case with the flawed model transformation I used, which produced Llama's drunken response above.

Insert image description here

Recommended: Use NSDT editor to quickly build programmable 3D scenes

1 Introduction

Although not well publicized, Meta has quietly contributed to at least four of the most revolutionary open source projects in the world over the past decade. The first is Zstd compression. This faster, superior compression algorithm has made any ZIP file saved since 2015 obsolete. The second and third are through collaboration with the Open Media Alliance, namely the media compression codecs Opus and AV1, without which modern global 4K streaming would not be possible. almost impossible. The fourth and most recent contribution that I like is the ultra-portable Llama LLM and its associated training models. Model training is by far the trickiest and most resource-intensive process in LLM, which is why providing raw models is so useful.

Google released their open source TensorFlow project a few years ago, but most people can only train their own models, which is not easy. Using publicly available ecosystem models for inference is actually easier and can be performed on commodity computing resources or GPUs. In theory, anything compatible with the OpenCL CLBLAST library can do this. This version of llama.cpp supports both CPU and GPU hardware uniformly. Please note that we will be using a build from the master branch, which is considered a beta and therefore may have issues.

Making these public and available to anyone makes it very simple to set up your own AI lab for local help with LLM inference. It doesn't even require the latest GPUs or resources that are hard to find on the market, you can run these using cheap second-hand technology. A lot of this was inspired by Chandler Lofland's excellent article "How to Run Llama 2 on Anything", but I'm going to try to simplify things even further by pre-packaging a high-performance llama.cpp repository for Fedora Linux.

Fedora remains the upstream for most RPM-based distributions, including Red Hat Enterprise Linux and derivatives such as Amazon Linux, Oracle Linux, Rocky, the former CentOS, Scientific Linux, and Fermi Linux. This means that RPM packaging in Fedora will permeate these distributions, which is a journey of how to prepare releases for future environments. Since this project moves so fast, it's hard to keep up, and most users don't want to have to compile things themselves every time something changes. One of my specialties has always been package automation for upstream projects. This is a bit of a journey from commit to production server, and if you do plan on using this in production I'd recommend using hardened Enterprise Linux rather than the bleeding edge Fedora I'm using upstream.

While the previous article mentioned running Llama on a MacBook for a single user, I'm going to approach this from a server perspective on a server-grade workstation. The goal is to build a repeatable environment where customers on a team can connect and share in the cloud or on-premises. You can also try the public sandbox, which performs very well in my experience.

It is worth noting that Llama2 has two components: application and data. The official way to run Llama2 is through a Python application, but the C++ version is obviously faster and more efficient, since RAM is the most critical component you'll find when trying to run an Llama2 service using a CPU or GPU.

In my case, the workstation is running at half capacity on a 2-socket HP Z840 with 512GB DDR4 LRDIMMs. More RAM can be accessed using only the CPU, but CPU inference is slower, especially when RAM needs to be swapped between sockets. Additionally, LRDIMMs are slightly slower than regular RDIMMs, at the expense of higher capacity. Normally I recommend disabling NUMA because giving CPU slot priority over memory will often cause a CPU to overheat until it slows down to unrealistic speeds. In this case, LLM is a CPU workload that can benefit from some NUMA tuning (Llama.cpp does support the NUMA flag), but most people will want to use the GPU anyway.

At the time of writing this article, the common GGML model file format has been deprecated and replaced by GGUF, which is somewhat delayed. The GGUF format is superior and future-proof, but some models need to be re-converted or re-downloaded. It's a moving target, and even converters can't handle most of my GGML files. The good news is that GGUF files are much smaller and less compressed on disk, so they are definitely more efficient than larger GGML files. The bad news is that CodeLlama happened to be released at the same time, and the introduction of this change slowed me down a lot. In my struggle with complex model formats, I realized it would be helpful to easily identify which files are in what format and which version. To keep everything working during testing, I made a temporary magic file to identify the model. Then through PR I found that the community of this project is very active and very friendly. Together we also formally submitted the correct MIME submission.

jboero@xps ~/Downloads> file *llama*
codellama-7b.Q8_0.gguf:     GGUF LLM model version=1
llama-2-7b.ggmlv3.q8_0.bin: GGML/GGJT LLM model version=3
llama-cpp.srpm.spec:        ASCII text

CodeLlama is mainly released as three different models, ranging from training with 7B, 13B and 34B parameter numbers. Even the 7B model is very powerful and can easily run on a small laptop.

2. Performance

The GPU is actually optional, but I've replaced the old Nvidia card, so I'm running a 16GB workstation-grade A4000 and a 12GB consumer-grade 4070ti.

Any production AI deployment should use a server or workstation-class GPU with ECC RAM, such as the A4000. Do not run critical LLM on old mining or gaming CPUs or GPUs without checking the RAM's ECC. Faster HBM2/HBM3 RAM on server-grade GPUs will provide a faster experience than DDR/GDDR on commodity hardware, as access latency to the next weight of a model or tensor is always critical to performance.

In addition to the CPU, the GPU is free to choose its own RAM speed and bus, which is one of the reasons they make such important co-processors. If you're not familiar with general purpose or GPGPU history, see my previous series All Things GPU for background.

We have enough GPU memory to load most 13B Llama2 models at full performance, and enough CPU memory to test the large 70B model. So let’s take it for a test drive.

First, downloading models can be daunting without a fast connection. They're huge, if you take them all, currently about 200GB+ uncompressed. If you need to convert between formats, additional space is required. For this type of storage you may need solid state hardware, but in production you only need to start the service.

I find Fedora 38's btrfs file system and its zstd compression option to be of great benefit. In my experience, the default compress=zstd:1 is not enough and the kernel decides not to compress too much, which is stupid since most content is actually highly compressible depending on the conversion format you need. Replacing the mount option with force-compress=zstd:1 is much better, requiring 46% of the real space on disk and almost no CPU overhead. Since this is read-only data, it works perfectly, as shown in this copy of the 70B chat model:

sudo compsize llama-2-70b-chat/
Processed 3 files, 2105004 regular extents (2105004 refs), 2 inline.
Type       Perc     Disk Usage   Uncompressed Referenced  
TOTAL       46%      120G         256G         256G       
zstd        46%      120G         256G         256G

3. Initial thoughts on CodeLlama

My main interest in Llama2 is to add support for my AI Shell project via the Llama plugin, so I'll be trying it out with a simple bash test. I like code-oriented LLMs to just show me the current time in a city, so let's try a basic 34B CodeLlama model on a CPU server.

Starting the server is very simple, specify host:: to listen on any IPv4 or IPv6 address (note that TLS is not supported yet!). My model is saved to backup disk by forcing zstd compression. Once loaded into disk cache and then into a service or GPU memory, read speed no longer matters.

llamaserver --host :: -m /mnt/backup/llama/codellama-34b.gguf

This provides a convenient web interface for remote connections to quickly and easily test drive CodeLlama. Note that Llama.cpp is not the only runtime for these models. An entire ecosystem of projects has emerged providing UI or REST API services for Llama inference. The simple UI in Llama.cpp uses its own API, is very simple and unauthenticated. Proxy wrapping with a TLS client certificate or external authentication method is very simple, but I expect to add some content soon.
Insert image description here

I would ask it to write a bash script or function that tells me the current time in Dubai. This tends to work very well with Google's Bison encoding model. At this point my preferred litmus test reveals errors in the converted model. These bugs have been fixed in the upstream project.
Insert image description here

Well, this isn't what I'd call a production-grade Llama. This was caused by a model conversion error that occurred at the same time as the GGUF format release, and I wasted a day trying to troubleshoot it. There is a parameter that is supposed to penalize such duplications, but it has no effect, making the Codellama model effectively behave as beautified text predictions on mobile phone keyboards.

First, I needed to re-convert the model from the original GGML format to GGUF using a patched conversion script. Then adjust some parameters and tips to see if you can do better. The default settings in the server console are certainly not what we need. On top of that, this model seems to be particularly temperature sensitive, so let's give it a try.

Next I lowered the temperature to 0.5, but that was obviously not enough. There's still so much creative freedom here that Llama calls me "buddykins" for some reason. Lowering the temperature further by 0.1 and using the settings from the public sandbox environment gave me the results I wanted to see. It appears that the CodeLlama model does not affect temperature, which makes sense. Coding is a fairly strict syntax that doesn't leave much room for randomness or creativity.
Insert image description here

This example better suits my own needs. This is exactly the type of output that aish can use to convert simple requests directly and transparently into local operations in the environment. Please note that I absolutely do not recommend this - do it at your own risk or risk. Sometimes Llama needs a little coaxing.

Insert image description here

For individual users, CPU-only performance isn't actually terrible. Long-running AI inference on pure CPU systems does maximize slots with a single query, so thermal issues can be prohibitive at scale.

Here I'm using the default number of threads, which is one thread per physical core. On my dual 14-core hyperthreaded box, this means that the 28 threads remain pretty much localized to one socket (or NUMA node) until the CPU gets too hot, and then the OS switches to the other socket. Here you can see this happening, the green bar turns blue.

If I manually set the number of threads to 56, the CPU load is maxed out and actual performance is slower - even with the NUMA flag of Llama.cpp. When a thread needs to access memory in another NUMA node across the bus, it slows down the entire process. If you plan to use the CPU to handle some or all of your workload, I recommend a well-cooled single-socket server. Here, you can see lengthy inference heating up the GPU, and hopping between CPU sockets when the active CPU socket gets too hot.

Insert image description here

Hot NUMA balancing shows when inference is migrated between sockets. Temperature is in degrees Celsius.

4. Add CUDA and OpenCL GPU support

Now that we have build automation and packaging set up for standard Llama.cpp on the CPU, it's time to add GPU support for packaging for Fedora and future derivatives (such as RHEL + Amazon Linux).

It turns out that OpenCL is very simple for this. All libraries are purely open source, so they are compatible with the free COPR build system. Additionally, all dependencies are included in the standard repository. Adding some BuildRequires dependencies for building and Requires dependencies for installation should solve the problem. Users need to install the correct OpenCL libraries and drivers for their devices. Note that this also supports Nvidia's implementation, which in my experience is actually one of the most stable OpenCL implementations.

Packaging a CUDA build is tricky for a number of reasons. The CUDA toolkit is not ready for current Fedora 38 and its newer gcc v13, and dynamic builds for Fedora 37 will require the older F37 libraries as dependencies. CUDA is also not easy to compile statically, which requires the GPU driver to be statically linked into the binary. Then we will face the following challenges:

  • Free public build systems do not allow proprietary licensing.
  • nVidia has not yet caught up with the latest GCC compiler version 13.
  • The nVidia CUDA toolkit repository currently only works on x86_64.

For now it's best to stick with Fedora 37 or build with Fedora 37 containers. In the past, CUDA has often lagged behind stable GCC releases, which is not nVidia's fault. In fact, major transitions between major GCC versions resulted in dependency shifts involving many major components.

Fedora usually leads the upstream and takes time to catch up, but Fedora 38 switched to GCC v13, which caused trouble for many projects. This struggle is a normal progression of ecosystems and is nothing new. Fedora has historically not made it easy to install multiple versions of GCC, while Ubuntu and Debian derivatives tend to make installations quite simple.

It is not recommended to mix them frequently as this may lead to support confusion. Many people ended up asking on the forum how to build their own versions of the previous GCC compiler. Don't bother. It has been compiled thousands of times and still exists in legacy repositories. This task can be accomplished by simply obtaining a copy of gcc and g++ v12 from the Fedora 37 repository and copying them to the local bin directory. Next, your architecture requires a g++ layer (x86_64 in this case). Fortunately, this is still optional in Fedora 38 repositories.

sudo dnf --releasever=37 --installroot=/tmp/f37 install gcc-c++
cp /tmp/f37/bin/{gcc,g++} ~/YOURPATHBIN/
sudo dnf install gcc-c++-x86_64-linux-gnu-12.2.1-5.fc38.x86_64

Now enable the Nvidia CUDA repository and install all the correct development packages, including the cuBLAS library. The latest version they support is F37, as they are not quite ready to use F38 with GCC 13. Also, since we have added GCC 12 to the local environment, we can enable the F37 repository on F38.

Finally, cuBLAS-enabled builds can be made using CUDA on our system. If your system has GCC < 13, you should be able to simply build your own RPM using the specs I added to the project here:

Don't forget that even Nvidia devices should be able to use the OpenCL implementation of CLBlast. Three different packages are now provided, each with renamed binaries. You can install all three and use the appropriate version, or just choose the one that suits your hardware. The OpenCL version can work with most current CPUs and GPUs, including theoretically mobile devices, which is nice, so it's my preferred installation.

5. API usage

Remember that the server contains its own simple unauthenticated API. Now testing with Curl and JQ is simple enough. This means adding Llama support directly to my AI Shell project using libCurl and libJsonCPP.

PROMPT="This is a conversation between User and Llama, a friendly codebot. 
    Llama only writes code. Please write a hello world in C++."
curl --request POST \
    --url http://localhost:8080/completion \
    --header "Content-Type: application/json" \
    --data '{"prompt": "$PROMPT","n_predict": 128}' | jq .content
"\n\n#include <iostream>\n\nint main() {\n  std::cout << \"Hello world!\" << std::endl;\n}\n"

While some bugs have been fixed, the field is really taking off and I'm finding that Llama2 is getting a lot more attention than earlier Llama versions. While it's not perfect yet, it's an amazing donation from Meta to the open source community, and I suspect the user base will make it even better. Next, I will add speech recognition support to the AI ​​Shell using OpenAI Whisper or Julius.

6. Conclusion

Indeed, the results speak for themselves. Anyone can start using it today. You should be able to easily evaluate the Llama using the packaging we've put together in this article and some modest hardware. I still caution using AI inference results in production at your own risk, but the environment itself should be production-ready for any team that wants to get started.


Original link: Production-level Llama practice—BimAnt

Guess you like

Origin blog.csdn.net/shebao3333/article/details/132895328