Fast, portable Llama2 inference on the heterogeneous edge

The Rust + Wasm technology stack can be a powerful Python alternative for AI inference.

Compared to Python, Rust+Wasm applications can be 1/100 the size of Python, 100 times faster, and best of all, can run safely anywhere with full hardware acceleration, without the need for binary code Make any changes. Rust is the language of AGI .

We created a very simple Rust program (40 lines of code) to perform inference using an llama2 model at native speed. When compiled to Wasm, the binary application (only 2MB) is fully portable between devices with heterogeneous hardware accelerators. The Wasm runtime ( WasmEdge ) also provides a safe and reliable execution environment for cloud environments. In fact, WasmEdge Runtime also works seamlessly with container tools to orchestrate and execute portable applications across many different devices.

[Click here to view dmeo video: How to chat with llama2 model on MacBook]([Run a large language model with a 2MB application? WebAssembly will take you flying!] https://www.bilibili.com/video/BV1BF411m73B/?share_source=copy_web&vd_source =71527f8e252ebbaa141625dd2b623396 "Click here to view dmeo video: How to chat with llama2 model on MacBook")

This work is based on the llama.cpp project created by Georgi Gerganov . We used original C++ programs to run on Wasm. It is suitable for model files in GGUF format .

Run the Llama 2 model on your computer

Step 1. Install WasmEdge and GGML plug-in

Use the following commands on a Linux or Mac (M1/M2) computer to install everything. See here for more details .

curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install.sh | bash -s -- --plugin wasi_nn-ggml

Step 2. Download pre-built Wasm applications and models

curl -LO https://github.com/second-state/WasmEdge-WASINN-examples/raw/master/wasmedge-ggml-llama-interactive/wasmedge-ggml-llama-interactive.wasm

You will also need to download the llama2 model in GGUF format. The example below downloads a llama2 7B model quantized to 5-bit weights, specifically optimized for conversation ( see more here ).

curl -LO https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/resolve/main/llama-2-7b-chat.Q5_K_M.gguf

Step 3. Run it and try it!

Use WasmEdge to run the Wasm inference application and pass it the GGUF model. You can now type questions to chat with the model.

wasmedge --dir .:. --nn-preload default:GGML:CPU:llama-2-7b-chat.Q5_K_M.gguf wasmedge-ggml-llama-interactive.wasm default

Question:

Here is a complete conversation example.

wasmedge --dir .:. --nn-preload default:GGML:CPU:llama-2-7b-chat.Q5_K_M.gguf wasmedge-ggml-llama-interactive.wasm default

Question:
Who is the "father of the atomic bomb"?
Answer:
The "father of the atomic bomb" is a term commonly associated with physicist J. Robert Oppenheimer. Oppenheimer was a leading figure in the development of the atomic bomb during World War II, serving as the director of the Manhattan Project, which was responsible for the development and deployment of the first nuclear weapons. He is often referred to as the "father of the atomic bomb" due to his significant contributions to the field of nuclear physics and his leadership in the development of the bomb.
Question:
Was he a communist?
Answer:
J. Robert Oppenheimer was not a communist. While he was associated with the Manhattan Project, which was a secret government project, and was involved in the development of the atomic bomb, he was not a member of the Communist Party or had any known political affiliations with communism. Oppenheimer was a physicist and a scientist, and his primary focus was on the development of nuclear weapons during World War II.

Optional: configure the model

You can use environment variables to configure model execution.

For example, the following command specifies a context length of 4k tokens, which is the llama2 standard, and a maximum number of tokens per response of 1k. It also tells WasmEdge to print the model's logs and statistics at the runtime level.

LLAMA_LOG=1 wasmedge --dir .:. --nn-preload default:GGML:CPU:llama-2-7b-chat.Q5_K_M.gguf wasmedge-ggml-llama-interactive.wasm default

llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from llama-2-7b-chat.Q5_K_M.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q5_K     [  4096, 32000,     1,     1 ]
... ...
llm_load_tensors: mem required  = 4560.96 MB (+  256.00 MB per state)
...................................................................................................
Question:
Who is the "father of the atomic bomb"?
llama_new_context_with_model: kv self size  =  256.00 MB
... ...
llama_print_timings:      sample time =     3.35 ms /   104 runs   (    0.03 ms per token, 31054.05 tokens per second)
llama_print_timings: prompt eval time =  4593.10 ms /    54 tokens (   85.06 ms per token,    11.76 tokens per second)
llama_print_timings:        eval time =  3710.33 ms /   103 runs   (   36.02 ms per token,    27.76 tokens per second)
Answer:
The "father of the atomic bomb" is a term commonly associated with physicist J. Robert Oppenheimer. Oppenheimer was a leading figure in the development of the atomic bomb during World War II, serving as the director of the Manhattan Project, which was responsible for the development and deployment of the first nuclear weapons. He is often referred to as the "father of the atomic bomb" due to his significant contributions to the field of nuclear physics and his leadership in the development of the bomb.

llama on the edge. Image generated by Midjourney.

Why not choose Python?

Large language models like llama2 are often trained using Python (such as PyTorch, Tensorflow and JAX). But using Python for inference applications (about 95% of calculations in AI) would be a serious mistake.

  • Python packages have complex dependencies . They are difficult to build and use.
  • Python has a lot of dependencies. Docker images of Python or PyTorch are usually several or even tens of GB in size . This is particularly problematic for AI inference on edge servers or devices.
  • Python is a very slow language. Up to 35,000 times slower than compiled languages ​​such as C, C++ and Rust .
  • Because Python is slow, most real-world workloads must be delegated to native shared libraries under the Python wrapper. This makes Python inference applications great for presentations, but difficult to modify behind the scenes to meet business-specific needs.
  • Unwieldy reliance on native libraries and complex dependency management make it difficult to port Python AI programs across devices while taking advantage of the devices' unique hardware capabilities.

Commonly used Python packages in LLM toolchains directly conflict with each other.

Chris Lattner , known for LLVM, Tensorflow, and the Swift language, had a great interview on a recent episode of the Entrepreneurship Podcast . He discusses why Python is great for model training but a wrong choice for inference applications.

Advantages of Rust+Wasm

The Rust + Wasm stack provides a unified cloud computing infrastructure spanning devices to edge clouds, on-premises servers, and public clouds. It is a powerful alternative to the Python stack for AI inference applications. No wonder Elon Musk said Rust is the language of AGI.

  • **Super lightweight. **The inference application with all dependencies is only 2MB. It is less than 1% of the size of a typical PyTorch container.
  • **very fast. **Native C/Rust speed can be used throughout all parts of the inference application: preprocessing, tensor calculations, and post-processing.
  • **Portable. **The same Wasm bytecode applications can run on all major computing platforms and support heterogeneous hardware acceleration.
  • **Easy to set up, develop and deploy. **No more complex dependencies. Build a single Wasm file using standard tools on your laptop and deploy it anywhere!
  • **Secure and cloud-ready. **The Wasm runtime is designed to isolate untrusted user code. The Wasm runtime can be managed through container tools and easily deployed on cloud-native platforms.

Rust reasoner

Our demo inference program is written in Rust and compiled into Wasm. Rust source code is very simple. It's only 40 lines of code. The Rust program manages user input, tracks conversation history, converts text into llama2 chat templates, and runs inference operations using the WASI NN standard API .


fn main() {
    let args: Vec<String> = env::args().collect();
    let model_name: &str = &args[1];

    let graph =
        wasi_nn::GraphBuilder::new(wasi_nn::GraphEncoding::Ggml, wasi_nn::ExecutionTarget::AUTO)
            .build_from_cache(model_name)
            .unwrap();
    let mut context = graph.init_execution_context().unwrap();

    let system_prompt = String::from("<<SYS>>You are a helpful, respectful and honest assistant. Always answer as short as possible, while being safe. <</SYS>>");
    let mut saved_prompt = String::new();

    loop {
        println!("Question:");
        let input = read_input();
        if saved_prompt == "" {
            saved_prompt = format!("[INST] {} {} [/INST]", system_prompt, input.trim());
        } else {
            saved_prompt = format!("{} [INST] {} [/INST]", saved_prompt, input.trim());
        }

        // Set prompt to the input tensor.
        let tensor_data = saved_prompt.as_bytes().to_vec();
        context
            .set_input(0, wasi_nn::TensorType::U8, &[1], &tensor_data)
            .unwrap();

        // Execute the inference.
        context.compute().unwrap();

        // Retrieve the output.
        let mut output_buffer = vec![0u8; 1000];
        let output_size = context.get_output(0, &mut output_buffer).unwrap();
        let output = String::from_utf8_lossy(&output_buffer[..output_size]).to_string();
        println!("Answer:\n{}", output.trim());

        saved_prompt = format!("{} {} ", saved_prompt, output.trim());
    }
}

To build the application yourself, just install the Rust compiler and add wasm32-wasithe compiler target.

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
rustup target add wasm32-wasi

Then, view the source project and run Cargothe command to build the Wasm file from the Rust source project.

# Clone 源代码
git clone https://github.com/second-state/WasmEdge-WASINN-examples/
cd WasmEdge-WASINN-examples/wasmedge-ggml-llama-interactive/

# 构建 Rust 程序
cargo build --target wasm32-wasi --release

# 输出的 Wasm 结果文件
cp target/wasm32-wasi/release/wasmedge-ggml-llama-interactive.wasm .

Run in the cloud or edge

Once you obtain the Wasm bytecode file, you can deploy it on any device that supports WasmEdge Runtime. Just install WasmEdge with the GGML plug-in on your device . We currently have GGML plug-ins for Universal Linux, Ubuntu Linux, and Mac M1/M2.

Based on Llama.cpp , the WasmEdge GGML plug-in will automatically take advantage of any hardware acceleration on the device to run your Llama2 model. For example, the Mac OS version of the GGML plug-in uses the Metal API to run inference workloads on M1/M2's built-in neural processing engine. The Linux CPU version of the GGML plug-in uses the OpenBLAS library to automatically detect and exploit advanced computing features on modern CPUs such as AVX and SIMD.

This is how we achieve portability across heterogeneous AI hardware and platforms without sacrificing performance.

what's next

While the WasmEdge GGML tool is currently available (and indeed used by our cloud native customers), it is still in its early stages. If you're interested in contributing to open source projects and shaping the future direction of large language model inference infrastructure, here are some low-hanging fruits why you may want to contribute !

  • Added GGML plug-ins for more hardware and operating system platforms. Nvidia CUDA is obviously an important goal and we will get there soon. But we are also interested in TPUs, ARM NPUs, and other dedicated AI chips on Linux and Windows.
  • Support more llama.cpp configurations. We currently support passing some configuration options from Wasm to GGML plugins. But we want to support all the options GGML offers!
  • Supports WASI NN API for other Wasm-compatible languages. We are particularly interested in Go, Zig, Kotlin, JavaScript, C and C++.
  • Support for text streaming in model results. It can be seen that the current WASI NN standard API returns all inference results at once. We wanted to create an alternative that would return characters one by one by simulating a typewriter experience.

Other AI models

As lightweight, fast, portable, and secure alternatives to Python, WasmEdge and WASI NN enable building inference applications around popular AI models in addition to large language models. For example,

Edge lightweight AI inference has just begun!

Alibaba Cloud suffered a serious failure and all products were affected (restored). Tumblr cooled down the Russian operating system Aurora OS 5.0. New UI unveiled Delphi 12 & C++ Builder 12, RAD Studio 12. Many Internet companies urgently recruit Hongmeng programmers. UNIX time is about to enter the 1.7 billion era (already entered). Meituan recruits troops and plans to develop the Hongmeng system App. Amazon develops a Linux-based operating system to get rid of Android's dependence on .NET 8 on Linux. The independent size is reduced by 50%. FFmpeg 6.1 "Heaviside" is released
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4532842/blog/10116700