3 ways to run Llama2 locally

We've talked a lot about running and fine-tuning Llama 2 on Replicate. But you can also run Llama natively on an M1/M2 Mac, Windows, Linux, or even a phone. One of the cool things about running Llama 2 locally is that you don't even need an internet connection.

insert image description here

Recommendation: Use NSDT Designer to quickly build programmable 3D scenes.

Llama 2 has only been out for a few days, but there are already some techniques for running it locally. In this blog post, we will introduce three open source tools that can be used to run Llama 2 on your own device:

  • Llama.cpp (Mac/Windows/Linux)
  • Ollama (March)
  • MLC LLM(iOS/Android)

1、Llama.cpp (Mac/Windows/Linux)

Llama.cpp is a port of Llama in C/C++, which makes it possible to run Llama 2 natively on a Mac with 4-bit integer quantization. Llama.cpp also supports Linux/Windows.

It can be installed on an M1/M2 Mac with the following command:

curl -L "https://replicate.fyi/install-llama-cpp" | bash

The following is the execution content of the above command:

#!/bin/bash

# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

# Build it. `LLAMA_METAL=1` allows the computation to be executed on the GPU
LLAMA_METAL=1 make

# Download model
export MODEL=llama-2-13b-chat.ggmlv3.q4_0.bin
if [ ! -f models/${MODEL} ]; then
    curl -L "https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolve/main/${MODEL}" -o models/${MODEL}
fi

# Set prompt
PROMPT="Hello! How are you?"

# Run in interactive mode
./main -m ./models/llama-2-13b-chat.ggmlv3.q4_0.bin \
  --color \
  --ctx_size 2048 \
  -n -1 \
  -ins -b 256 \
  --top_k 10000 \
  --temp 0.2 \
  --repeat_penalty 1.1 \
  -t 8

Here's a one-line command for your Intel Mac or Linux machine. Same as above, but we don't include the LLAMA_METAL=1 flag:

curl -L "https://replicate.fyi/install-llama-cpp-cpu" | bash

Here's a one-line command to run on WSL on Windows:

curl -L "https://replicate.fyi/windows-install-llama-cpp" | bash

2、Ollama (March)

Ollama is an open source macOS application (for Apple Silicon) that lets you run, create, and share large language models through a command-line interface. Ollama already supports Llama 2.

To use the Ollama CLI, download the macOS app from ollama.ai/download. Once installed, you can download Lllama 2 without signing up for an account or joining any waitlists. In your terminal run:

# download the 7B model (3.8 GB)
ollama pull llama2

# or the 13B model (7.3 GB)
ollama pull llama2:13b

Then you can run the model and chat with it:

ollama run llama2
>>> hi
Hello! How can I help you today?

NOTE: Ollama recommends at least 8 GB of RAM to run the 3B model, 16 GB to run the 7B model, and 32 GB to run the 13B model.

3. MLC LLM (Llama on mobile)

MLC LLM is an open source project that can run language models natively on a variety of devices and platforms, including iOS and Android.

For iPhone users, there is an MLC chat app on the App Store. MLC now supports versions 7B, 13B, and 70B of Llama 2, but it's still in beta and not yet on the Apple Store version, so you'll need TestFlight installed to try it out. Check out the instructions for installing the beta here.


Original text link: 3 solutions for local operation of Llama2 - BimAnt

Guess you like

Origin blog.csdn.net/shebao3333/article/details/132015116
Recommended