Running LLM with Distributed Inference using prima.cpp on K1 Cluster

prima.cpp is a distributed implementation of llama.cpp.

First, prepare the cluster. Then download the model, which takes some time (using qwen2.5-7b-instruct-q8_0.gguf as an example). Currently, the following model list is supported, with only three quantization formats: Q4KM, Q6K, Q80. It is not recommended to download Q4KM quantization format models at present, as when loading Q4_K_M (Q4KM) models, the framework detects no corresponding operator for the tensor’s ftype enumeration and triggers an assertion exception “ggml_blck_size(type) == 0”, resulting in a core dump.

Using fping and arp tools to find cluster IP addresses

## Use fping to scan LAN clients
fping -a 192.168.2.1 192.168.2.255 -g -q
## Then use arp to view MAC addresses corresponding to IPs
arp

Based on MAC addresses, we can find the addresses:

192.168.2.113
192.168.2.116
192.168.2.117
192.168.2.120

We’ll use 120 as the master node and the rest as worker nodes

Environment Installation

Taking node 120 as an example:

sudo apt update -y && sudo apt install -y gcc-9 make cmake fio git wget libzmq3-dev

Download Model File

wget https://www.modelscope.cn/models/Qwen/Qwen2.5-7B-Instruct-GGUF/resolve/master/qwen2.5-7b-instruct-q8_0.gguf

Install fio

prima.cpp introduces the Halda algorithm, which needs to detect computing power, GPU performance, network bandwidth, and disk before scheduling. The fio tool on the k1 development board lacks libaio, so we need to clone the fio repository and recompile:

sudo apt-get install build-essential pkg-config
git clone https://github.com/axboe/fio.git
cd fio
./configure --prefix=/usr
make -j$(nproc)
sudo make install

Install HIGHS

HIGHS is a high-performance linear optimization library, mainly used in prima.cpp for planning “how many layers for each node, which model layers to put on GPU, which model layers to offload to CPU, etc.” Clone HIGHS:

git clone https://github.com/ERGO-Code/HiGHS.git
cd HiGHS
mkdir build && cd build
cmake ..
make -j$(nproc)
sudo make install

Clone prima.cpp Repository

git clone https://github.com/Lizonghang/prima.cpp.git
cd prima.cpp
make USE_HIGHS=1 -j$(nproc)

The other three worker nodes need to follow these steps to install the environment

Deploy Large Model

Run the following commands on the four hosts respectively, with 120 as the master node and the others as workers, structured as follows:

Commands to start each node:

##192.168.2.120: rank0
./llama-cli -m ~/prima/qwen2.5-7b-instruct-q8_0.gguf -c 1024 --world 4 -p "what is edge AI?" --rank 0 --master 192.168.2.120 --next 192.168.2.113

##192.168.2.113: rank1
./llama-cli -m ~/prima/qwen2.5-7b-instruct-q8_0.gguf  -c 1024 --world 4 --rank 1 --prefetch --master 192.168.2.120 --next 192.168.2.116

##192.168.2.116: rank2
./llama-cli -m ~/prima/qwen2.5-7b-instruct-q8_0.gguf  -c 1024 --world 4 --rank 2 --prefetch --master 192.168.2.120 --next 192.168.2.117

##192.168.2.117: rank3
./llama-cli -m ~/prima/qwen2.5-7b-instruct-q8_0.gguf  -c 1024 --world 4 --rank 3 --prefetch --master 192.168.2.120 --next 192.168.2.120

Running effect as shown:

Thanks to 麻瓜 for providing ideas and answering questions

Running LLM with Distributed Inference using prima.cpp on K1 Cluster

Using fping and arp tools to find cluster IP addresses

Environment Installation

Download Model File

Install fio

Install HIGHS

Clone prima.cpp Repository

Deploy Large Model

1 thought on “Running LLM with Distributed Inference using prima.cpp on K1 Cluster”

Leave a Comment

Using fping and arp tools to find cluster IP addresses

Environment Installation

Download Model File

Install fio

Install HIGHS

Clone prima.cpp Repository

Deploy Large Model

Related Posts

1 thought on “Running LLM with Distributed Inference using prima.cpp on K1 Cluster”

Leave a Comment