K1 AI CPU Deployment of Large Models Based on llama.cpp and Ollama

I. Introduction

To meet the challenges of the intelligent wave such as large models (LLM) and AIGC, spacemit has injected native AI computing power into the RISC-V CPU through AI instruction extension, creating the AI CPU. K1, as its first chip, was released in April this year. This article takes K1 as an example and combines with llama.cpp to demonstrate the advantages of the AI CPU in the field of large models.

II. Tool Introduction

(I) llama.cpp

GitHub Address: https://github.com/ggerganov/llama.cpp
Functional Characteristics: An open-source high-performance CPU/GPU large language model inference framework suitable for consumer-level and edge devices. Developers can convert and quantize various open-source large language models into gguf format files and implement local inference through this framework.
Optimization by spacemit: Based on the running cases contributed by the RISC-V community, spacemit has optimized the operators of large models. With only 4 CPU cores, it achieves 2 – 3 times the performance of the best 8-core version in the community, releasing CPU Loading and facilitating developers to expand AI applications.

(II) Ollama

GitHub Address: https://github.com/ollama/ollama
Functional Characteristics: An open-source large language model service tool that helps users quickly run large models locally. After simple installation, an open-source model such as Llama, Qwen, or Gemma can be run with a single command.

III. Deployment Practice

(I) Tool and Model Preparation

# Pull the ollama and llama.cpp precompiled packages on K1
sudo apt update
sudo apt install spacemit-ollama-toolkit
# Start the ollama service
ollama serve

# open other terminal
# Download the model
wget -P /home/llm/ https://archive.spacemit.com/spacemit-ai/ModelZoo/gguf/qwen2.5-0.5b-q4_0_16_8.gguf
# Import the model, for example, qwen2.5-0.5b
# modelfile address: https://archive.spacemit.com/spacemit-ai/ollama/modelfile/qwen2.5-0.5b.modelfile
wget -P /home/llm/ https://archive.spacemit.com/spacemit-ai/ollama/modelfile/qwen2.5-0.5b.modelfile
cd /home/llm
ollama create qwen2 -f qwen2.5-0.5b.modelfile
# Run the model
ollama run qwen2

(II) Ollama Effect Display

1. Performance and Resource Display

Select representative large language models with a size of 0.5B – 4B on the edge side to demonstrate the acceleration effect of K1’s AI extended instructions. Compare with the master branch of llama.cpp (the official version) and the optimized version of the RISC-V community (the RISC-V community version, GitHub address: https://github.com/xctan/llama.cpp/tree/rvv_q4_0_8x8).

	Official Version (8 Threads)		RISC-V Community Version (8 Threads)		spacemit Version (4 Threads)
Model\Performance	prefill@64t (tokens/s)	decoding@64t (tokens/s)	prefill@64t (tokens/s)	decoding@64t (tokens/s)	prefill @64t (tokens/s)	decoding@64t (tokens/s)
qwen2.5-0.5b	13.7	7.7	29.4	10.9	105.4	15.2
qwen2.5-1.5b	3.9	2.8	9.9	4.4	32.7	5.5
qwen2.5-3b	1.8	1.4	4.8	2.2	15.4	3.0
llama3.2-1b	5.3	3.6	13.4	5.3	42.4	7.31
minicpm3-4b	1.3	1.0	2.9	1.6	10.3	1.9

All models are quantized with 4 bits. The RISC-V community and official version models achieve the optimal acceleration effect. When quantizing, the token-embedding-type is set to q8_0.

2. CPU Occupancy Situation

CPU Occupancy Situation of the spacemit Version of llama.cpp：

CPU Occupancy Situation of the RISC-V Community Version of llama.cpp：

IV. Conclusion

spacemit has made remarkable progress in deploying large models on the K1 platform. It has excellent performance and high openness, helping developers innovate by utilizing community resources. We look forward to more innovations in large language model applications on the K1 platform. spacemit will continue to promote related work. The pre-released software package will be open-sourced in the form of source code at the end of the year for developers to learn and explore.

K1 AI CPU Deployment of Large Models Based on llama.cpp and Ollama

I. Introduction

II. Tool Introduction

(I) llama.cpp

(II) Ollama

III. Deployment Practice

(I) Tool and Model Preparation

(II) Ollama Effect Display

1. Performance and Resource Display

2. CPU Occupancy Situation

IV. Conclusion

V. Reference Documents

Leave a Comment

I. Introduction

II. Tool Introduction

(I) llama.cpp

(II) Ollama

III. Deployment Practice

(I) Tool and Model Preparation

(II) Ollama Effect Display

1. Performance and Resource Display

2. CPU Occupancy Situation

IV. Conclusion

V. Reference Documents

Related Posts

Leave a Comment