【项目记录】大模型基于llama.cpp在Qemu-riscv64向量扩展指令下的部署

devtools/2024/10/11 1:13:22/

概述

本文在qemu-riscv64平台上,利用向量扩展指令加速运行基于llama.cpp构建的大模型。
参考博客链接:
Accelerating llama.cpp with RISC-V Vector Extension
基于RVV的llama.cpp在Banana Pi F3 RISCV开发板上的演示

llamacpp_6">llama.cpp工程

Llama.cpp是一个基于C++编写的高性能大模型推理框架,旨在提供快速、稳定且易于使用的计算工具,Llama.cpp支持多种计算模式,包括向量计算、矩阵运算、图算法等,可广泛应用于机器学习、图像处理、数据分析等领域。

目录结构

src是构建模型架构的基础库文件夹
examples是部分案例模型的源文件
ggml是计算操作的库文件
在这里插入图片描述

源码分析

可如下参考链接(注意:函数名有所变动):
CodeLeaner@微信公众号:llama.cpp源码解析

llama_19">llama模型
// llama  @examples/main/main.cpp
main()  gpt_params_parse(argc, argv, params, LLAMA_EXAMPLE_MAIN, print_usage)	// 解析传递进来的模型参数llama_init_from_gpt_params()llama_load_model_from_file(params.model.c_str(), mparams);		// 加载model参数llama_new_context_with_model(model, cparams);			// ggml_backend_cpu_init();*cpu_backend				// 定义指针指向 cpu_backend_illama_tokenize(ctx, prompt, true, true)				// 将prompt tokenizewhile   // 循环产生tokenllama_decode(ctx, llama_batch_get_one(&embd[i], n_eval, n_past, 0))	// 生成token函数llama_token_to_piece(ctx, id, params.special)gpt_perf_print(ctx, smpl);						// 打印性能结果
llamacpp_35">llama.cpp模型库文件

// decode 函数体 @src/llama.cpp
int32_t llama_decode(        struct llama_context * ctx,          struct llama_batch   batch)llama_decode_internal(*ctx, batch)while (lctx.sbatch.n_tokens > 0) llama_build_graph(lctx, ubatch, false);llama_graph_compute(lctx, gf, n_threads, threadpool);ggml_backend_cpu_set_n_threads(lctx.backend_cpu, n_threads);ggml_backend_cpu_set_threadpool(lctx.backend_cpu, threadpool);ggml_backend_cpu_set_abort_callback(lctx.backend_cpu, lctx.abort_callback, lctx.abort_callback_data);	ggml_backend_sched_graph_compute_async(lctx.sched, gf);
ggml计算库函数
// backend计算图执行函数   ggml/src/ggml-backend.c
enum ggml_status ggml_backend_sched_graph_compute_async(ggml_backend_sched_t sched, struct ggml_cgraph * graph)ggml_backend_sched_alloc_graph(sched, graph)ggml_backend_sched_split_graph(sched, graph);			// 该函数划分graph,并调用cpu_backend进行计算,继而调用ggml_graph_compute计算// ggml计算图中各类计算选通的主体函数  ggml/src/ggml.c
enum ggml_status ggml_graph_compute(struct ggml_cgraph * cgraph, struct ggml_cplan * cplan)ggml_graph_compute_thread(&threadpool->workers[0])ggml_compute_forward(&params, node);ggml_compute_forward_dup(params, tensor);ggml_compute_forward_add1(params, tensor);ggml_compute_forward_repeat(params, tensor);ggml_compute_forward_mul_mat(params, tensor);ggml_compute_forward_soft_max(params, tensor);ggml_compute_forward_rms_norm(params, tensor);

结构体

// cpu执行数据流结构体,将函数作为结构体成员。
cpu_backend_i	// @ggml/src/ggml-backend.c   ggml_backend_cpu_graph_plan_create()ggml_graph_plan()ggml_backend_cpu_graph_plan_compute()ggml_graph_compute()ggml_backend_cpu_graph_compute()ggml_graph_plan()ggml_graph_compute()

llamacppQn_0Qn_1_88">llama.cpp中量化方式(Qn_0、Qn_1等)含义

sgsprog@hackmd.io: Linux 核心專題: llama.cpp 效能分析

rvv移植代码分析

Github相关链接:
Llama.cpp中利用GGML中对RVV的支持1
Llama.cpp中利用GGML中对RVV的支持2
Tameem-10xE@llama.cpp Github:Added RISC-V Vector Intrinsics Support

起初移植代码在ggml.c中后续迁移至ggml-quants.c文件中。

修改函数包括12个:
quantize_row_q8_0
quantize_row_q8_1
ggml_vec_dot_q4_0_q8_0
ggml_vec_dot_q4_1_q8_1
ggml_vec_dot_q5_0_q8_0
ggml_vec_dot_q5_1_q8_1
ggml_vec_dot_q8_0_q8_0
ggml_vec_dot_q2_K_q8_K
ggml_vec_dot_q3_K_q8_K
ggml_vec_dot_q4_K_q8_K
ggml_vec_dot_q5_K_q8_K
ggml_vec_dot_q6_K_q8_K

这些函数作为不同量化模式的成员函数:

static const ggml_type_traits_t type_traits[GGML_TYPE_COUNT] = {[GGML_TYPE_Q8_0] = {.type_name                = "q8_0",.blck_size                = QK8_0,.type_size                = sizeof(block_q8_0),.is_quantized             = true,.to_float                 = (ggml_to_float_t) dequantize_row_q8_0,.from_float               = quantize_row_q8_0,.from_float_ref           = (ggml_from_float_t) quantize_row_q8_0_ref,.from_float_to_mat        = quantize_mat_q8_0,.vec_dot                  = ggml_vec_dot_q8_0_q8_0,.vec_dot_type             = GGML_TYPE_Q8_0,......
}

在ggml不同算子计算函数中被调用:

static void ggml_compute_forward_mul_mat_one_chunk(...)ggml_vec_dot_t const vec_dot      = type_traits[type].vec_dot;for (int64_t iir1 = ir1_start; iir1 < ir1_end; iir1 += blck_1) {for (int64_t iir0 = ir0_start; iir0 < ir0_end; iir0 += blck_0) {for (int64_t ir1 = iir1; ir1 < iir1 + blck_1 && ir1 < ir1_end; ir1 += num_rows_per_vec_dot) {...for (int64_t ir0 = iir0; ir0 < iir0 + blck_0 && ir0 < ir0_end; ir0 += num_rows_per_vec_dot) {vec_dot(ne00, &tmp[ir0 - iir0], (num_rows_per_vec_dot > 1 ? 16 : 0), src0_row + ir0 * nb01, (num_rows_per_vec_dot > 1 ? nb01 : 0), src1_col, (num_rows_per_vec_dot > 1 ? src1_col_stride : 0), num_rows_per_vec_dot);}...}}}

2024/10/02: 工具准备OK,但qemu运行时被killed

工具版本

Qemu:
在这里插入图片描述
Gcc版本:
Github Release

llama.cpp:
llama.cpp Github 10月2号pull

llama-7b模型版本:
Huggingface gguf文件

编译

llamacpp_164">llama.cpp编译

cd llama.cpp
make   RISCV_CROSS_COMPILE=1 

运行命令

qemu-riscv64 -L /home/kevin/data/projects/tools/riscv64_linux_gcc/sysroot -cpu rv64,v=true,vlen=256,elen=64,vext_spec=v1.0 ./llama-server -m /home/kevin/data/projects/kg_proj/rvv_transformer/codellama-7b.Q4_K_M.gguf -p “Anything” -n 9

问题

命令运行现象

在这里插入图片描述在这里插入图片描述
在这里插入图片描述

可能原因

运行内存可能太小

2024/10/03: 使用10xE团队的最新版,解决tokenizer的问题,但还是被killed

最新版Github链接

Tameem-10xE/llama.cpp Github

问题:运行7B模型被killed

运行现象

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

可能原因

有可能跟qemu运行的swapfile有关
可以团队成员提的一个issue:
Github Issue: qemu-riscv64 unexpectedly reached EOF error

解决办法

先尝试换一个更小的模型试试,不行就解决swapfile的问题

运行3B规模的model的现象:failed to allocate buffer of size

kevin@BRICKHOUSE01:~/data/projects/kg_proj/rvv_transformer/llama.cpp$ qemu-riscv64  -L /home/kevin/data/projects/tools/riscv64_linux_gcc/sysroot  -cpu rv64,v=true,vlen=256,elen=64,vext_spec=v1.0 ./llama-cli -m /home/kevin/data/projects/kg_proj/rvv_transformer/Llama-3.2-3B-Instruct-IQ3_M.gguf -p "Anything" -n 9
Log start
main: build = 3733 (e5701063)
main: built with riscv64-unknown-linux-gnu-gcc () 13.2.0 for riscv64-unknown-linux-gnu
llama_model_loader: loaded meta data with 35 key-value pairs and 255 tensors from /home/kevin/data/projects/kg_proj/rvv_transformer/Llama-3.2-3B-Instruct-IQ3_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                            general.license str              = llama3.2
llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   9:                          llama.block_count u32              = 28
llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 24
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  18:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  19:                          general.file_type u32              = 27
llama_model_loader: - kv  20:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  21:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  29:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  30:               general.quantization_version u32              = 2
llama_model_loader: - kv  31:                      quantize.imatrix.file str              = /models_out/Llama-3.2-3B-Instruct-GGU...
llama_model_loader: - kv  32:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  33:             quantize.imatrix.entries_count i32              = 196
llama_model_loader: - kv  34:              quantize.imatrix.chunks_count i32              = 125
llama_model_loader: - type  f32:   58 tensors
llama_model_loader: - type q4_K:   59 tensors
llama_model_loader: - type q6_K:    1 tensors
llama_model_loader: - type iq3_s:  137 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 24
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 3
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = IQ3_S mix - 3.66 bpw
llm_load_print_meta: model params     = 3.21 B
llm_load_print_meta: model size       = 1.48 GiB (3.96 BPW)
llm_load_print_meta: general.name     = Llama 3.2 3B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.12 MiB
llm_load_tensors:        CPU buffer size =  1518.09 MiB
.....................................................................
llama_new_context_with_model: n_ctx      = 131072
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 15032385568
llama_kv_cache_init: failed to allocate buffer for kv cache
llama_new_context_with_model: llama_kv_cache_init() failed for self-attention cache
llama_init_from_gpt_params: error: failed to create context with model '/home/kevin/data/projects/kg_proj/rvv_transformer/Llama-3.2-3B-Instruct-IQ3_M.gguf'
main: error: unable to load model

可能原因

llama.cpp Github issue: Bug: ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 137438953504

解决办法

尝试调整模型的参数:
llama.cpp Github参数说明

调整ctx参数后成功运行

kevin@BRICKHOUSE01:~/data/projects/kg_proj/rvv_transformer/llama.cpp$ qemu-riscv64  -L /home/kevin/data/projects/tools/riscv64_linux_gcc/sysroot  -cpu rv64,v=true,vlen=256,elen=64,vext_spec=v1.0 ./llama-cli -m /home/kevin/data/projects/kg_proj/rvv_transformer/Llama-3.2-3B-Instruct-IQ3_M.gguf -p "Anything" -n 9 -c 50
Log start
main: build = 3733 (e5701063)
main: built with riscv64-unknown-linux-gnu-gcc () 13.2.0 for riscv64-unknown-linux-gnu
llama_model_loader: loaded meta data with 35 key-value pairs and 255 tensors from /home/kevin/data/projects/kg_proj/rvv_transformer/Llama-3.2-3B-Instruct-IQ3_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                            general.license str              = llama3.2
llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   9:                          llama.block_count u32              = 28
llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 24
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  18:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  19:                          general.file_type u32              = 27
llama_model_loader: - kv  20:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  21:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  29:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  30:               general.quantization_version u32              = 2
llama_model_loader: - kv  31:                      quantize.imatrix.file str              = /models_out/Llama-3.2-3B-Instruct-GGU...
llama_model_loader: - kv  32:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  33:             quantize.imatrix.entries_count i32              = 196
llama_model_loader: - kv  34:              quantize.imatrix.chunks_count i32              = 125
llama_model_loader: - type  f32:   58 tensors
llama_model_loader: - type q4_K:   59 tensors
llama_model_loader: - type q6_K:    1 tensors
llama_model_loader: - type iq3_s:  137 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 24
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 3
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = IQ3_S mix - 3.66 bpw
llm_load_print_meta: model params     = 3.21 B
llm_load_print_meta: model size       = 1.48 GiB (3.96 BPW)
llm_load_print_meta: general.name     = Llama 3.2 3B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.12 MiB
llm_load_tensors:        CPU buffer size =  1518.09 MiB
.....................................................................
llama_new_context_with_model: n_ctx      = 64
llama_new_context_with_model: n_batch    = 64
llama_new_context_with_model: n_ubatch   = 64
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =     7.00 MiB
llama_new_context_with_model: KV self size  =    7.00 MiB, K (f16):    3.50 MiB, V (f16):    3.50 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:        CPU compute buffer size =    32.06 MiB
llama_new_context_with_model: graph nodes  = 902
llama_new_context_with_model: graph splits = 1system_info: n_threads = 6 (n_threads_batch = 6) / 12 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling seed: 2622092847
sampling params:repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler constr:logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> temp-ext -> softmax -> dist
generate: n_ctx = 64, n_batch = 2048, n_predict = 9, n_keep = 1Anything else?
Yes. You can also consider the
llama_perf_print:    sampling time =      10.96 ms /    11 runs   (    1.00 ms per token,  1003.37 tokens per second)
llama_perf_print:        load time =   73770.30 ms
llama_perf_print: prompt eval time =   15435.63 ms /     2 tokens ( 7717.81 ms per token,     0.13 tokens per second)
llama_perf_print:        eval time =   74267.24 ms /     8 runs   ( 9283.41 ms per token,     0.11 tokens per second)
llama_perf_print:       total time =   89764.85 ms /    10 tokens
Log end

llamacpp_459">2024/10/05:生成非向量支持的riscv版本llama.cpp,进行对比实验

llamacpp_461">llama.cpp编译

make CC="riscv64-unknown-linux-gnu-gcc -march=rv64gc -mabi=lp64d" CXX="riscv64-unknown-linux-gnu-g++ -march=rv64gc -mabi=lp64d"
make CC="riscv64-unknown-linux-gnu-gcc -march=rv64gc -mabi=lp64d" CXX="riscv64-unknown-linux-gnu-g++ -march=rv64gc -mabi=lp64d"

报错:riscv64-unknown-linux-gnu-g++: error: ‘-march=native’: ISA string must begin with rv32 or rv64

解决方案:
GCC GitHub:riscv64-unknown-linux-gnu-g++: error: ‘-march=native’: ISA string must begin with rv32 or rv64

即将llama.cpp工程中的Makefile中523和524行修改为rv64gc编译,该部分原先是交叉编译时使能RVV编译,如图所示:

在这里插入图片描述

修正后编译命令

make RISCV_CROSS_COMPILE=1

运行结果

kevin@BRICKHOUSE01:~/data/projects/kg_proj/rvv_transformer/llama.cpp.rv64gc$ qemu-riscv64  -L /home/kevin/data/projects/tools/riscv64_linux_gcc/sysroot  -cpu rv64 ./llama-cli -m /home/kevin/data/projects/kg_proj/rvv_transformer/codellama-7b.Q4_K_M.gguf -p "Anything" -n 100 -c 1024
Log start
main: build = 3733 (e5701063)
main: built with riscv64-unknown-linux-gnu-gcc () 13.2.0 for riscv64-unknown-linux-gnu
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /home/kevin/data/projects/kg_proj/rvv_transformer/codellama-7b.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = codellama_codellama-7b-hf
llama_model_loader: - kv   2:                       llama.context_length u32              = 16384
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 15
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32016]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32016]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32016]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0.1686 MB
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32016
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 16384
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 16384
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 3.80 GiB (4.84 BPW)
llm_load_print_meta: general.name     = codellama_codellama-7b-hf
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: PRE token        = 32007 '▁<PRE>'
llm_load_print_meta: SUF token        = 32008 '▁<SUF>'
llm_load_print_meta: MID token        = 32009 '▁<MID>'
llm_load_print_meta: EOT token        = 32010 '▁<EOT>'
llm_load_print_meta: max token length = 48
llm_load_tensors: ggml ctx size =    0.14 MiB
llm_load_tensors:        CPU buffer size =  3891.33 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 1024
llama_new_context_with_model: n_batch    = 1024
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   512.00 MiB
llama_new_context_with_model: KV self size  =  512.00 MiB, K (f16):  256.00 MiB, V (f16):  256.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB
llama_new_context_with_model:        CPU compute buffer size =    98.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1system_info: n_threads = 6 (n_threads_batch = 6) / 12 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling seed: 3630221793
sampling params:repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler constr:logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> temp-ext -> softmax -> dist
generate: n_ctx = 1024, n_batch = 2048, n_predict = 100, n_keep = 1Anything is possible if you believe it. The only thing you are not capable of doing is believing in anything.
- John L. Sullivan- To be honest, I've never had a problem with the fact that I'm a bad person and I'm a horrible human. I've always been okay with that.
- I think it's a mistake to think of yourself as being in the middle of the world, instead of at the
llama_perf_print:    sampling time =      94.53 ms /   104 runs   (    0.91 ms per token,  1100.12 tokens per second)
llama_perf_print:        load time =   44296.57 ms
llama_perf_print: prompt eval time =   79786.77 ms /     4 tokens (19946.69 ms per token,     0.05 tokens per second)
llama_perf_print:        eval time = 2360379.06 ms /    99 runs   (23842.21 ms per token,     0.04 tokens per second)
llama_perf_print:       total time = 2440911.31 ms /   103 tokens
Log end

http://www.ppmy.cn/devtools/123917.html

相关文章

python交互式命令时如何清除

在交互模式中使用Python&#xff0c;如果要清屏&#xff0c;可以import os&#xff0c;通过os.system()来调用系统命令clear或者cls来实现清屏。 [python] view plain copy print? >>> import os >>> os.system(clear) 但是此时shell中的状态是&#xff1a;…

React18新特性

React 18新特性详解如下&#xff1a; 并发渲染&#xff08;Concurrent Rendering&#xff09;&#xff1a; React 18引入了并发渲染特性&#xff0c;允许React在等待异步操作&#xff08;如数据获取&#xff09;时暂停和恢复渲染&#xff0c;从而提供更平滑的用户体验。 通过时…

Linux内核参数管理

Linux 内核有很多可以定制化的参数 —— 内核参数 ( kernel parameters )&#xff0c; 斟酌设置内核参数对 系统调优 意义重大。 内核参数 涵盖内核的方方面面&#xff0c;包括 网络 ( net )、 文件系统 ( fs )等等。 本文以 fs.file-max 参数为例&#xff0c;介绍设置内…

数据库中间件 -- MyCat

1、什么是数据库中间件 数据库中间件(Database Middleware)是一种位于应用程序与数据库管理系统(DBMS)之间的软件层。它的主要目的是为应用程序提供更加高效、可靠和透明的数据库访问,同时解决多种数据库管理问题。 The domain name Mycat.io is for sale 1.1、常见的数…

容器化技术:Docker的基本概念和使用

在现代软件开发和运维中&#xff0c;容器化技术已经成为一种不可或缺的工具。Docker作为容器化技术的代表&#xff0c;以其轻量级、可移植性和隔离性等特点&#xff0c;赢得了广泛的关注和应用。本文将详细介绍Docker的基本概念和使用方法&#xff0c;帮助读者快速上手Docker容…

【浏览器】如何正确使用Microsoft Edge

1、清理主页广告 如今的Microsoft Edge 浏览器 主页太乱了&#xff0c;各种广告推送&#xff0c;点右上角⚙️设置&#xff0c;把快速链接、网站导航、信息提要、背景等全部关闭。这样你就能得到一个超级清爽的主页。 网站导航       关闭 …

面试--开源框架面试题集合

Spring 谈谈自己对于 Spring IoC 的了解什么是 IoC?IoC 解决了什么问题?什么是 Spring Bean&#xff1f;将一个类声明为 Bean 的注解有哪些?Component 和 Bean 的区别是什么&#xff1f;注入 Bean 的注解有哪些&#xff1f;Autowired 和 Resource 的区别是什么&#xff1f;…

【MySQL 10】索引

目录 1.初始索引 1.1索引概念 1.2常见索引分类 1.3 见一下索引&#xff08;案例&#xff09; 2.关于物理磁盘 2.1见一下物理磁盘 2.2 了解磁盘的存储结构 2.3对磁盘的逻辑结构进行抽象 4.磁盘随机访问与连续访问 5.MySQL表与磁盘 3. MySQL 与磁盘的交互 3.1MySQL 与…