ggerganov/llama.cpp - 编译出main可执行程序依赖以下的源代码文件

如果想在 Windows 系统编译出 llama.cpp 项目（这个是github上的仓库， ggerganov/llama.cpp ），需要在Visual Studio上添加项目内的若干个源文件。这篇简陋的笔记记录了截至目前为止项目中的 main 可执行程序编译时依赖的各个代码文件和它们的路径，方便我自己事后回过头来查，算是备忘。

目前是 2023年5月16日 ，llama.cpp 项目最新的 git 提交是 2a5ee023ad3022bc0b505343394b9754587fb731 。

Author: sandyiscool <sandyiscool@gmail.com>
Date:   Tue May 16 14:00:15 2023 +0530Add alternate include path for openblas (#1476)In some linux distributions (fedora, for example), the include path for openblas is located at '/usr/local/include'

总结

需要以下文件：

main.cpp 在这儿： examples/main/main.cpp
build-info.h 这个文件需要自己手写，见下方的具体内容
ggml.c ggml.h ggml-cuda.h 这三个文件都在根目录
llama.cpp llama.h llama-util.h 在根目录
common.cpp common.h 这两个在 examples/ 目录下

注意：

以上代码拷贝过去之后，编译一堆警告和错误。
如果运行失败原因见下方。

具体内容

main.cpp

位于examples/main/

build-info.h

这个文件随便写就是了，内容如下：

#ifndef BUILD_INFO_H
#define BUILD_INFO_H#define BUILD_NUMBER 1
#define BUILD_COMMIT "2a5ee02"#endif // BUILD_INFO_H

其中的 2a5ee02 是 git log 的提交ID前几位。

ggml.c, ggml.h, ggml-cuda.h

这三个文件在根目录下，主要原因是库文件 ggml.o 是从这三个文件编译出来的。而编译依赖 ggml.o 。

ggml-cuda.h 比较特殊，在文件 ggml.c 内有如下条件：

#elif defined(GGML_USE_CUBLAS)
#include "ggml-cuda.h"

所以如果没有使用CUDA加速那么可以不用文件 ggml-cuda.h 。

llama.cpp, ggml.h, ggml-cuda.h, llama.h, llama-util.h

llama.o 是根据这几个文件编译出来的。和文件 ggml.c 内的处理方式一样，只有使用CDUA才会加载 ggml-cuda.h 。

examples/common.cpp, examples/common.h

common.o 是根据这几个文件编译出来的。

编译错误和警告

代码没有彻底针对Visual Studio兼容，导致默认情况下编译出错。解决办法有两个：

我提供了自己修正后的代码，看这里：az13js/llama.cpp。
或者说，在Visual Studio里右键点击解决方案，在配置属性->C/C++->常规->SDL检查，选择否。

运行失败

error loading model: this format is no longer supported (see https://github.com/ggerganov/llama.cpp/pull/1305)
llama_init_from_file: c failed to load model
llama_init_from_gpt_params: error: failed to load model 'ggml-model-q4_0.bin'
main: error: unable to load model

模型格式不支持。要么就更新模型文件，要么就找回到旧版本的代码去编译：

git reset --hard b608b55a3ea8e4760c617418538465449175bdb8

ggerganov/llama.cpp - main函数的执行过程

我电脑上下载的模型文件是旧的，但是为了省事将项目的提交ID切换至 b608b55a3ea8e4760c617418538465449175bdb8 。这里分析文件 examples/main/main.cpp 中的 main 函数的执行。这个文件是编译后可执行程序 main.exe 的源代码。

1. 模型加载

显然这段代码用来加载模型。

// load the model and apply lora adapter, if any
ctx = llama_init_from_gpt_params(params);
if (ctx == NULL) {fprintf(stderr, "%s: error: unable to load model\n", __func__);return 1;
}

关键函数 llama_init_from_gpt_params 的声明位于 examples/common.h （C++头文件），参数类型是：

struct llama_context * llama_init_from_gpt_params(const gpt_params & params);

结构体 gpt_params 的声明同样位于文件 examples/common.h 。完整声明如下：

struct gpt_params {int32_t seed          = -1;   // RNG seedint32_t n_threads     = get_num_physical_cores();int32_t n_predict     = -1;  // new tokens to predictint32_t n_parts       = -1;   // amount of model parts (-1 = determine from model dimensions)int32_t n_ctx         = 512;  // context sizeint32_t n_batch       = 512;  // batch size for prompt processing (must be >=32 to use BLAS)int32_t n_keep        = 0;    // number of tokens to keep from initial prompt// sampling parametersstd::unordered_map<llama_token, float> logit_bias; // logit bias for specific tokensint32_t top_k             = 40;    // <= 0 to use vocab sizefloat   top_p             = 0.95f; // 1.0 = disabledfloat   tfs_z             = 1.00f; // 1.0 = disabledfloat   typical_p         = 1.00f; // 1.0 = disabledfloat   temp              = 0.80f; // 1.0 = disabledfloat   repeat_penalty    = 1.10f; // 1.0 = disabledint32_t repeat_last_n     = 64;    // last n tokens to penalize (0 = disable penalty, -1 = context size)float   frequency_penalty = 0.00f; // 0.0 = disabledfloat   presence_penalty  = 0.00f; // 0.0 = disabledint     mirostat          = 0;     // 0 = disabled, 1 = mirostat, 2 = mirostat 2.0float   mirostat_tau      = 5.00f; // target entropyfloat   mirostat_eta      = 0.10f; // learning ratestd::string model  = "models/lamma-7B/ggml-model.bin"; // model pathstd::string prompt = "";std::string path_prompt_cache = "";  // path to file for saving/loading prompt eval statestd::string input_prefix      = "";  // string to prefix user inputs withstd::string input_suffix      = "";  // string to suffix user inputs withstd::vector<std::string> antiprompt; // string upon seeing which more user input is promptedstd::string lora_adapter = "";  // lora adapter pathstd::string lora_base = "";     // base model path for the lora adapterbool memory_f16        = true;  // use f16 instead of f32 for memory kvbool random_prompt     = false; // do not randomize prompt if none providedbool use_color         = false; // use color to distinguish generations and inputsbool interactive       = false; // interactive modebool prompt_cache_all  = false; // save user input and generations to prompt cachebool embedding         = false; // get only sentence embeddingbool interactive_first = false; // wait for user input immediatelybool multiline_input   = false; // reverse the usage of `\`bool instruct          = false; // instruction mode (used for Alpaca models)bool penalize_nl       = true;  // consider newlines as a repeatable tokenbool perplexity        = false; // compute perplexity over the promptbool use_mmap          = true;  // use mmap for faster loadsbool use_mlock         = false; // use mlock to keep model in memorybool mem_test          = false; // compute maximum memory usagebool verbose_prompt    = false; // print prompt tokens before generation
};

大部分的参数都好理解。

top_k 对于预测出来的可能的词语，采用前面的几个来采样。如果 top_k 小于或者等于0那么采用词典的大小（也就是所有可能的词都参与采样）。
logit_bias Logit 偏差可用于促进或抑制特定令牌的生成。这是通过在每个标记的各自逻辑数中添加一个偏置项（ bias ）来实现的。如果正偏差增加了生成概率，负偏差则降低了生成概率。这个参数在命令参数 --ignore-eos 的时候被设置。
path_prompt_cache 可以缓存执行了提示词之后的模型的状态。
prompt_cache_all 保存用户的输入和生成的文本到提示词缓存内。

函数 llama_init_from_gpt_params 如果运行失败，将返回 NULL 。如果运行出现错误，将直接通过标准的错误输出输出信息。

测试代码：

#include "common.h"
#include <iostream>int main() {gpt_params params;params.model = "D:\\my_files\\llama7b\\ggml-model-q4_0.bin";auto llama_context = llama_init_from_gpt_params(params);if (nullptr == llama_context) {std::cout << "INIT FAIL" << std::endl;return 1;}std::cout << "INIT SUCCESS" << std::endl;return 0;
}

输出：

llama.cpp: loading model from D:\my_files\llama7b\ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 49954
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  68.20 KB
llama_model_load_internal: mem required  = 5897.00 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB
INIT SUCCESS

如果文件有问题的时候，执行失败：

llama.cpp: loading model from D:\my_files\llama7b\tmp.txt
error loading model: unknown (magic, version) combination: 8890e509, e8b6b9e5; is this really a GGML file?
llama_init_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model 'D:\my_files\llama7b\tmp.txt'
INIT FAIL

注意，函数占用的内存比较多。

2. 模型释放

函数：

void llama_free(struct llama_context * ctx);

测试代码：

#include "common.h"
#include <iostream>int main() {gpt_params params;params.model = "D:\\my_files\\llama7b\\ggml-model-q4_0.bin";auto llama_context = llama_init_from_gpt_params(params);llama_free(llama_context);return 0;
}

3. token的处理

执行语言模型的函数 llama_evel 接受的关键参数是 llama_token* ，这个参数存放提示词。所以需要先理解如何使用这个类型。头文件 llama.h 有如下声明：

typedef int llama_token;

所以就目前而言， llama_token 其实就是个整数类型。函数 llama_tokenize 可以把字符串转换为 llama_token ，它声明如下：

std::vector<llama_token> llama_tokenize(struct llama_context * ctx, const std::string & text, bool add_bos);

计算机将语言当作一个sequence，<BOS>可以看成是它的初始状态，<EOS>则通常当作判断终止的标签。

声明位于 common.h 。使用方法可以参考如下来自 main.cpp 的代码：

// tokenize the prompt
auto embd_inp = ::llama_tokenize(ctx, params.prompt, true);

为了方便调试，可以利用位于 llama.h 提供的如下函数将 llama_token 转换为文字输出：

// Token Id -> String. Uses the vocabulary in the provided context
LLAMA_API const char * llama_token_to_str(const struct llama_context * ctx, llama_token token);

测试代码：

#include "common.h"
#include "llama.h"
#include <iostream>int main() {gpt_params params;params.model = "D:\\my_files\\llama7b\\ggml-model-q4_0.bin";auto llama_context = llama_init_from_gpt_params(params);auto tokens = llama_tokenize(llama_context, "Hello world.", true);std::cout << "Tokens:" << std::endl;for (const auto& token : tokens) {std::cout << "<";std::cout << token << ":" << llama_token_to_str(llama_context, token);std::cout << ">" << std::endl;}llama_free(llama_context);return 0;
}

输出（结果）：

Tokens:
<1:>
<10994:Hello>
<3186: world>
<29889:.>

llama_tokenize 第三个参数改成 false 后输出：

Tokens:
<10994:Hello>
<3186: world>
<29889:.>

4. 模型执行

核心的函数是 llama_eval ，函数声明位于 llama.h 。

// Run the llama inference to obtain the logits and probabilities for the next token.
// tokens + n_tokens is the provided batch of new tokens to process
// n_past is the number of tokens to use from previous eval calls
// Returns 0 on success
LLAMA_API int llama_eval(struct llama_context * ctx,const llama_token * tokens,int   n_tokens,int   n_past,int   n_threads);

通过断点在调试模式下发现，主函数 main 在命令行交互式的输入情况下会进入以下的 while 语句内部（位于309行）：

while (n_remain != 0 || params.interactive) {// 这里省略此处的代码
}

在366行有 llama_eval 的使用场景：

if (llama_eval(ctx, &embd[i], n_eval, n_past, params.n_threads)) {fprintf(stderr, "%s : failed to eval\n", __func__);return 1;
}

观察几个入参可以得到这样的结论，函数 llama_eval ：

一个参数是 llama_init_from_gpt_params 返回值
vector<llama_token> 内第一个元素的地址；可以理解为需要模型读取的 llama_token 向量的第一个元素的位置，等同于 llama_token[] 数组的头部
第三个参数是第二个参数的长度，就是模型一次处理的元素个数，或者直白点说是一次调用处理的 token 的数量
代码备注是： n_past 是从以前的 llama_eval 调用中使用的 token 数量；我自己理解是：传给 llama_eval 的第二个参数指定的数组之中的前多少个 token 是可以不用重新计算的——使得模型（transformer）可以从第 n_past+1 个元素的位置接着上次的结果计算。
n_threads 计算的时候使用多线程。

函数 llama_eval 调用成功的返回值是0。

由于模型一次能处理的 token 长度有限，所以需要分多次处理，每次处理一批。 n_batch 最大不能超过512。第一个 token 必须是 BOS ，如果需要获得一个 BOS 的 token 可以通过调用以下函数获得（位于头文件 llama.h ）：

llama_token llama_token_bos();

该函数返回代表 BOS 的 token 。

测试代码：

#include "common.h"
#include "llama.h"
#include <vector>
#include <string>
#include <iostream>void showTokens(const llama_context* llama_context, const std::vector<llama_token>& tokens) {using namespace std;const auto bos = llama_token_bos();string tokenString;cout << "Token info:" << endl;cout << "Total:" << tokens.size() << endl;cout << "int[worlds]:" << endl;for (const auto& token : tokens) {auto words = llama_token_to_str(llama_context, token);if (bos == token) {words = "`BOS`";}tokenString += words;cout << (int)token << "[" << words << "]" << endl;}cout << "String:" << tokenString << endl;
}int main() {using namespace std;gpt_params params;params.model = "D:\\my_files\\llama7b\\ggml-model-q4_0.bin";auto llama_context = llama_init_from_gpt_params(params);auto tokens = llama_tokenize(llama_context, "2,3,5,7,11,", true);showTokens(llama_context, tokens);auto result = llama_eval(llama_context, tokens.data(), tokens.size(), 0, 2);if (0 == result) {cout << "eval success" << endl;} else {cout << "fail" << endl;}llama_free(llama_context);return 0;
}

执行结果：

llama.cpp: loading model from D:\my_files\llama7b\ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 49954
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  68.20 KB
llama_model_load_internal: mem required  = 5897.00 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB
Token info:
Total:11
int[worlds]:
1[`BOS`]
29906[2]
29892[,]
29941[3]
29892[,]
29945[5]
29892[,]
29955[7]
29892[,]
32040[11]
29892[,]
String:`BOS`2,3,5,7,11,
eval success

5. 获取模型预测的内容

语言模型用于预测一段文本的下一个 token ，上面已经梳理了模型初始化、 token 转换和模型执行的内容。在 main.cpp 的第406行可以找到获取输出文字的办法，主要涉及以下的函数使用。

float * llama_get_logits(struct llama_context * ctx);

函数用来从模型的最后一行取出结果，返回值指向一个 float 类型的数组。这个返回值所指的数组的下标是词典代表的词的值，也就是 token （ llama_token 类型）的值；数组的元素是模型计算后认为对应的 token 出现作为下一个词的“概率”。根据代码中的注释，这个数组的值是可以修改的，可以将特定的值改小或者改大，后续调用模型来预测的时候可以使用这个修改后的值接着计算后续内容。

main.cpp 的实现是先针对模型重复输出相同内容的问题进行惩罚计算，再对计算后的结果进行采样。其中温度参数 temp 小于或者等于0的时候，直接将计算后结果中概率最大的 token 选择作为输出。下面的测试代码直接取最大概率的 token 输出，测试输入是质数序列 2,3,5,7,11, ，预期得到的下一个数应该是13。

测试代码：

#include "common.h"
#include "llama.h"
#include <iostream>llama_token get_max_probability_token(llama_context* context) {auto tokens_probability = llama_get_logits(context);long tokens_total = llama_n_vocab(context);float max_probability = -100.0;llama_token token = llama_token_eos();for (long i = 0; i < tokens_total; i++) {if (tokens_probability[i] > max_probability) {max_probability = tokens_probability[i];token = (llama_token)i;}}return token;
}int main() {using namespace std;gpt_params params;params.model = "D:\\my_files\\llama7b\\ggml-model-q4_0.bin";auto context = llama_init_from_gpt_params(params);auto tokens = llama_tokenize(context, "2,3,5,7,11,", true);llama_eval(context, tokens.data(), tokens.size(), 0, 2);cout << "next:" << llama_token_to_str(context, get_max_probability_token(context)) << endl;llama_free(context);return 0;
}

运行输出：

llama.cpp: loading model from D:\my_files\llama7b\ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 49954
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  68.20 KB
llama_model_load_internal: mem required  = 5897.00 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB
next:13

next:13 符合预期。注意实际上获取下一个输出的 token 不是直接取概率最大的就一定最好，实际使用的时候往往有很多的计算方法。比如：

top-k采样
top-p采样

除此之外还需考虑如何惩罚重复。

6. 如何连续运行模型获得一段文本

将预测的 token 放 llama_evel 第二个参数的末尾，处理好 n_path 后重新执行就能连续预测了。以下对话起始提示是：

Jane: Hey, Michael, I've got a problem I need your help with.
Michael: Sure, what's the problem?

代码：

#include "common.h"
#include "llama.h"
#include <iostream>
#include <limits>llama_token get_max_probability_token(llama_context* context) {auto tokens_probability = llama_get_logits(context);long tokens_total = llama_n_vocab(context);float max_probability = std::numeric_limits<float>::min();llama_token token = llama_token_eos();for (long i = 0; i < tokens_total; i++) {if (tokens_probability[i] > max_probability) {max_probability = tokens_probability[i];token = (llama_token)i;}}return token;
}int main() {using namespace std;gpt_params params;params.model = "D:\\my_files\\llama7b\\ggml-model-q4_0.bin";auto context = llama_init_from_gpt_params(params);auto tokens = llama_tokenize(context, "Jane: Hey, Michael, I've got a problem I need your help with.\nMichael: Sure, what's the problem?\n", true);auto eos_token = llama_token_eos();long n_past = 0;for (long i = 0; tokens.size() < llama_n_ctx(context); i++) {llama_eval(context, &tokens[n_past], tokens.size() - n_past, n_past, 2);auto predict_token = get_max_probability_token(context);if (predict_token == eos_token) {break;}cout << llama_token_to_str(context, predict_token);n_past += tokens.size() - n_past;tokens.push_back(predict_token);}llama_free(context);return 0;
}

输出：

llama.cpp: loading model from D:\my_files\llama7b\ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 49954
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  68.20 KB
llama_model_load_internal: mem required  = 5897.00 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB
Jane: I'm not sure I can trust my boyfriend anymore.
Michael: Why not?
Jane: Well, I think he's been hiding something from me.
Michael: What kind of something?
Jane: I don't know. I just have a feeling that something is wrong.
Michael: Well, I think you should talk to him about it.
Jane: Yeah, I guess I will. Thanks, Michael.
Michael: No problem.

连起来就是：

Jane: Hey, Michael, I've got a problem I need your help with.
Michael: Sure, what's the problem?
Jane: I'm not sure I can trust my boyfriend anymore.
Michael: Why not?
Jane: Well, I think he's been hiding something from me.
Michael: What kind of something?
Jane: I don't know. I just have a feeling that something is wrong.
Michael: Well, I think you should talk to him about it.
Jane: Yeah, I guess I will. Thanks, Michael.
Michael: No problem.

llama_eval 接受的 token 数量有限，在 llama_n_ctx() 个token之后就需要采用一定的手段截断丢弃先前的 token 。