LangChain-Evaluation—如何评估LLM及其应用（三）

省流：目前没有真正完美的解决方案，比如分类有精度这样接近完美的评估方案，但LLM目前没有

This section of documentation covers how we approach and think about evaluation in LangChain. Both evaluation of internal chains/agents, but also how we would recommend people building on top of LangChain approach evaluation.

问题所在

评估LangChain链和代理可能真的很难。这主要有两个原因：

#1：缺乏数据

在开始项目之前，您通常没有大量数据来评估您的链/代理。这通常是因为大型语言模型（大多数链/代理的核心）是出色的少镜头（few-shot）和零镜头（zero-shot）学习者，这意味着您几乎总是能够开始执行特定任务（文本到SQL，问答等），而无需示例的大型数据集。这与传统的机器学习形成鲜明对比，在传统的机器学习中，你必须首先收集一堆数据点。甚至在开始使用模型之前。

#2：缺少指标

大多数链/代理正在执行没有很好的指标来评估性能的任务。例如，最常见的用例之一是生成某种形式的文本。 评估生成的文本比评估分类预测或数值预测要复杂得多。

解决方案

LangChain试图解决这两个问题。到目前为止，我们所拥有的只是解决方案的初步通过 - 我们认为我们没有完美的解决方案。因此，我们非常欢迎对此的反馈、贡献、集成和想法。

以下是到目前为止我们对每个问题的看法：

#1：缺乏数据

我们已经启动了LangChainDatasets，这是一个关于拥抱脸的社区空间。我们打算将其作为用于评估常见链和代理的开源数据集的集合。 我们首先贡献了五个自己的数据集，但我们非常希望这是社区的努力。为了贡献数据集，您只需加入社区，然后您就可以上传数据集。

我们还致力于让人们尽可能轻松地创建自己的数据集。作为第一步，我们添加了一个QAGenerationChain，它给出了一个文档。具有可用于评估该文档的问答任务的问答对。

#2：缺少指标

我们对缺乏度量指标有两个解决方案。第一个方案是不使用度量指标，而是仅依赖肉眼观察结果，以便了解链/代理的性能。为了协助这一点，我们已经开发了基于UI的可视化器tracing，用于追踪链和代理运行。

我们建议的第二个解决方案是使用语言模型本身来评估输出。为此，我们有几个不同的链和提示，旨在解决这个问题。蓝框标记部分问答应该是针对LLM本身的问答能力，数据增强型问答是针对评估特定文档的问答系统。

数据增强问答 # – LangChain中文网 //代码在这个链接中

我自己的例子：

等待开发中... 敬请期待

使用其他指标评估 Evaluate with Other Metrics #

除了使用语言模型预测答案是正确还是不正确之外，我们还可以使用其他指标来获得对答案质量的更细致入微的看法。为此，我们可以使用Critique库Critique(opens in a new tab) ，它允许对生成的文本进行各种指标的简单计算。

首先，您可以从 Inspired Cognition Dashboard(opens in a new tab) 获取API密钥并进行一些设置：

export INSPIREDCO_API_KEY="..."pip install inspiredco

import inspiredco.critiqueimport oscritique = inspiredco.critique.Critique(api_key=os.environ['INSPIREDCO_API_KEY'])

然后运行以下代码来设置配置并计算ROUGE(opens in a new tab) 、chrf(opens in a new tab) 、BERTScore(opens in a new tab)和UniEval(opens in a new tab)（您也可以选择其他other metrics(opens in a new tab) 指标）：

metrics = {    "rouge": {        "metric": "rouge",        "config": {"variety": "rouge_l"},    },    "chrf": {        "metric": "chrf",        "config": {},    },    "bert_score": {        "metric": "bert_score",        "config": {"model": "bert-base-uncased"},    },    "uni_eval": {        "metric": "uni_eval",        "config": {"task": "summarization", "evaluation_aspect": "relevance"},    },}

critique_data = [    {"target": pred['result'], "references": [pred['answer']]} for pred in predictions]eval_results = {    k: critique.evaluate(dataset=critique_data, metric=v["metric"], config=v["config"])    for k, v in metrics.items()}

最后，我们可以打印出结果。我们可以看到，总体而言，当输出在语义上正确时，以及当输出与黄金标准答案紧密匹配时，得分更高。

for i, eg in enumerate(examples):    score_string = ", ".join([f"{k}={v['examples'][i]['value']:.4f}" for k, v in eval_results.items()])    print(f"Example {i}:")    print("Question: " + predictions[i]['query'])    print("Real Answer: " + predictions[i]['answer'])    print("Predicted Answer: " + predictions[i]['result'])    print("Predicted Scores: " + score_string)    print()

Example 0:Question: What did the president say about Ketanji Brown JacksonReal Answer: He praised her legal ability and said he nominated her for the supreme court.Predicted Answer:  The president said that she is one of the nation's top legal minds, a former top litigator in private practice, a former federal public defender, and from a family of public school educators and police officers. He also said that she is a consensus builder and that she has received a broad range of support from the Fraternal Order of Police to former judges appointed by both Democrats and Republicans.Predicted Scores: rouge=0.0941, chrf=0.2001, bert_score=0.5219, uni_eval=0.9043 Example 1:Question: What did the president say about Michael JacksonReal Answer: NothingPredicted Answer:  The president did not mention Michael Jackson in this speech.Predicted Scores: rouge=0.0000, chrf=0.1087, bert_score=0.3486, uni_eval=0.7802 Example 2:Question: According to the document, what did Vladimir Putin miscalculate?Real Answer: He miscalculated that he could roll into Ukraine and the world would roll over.Predicted Answer:  Putin miscalculated that the world would roll over when he rolled into Ukraine.Predicted Scores: rouge=0.5185, chrf=0.6955, bert_score=0.8421, uni_eval=0.9578 Example 3:Question: Who is the Ukrainian Ambassador to the United States?Real Answer: The Ukrainian Ambassador to the United States is here tonight.Predicted Answer:  I don't know.Predicted Scores: rouge=0.0000, chrf=0.0375, bert_score=0.3159, uni_eval=0.7493 Example 4:Question: How many countries were part of the coalition formed to confront Putin?Real Answer: 27 members of the European Union, France, Germany, Italy, the United Kingdom, Canada, Japan, Korea, Australia, New Zealand, and many others, even Switzerland.Predicted Answer:  The coalition included freedom-loving nations from Europe and the Americas to Asia and Africa, 27 members of the European Union including France, Germany, Italy, the United Kingdom, Canada, Japan, Korea, Australia, New Zealand, and many others, even Switzerland.Predicted Scores: rouge=0.7419, chrf=0.8602, bert_score=0.8388, uni_eval=0.0669 Example 5:Question: What action is the U.S. Department of Justice taking to target Russian oligarchs?Real Answer: The U.S. Department of Justice is assembling a dedicated task force to go after the crimes of Russian oligarchs and joining with European allies to find and seize their yachts, luxury apartments, and private jets.Predicted Answer:  The U.S. Department of Justice is assembling a dedicated task force to go after the crimes of Russian oligarchs and to find and seize their yachts, luxury apartments, and private jets.Predicted Scores: rouge=0.9412, chrf=0.8687, bert_score=0.9607, uni_eval=0.9718 Example 6:Question: How much direct assistance is the United States providing to Ukraine?Real Answer: The United States is providing more than $1 Billion in direct assistance to Ukraine.Predicted Answer:  The United States is providing more than $1 billion in direct assistance to Ukraine.Predicted Scores: rouge=1.0000, chrf=0.9483, bert_score=1.0000, uni_eval=0.9734

Reference

Evaluation | 🦜️🔗 LangChain

评估 Evaluation – LangChain中文网

基于LangChain的大语言模型应用开发6——评估_哔哩哔哩_bilibili

基于LangChai... - @宝玉xp的微博 - 微博 (weibo.com)

通用代理评估 Generic Agent Evaluation # – LangChain中文网