deepeval - LLM 评估框架

最新推荐文章于 2025-07-07 20:32:39 发布

EAI工程笔记

最新推荐文章于 2025-07-07 20:32:39 发布

阅读量2k

点赞数 17

CC 4.0 BY-SA版权

分类专栏： # AI 开源项目文章标签： deepeval 大模型评估 Confident AI llamainde langchain rag

本文链接：https://blog.csdn.net/lovechris00/article/details/143783278

AI 开源项目专栏收录该内容

229 篇文章

订阅专栏

文章目录

一、关于 deepeval

DeepEval是一个简单易用的开源LLM评估框架，用于评估和测试大型语言模型系统。
它类似于Pytest，但专门用于单元测试LLM输出。
DeepEval结合了基于G-Eval、幻觉、答案相关性、RAGAS等指标评估LLM输出的最新研究，它使用LLM和在您的机器上本地运行的各种其他NLP模型进行评估。

无论您的应用程序是通过RAG还是微调、LangChain还是LlamaIndex实现的，DeepEval都能满足您的需求。
有了它，您可以轻松确定最佳超参数，以改进您的RAG管道，防止快速漂移，甚至从OpenAI过渡到自信地托管您自己的Llama2。

github : https://github.com/confident-ai/deepeval
官网：https://docs.confident-ai.com/
官方文档：https://docs.confident-ai.com/docs/getting-started
Confident AI 官网： https://www.confident-ai.com/
Try Quickstart in Colab : https://colab.research.google.com/drive/1PPxYEBa6eu__LquGoFFJZkhYgWVYE6kh?usp=sharing
discord : https://discord.com/invite/a3K9c8GRGt
作者：由Confident AI 的创始人构建
所有查询请联系jeffreyip@confident-ai.com

路线图

特点：

实现 G-Eval
无参考评价
生产评估和测井
评估数据集创建

集成：

lLamaIndex - 玩啊成
langChain
Guidance
Guardrails
EmbedChain

二、🔥指标和功能

‼️您现在可以直接在 Confident AI 的基础设施上免费在云上运行DeepEval的指标🥳

由您选择的任何LLM、统计方法或在您的机器上本地运行的NLP模型提供的各种现成的LLM评估指标（均附有解释）：
- G-Eval
- Summarization
- Answer Relevancy
- Faithfulness
- Contextual Recall
- Contextual Precision
- RAGAS
- Hallucination
- etc.
Red team your LLM application 针对几行代码中的40多个安全漏洞，包括：
- Toxicity
- Bias
- SQL Injection
- 等，使用提示注入等高级10+攻击增强策略。
用不到20行Python代码并行批量评估整个数据集。以类似Pytest的方式通过CLI或通过我们的evaluate（）函数执行此操作。
通过继承DeepEval的基本指标类，创建自己的自定义指标，这些指标将自动与DeepEval生态系统集成。
与任何 CI/CD环境无缝集成。
在不到10行的代码中，轻松地在流行的LLM基准上对任何LLM进行基准测试。其中包括：
- MMLU
- HellaSwag
- DROP
- BIG-Bench Hard
- TruthfulQA
- HumanEval
- GSM8K
与 Confident AI 自动集成在 LLM（应用程序）的整个生命周期内进行持续评估：
- 记录评估结果并分析指标通过/失败
- 根据评估结果比较和选择最佳超参数（例如提示模板、块大小、使用的模型等）
- 通过LLM跟踪调试评估结果
- 在一个地方管理评估测试用例/数据集
- 跟踪事件以识别生产中的实时LLM响应
- 生产中的实时评估
- 将生产事件添加到现有评估数据集中，以随着时间的推移加强评估

（请注意，虽然有些指标适用于RAG，但其他指标更适合微调用例。请务必查阅我们的文档以选择正确的指标。）

三、集成🔌

🦄LlamaIndex，以CI/CD单元测试RAG应用程序
🤗Hugging Face ，在LLM微调期间实现实时评估

四、快速入门🚀

让我们假设您的LLM应用程序是基于RAG的客户支持聊天机器人；以下是DeepEval如何帮助测试您构建的内容。

安装

pip install -U deepeval

创建一个帐户（强烈推荐）

虽然是可选的，但在我们的平台上创建一个帐户将允许您记录测试结果，能够在迭代中轻松跟踪更改和性能。这一步是可选的，即使不登录也可以运行测试用例，但我们强烈建议尝试一下。

要登录，请运行：

deepeval login

按照CLI中的说明创建一个帐户，复制您的API密钥，并将其粘贴到CLI中。所有测试用例都将自动记录（在此处查找有关数据隐私的更多信息）。

编写您的第一个测试用例

创建测试文件：

touch test_chatbot.py

打开test_chatbot.py并使用DeepEval编写您的第一个测试用例：

import pytest
from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

def test_case():
    answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5)
    test_case = LLMTestCase(
        input="What if these shoes don't fit?",
        # Replace this with the actual output from your LLM application
        actual_output="We offer a 30-day full refund at no extra costs.",
        retrieval_context=["All customers are eligible for a 30 day full refund at no extra costs."]
    )
    assert_test(test_case, [answer_relevancy_metric])

将您的OPENAI_API_KEY设置为环境变量（您也可以使用自己的自定义模型进行评估，有关详细信息，请访问我们文档的这一部分）：

export OPENAI_API_KEY="..."

最后，在CLI中运行test_chatbot.py：

deepeval test run test_chatbot.py

你的测试应该通过了✅让我们分解一下发生了什么。

变量input模仿用户输入，actual_output是聊天机器人基于此查询的预期输出的占位符。
变量retrieval_context包含来自您的知识库的相关信息，AnswerRelevancyMetric(threshold=0.5)是DeepEval提供的开箱即用的指标，它有助于根据提供的上下文评估您的LLM输出的相关性。
度量分数范围为0-1。threshold=0.5阈值最终决定您的测试是否通过。

请阅读我们的文档，了解有关如何使用其他指标、创建自己的自定义指标以及如何与LangChain和LlamaIndex等其他工具集成的教程的更多信息。

没有Pytest集成的评估

或者，您可以在没有Pytest的情况下进行评估，这更适合笔记本环境。

from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.7)
test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    # Replace this with the actual output from your LLM application
    actual_output="We offer a 30-day full refund at no extra costs.",
    retrieval_context=["All customers are eligible for a 30 day full refund at no extra costs."]
)
evaluate([test_case], [answer_relevancy_metric])

使用独立指标

DeepEval非常模块化，任何人都可以轻松使用我们的任何指标。继续前面的例子：

from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.7)
test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    # Replace this with the actual output from your LLM application
    actual_output="We offer a 30-day full refund at no extra costs.",
    retrieval_context=["All customers are eligible for a 30 day full refund at no extra costs."]
)

answer_relevancy_metric.measure(test_case)
print(answer_relevancy_metric.score)
# Most metrics also offer an explanation
print(answer_relevancy_metric.reason)

Note that some metrics are for RAG pipelines, while others are for fine-tuning. Make sure to use our docs to pick the right one for your use case.

批量评估数据集/测试用例

In DeepEval, a dataset is simply a collection of test cases. Here is how you can evaluate these in bulk:

import pytest
from deepeval import assert_test
from deepeval.metrics import HallucinationMetric, AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
from deepeval.dataset import EvaluationDataset

first_test_case = LLMTestCase(input="...", actual_output="...", context=["..."])
second_test_case = LLMTestCase(input="...", actual_output="...", context=["..."])

dataset = EvaluationDataset(test_cases=[first_test_case, second_test_case])

@pytest.mark.parametrize(
    "test_case",
    dataset,
)
def test_customer_chatbot(test_case: LLMTestCase):
    hallucination_metric = HallucinationMetric(threshold=0.3)
    answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5)
    assert_test(test_case, [hallucination_metric, answer_relevancy_metric])

# Run this in the CLI, you can also add an optional -n flag to run tests in parallel
deepeval test run test_<filename>.py -n 4

或者，尽管我们建议使用deepeval test run，但您可以在不使用我们的Pytest集成的情况下评估数据集/测试用例：

from deepeval import evaluate
...

evaluate(dataset, [answer_relevancy_metric])
# or
dataset.evaluate([answer_relevancy_metric])

五、对Confident AI 的实时评估

我们为您提供一个网络平台：

记录并查看DeepEval测试运行中的所有测试结果/指标数据。
通过LLM跟踪调试评估结果。
比较并选择最佳的超参数（提示模板、模型、块大小等）。
创建、管理和集中您的评估数据集。
跟踪生产中的事件并增强您的评估数据集以进行持续评估。
跟踪生产中的事件，查看评估结果和历史见解。

关于Confident AI 的一切，包括如何使用自信，都可以在这里找到。

首先，从CLI登录：

deepeval login

按照说明登录、创建帐户并将API密钥粘贴到CLI中。

现在，再次运行您的测试文件：

deepeval test run test_chatbot.py

测试完成运行后，您应该会在CLI中看到一个链接。将其粘贴到浏览器中以查看结果！

2024-11-14(四)