Lighteval documentation

Lighteval

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Lighteval

πŸ€— Lighteval is your all-in-one toolkit for evaluating Large Language Models (LLMs) across multiple backends with ease. Dive deep into your model’s performance by saving and exploring detailed, sample-by-sample results to debug and see how your models stack up.

Key Features

πŸš€ Multi-Backend Support

Evaluate your models using the most popular and efficient inference backends:

πŸ“Š Comprehensive Evaluation

  • Extensive Task Library: 1000s pre-built evaluation tasks
  • Custom Task Creation: Build your own evaluation tasks
  • Flexible Metrics: Support for custom metrics and scoring
  • Detailed Analysis: Sample-by-sample results for deep insights

πŸ”§ Easy Customization

Customization at your fingertips: create new tasks, metrics or model tailored to your needs, or browse all our existing tasks and metrics.

☁️ Seamless Integration

Seamlessly experiment, benchmark, and store your results on the Hugging Face Hub, S3, or locally.

Quick Start

Installation

pip install lighteval

Basic Usage

Find a task

Run your benchmark and push details to the hub

lighteval eval "hf-inference-providers/openai/gpt-oss-20b" \
  "lighteval|gpqa:diamond|0" \
    --bundle-dir gpt-oss-bundle \
    --repo-id OpenEvals/evals

Resulting Space:

Update on GitHub