Aditya Bharat Soni 0809d26f4d fix: Allow evaluation benchmarks to pass image urls in run_controller() instead of simply passing strings (#4100) 1 年之前
..
scripts 31b244f95e [Refactor, Evaluation] Refactor and clean up evaluation harness to remove global config and use EventStreamRuntime (#3230) 1 年之前
README.md 01ae22ef57 Rename OpenDevin to OpenHands (#3472) 1 年之前
get_score.py 2d52298a1d Support GAIA benchmark (#1911) 1 年之前
run_infer.py 0809d26f4d fix: Allow evaluation benchmarks to pass image urls in run_controller() instead of simply passing strings (#4100) 1 年之前
scorer.py 8f76587e5c docs: updated docstrings using ruff's autofix feature (#2923) 1 年之前

README.md

GAIA Evaluation

This folder contains evaluation harness for evaluating agents on the GAIA benchmark.

Setup Environment and LLM Configuration

Please follow instruction here to setup your local development environment and LLM.

Run the evaluation

We are using the GAIA dataset hosted on Hugging Face. Please accept the terms and make sure to have logged in on your computer by huggingface-cli login before running the evaluation.

Following is the basic command to start the evaluation. Here we are evaluating on the validation set for the 2023_all split. You can adjust ./evaluation/gaia/scripts/run_infer.sh to change the subset you want to evaluate on.

./evaluation/gaia/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [gaia_subset]
# e.g., ./evaluation/gaia/scripts/run_infer.sh eval_gpt4_1106_preview 0.6.2 CodeActAgent 300

where model_config is mandatory, while git-version, agent, eval_limit and gaia_subset are optional.

  • model_config, e.g. eval_gpt4_1106_preview, is the config group name for your LLM settings, as defined in your config.toml, defaulting to gpt-3.5-turbo

  • git-version, e.g. HEAD, is the git commit hash of the OpenHands version you would like to evaluate. It could also be a release tag like 0.6.2.

  • agent, e.g. CodeActAgent, is the name of the agent for benchmarks, defaulting to CodeActAgent.

  • eval_limit, e.g. 10, limits the evaluation to the first eval_limit instances, defaulting to all instances.

  • gaia_subset, GAIA benchmark has multiple subsets: 2023_level1, 2023_level2, 2023_level3, 2023_all, defaulting to 2023_level1.

For example,

./evaluation/gaia/scripts/run_infer.sh eval_gpt4_1106_preview 0.6.2 CodeActAgent 10

Get score

Then you can get stats by running the following command:

python ./evaluation/gaia/get_score.py \
--file <path_to/output.json>