Engel Nyst b295f5775c Revert "Fix issue #5609: Use litellm's modify_params with default True" (#5631)		1 жил өмнө
..
scripts	9908e1b285 [Evaluation]: Log openhands version in eval output folder, instead of agent version (#5394)	1 жил өмнө
Dockerfile	678436da30 Fix issue #5222: [Refactor]: Refactor the evaluation directory (#5223)	1 жил өмнө
Dockerfile.evaluator	678436da30 Fix issue #5222: [Refactor]: Refactor the evaluation directory (#5223)	1 жил өмнө
README.md	8f47547b08 docs: fix markdown linting and broken links (#5401)	1 жил өмнө
post_proc.py	678436da30 Fix issue #5222: [Refactor]: Refactor the evaluation directory (#5223)	1 жил өмнө
run_infer.py	b295f5775c Revert "Fix issue #5609: Use litellm's modify_params with default True" (#5631)	1 жил өмнө

ScienceAgentBench Evaluation with OpenHands

This folder contains the evaluation harness for ScienceAgentBench (paper: https://arxiv.org/abs/2410.05080).

Setup Environment and LLM Configuration

Please follow instruction here to setup your local development environment and LLM.

Setup ScienceAgentBench

To prevent benchmark data contamination, we only provide the annotation sheet on Huggingface, which includes all necessary inputs to run an agent.

Run Inference on ScienceAgentBench

./evaluation/benchmarks/scienceagentbench/scripts/run_infer.sh [model_config] [git-version] [use_knowledge] [agent] [eval_limit] [max_iter] [num_workers] [dataset] [dataset_split]

# Example
./evaluation/benchmarks/scienceagentbench/scripts/run_infer.sh llm.eval_gpt4o 0.9.3

where model_config is mandatory, and the rest are optional.

model_config, e.g. eval_gpt4_1106_preview, is the config group name for your LLM settings, as defined in your config.toml.
git-version, e.g. HEAD, is the git commit hash of the OpenHands version you would like to evaluate. It could also be a release tag like 0.6.2.
use_knowledge, e.g. true, specifies whether allowing the agent to use expert-provided knowledge as additional input or not. By default, it is set to false.
agent, e.g. CodeActAgent, is the name of the agent for benchmarks, defaulting to CodeActAgent.
eval_limit, e.g. 10, limits the evaluation to the first eval_limit instances. By default, the script evaluates the entire SWE-bench_Lite test set (300 issues). Note: in order to use eval_limit, you must also set agent.
max_iter, e.g. 20, is the maximum number of iterations for the agent to run. By default, it is set to 30.
num_workers, e.g. 3, is the number of parallel workers to run the evaluation. By default, it is set to 1.

Evaluate Generated Programs

Extract Necessary Information from OpenHands Log

After the inference is completed, you may use the following command to extract necessary information from the output log for evaluation:

python post_proc.py [log_fname]

log_fname, e.g. evaluation/.../output.jsonl, is the automatically saved trajectory log of an OpenHands agent.

Output will be write to e.g. evaluation/.../output.converted.jsonl

Run evaluation

Please follow the steps here to evaluate the generated programs.

README.md