Engel Nyst b295f5775c Revert "Fix issue #5609: Use litellm's modify_params with default True" (#5631)		11 months ago
..
eval_utils	678436da30 Fix issue #5222: [Refactor]: Refactor the evaluation directory (#5223)	1 year ago
scripts	9908e1b285 [Evaluation]: Log openhands version in eval output folder, instead of agent version (#5394)	1 year ago
README.md	678436da30 Fix issue #5222: [Refactor]: Refactor the evaluation directory (#5223)	1 year ago
run_infer.py	b295f5775c Revert "Fix issue #5609: Use litellm's modify_params with default True" (#5631)	11 months ago

DiscoveryBench with OpenHands

DiscoveryBench (Paper) contains 264 tasks collected across 6 diverse domains, such as biology, economics, and sociology. It incorporates discovery workflows from published papers to approximate the real-world challenges faced by researchers.

Setup Environment and LLM Configuration

Please follow instructions mentioned here to setup OpenHands development environment and LLMs locally

Execute the bash script to start DiscoveryBench Evaluation

./evaluation/benchmarks/discoverybench/scripts/run_infer.sh [YOUR MODEL CONFIG]

Replace [YOUR MODEL CONFIG] with any model the model that you have set up in config.toml

Run Inference on DiscoveryBench Instances

When the run_infer.sh script is started, it will automatically pull the latest DiscoveryBench instances & set up the agent environment. The OpenHands agent is invoked to process the task within this environment, producing a hypothesis. We then evaluate it against the “gold” hypothesis provided by DiscoveryBench. The evaluation result, along with the agent chat history is logged to output.jsonl under evaluation_outputs.

./evaluation/benchmarks/discoverybench/scripts/run_infer.sh [MODEL_CONFIG] [GIT_COMMIT] [AGENT] [EVAL_LIMIT] [NUM_WORKERS]

MODEL_CONFIG: Name of the model you want to evaluate with
GIT_COMMIT: This should be the git commit hash or release tag for OpenHands, e.g., HEAD or a specific tag like 0.6.2.
AGENT: Use CoderActAgent, right now it only supports that.
EVAL_LIMIT: Number of samples to evaluate.
NUM_WORKERS: Number of workers to parallelize the evaluation process.

README.md

DiscoveryBench with OpenHands

Setup Environment and LLM Configuration

Run Inference on DiscoveryBench Instances