1 year ago · 8f47547b08
--- a/evaluation/benchmarks/EDA/README.md
+++ b/evaluation/benchmarks/EDA/README.md
@@ -4,12 +4,10 @@ This folder contains evaluation harness for evaluating agents on the Entity-dedu
 
				 
			
 
				 ## Setup Environment and LLM Configuration
			
 
				 
			
 
				-Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
			
 
				-
			
 
				+Please follow instruction [here](../../README.md#setup) to setup your local development environment and LLM.
			
 
				 
			
 
				 ## Start the evaluation
			
 
				 
			
 
				-
			
 
				 ```bash
			
 
				 export OPENAI_API_KEY="sk-XXX"; # This is required for evaluation (to simulate another party of conversation)
			
 
				 ./evaluation/benchmarks/EDA/scripts/run_infer.sh [model_config] [git-version] [agent] [dataset] [eval_limit]
			
@@ -37,7 +35,8 @@ For example,
 
				 ```
			
 
				 
			
 
				 ## Reference
			
 
				-```
			
 
				+
			
 
				+```bibtex
			
 
				 @inproceedings{zhang2023entity,
			
 
				   title={Probing the Multi-turn Planning Capabilities of LLMs via 20 Question Games},
			
 
				   author={Zhang, Yizhe and Lu, Jiarui and Jaitly, Navdeep},
			
--- a/evaluation/benchmarks/agent_bench/README.md
+++ b/evaluation/benchmarks/agent_bench/README.md
@@ -4,7 +4,7 @@ This folder contains evaluation harness for evaluating agents on the [AgentBench
 
				 
			
 
				 ## Setup Environment and LLM Configuration
			
 
				 
			
 
				-Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
			
 
				+Please follow instruction [here](../../README.md#setup) to setup your local development environment and LLM.
			
 
				 
			
 
				 ## Start the evaluation
			
 
				 
			
--- a/evaluation/benchmarks/aider_bench/README.md
+++ b/evaluation/benchmarks/aider_bench/README.md
@@ -10,7 +10,7 @@ Hugging Face dataset based on the
 
				 
			
 
				 ## Setup Environment and LLM Configuration
			
 
				 
			
 
				-Please follow instruction [here](../README.md#setup) to setup your local
			
 
				+Please follow instruction [here](../../README.md#setup) to setup your local
			
 
				 development environment and LLM.
			
 
				 
			
 
				 ## Start the evaluation
			
--- a/evaluation/benchmarks/biocoder/README.md
+++ b/evaluation/benchmarks/biocoder/README.md
@@ -4,13 +4,14 @@ Implements evaluation of agents on BioCoder from the BioCoder benchmark introduc
 
				 
			
 
				 ## Setup Environment and LLM Configuration
			
 
				 
			
 
				-Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
			
 
				+Please follow instruction [here](../../README.md#setup) to setup your local development environment and LLM.
			
 
				 
			
 
				 ## BioCoder Docker Image
			
 
				 
			
 
				 In the openhands branch of the Biocoder repository, we have slightly modified our original Docker image to work with the OpenHands environment. In the Docker image are testing scripts (`/testing/start_test_openhands.py` and aux files in `/testing_files/`) to assist with evaluation. Additionally, we have installed all dependencies, including OpenJDK, mamba (with Python 3.6), and many system libraries. Notably, we have **not** packaged all repositories into the image, so they are downloaded at runtime.
			
 
				 
			
 
				 **Before first execution, pull our Docker image with the following command**
			
 
				+
			
 
				 ```bash
			
 
				 docker pull public.ecr.aws/i5g0m1f6/eval_biocoder:v1.0
			
 
				 ```
			
@@ -19,7 +20,6 @@ To reproduce this image, please see the Dockerfile_Openopenhands in the `biocode
 
				 
			
 
				 ## Start the evaluation
			
 
				 
			
 
				-
			
 
				 ```bash
			
 
				 ./evaluation/benchmarks/biocoder/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit]
			
 
				 ```
			
@@ -47,7 +47,8 @@ with current OpenHands version, then your command would be:
 
				 ```
			
 
				 
			
 
				 ## Reference
			
 
				-```
			
 
				+
			
 
				+```bibtex
			
 
				 @misc{tang2024biocoder,
			
 
				       title={BioCoder: A Benchmark for Bioinformatics Code Generation with Large Language Models},
			
 
				       author={Xiangru Tang and Bill Qian and Rick Gao and Jiakang Chen and Xinyun Chen and Mark Gerstein},
			
--- a/evaluation/benchmarks/bird/README.md
+++ b/evaluation/benchmarks/bird/README.md
--- a/evaluation/benchmarks/browsing_delegation/README.md
+++ b/evaluation/benchmarks/browsing_delegation/README.md
@@ -7,7 +7,7 @@ If so, the browsing performance upper-bound of CodeActAgent will be the performa
 
				 
			
 
				 ## Setup Environment and LLM Configuration
			
 
				 
			
 
				-Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
			
 
				+Please follow instruction [here](../../README.md#setup) to setup your local development environment and LLM.
			
 
				 
			
 
				 ## Run Inference
			
 
				 
			
--- a/evaluation/benchmarks/commit0_bench/README.md
+++ b/evaluation/benchmarks/commit0_bench/README.md
@@ -4,19 +4,18 @@ This folder contains the evaluation harness that we built on top of the original
 
				 
			
 
				 The evaluation consists of three steps:
			
 
				 
			
 
				-1. Environment setup: [install python environment](../README.md#development-environment), [configure LLM config](../README.md#configure-openhands-and-your-llm).
			
 
				+1. Environment setup: [install python environment](../../README.md#development-environment), [configure LLM config](../../README.md#configure-openhands-and-your-llm).
			
 
				 2. [Run Evaluation](#run-inference-on-commit0-instances): Generate a edit patch for each Commit0 Repo, and get the evaluation results
			
 
				 
			
 
				 ## Setup Environment and LLM Configuration
			
 
				 
			
 
				-Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
			
 
				+Please follow instruction [here](../../README.md#setup) to setup your local development environment and LLM.
			
 
				 
			
 
				 ## OpenHands Commit0 Instance-level Docker Support
			
 
				 
			
 
				 OpenHands supports using the Commit0 Docker for **[inference](#run-inference-on-commit0-instances).
			
 
				 This is now the default behavior.
			
 
				 
			
 
				-
			
 
				 ## Run Inference on Commit0 Instances
			
 
				 
			
 
				 Make sure your Docker daemon is running, and you have ample disk space (at least 200-500GB, depends on the Commit0 set you are running on) for the [instance-level docker image](#openhands-commit0-instance-level-docker-support).
			
--- a/evaluation/benchmarks/gaia/README.md
+++ b/evaluation/benchmarks/gaia/README.md
@@ -4,9 +4,10 @@ This folder contains evaluation harness for evaluating agents on the [GAIA bench
 
				 
			
 
				 ## Setup Environment and LLM Configuration
			
 
				 
			
 
				-Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
			
 
				+Please follow instruction [here](../../README.md#setup) to setup your local development environment and LLM.
			
 
				 
			
 
				 ## Run the evaluation
			
 
				+
			
 
				 We are using the GAIA dataset hosted on [Hugging Face](https://huggingface.co/datasets/gaia-benchmark/GAIA).
			
 
				 Please accept the terms and make sure to have logged in on your computer by `huggingface-cli login` before running the evaluation.
			
 
				 
			
@@ -41,6 +42,7 @@ For example,
 
				 ## Get score
			
 
				 
			
 
				 Then you can get stats by running the following command:
			
 
				+
			
 
				 ```bash
			
 
				 python ./evaluation/benchmarks/gaia/get_score.py \
			
 
				 --file <path_to/output.json>
			
--- a/evaluation/benchmarks/gorilla/README.md
+++ b/evaluation/benchmarks/gorilla/README.md
@@ -4,7 +4,7 @@ This folder contains evaluation harness we built on top of the original [Gorilla
 
				 
			
 
				 ## Setup Environment and LLM Configuration
			
 
				 
			
 
				-Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
			
 
				+Please follow instruction [here](../../README.md#setup) to setup your local development environment and LLM.
			
 
				 
			
 
				 ## Run Inference on APIBench Instances
			
 
				 
			
--- a/evaluation/benchmarks/gpqa/README.md
+++ b/evaluation/benchmarks/gpqa/README.md
@@ -3,6 +3,7 @@
 
				 Implements the evaluation of agents on the GPQA benchmark introduced in [GPQA: A Graduate-Level Google-Proof Q&A Benchmark](https://arxiv.org/abs/2308.07124).
			
 
				 
			
 
				 This code implements the evaluation of agents on the GPQA Benchmark with Open Book setting.
			
 
				+
			
 
				 - The benchmark consists of 448 high-quality and extremely difficult multiple-choice questions in the domains of biology, physics, and chemistry. The questions are intentionally designed to be "Google-proof," meaning that even highly skilled non-expert validators achieve only 34% accuracy despite unrestricted access to the web.
			
 
				 - Even experts in the corresponding domains achieve only 65% accuracy.
			
 
				 - State-of-the-art AI systems achieve only 39% accuracy on this challenging dataset.
			
@@ -11,20 +12,24 @@ This code implements the evaluation of agents on the GPQA Benchmark with Open Bo
 
				 Accurate solving of above graduate level questions would require both tool use (e.g., python for calculations) and web-search for finding related facts as information required for the questions might not be part of the LLM knowledge / training data.
			
 
				 
			
 
				 Further references:
			
 
				-- https://arxiv.org/pdf/2311.12022
			
 
				-- https://paperswithcode.com/dataset/gpqa
			
 
				-- https://github.com/idavidrein/gpqa
			
 
				+
			
 
				+- <https://arxiv.org/pdf/2311.12022>
			
 
				+- <https://paperswithcode.com/dataset/gpqa>
			
 
				+- <https://github.com/idavidrein/gpqa>
			
 
				 
			
 
				 ## Setup Environment and LLM Configuration
			
 
				 
			
 
				-Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
			
 
				+Please follow instruction [here](../../README.md#setup) to setup your local development environment and LLM.
			
 
				 
			
 
				 ## Run Inference on GPQA Benchmark
			
 
				+
			
 
				 'gpqa_main', 'gqpa_diamond', 'gpqa_experts', 'gpqa_extended' -- data split options
			
 
				 From the root of the OpenHands repo, run the following command:
			
 
				+
			
 
				 ```bash
			
 
				 ./evaluation/benchmarks/gpqa/scripts/run_infer.sh [model_config_name] [git-version] [num_samples_eval] [data_split] [AgentClass]
			
 
				 ```
			
 
				+
			
 
				 You can replace `model_config_name` with any model you set up in `config.toml`.
			
 
				 
			
 
				 - `model_config_name`: The model configuration name from `config.toml` that you want to evaluate.
			
--- a/evaluation/benchmarks/humanevalfix/README.md
+++ b/evaluation/benchmarks/humanevalfix/README.md
@@ -4,7 +4,7 @@ Implements evaluation of agents on HumanEvalFix from the HumanEvalPack benchmark
 
				 
			
 
				 ## Setup Environment and LLM Configuration
			
 
				 
			
 
				-Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
			
 
				+Please follow instruction [here](../../README.md#setup) to setup your local development environment and LLM.
			
 
				 
			
 
				 ## Run Inference on HumanEvalFix
			
 
				 
			
@@ -14,13 +14,11 @@ Please follow instruction [here](../README.md#setup) to setup your local develop
 
				 
			
 
				 You can replace `eval_gpt4_1106_preview` with any model you set up in `config.toml`.
			
 
				 
			
 
				-
			
 
				 ## Examples
			
 
				 
			
 
				 For each problem, OpenHands is given a set number of iterations to fix the failing code. The history field shows each iteration's response to correct its code that fails any test case.
			
 
				 
			
 
				-
			
 
				-```
			
 
				+```json
			
 
				 {
			
 
				     "task_id": "Python/2",
			
 
				     "instruction": "Please fix the function in Python__2.py such that all test cases pass.\nEnvironment has been set up for you to start working. You may assume all necessary tools are installed.\n\n# Problem Statement\ndef truncate_number(number: float) -> float:\n    return number % 1.0 + 1.0\n\n\n\n\n\n\ndef check(truncate_number):\n    assert truncate_number(3.5) == 0.5\n    assert abs(truncate_number(1.33) - 0.33) < 1e-6\n    assert abs(truncate_number(123.456) - 0.456) < 1e-6\n\ncheck(truncate_number)\n\nIMPORTANT: You should ONLY interact with the environment provided to you AND NEVER ASK FOR HUMAN HELP.\nYou should NOT modify any existing test case files. If needed, you can add new test cases in a NEW file to reproduce the issue.\nYou SHOULD INCLUDE PROPER INDENTATION in your edit commands.\nWhen you think you have fixed the issue through code changes, please finish the interaction using the "finish" tool.\n",
			
--- a/evaluation/benchmarks/logic_reasoning/README.md
+++ b/evaluation/benchmarks/logic_reasoning/README.md
@@ -4,9 +4,10 @@ This folder contains evaluation harness for evaluating agents on the logic reaso
 
				 
			
 
				 ## Setup Environment and LLM Configuration
			
 
				 
			
 
				-Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
			
 
				+Please follow instruction [here](../../README.md#setup) to setup your local development environment and LLM.
			
 
				 
			
 
				 ## Run Inference on logic_reasoning
			
 
				+
			
 
				 The following code will run inference on the first example of the ProofWriter dataset,
			
 
				 
			
 
				 ```bash
			
--- a/evaluation/benchmarks/miniwob/README.md
+++ b/evaluation/benchmarks/miniwob/README.md
@@ -4,7 +4,7 @@ This folder contains evaluation for [MiniWoB++](https://miniwob.farama.org/) ben
 
				 
			
 
				 ## Setup Environment and LLM Configuration
			
 
				 
			
 
				-Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
			
 
				+Please follow instruction [here](../../README.md#setup) to setup your local development environment and LLM.
			
 
				 
			
 
				 ## Test if your environment works
			
 
				 
			
@@ -42,7 +42,6 @@ poetry run python evaluation/benchmarks/miniwob/get_success_rate.py evaluation/e
 
				 
			
 
				 You can start your own fork of [our huggingface evaluation outputs](https://huggingface.co/spaces/OpenHands/evaluation) and submit a PR of your evaluation results following the guide [here](https://huggingface.co/docs/hub/en/repositories-pull-requests-discussions#pull-requests-and-discussions).
			
 
				 
			
 
				-
			
 
				 ## BrowsingAgent V1.0 result
			
 
				 
			
 
				 Tested on BrowsingAgent V1.0
			
--- a/evaluation/benchmarks/ml_bench/README.md
+++ b/evaluation/benchmarks/ml_bench/README.md
@@ -12,7 +12,7 @@ For more details on the ML-Bench task and dataset, please refer to the paper: [M
 
				 
			
 
				 ## Setup Environment and LLM Configuration
			
 
				 
			
 
				-Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
			
 
				+Please follow instruction [here](../../README.md#setup) to setup your local development environment and LLM.
			
 
				 
			
 
				 ## Run Inference on ML-Bench
			
 
				 
			
--- a/evaluation/benchmarks/scienceagentbench/README.md
+++ b/evaluation/benchmarks/scienceagentbench/README.md
@@ -1,10 +1,10 @@
 
				 # ScienceAgentBench Evaluation with OpenHands
			
 
				 
			
 
				-This folder contains the evaluation harness for [ScienceAgentBench](https://osu-nlp-group.github.io/ScienceAgentBench/) (paper: https://arxiv.org/abs/2410.05080).
			
 
				+This folder contains the evaluation harness for [ScienceAgentBench](https://osu-nlp-group.github.io/ScienceAgentBench/) (paper: <https://arxiv.org/abs/2410.05080>).
			
 
				 
			
 
				 ## Setup Environment and LLM Configuration
			
 
				 
			
 
				-Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
			
 
				+Please follow instruction [here](../../README.md#setup) to setup your local development environment and LLM.
			
 
				 
			
 
				 ## Setup ScienceAgentBench
			
 
				 
			
@@ -45,6 +45,7 @@ After the inference is completed, you may use the following command to extract n
 
				 ```bash
			
 
				 python post_proc.py [log_fname]
			
 
				 ```
			
 
				+
			
 
				 - `log_fname`, e.g. `evaluation/.../output.jsonl`, is the automatically saved trajectory log of an OpenHands agent.
			
 
				 
			
 
				 Output will be write to e.g. `evaluation/.../output.converted.jsonl`
			
--- a/evaluation/benchmarks/swe_bench/README.md
+++ b/evaluation/benchmarks/swe_bench/README.md
@@ -6,20 +6,19 @@ This folder contains the evaluation harness that we built on top of the original
 
				 
			
 
				 The evaluation consists of three steps:
			
 
				 
			
 
				-1. Environment setup: [install python environment](../README.md#development-environment), [configure LLM config](../README.md#configure-openhands-and-your-llm), and [pull docker](#openhands-swe-bench-instance-level-docker-support).
			
 
				+1. Environment setup: [install python environment](../../README.md#development-environment), [configure LLM config](../../README.md#configure-openhands-and-your-llm), and [pull docker](#openhands-swe-bench-instance-level-docker-support).
			
 
				 2. [Run inference](#run-inference-on-swe-bench-instances): Generate a edit patch for each Github issue
			
 
				 3. [Evaluate patches using SWE-Bench docker](#evaluate-generated-patches)
			
 
				 
			
 
				 ## Setup Environment and LLM Configuration
			
 
				 
			
 
				-Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
			
 
				+Please follow instruction [here](../../README.md#setup) to setup your local development environment and LLM.
			
 
				 
			
 
				 ## OpenHands SWE-Bench Instance-level Docker Support
			
 
				 
			
 
				 OpenHands now support using the [official evaluation docker](https://github.com/princeton-nlp/SWE-bench/blob/main/docs/20240627_docker/README.md) for both **[inference](#run-inference-on-swe-bench-instances) and [evaluation](#evaluate-generated-patches)**.
			
 
				 This is now the default behavior.
			
 
				 
			
 
				-
			
 
				 ## Run Inference on SWE-Bench Instances
			
 
				 
			
 
				 Make sure your Docker daemon is running, and you have ample disk space (at least 200-500GB, depends on the SWE-Bench set you are running on) for the [instance-level docker image](#openhands-swe-bench-instance-level-docker-support).
			
@@ -52,7 +51,8 @@ default, it is set to 1.
 
				 - `dataset_split`, split for the huggingface dataset. e.g., `test`, `dev`. Default to `test`.
			
 
				 
			
 
				 There are also two optional environment variables you can set.
			
 
				-```
			
 
				+
			
 
				+```bash
			
 
				 export USE_HINT_TEXT=true # if you want to use hint text in the evaluation. Default to false. Ignore this if you are not sure.
			
 
				 export USE_INSTANCE_IMAGE=true # if you want to use instance-level docker images. Default to true
			
 
				 ```
			
@@ -127,6 +127,7 @@ With `output.jsonl` file, you can run `eval_infer.sh` to evaluate generated patc
 
				 **This evaluation is performed using the official dockerized evaluation announced [here](https://github.com/princeton-nlp/SWE-bench/blob/main/docs/20240627_docker/README.md).**
			
 
				 
			
 
				 > If you want to evaluate existing results, you should first run this to clone existing outputs
			
 
				+>
			
 
				 >```bash
			
 
				 >git clone https://huggingface.co/spaces/OpenHands/evaluation evaluation/evaluation_outputs
			
 
				 >```
			
@@ -143,6 +144,7 @@ Then you can run the following:
 
				 ```
			
 
				 
			
 
				 The script now accepts optional arguments:
			
 
				+
			
 
				 - `instance_id`: Specify a single instance to evaluate (optional)
			
 
				 - `dataset_name`: The name of the dataset to use (default: `"princeton-nlp/SWE-bench_Lite"`)
			
 
				 - `split`: The split of the dataset to use (default: `"test"`)
			
@@ -179,7 +181,6 @@ To clean-up all existing runtimes that you've already started, run:
 
				 ALLHANDS_API_KEY="YOUR-API-KEY" ./evaluation/benchmarks/swe_bench/scripts/cleanup_remote_runtime.sh
			
 
				 ```
			
 
				 
			
 
				-
			
 
				 ## Visualize Results
			
 
				 
			
 
				 First you need to clone `https://huggingface.co/spaces/OpenHands/evaluation` and add your own running results from openhands into the `outputs` of the cloned repo.
			
@@ -189,6 +190,7 @@ git clone https://huggingface.co/spaces/OpenHands/evaluation
 
				 ```
			
 
				 
			
 
				 **(optional) setup streamlit environment with conda**:
			
 
				+
			
 
				 ```bash
			
 
				 cd evaluation
			
 
				 conda create -n streamlit python=3.10
			
--- a/evaluation/benchmarks/toolqa/README.md
+++ b/evaluation/benchmarks/toolqa/README.md
@@ -4,7 +4,7 @@ This folder contains an evaluation harness we built on top of the original [Tool
 
				 
			
 
				 ## Setup Environment and LLM Configuration
			
 
				 
			
 
				-Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
			
 
				+Please follow instruction [here](../../README.md#setup) to setup your local development environment and LLM.
			
 
				 
			
 
				 ## Run Inference on ToolQA Instances
			
 
				 
			
--- a/evaluation/benchmarks/webarena/README.md
+++ b/evaluation/benchmarks/webarena/README.md
@@ -4,7 +4,7 @@ This folder contains evaluation for [WebArena](https://github.com/web-arena-x/we
 
				 
			
 
				 ## Setup Environment and LLM Configuration
			
 
				 
			
 
				-Please follow instruction [here](../README.md#setup) to setup your local development environment and LLM.
			
 
				+Please follow instruction [here](../../README.md#setup) to setup your local development environment and LLM.
			
 
				 
			
 
				 ## Setup WebArena Environment