Xingyao Wang b30a2dd87a completely remove update_source_code (#3280) преди 1 година
..
examples 2406b901df feat(SWE-Bench environment) integrate SWE-Bench sandbox (#1468) преди 1 година
scripts 31b244f95e [Refactor, Evaluation] Refactor and clean up evaluation harness to remove global config and use EventStreamRuntime (#3230) преди 1 година
BUILD_TESTBED_AND_ENV.md 2c0a2dbc61 fix yet another swe_bench issue (#2069) преди 1 година
README.md 31b244f95e [Refactor, Evaluation] Refactor and clean up evaluation harness to remove global config and use EventStreamRuntime (#3230) преди 1 година
__init__.py 2406b901df feat(SWE-Bench environment) integrate SWE-Bench sandbox (#1468) преди 1 година
prompt.py 31b244f95e [Refactor, Evaluation] Refactor and clean up evaluation harness to remove global config and use EventStreamRuntime (#3230) преди 1 година
run_infer.py b30a2dd87a completely remove update_source_code (#3280) преди 1 година

README.md

SWE-Bench Evaluation with OpenDevin SWE-Bench Docker Image

This folder contains the evaluation harness that we built on top of the original SWE-Bench benchmark (paper).

UPDATE (7/1/2024): We now support the official SWE-Bench dockerized evaluation as announced here.

The evaluation consists of three steps:

  1. Environment setup: install python environment, configure LLM config, and pull docker.
  2. Run inference: Generate a edit patch for each Github issue
  3. Evaluate patches using SWE-Bench docker

Setup Environment and LLM Configuration

Please follow instruction here to setup your local development environment and LLM.

OpenDevin SWE-Bench Instance-level Docker Support

OpenDevin now support using the official evaluation docker for both inference and evaluation. This is now the default behavior.

Download Docker Images

(Recommended for reproducibility) If you have extra local space (e.g., 100GB), you can try pull the instance-level docker images we've prepared by running:

evaluation/swe_bench/scripts/docker/pull_all_eval_docker.sh instance

If you want to save disk space a bit (e.g., with ~50GB free disk space), while speeding up the image pre-build process, you can pull the environment-level docker images:

evaluation/swe_bench/scripts/docker/pull_all_eval_docker.sh env

Run Inference on SWE-Bench Instances

Make sure your Docker daemon is running, and you have pulled the instance-level docker image.

./evaluation/swe_bench/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [max_iter] [num_workers]
# e.g., ./evaluation/swe_bench/scripts/run_infer.sh llm.eval_gpt4_1106_preview HEAD CodeActAgent 300

where model_config is mandatory, and the rest are optional.

  • model_config, e.g. eval_gpt4_1106_preview, is the config group name for your LLM settings, as defined in your config.toml.
  • git-version, e.g. HEAD, is the git commit hash of the OpenDevin version you would like to evaluate. It could also be a release tag like 0.6.2.
  • agent, e.g. CodeActAgent, is the name of the agent for benchmarks, defaulting to CodeActAgent.
  • eval_limit, e.g. 10, limits the evaluation to the first eval_limit instances. By default, the script evaluates the entire SWE-bench_Lite test set (300 issues). Note: in order to use eval_limit, you must also set agent.
  • max_iter, e.g. 20, is the maximum number of iterations for the agent to run. By default, it is set to 30.
  • num_workers, e.g. 3, is the number of parallel workers to run the evaluation. By default, it is set to 1.

There are also two optional environment variables you can set.

export USE_HINT_TEXT=true # if you want to use hint text in the evaluation. Default to false. Ignore this if you are not sure.
export USE_INSTANCE_IMAGE=true # if you want to use instance-level docker images. Default to true

Let's say you'd like to run 10 instances using llm.eval_gpt4_1106_preview and CodeActAgent,

then your command would be:

./evaluation/swe_bench/scripts/run_infer.sh llm.eval_gpt4_1106_preview HEAD CodeActAgent 10

Specify a subset of tasks to run infer

If you would like to specify a list of tasks you'd like to benchmark on, you could create a config.toml under ./evaluation/swe_bench/ folder, and put a list attribute named selected_ids, e.g.

selected_ids = ['sphinx-doc__sphinx-8721', 'sympy__sympy-14774', 'scikit-learn__scikit-learn-10508']

Then only these tasks (rows whose instance_id is in the above list) will be evaluated. In this case, eval_limit option applies to tasks that are in the selected_ids list.

After running the inference, you will obtain a output.jsonl (by default it will be saved to evaluation/evaluation_outputs).

Evaluate Generated Patches

With output.jsonl file, you can run eval_infer.sh to evaluate generated patches, and produce a fine-grained report.

This evaluation is performed using the official dockerized evaluation announced here.

If you want to evaluate existing results, you should first run this to clone existing outputs

>git clone https://huggingface.co/spaces/OpenDevin/evaluation evaluation/evaluation_outputs
>```

NOTE, you should have already pulled the instance-level OR env-level docker images following [this section](#opendevin-swe-bench-instance-level-docker-support).

Then you can run the following:

bash

./evaluation/swe_bench/scripts/eval_infer.sh $YOUR_OUTPUT_JSONL

For example:

./evaluation/swe_bench/scripts/eval_infer.sh evaluation/evaluation_outputs/outputs/swe_bench/CodeActAgent/gpt-4-1106-preview_maxiter_50_N_v1.0/output.jsonl


> You can also pass in a JSONL with [SWE-Bench format](https://github.com/princeton-nlp/SWE-bench/blob/main/tutorials/evaluation.md#-creating-predictions) to `./evaluation/swe_bench/scripts/eval_infer.sh`, where each line is a JSON of `{"model_patch": "XXX", "model_name_or_path": "YYY", "instance_id": "ZZZ"}`.

The final results will be saved to `evaluation/evaluation_outputs/outputs/swe_bench/CodeActAgent/gpt-4-1106-preview_maxiter_50_N_v1.0/` with the following files/directory:

- `README.md`: a report showing what are the instances that passed, failed, etc.
- `report.json`: a JSON file that contains keys like `"resolved_ids"` pointing to instance IDs that are resolved by the agent.
- `logs/`: a directory of test logs

## Visualize Results

First you need to clone `https://huggingface.co/spaces/OpenDevin/evaluation` and add your own running results from opendevin into the `outputs` of the cloned repo.

bash git clone https://huggingface.co/spaces/OpenDevin/evaluation


**(optional) setup streamlit environment with conda**:

bash cd evaluation conda create -n streamlit python=3.10 conda activate streamlit pip install -r requirements.txt


**run the visualizer**:
Then, in a separate Python environment with `streamlit` library, you can run the following:

bash

Make sure you are inside the cloned evaluation repo

conda activate streamlit # if you follow the optional conda env setup above streamlit run 0_📊_OpenDevin_Benchmark.py --server.port 8501 --server.address 0.0.0.0 ```

Then you can access the SWE-Bench trajectory visualizer at localhost:8501.

Submit your evaluation results

You can start your own fork of our huggingface evaluation outputs and submit a PR of your evaluation results following the guide here.