|
|
hai 1 ano | |
|---|---|---|
| .. | ||
| SWE-bench | hai 1 ano | |
| regression | hai 1 ano | |
| README.md | hai 1 ano | |
This folder contains code and resources to run experiments and evaluations.
To better organize the evaluation folder, we should follow the rules below:
evaluation/SWE-bench should contain
all the preprocessing/evaluation/analysis scripts.devin_eval_analysis.ipynb: notebook analyzing devin's outputsprepare_devin_outputs_for_evaluation.py: script fetching and converting devin's output into the desired json file for evaluation.python prepare_devin_outputs_for_evaluation.py <setting> where setting can be passed, failed or allwget https://huggingface.co/datasets/OpenDevin/Devin-SWE-bench-output/raw/main/devin_swe_passed.jsonwget https://huggingface.co/datasets/OpenDevin/Devin-SWE-bench-output/raw/main/devin_swe_outputs.jsonSee SWE-bench/README.md for more details on how to run SWE-Bench for evaluation.
We have refined the original SWE-bench evaluation pipeline to enhance its efficiency and reliability. The updates are as follows:
patch command for patch application if git apply command fails.🤗 OpenDevin/SWE-bench-devin-passed
| Model/Agent | #instances | #init | #apply | #resolve |
|---|---|---|---|---|
| Gold | 79 | 79 | 79 | 79 |
| Devin | 79 | 79 | 76 | 76 |
#init: number of instances where testbeds have been successfully initialized.
In the 3 Devin-failed instances (see below), Devin has made changes to the tests, which are incompatible with the provided test patch and causes failures during patch application. The evaluation adopted by Devin does not seem to align with the original SWE-bench evaluation.
django__django-11244
scikit-learn__scikit-learn-10870
sphinx-doc__sphinx-9367
| Model/Agent | #instances | #init | #apply | #resolve |
|---|---|---|---|---|
| Gold | 491 | 491 | 491 | 371 |
| Devin | 491 | 491 | 463 | 7 |
Devin passes 7 instances on the SWE-bench-devin-failed subset. SWE-bench dataset appears to be noisy, evidenced by 120 instances where gold patches do not pass.
We have filtered out the problematic 120 instances, resulting in the creation of the SWE-bench-devin-full-filtered subset.
🤗 OpenDevin/SWE-bench-devin-full-filtered
| Model/Agent | #instances | #init | #apply | #resolve |
|---|---|---|---|---|
| Gold | 450 | 450 | 450 | 450 |
| Devin | 450 | 450 | 426 | 83 |