|
|
1 year ago | |
|---|---|---|
| .. | ||
| EDA | 1 year ago | |
| agent_bench | 1 year ago | |
| biocoder | 1 year ago | |
| bird | 1 year ago | |
| browsing_delegation | 1 year ago | |
| gaia | 1 year ago | |
| gorilla | 1 year ago | |
| gpqa | 1 year ago | |
| humanevalfix | 1 year ago | |
| logic_reasoning | 1 year ago | |
| miniwob | 1 year ago | |
| mint | 1 year ago | |
| ml_bench | 1 year ago | |
| regression | 1 year ago | |
| static | 1 year ago | |
| swe_bench | 1 year ago | |
| toolqa | 1 year ago | |
| utils | 1 year ago | |
| webarena | 1 year ago | |
| README.md | 1 year ago | |
| TUTORIAL.md | 1 year ago | |
| __init__.py | 1 year ago | |
This folder contains code and resources to run experiments and evaluations.
To better organize the evaluation folder, we should follow the rules below:
evaluation/swe_bench should contain
all the preprocessing/evaluation/analysis scripts.evaluation/swe_benchevaluation/ml_benchevaluation/humanevalfixevaluation/gaiaevaluation/EDAevaluation/mintevaluation/agent_benchevaluation/birdevaluation/logic_reasoningPlease follow this document to set up a local development environment for OpenDevin.
Create a config.toml file if it does not exist at the root of the workspace. You can copy from config.template.toml if it is easier for you.
Add the configuration for your LLM:
# TODO: Change these to the model you want to evaluate
[llm.eval_gpt4_1106_preview_llm]
model = "gpt-4-1106-preview"
api_key = "XXX"
temperature = 0.0
[llm.eval_some_openai_compatible_model_llm]
model = "openai/MODEL_NAME"
base_url = "https://OPENAI_COMPATIBLE_URL/v1"
api_key = "XXX"
temperature = 0.0
Check this huggingface space for visualization of existing experimental results.
You can start your own fork of our huggingface evaluation outputs and submit a PR of your evaluation results to our hosted huggingface repo via PR following the guide here.