|
|
vor 1 Jahr | |
|---|---|---|
| .. | ||
| EDA | vor 1 Jahr | |
| agent_bench | vor 1 Jahr | |
| biocoder | vor 1 Jahr | |
| bird | vor 1 Jahr | |
| gaia | vor 1 Jahr | |
| gorilla | vor 1 Jahr | |
| gpqa | vor 1 Jahr | |
| humanevalfix | vor 1 Jahr | |
| logic_reasoning | vor 1 Jahr | |
| miniwob | vor 1 Jahr | |
| mint | vor 1 Jahr | |
| ml_bench | vor 1 Jahr | |
| regression | vor 1 Jahr | |
| static | vor 1 Jahr | |
| swe_bench | vor 1 Jahr | |
| toolqa | vor 1 Jahr | |
| utils | vor 1 Jahr | |
| webarena | vor 1 Jahr | |
| README.md | vor 1 Jahr | |
| TUTORIAL.md | vor 1 Jahr | |
| __init__.py | vor 1 Jahr | |
This folder contains code and resources to run experiments and evaluations.
To better organize the evaluation folder, we should follow the rules below:
evaluation/swe_bench should contain
all the preprocessing/evaluation/analysis scripts.evaluation/swe_benchevaluation/ml_benchevaluation/humanevalfixevaluation/gaiaevaluation/EDAevaluation/mintevaluation/agent_benchevaluation/birdevaluation/logic_reasoningCheck this huggingface space for visualization of existing experimental results.
You can start your own fork of our huggingface evaluation outputs and submit a PR of your evaluation results to our hosted huggingface repo via PR following the guide here.