super-dainiu
|
ebafb702e5
Add ML-Bench Evaluation with OpenDevin (#2015)
|
1 жил өмнө |
Leo
|
2c231c57c9
Add supported benchmarks to evaluation README (AgentBench, BIRD, LogicReasoning) (#2183)
|
1 жил өмнө |
Ryan H. Tran
|
9434bcce48
Support MINT benchmark (MATH, GSM8K subset) (#1955)
|
1 жил өмнө |
Yizhe Zhang
|
0c829cd067
Support Entity-Deduction-Arena (EDA) Benchmark (#1931)
|
1 жил өмнө |
Jiayi Pan
|
2d52298a1d
Support GAIA benchmark (#1911)
|
1 жил өмнө |
Niklas Muennighoff
|
ef6cdb7532
HumanEvalFix integration (#1908)
|
1 жил өмнө |
Xingyao Wang
|
2406b901df
feat(SWE-Bench environment) integrate SWE-Bench sandbox (#1468)
|
1 жил өмнө |
Jirka Borovec
|
e32d95cb1a
lint: simplify hooks already covered by Ruff (#1204)
|
1 жил өмнө |
hugehope
|
9cd4ad3298
chore: fix some typos in comments (#1013)
|
1 жил өмнө |
libowen2121
|
e256329e5e
Update SWE-bench eval results (#978)
|
1 жил өмнө |
libowen2121
|
40a3614e80
Add a roadmap for eval (#92)
|
1 жил өмнө |
Xingyao Wang
|
5ff96111f0
A starting point for SWE-Bench Evaluation with docker (#60)
|
1 жил өмнө |
Jiaxin Pei
|
dc88dac296
adding a script to fetch and convert devin's output for evaluation (#81)
|
1 жил өмнө |
Binyuan Hui
|
f99f4ebdaa
fix: typo in the evaluation folder name. (#66)
|
1 жил өмнө |