Cheng Yang
|
b808a639d9
docs: improve evaluation README with proper links and formatting (#5221)
|
1 gadu atpakaļ |
OpenHands
|
678436da30
Fix issue #5222: [Refactor]: Refactor the evaluation directory (#5223)
|
1 gadu atpakaļ |
Xingyao Wang
|
1d2a616be7
Fix issue #4739: '[Bug]: The agent doesn'"'"'t know its name' (#4740)
|
1 gadu atpakaļ |
Graham Neubig
|
54250e3fe2
Update evaluation README.md structure (#4516)
|
1 gadu atpakaļ |
Xingyao Wang
|
797f02ff6f
rename huggingface evaluation benchmark (#3845)
|
1 gadu atpakaļ |
mamoodi
|
6fcc4ca052
fix eval README link (#3692)
|
1 gadu atpakaļ |
tobitege
|
9c39f07430
(enh) Aider-Bench: make resumable with skip_num arg (#3626)
|
1 gadu atpakaļ |
Robert Brennan
|
01ae22ef57
Rename OpenDevin to OpenHands (#3472)
|
1 gadu atpakaļ |
Xingyao Wang
|
7270d21cf9
update documentation for evaluation tutorial
|
1 gadu atpakaļ |
Xingyao Wang
|
31b244f95e
[Refactor, Evaluation] Refactor and clean up evaluation harness to remove global config and use EventStreamRuntime (#3230)
|
1 gadu atpakaļ |
super-dainiu
|
ebafb702e5
Add ML-Bench Evaluation with OpenDevin (#2015)
|
1 gadu atpakaļ |
Leo
|
2c231c57c9
Add supported benchmarks to evaluation README (AgentBench, BIRD, LogicReasoning) (#2183)
|
1 gadu atpakaļ |
Ryan H. Tran
|
9434bcce48
Support MINT benchmark (MATH, GSM8K subset) (#1955)
|
1 gadu atpakaļ |
Yizhe Zhang
|
0c829cd067
Support Entity-Deduction-Arena (EDA) Benchmark (#1931)
|
1 gadu atpakaļ |
Jiayi Pan
|
2d52298a1d
Support GAIA benchmark (#1911)
|
1 gadu atpakaļ |
Niklas Muennighoff
|
ef6cdb7532
HumanEvalFix integration (#1908)
|
1 gadu atpakaļ |
Xingyao Wang
|
2406b901df
feat(SWE-Bench environment) integrate SWE-Bench sandbox (#1468)
|
1 gadu atpakaļ |
Jirka Borovec
|
e32d95cb1a
lint: simplify hooks already covered by Ruff (#1204)
|
1 gadu atpakaļ |
hugehope
|
9cd4ad3298
chore: fix some typos in comments (#1013)
|
1 gadu atpakaļ |
libowen2121
|
e256329e5e
Update SWE-bench eval results (#978)
|
1 gadu atpakaļ |
libowen2121
|
40a3614e80
Add a roadmap for eval (#92)
|
1 gadu atpakaļ |
Xingyao Wang
|
5ff96111f0
A starting point for SWE-Bench Evaluation with docker (#60)
|
1 gadu atpakaļ |
Jiaxin Pei
|
dc88dac296
adding a script to fetch and convert devin's output for evaluation (#81)
|
1 gadu atpakaļ |
Binyuan Hui
|
f99f4ebdaa
fix: typo in the evaluation folder name. (#66)
|
1 gadu atpakaļ |