Ketan Ramaneti
|
852c90f64a
[fix eval] Fix issues with miniwob remote runtime evaluation (#5001)
|
il y a 1 an |
Xingyao Wang
|
50c13aad98
[Eval] Improve SWE-Bench Eval harness: multi-run support & entry script simplification (#4396)
|
il y a 1 an |
Xingyao Wang
|
31b244f95e
[Refactor, Evaluation] Refactor and clean up evaluation harness to remove global config and use EventStreamRuntime (#3230)
|
il y a 1 an |
Graham Neubig
|
cab7a288ca
Add NUM_WORKERS variable to run_infer.sh scripts for configurable woker settings (#2597)
|
il y a 1 an |
Boxuan Li
|
feabc97aba
Evaluation time travel: build sandbox on the fly (#2491)
|
il y a 1 an |
Boxuan Li
|
6f235937cf
Evaluation time travel: allow evaluation on a specific version (#2356)
|
il y a 1 an |
Frank Xu
|
48151bdbb0
[feat] WebArena benchmark, MiniWoB++ benchmark and related arch changes (#2170)
|
il y a 1 an |