Engel Nyst b295f5775c Revert "Fix issue #5609: Use litellm's modify_params with default True" (#5631)		1 rok pred
..
prompts	678436da30 Fix issue #5222: [Refactor]: Refactor the evaluation directory (#5223)	1 rok pred
scripts	9908e1b285 [Evaluation]: Log openhands version in eval output folder, instead of agent version (#5394)	1 rok pred
tasks	678436da30 Fix issue #5222: [Refactor]: Refactor the evaluation directory (#5223)	1 rok pred
.gitignore	678436da30 Fix issue #5222: [Refactor]: Refactor the evaluation directory (#5223)	1 rok pred
Dockerfile	678436da30 Fix issue #5222: [Refactor]: Refactor the evaluation directory (#5223)	1 rok pred
README.md	678436da30 Fix issue #5222: [Refactor]: Refactor the evaluation directory (#5223)	1 rok pred
config_variables.py	678436da30 Fix issue #5222: [Refactor]: Refactor the evaluation directory (#5223)	1 rok pred
datatypes.py	678436da30 Fix issue #5222: [Refactor]: Refactor the evaluation directory (#5223)	1 rok pred
env.py	678436da30 Fix issue #5222: [Refactor]: Refactor the evaluation directory (#5223)	1 rok pred
requirements.txt	678436da30 Fix issue #5222: [Refactor]: Refactor the evaluation directory (#5223)	1 rok pred
run_infer.py	b295f5775c Revert "Fix issue #5609: Use litellm's modify_params with default True" (#5631)	1 rok pred
utils.py	678436da30 Fix issue #5222: [Refactor]: Refactor the evaluation directory (#5223)	1 rok pred

MINT Benchmark

This folder contains the evaluation harness for the MINT benchmark on LLMs' ability to solve tasks with multi-turn interactions.

We support evaluation of the Eurus subset focus on math and code reasoning, including MATH, MMLU, TheoremQA, HumanEval, MBPP.

Setup Environment and LLM Configuration

Please follow instruction here to setup your local development environment and LLM.

Start the evaluation

We are using the MINT dataset hosted on Hugging Face.

Following is the basic command to start the evaluation. Currently, the only agent supported with MINT is CodeActAgent.

./evaluation/benchmarks/mint/scripts/run_infer.sh [model_config] [git-version] [subset] [eval_limit]

where model_config is mandatory, while others are optional.

model_config, e.g. eval_gpt4_1106_preview, is the config group name for your LLM settings, as defined in your config.toml.
git-version, e.g. HEAD, is the git commit hash of the OpenHands version you would like to evaluate. It could also be a release tag like 0.6.2.
subset, e.g. math, is the subset of the MINT benchmark to evaluate on, defaulting to math. It can be either: math, gsm8k, mmlu, theoremqa, mbpp,humaneval.
eval_limit, e.g. 2, limits the evaluation to the first eval_limit instances, defaulting to all instances.

Note: in order to use eval_limit, you must also set subset.

For example,

./evaluation/benchmarks/mint/scripts/run_infer.sh eval_gpt4_1106_preview 0.6.2 gsm8k 3

Reference

@misc{wang2024mint,
    title={MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback},
    author={Xingyao Wang and Zihan Wang and Jiateng Liu and Yangyi Chen and Lifan Yuan and Hao Peng and Heng Ji},
    year={2024},
    eprint={2309.10691},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

README.md

MINT Benchmark

Setup Environment and LLM Configuration

Start the evaluation

Reference