Xingyao Wang 07f0d1ccb3 feat(llm): convert function call request for non-funcall OSS model (#4711) 1 jaar geleden
..
prompts 01296ff79d Add remaining subsets for MINT benchmark (#2142) 1 jaar geleden
scripts 31b244f95e [Refactor, Evaluation] Refactor and clean up evaluation harness to remove global config and use EventStreamRuntime (#3230) 1 jaar geleden
tasks c4f5c07be1 Refactor: shorter syntax (#4558) 1 jaar geleden
.gitignore 9434bcce48 Support MINT benchmark (MATH, GSM8K subset) (#1955) 1 jaar geleden
Dockerfile 152f99c64f Chore Bump python version (#3545) 1 jaar geleden
README.md 01ae22ef57 Rename OpenDevin to OpenHands (#3472) 1 jaar geleden
config_variables.py 01296ff79d Add remaining subsets for MINT benchmark (#2142) 1 jaar geleden
datatypes.py f9f96dd429 Use generic types (#3414) 1 jaar geleden
env.py 01ae22ef57 Rename OpenDevin to OpenHands (#3472) 1 jaar geleden
requirements.txt 01296ff79d Add remaining subsets for MINT benchmark (#2142) 1 jaar geleden
run_infer.py 07f0d1ccb3 feat(llm): convert function call request for non-funcall OSS model (#4711) 1 jaar geleden
utils.py f9f96dd429 Use generic types (#3414) 1 jaar geleden

README.md

MINT Benchmark

This folder contains the evaluation harness for the MINT benchmark on LLMs' ability to solve tasks with multi-turn interactions.

We support evaluation of the Eurus subset focus on math and code reasoning, including MATH, MMLU, TheoremQA, HumanEval, MBPP.

Setup Environment and LLM Configuration

Please follow instruction here to setup your local development environment and LLM.

Start the evaluation

We are using the MINT dataset hosted on Hugging Face.

Following is the basic command to start the evaluation. Currently, the only agent supported with MINT is CodeActAgent.

./evaluation/mint/scripts/run_infer.sh [model_config] [git-version] [subset] [eval_limit]

where model_config is mandatory, while others are optional.

  • model_config, e.g. eval_gpt4_1106_preview, is the config group name for your LLM settings, as defined in your config.toml.

  • git-version, e.g. HEAD, is the git commit hash of the OpenHands version you would like to evaluate. It could also be a release tag like 0.6.2.

  • subset, e.g. math, is the subset of the MINT benchmark to evaluate on, defaulting to math. It can be either: math, gsm8k, mmlu, theoremqa, mbpp,humaneval.

  • eval_limit, e.g. 2, limits the evaluation to the first eval_limit instances, defaulting to all instances.

Note: in order to use eval_limit, you must also set subset.

For example,

./evaluation/swe_bench/scripts/run_infer.sh eval_gpt4_1106_preview 0.6.2 gsm8k 3

Reference

@misc{wang2024mint,
    title={MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback},
    author={Xingyao Wang and Zihan Wang and Jiateng Liu and Yangyi Chen and Lifan Yuan and Hao Peng and Heng Ji},
    year={2024},
    eprint={2309.10691},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}