Xingyao Wang da17665cab fix: make max_budget_per_task optional in `run_agent_controller` (#3071) 1 год назад
..
scripts cab7a288ca Add NUM_WORKERS variable to run_infer.sh scripts for configurable woker settings (#2597) 1 год назад
README.md ff6ddc831f fix: runtime test for mac (#3005) 1 год назад
__init__.py be251b11de Add AgentBench. (#2012) 1 год назад
helper.py a081935fd8 Simplify eval code (#2775) 1 год назад
run_infer.py da17665cab fix: make max_budget_per_task optional in `run_agent_controller` (#3071) 1 год назад

README.md

AgentBench Evaluation

This folder contains evaluation harness for evaluating agents on the AgentBench: Evaluating LLMs as Agents.

Configure OpenDevin and your LLM

Create a config.toml file if it does not exist at the root of the workspace. Please check README.md for how to set this up.

Here is an example config.toml file:

[core]
max_iterations = 100
cache_dir = "/path/to/cache"

workspace_base = "/path/to/workspace"
workspace_mount_path = "/path/to/workspace"

ssh_hostname = "localhost"

# AgentBench specific
run_as_devin = true

[sandbox]
use_host_network = false
enable_auto_lint = true
box_type = "ssh"
timeout = 120

[llm.eval_gpt35_turbo]
model = "gpt-3.5-turbo"
api_key = "sk-123"
temperature = 0.0

[llm.eval_gpt4o]
model = "gpt-4o"
api_key = "sk-123"
temperature = 0.0

Start the evaluation

./evaluation/agent_bench/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit]

Following is the basic command to start the evaluation. Here we are only evaluating the osbench for now.

You can update the arguments in the script evaluation/agent_bench/scripts/run_infer.sh, such as --max-iterations, --eval-num-workers and so on.

  • --agent-cls, the agent to use. For example, CodeActAgent.
  • --llm-config: the LLM configuration to use. For example, eval_gpt4_1106_preview.
  • --max-iterations: the number of iterations to run the evaluation. For example, 30.
  • --eval-num-workers: the number of workers to use for evaluation. For example, 5.
  • --eval-n-limit: the number of examples to evaluate. For example, 100.

    ./evaluation/agent_bench/scripts/run_infer.sh eval_gpt35_turbo 0.6.2 CodeActAgent 1