1 年之前 · b2fdb963b6
--- a/evaluation/TUTORIAL.md
+++ b/evaluation/TUTORIAL.md
@@ -0,0 +1,166 @@
 
				+# Tutorial: How to add a New Evaluation Benchmark to OpenDevin
			
 
				+
			
 
				+This tutorial provides a general guide on how to integrate your own evaluation benchmark into the OpenDevin framework.
			
 
				+
			
 
				+You can read this for details, and also learn by example by looking at our existing evaluations:
			
 
				+- [swe_bench](swe_bench/)
			
 
				+
			
 
				+
			
 
				+## A quick walk-through of OpenDevin architecture
			
 
				+
			
 
				+### Before everything begins
			
 
				+
			
 
				+Please follow [this document](https://github.com/OpenDevin/OpenDevin/blob/main/Development.md) to setup local develop environment for OpenDevin.
			
 
				+
			
 
				+### Configuration file
			
 
				+
			
 
				+OpenDevin uses `config.toml` to keep track of most configurations.
			
 
				+
			
 
				+Here's an example configuration file you can use:
			
 
				+
			
 
				+```toml
			
 
				+[core]
			
 
				+max_iterations = 100
			
 
				+cache_dir = "/tmp/cache"
			
 
				+
			
 
				+# IMPORTANT: You should set these two paths to YOUR WORKSPACE directory,
			
 
				+# which will be mounted into Sandbox for agent to interact with!
			
 
				+# The OpenDevin agent will be able to read/write files whatever they like (even rm -rf)
			
 
				+# in this directory, so be careful!!
			
 
				+workspace_base = "/path/to/your/workspace"
			
 
				+workspace_mount_path = "/path/to/your/workspace"
			
 
				+# ==========================
			
 
				+
			
 
				+sandbox_container_image = "ghcr.io/opendevin/sandbox:latest"
			
 
				+sandbox_type = "ssh"
			
 
				+sandbox_timeout = 120
			
 
				+ssh_hostname = "localhost"
			
 
				+
			
 
				+# SWEBench eval specific - but you can tweak it to your needs
			
 
				+use_host_network = false
			
 
				+run_as_devin = false
			
 
				+# linting python after editing helps LLM fix indentations
			
 
				+enable_auto_lint = true
			
 
				+
			
 
				+[llm]
			
 
				+# IMPORTANT: add your API key here, and set the model to the one you want to evaluate
			
 
				+model = "gpt-4o-2024-05-13"
			
 
				+api_key = "sk-XXX"
			
 
				+```
			
 
				+
			
 
				+### How to use OpenDevin programmatically
			
 
				+
			
 
				+In this section, for the purpose of building an evaluation task, we don't use the standard OpenDevin web-based GUI, but rather run OpenDevin backend from CLI.
			
 
				+
			
 
				+For example, you can run the following, which performs the specified task `-t`, with a particular model `-m` and agent `-c`, for a maximum number of iterations `-i`:
			
 
				+
			
 
				+```bash
			
 
				+poetry run python ./opendevin/core/main.py \
			
 
				+        -i 10 \
			
 
				+        -t "Write me a bash script that print hello world." \
			
 
				+        -c CodeActAgent \
			
 
				+        -m gpt-4o-2024-05-13
			
 
				+```
			
 
				+
			
 
				+After running the script, you will observe the following:
			
 
				+
			
 
				+![](./static/example_task_1.png)
			
 
				+
			
 
				+You can see the agent uses bash to write a script, makes it executable, and then tests it by running it to make sure it is working.
			
 
				+
			
 
				+At the end of the above screenshot, OpenDevin actually requests user inputs when it think it finishes the task. This will cause issues in evaluation, since most evaluation don't assume additional user input. To fix this, we introduce the functionality of `fake_user_response_fn` in the `main` function, which we describe in the next section.
			
 
				+
			
 
				+## The `main` function
			
 
				+
			
 
				+The signature of `main` (in file [[`opendevin/core/main.py`](../opendevin/core/main.py)]) is as follows:
			
 
				+
			
 
				+```python
			
 
				+async def main(
			
 
				+    task_str: str = '',
			
 
				+    exit_on_message: bool = False,
			
 
				+    fake_user_response_fn: Optional[Callable[[Optional[State]], str]] = None,
			
 
				+    sandbox: Optional[Sandbox] = None,
			
 
				+) -> Optional[State]:
			
 
				+```
			
 
				+
			
 
				+- `task_str`: The task instruction to run. In the above example, it is "Write me a bash script that print hello world."
			
 
				+- `exit_on_message`: whether to quit if the agent asks for a message from user
			
 
				+- `fake_user_response_fn`: An optional function that receives the current state (could be None) and returns a fake user response.
			
 
				+- `sandbox`: An optional sandbox to run the agent in.
			
 
				+
			
 
				+### `fake_user_response_fn`
			
 
				+
			
 
				+Here's an example of `fake_user_response_fn` in the implementation for SWE-Bench in [`evaluation/swe_bench/run_infer.py`](swe_bench/run_infer.py):
			
 
				+
			
 
				+```python
			
 
				+def codeact_user_response(state: State) -> str:
			
 
				+    msg = (
			
 
				+        'Please continue working on the task on whatever approach you think is suitable.\n'
			
 
				+        'If you think you have modified the code in a way that fixes the issue, please run the following command: <execute_bash> exit </execute_bash>.\n'
			
 
				+        'IMPORTANT: YOU SHOULD NEVER ASK FOR HUMAN HELP OR USE THE INTERNET TO SOLVE THIS TASK.\n'
			
 
				+    )
			
 
				+    if state.history:
			
 
				+        user_msgs = [
			
 
				+            action
			
 
				+            for action, _ in state.history
			
 
				+            if isinstance(action, MessageAction) and action.source == 'agent'
			
 
				+        ]
			
 
				+        if len(user_msgs) >= 2:
			
 
				+            # let the agent know that it can give up when it has tried 3 times
			
 
				+            return (
			
 
				+                msg
			
 
				+                + 'If you want to give up, run: <execute_bash> exit </execute_bash>.\n'
			
 
				+            )
			
 
				+    return msg
			
 
				+```
			
 
				+
			
 
				+
			
 
				+### Return value
			
 
				+
			
 
				+The main function returns a `State`, which is defined in [`opendevin/controller/state/state.py`](../opendevin/controller/state/state.py). We are mainly using `state.history` here, which is the most important field of data. You can imagine it is being a more structured version of OpenAI's chat completion [messages](https://platform.openai.com/docs/guides/text-generation/chat-completions-api).
			
 
				+
			
 
				+`history: list[tuple[Action, Observation]] = field(default_factory=list)` is a list of (action, observation) tuple. All the actions are defined at [`opendevin/events/action`](../opendevin/events/action) and observations are defined at [`opendevin/events/observation`](../opendevin/events/action).
			
 
				+
			
 
				+The agent can emit different actions like `CmdRunAction`  (`opendevin/events/action/commands.py`) to execute bash commands and receive `CmdOutputObservation` (`opendevin/events/observation/commands.py`), `IPythonRunCellAction` to receive `IPythonRunCellObservation`, `BrowseInteractiveAction` (`opendevin/events/action/browse.py`) to browse the web and receive `BrowserOutputObservation` (`opendevin/events/observation/browse.py`).
			
 
				+
			
 
				+The action we used in this example is `MessageAction` (`opendevin/events/action/message.py`), which actually denotes a message from either `agent` or `user`. In the [CodeAct agent example](https://github.com/OpenDevin/OpenDevin/blob/7ca560471bd262f22513f3863995d0a8e6121c07/agenthub/codeact_agent/codeact_agent.py#L239-L273), an agent is considered to emit a `MessageAction` when it does not trigger a `CmdRunAction`, `IPythonRunCellAction`, and/or `BrowseInteractiveAction`.
			
 
				+
			
 
				+Typically, the agent returns `MessageAction` when it is confused about the task, and want to ask human for follow-up clarification, which is a good thing in real-world task, but not necessarily in evaluation. So in this example, we provide a dummy prompt to tell the agent "Please continue working on the task on whatever approach you think is suitable[...]".
			
 
				+
			
 
				+If you see something like this, you can consider adding this to your evaluation pipeline as well.
			
 
				+
			
 
				+### `sandbox`
			
 
				+
			
 
				+Sandbox is a fully functioning docker container where the agent can perform all sorts of tasks, e.g., using bash, calling Python, install packages, and more. You can leave `sandbox` to `None` if you don't need to do anything special to pre-configure the `Sandbox`.
			
 
				+
			
 
				+In SWE-Bench, we need to copy the proper repository directory to the workspace and activate the right python virtual environment before the agent can start performing the task, so we actually defined a custom [`SWEBenchSSHBox`](https://github.com/OpenDevin/OpenDevin/blob/7ca560471bd262f22513f3863995d0a8e6121c07/evaluation/swe_bench/swe_env_box.py#L12-L118) that inherit from the default sandbox [`SSHBox`](https://github.com/OpenDevin/OpenDevin/blob/7ca560471bd262f22513f3863995d0a8e6121c07/opendevin/runtime/docker/ssh_box.py#L188) and handles all these initial setup. If you need to configure the `sandbox` for your evaluation, check `SWEBenchSSHBox` for a reference of implementation.
			
 
				+
			
 
				+## How to put together an evaluation script?
			
 
				+
			
 
				+Now we know how to start running the agent end-to-end, and how `fake_user_response_fn` and `sandbox` work. We will walk through a piece of dummy code (simplified version of SWE-Bench's [`run_infer.py`](https://github.com/OpenDevin/OpenDevin/blob/main/evaluation/swe_bench/run_infer.py)) that outline the general workflow:
			
 
				+
			
 
				+- Load the dataset and prepare the evaluation configuration.
			
 
				+- Filter out any instances that have already been processed.
			
 
				+- For each instance in the dataset:
			
 
				+  - Set up the sandbox environment.
			
 
				+  - Run the agent to generate a solution.
			
 
				+  - Apply the solution to the instance and execute the test command.
			
 
				+  - Collect the results and write them to the output file.
			
 
				+- Perform cleanup after the evaluation is complete.
			
 
				+
			
 
				+You can see the [swe_bench/run_infer.py](swe_bench/run_infer.py) file for an example.
			
 
				+
			
 
				+When you fully understand the `run_infer.py`, you can be ready to actually starting the evaluation!
			
 
				+
			
 
				+
			
 
				+## Run the evaluation!
			
 
				+
			
 
				+You can write your `run_infer.sh` script mimicking SWE-Bench's [`run_infer.sh`](https://github.com/OpenDevin/OpenDevin/blob/main/evaluation/swe_bench/scripts/run_infer.sh).
			
 
				+
			
 
				+
			
 
				+You can start the evaluation by running:
			
 
				+
			
 
				+```bash
			
 
				+./run_infer.sh eval_gpt_4o_2024_05_13
			
 
				+```
			
 
				+Where `eval_gpt_4o_2024_05_13` is the model config you defined on the config.toml.
			
--- a/evaluation/static/example_task_1.png
+++ b/evaluation/static/example_task_1.png
--- a/evaluation/swe_bench/README.md
+++ b/evaluation/swe_bench/README.md
@@ -3,6 +3,11 @@
 
				 
			
 
				 This folder contains evaluation harness we built on top of the original [SWE-Bench benchmark](https://www.swebench.com/) ([paper](https://arxiv.org/abs/2310.06770)). We create [a fork of SWE-Bench](https://github.com/OpenDevin/OD-SWE-bench.git) mostly build on top of [the original repo](https://github.com/princeton-nlp/SWE-bench) and [containerized](#opendevin-swe-bench-docker-image) it for easy evaluation.
			
 
				 
			
 
				+## Setup Environment
			
 
				+
			
 
				+Please follow [this document](https://github.com/OpenDevin/OpenDevin/blob/main/Development.md) to setup local develop environment for OpenDevin.
			
 
				+
			
 
				+
			
 
				 ## OpenDevin SWE-Bench Docker Image
			
 
				 
			
 
				 In [OpenDevin-SWE-Bench fork](https://github.com/OpenDevin/OD-SWE-bench.git) (mostly from [original repo](https://github.com/princeton-nlp/SWE-bench) with some fixes), we try to pre-build the **testbed** (i.e., code of the repository we want the agent to edit) AND the **conda environment**, so that in evaluation (inference) time, we can directly leverage existing environments for effecienct evaluation.
			
@@ -40,11 +45,11 @@ max_iterations = 100
 
				 cache_dir = "/tmp/cache"
			
 
				 sandbox_container_image = "ghcr.io/opendevin/sandbox:latest"
			
 
				 sandbox_type = "ssh"
			
 
				-use_host_network = true
			
 
				 ssh_hostname = "localhost"
			
 
				 sandbox_timeout = 120
			
 
				 
			
 
				 # SWEBench eval specific
			
 
				+use_host_network = false
			
 
				 run_as_devin = false
			
 
				 enable_auto_lint = true
			
 
				 
			
--- a/evaluation/swe_bench/run_infer.py
+++ b/evaluation/swe_bench/run_infer.py
@@ -68,6 +68,7 @@ AGENT_CLS_TO_INST_SUFFIX = {
 
				 
			
 
				 def get_test_result(instance, sandbox, workspace_dir_name):
			
 
				     test_result = {'result': {}, 'metadata': {}}
			
 
				+    # NOTE: if you need to do something in the sandbox to get the correctness metric, modify this function
			
 
				     try:
			
 
				         test_patch_parsed = whatthepatch.parse_patch(instance.test_patch)
			
 
				         # get a list of filepaths that are involved in the patch
			
@@ -187,10 +188,13 @@ def process_instance(
 
				 ):
			
 
				     workspace_mount_path = os.path.join(config.workspace_mount_path, '_eval_workspace')
			
 
				     # create process-specific workspace dir
			
 
				+    # if `not skip_workspace_mount` - we will create a workspace directory for EACH process
			
 
				+    # so that different agent don't interfere with each other.
			
 
				     if not skip_workspace_mount:
			
 
				         workspace_mount_path = os.path.join(workspace_mount_path, str(os.getpid()))
			
 
				         pathlib.Path(workspace_mount_path).mkdir(parents=True, exist_ok=True)
			
 
				 
			
 
				+    # Setup the logger properly, so you can run multi-processing to parallize the evaluation
			
 
				     if reset_logger:
			
 
				         # Set up logger
			
 
				         log_file = os.path.join(
			
@@ -216,6 +220,8 @@ def process_instance(
 
				     if not skip_workspace_mount:
			
 
				         logger.info(f'Process-specific workspace mounted at {workspace_mount_path}')
			
 
				 
			
 
				+    # NOTE: this is something special we do for SWE-Bench due to the reason described in the previous section
			
 
				+    # You can omit this if you don't need to setup specialized sandbox
			
 
				     workspace_dir_name = f'{instance.repo}__{instance.version}'.replace('/', '__')
			
 
				     sandbox = SWEBenchSSHBox.get_box_for_instance(
			
 
				         instance,
			
@@ -238,9 +244,10 @@ def process_instance(
 
				         'You should NOT modify any existing test case files. If needed, you can add new test cases in a NEW file to reproduce the issue.\n'
			
 
				         'You SHOULD INCLUDE PROPER INDENTATION in your edit commands.\n'
			
 
				     )
			
 
				+    # NOTE: You can actually set slightly different instruction for different agents
			
 
				     instruction += AGENT_CLS_TO_INST_SUFFIX.get(agent_class, '')
			
 
				 
			
 
				-    # Run the agent
			
 
				+    # Here's how you can run the agent (similar to the `main` function) and get the final task state
			
 
				     state: State = asyncio.run(
			
 
				         main(
			
 
				             instruction,
			
@@ -249,23 +256,28 @@ def process_instance(
 
				         )
			
 
				     )
			
 
				 
			
 
				+    # ======= THIS IS SWE-Bench specific =======
			
 
				     # Get git patch
			
 
				     git_patch = sandbox.get_diff_patch()
			
 
				     logger.info(f'Got git diff for instance {instance.instance_id}')
			
 
				+    # ==========================================
			
 
				 
			
 
				     # ======= Attempt to evaluate the agent's edits =======
			
 
				-    # Attempt to analyze the test patch to get involved filepaths
			
 
				+    # TODO: if you need to do something in the sandbox to get the correctness metric, modify this function
			
 
				     test_result = get_test_result(instance, sandbox, workspace_dir_name)
			
 
				 
			
 
				+    # If you are working on some simpler benchmark that only evaluates the final model output (e.g., in a MessageAction)
			
 
				+    # You can simply get the LAST `MessageAction` from the returned `state.history` and parse it for evaluation.
			
 
				+
			
 
				     if state is None:
			
 
				         raise ValueError('State should not be None.')
			
 
				 
			
 
				     # Save the output
			
 
				     output = {
			
 
				         'instance_id': instance.instance_id,
			
 
				-        'swe_instance': instance.to_dict(),
			
 
				+        'swe_instance': instance.to_dict(),  # SWE Bench specific
			
 
				         'instruction': instruction,
			
 
				-        'git_patch': git_patch,
			
 
				+        'git_patch': git_patch,  # SWE Bench specific
			
 
				         'metadata': metadata,
			
 
				         'history': [
			
 
				             (event_to_dict(action), event_to_dict(obs)) for action, obs in state.history
			
@@ -280,10 +292,13 @@ def process_instance(
 
				 
			
 
				 
			
 
				 if __name__ == '__main__':
			
 
				-    # Load the dataset
			
 
				+    # NOTE: It is preferable to load datasets from huggingface datasets and perform post-processing
			
 
				+    # so we don't need to manage file uploading to OpenDevin's repo
			
 
				     dataset = load_dataset('princeton-nlp/SWE-bench_Lite')
			
 
				     swe_bench_tests = dataset['test'].to_pandas()
			
 
				 
			
 
				+    # Check https://github.com/OpenDevin/OpenDevin/blob/main/evaluation/swe_bench/README.md#configure-opendevin-and-your-llm
			
 
				+    # for details of how to set `llm_config`
			
 
				     if args.llm_config:
			
 
				         specified_llm_config = get_llm_config_arg(args.llm_config)
			
 
				         if specified_llm_config:
			
@@ -319,7 +334,7 @@ if __name__ == '__main__':
 
				         'max_iterations': max_iterations,
			
 
				         'eval_output_dir': eval_output_dir,
			
 
				         'start_time': time.strftime('%Y-%m-%d %H:%M:%S'),
			
 
				-        # get the commit id of current repo
			
 
				+        # get the commit id of current repo for reproduciblity
			
 
				         'git_commit': subprocess.check_output(['git', 'rev-parse', 'HEAD'])
			
 
				         .decode('utf-8')
			
 
				         .strip(),
			
@@ -352,6 +367,7 @@ if __name__ == '__main__':
 
				         f'Evaluation started with Agent {agent_class}, model {model_name}, max iterations {max_iterations}.'
			
 
				     )
			
 
				 
			
 
				+    # =============================================
			
 
				     # filter out finished instances
			
 
				     new_swe_bench_tests = []
			
 
				     for idx, instance in swe_bench_tests.iterrows():
			
@@ -366,9 +382,11 @@ if __name__ == '__main__':
 
				     logger.info(
			
 
				         f'Finished instances: {len(finished_instance_ids)}, Remaining instances: {len(swe_bench_tests)}'
			
 
				     )
			
 
				+    # =============================================
			
 
				 
			
 
				     pbar = tqdm(total=len(swe_bench_tests))
			
 
				 
			
 
				+    # This function tracks the progress AND write the output to a JSONL file
			
 
				     def update_progress(future):
			
 
				         pbar.update(1)
			
 
				         output = future.result()
			
@@ -380,14 +398,18 @@ if __name__ == '__main__':
 
				         output_fp.write(json.dumps(output) + '\n')
			
 
				         output_fp.flush()
			
 
				 
			
 
				+    # This sets the multi-processing
			
 
				     num_workers = args.eval_num_workers
			
 
				     logger.info(f'Using {num_workers} workers for evaluation.')
			
 
				 
			
 
				+    # This is SWE-Bench specific - CodeActAgent doesn't require mounted workspace to work
			
 
				     skip_workspace_mount = agent_class == 'CodeActAgent'
			
 
				     logger.info(f'Skipping workspace mount: {skip_workspace_mount}')
			
 
				+
			
 
				     try:
			
 
				         with ProcessPoolExecutor(num_workers) as executor:
			
 
				             futures = []
			
 
				+            # This is how we perform multi-processing
			
 
				             for row_idx, instance in swe_bench_tests.iterrows():
			
 
				                 future = executor.submit(
			
 
				                     process_instance,
			
--- a/evaluation/swe_bench/scripts/run_infer.sh
+++ b/evaluation/swe_bench/scripts/run_infer.sh
@@ -1,6 +1,8 @@
 
				 #!/bin/bash
			
 
				 
			
 
				 AGENT=CodeActAgent
			
 
				+# IMPORTANT: Because Agent's prompt changes fairly often in the rapidly evolving codebase of OpenDevin
			
 
				+# We need to track the version of Agent in the evaluation to make sure results are comparable
			
 
				 AGENT_VERSION=v$(python3 -c "from agenthub.codeact_agent import CodeActAgent; print(CodeActAgent.VERSION)")
			
 
				 MODEL_CONFIG=$1
			
 
				 
			
--- a/opendevin/controller/agent_controller.py
+++ b/opendevin/controller/agent_controller.py
@@ -138,8 +138,10 @@ class AgentController:
 
				             if self._pending_action and self._pending_action.id == event.cause:
			
 
				                 await self.add_history(self._pending_action, event)
			
 
				                 self._pending_action = None
			
 
				+                logger.info(event, extra={'msg_type': 'OBSERVATION'})
			
 
				             elif isinstance(event, CmdOutputObservation):
			
 
				                 await self.add_history(NullAction(), event)
			
 
				+                logger.info(event, extra={'msg_type': 'OBSERVATION'})
			
 
				 
			
 
				     def reset_task(self):
			
 
				         self.agent.reset()