|
|
@@ -58,45 +58,24 @@ poetry run pytest -s ./tests/integration
|
|
|
Note: in order to run integration tests correctly, please ensure your workspace is empty.
|
|
|
|
|
|
|
|
|
-## Write Integration Tests
|
|
|
-
|
|
|
-To write an integration test, there are essentially two steps:
|
|
|
-
|
|
|
-1. Decide your task prompt, and the result you want to verify.
|
|
|
-2. Either construct LLM responses by yourself, or run OpenDevin with a real LLM. The system prompts and
|
|
|
-LLM responses are recorded as logs, which you could then copy to test folder.
|
|
|
-The following paragraphs describe how to do it.
|
|
|
-
|
|
|
-Your `config.toml` should look like this:
|
|
|
-
|
|
|
-```toml
|
|
|
-LLM_MODEL="gpt-4-turbo"
|
|
|
-LLM_API_KEY="<your-api-key>"
|
|
|
-LLM_EMBEDDING_MODEL="openai"
|
|
|
-WORKSPACE_MOUNT_PATH="<absolute-path-of-your-workspace>"
|
|
|
-```
|
|
|
-
|
|
|
-You can choose any model you'd like to generate the mock responses.
|
|
|
-You can even handcraft mock responses, especially when you would like to test the behaviour of agent for corner cases. If you use a very weak model (e.g. 8B params), chance is most agents won't be able to finish the task.
|
|
|
-
|
|
|
+## Regenerate Integration Tests
|
|
|
+When you make changes to an agent's prompt, the integration tests will fail. You'll need to regenerate them
|
|
|
+by running:
|
|
|
```bash
|
|
|
-# Remove logs if you are okay to lose logs. This helps us locate the prompts and responses quickly, but is NOT a must.
|
|
|
-rm -rf logs
|
|
|
-# Clear the workspace, otherwise OpenDevin might not be able to reproduce your prompts in CI environment. Feel free to change the workspace name and path. Be sure to set `WORKSPACE_MOUNT_PATH` to the same absolute path.
|
|
|
-rm -rf workspace
|
|
|
-mkdir workspace
|
|
|
-# Depending on the complexity of the task you want to test, you can change the number of iterations limit. Change agent accordingly. If you are adding a new test, try generating mock responses for every agent.
|
|
|
-poetry run python ./opendevin/core/main.py -i 10 -t "Write a shell script 'hello.sh' that prints 'hello'." -c "MonologueAgent" -d "./workspace"
|
|
|
+./tests/integration/regenerate.sh
|
|
|
```
|
|
|
+Note that this will make several calls to your LLM_MODEL, potentially costing money! If you don't want
|
|
|
+to cover the cost, ask one of the maintainers to regenerate for you.
|
|
|
+You might also be able to fix the tests by hand.
|
|
|
|
|
|
-**NOTE**: If your agent decide to support user-agent interaction via natural language (e.g., you will prompted to enter user resposes when running the above `main.py` command), you should create a file named `tests/integration/mock/<AgentName>/<TestName>/user_responses.log` containing all the responses in order you provided to the agent, delimited by newline ('\n'). This will be used to mock the STDIN during testing.
|
|
|
+## Write a new Integration Test
|
|
|
|
|
|
-After running the above commands, you should be able to locate the real prompts
|
|
|
-and responses logged. The log folder follows `logs/llm/%y-%m-%d_%H-%M.log` format.
|
|
|
+To write an integration test, there are essentially two steps:
|
|
|
|
|
|
-Now, move all files under that folder to `tests/integration/mock/<AgentName>/<TestName>` folder. For example, moving all files from `logs/llm/24-04-23_21-55/` folder to
|
|
|
-`tests/integration/mock/MonologueAgent/test_write_simple_script` folder.
|
|
|
+1. Decide your task prompt, and the result you want to verify.
|
|
|
+2. Add your prompt to ./regenerate.sh
|
|
|
|
|
|
+**NOTE**: If your agent decide to support user-agent interaction via natural language (e.g., you will prompted to enter user resposes when running the above `main.py` command), you should create a file named `tests/integration/mock/<AgentName>/<TestName>/user_responses.log` containing all the responses in order you provided to the agent, delimited by newline ('\n'). This will be used to mock the STDIN during testing.
|
|
|
|
|
|
That's it, you are good to go! When you launch an integration test, mock
|
|
|
responses are loaded and used to replace a real LLM, so that we get
|