This folder contains backend integration tests that rely on a mock LLM. It serves two purposes:
Why don't we launch an open-source model, e.g. LLAMA3? There are two reasons:
Note: integration tests are orthogonal to evaluations/benchmarks as they serve different purposes. Although benchmarks could also capture bugs, some of which may not be caught by tests, benchmarks require real LLMs which are non-deterministic and costly. We run integration test suite for every single commit, which is not possible with benchmarks.
Known limitations:
The folder is organised as follows:
├── README.md
├── conftest.py
├── mock
│ ├── [AgentName]
│ │ └── [TestName]
│ │ ├── prompt_*.log
│ │ ├── response_*.log
└── [TestFiles].py
where conftest.py defines the infrastructure needed to load real-world LLM prompts
and responses for mocking purpose. Prompts and responses generated during real runs
of agents with real LLMs are stored under mock/AgentName/TestName folders.
Take a look at run-integration-tests.yml to learn how integration tests are
launched in CI environment. Assuming you want to use workspace for testing, an
example is as follows:
rm -rf workspace; AGENT=PlannerAgent \
WORKSPACE_BASE="/Users/admin/OpenDevin/workspace" WORKSPACE_MOUNT_PATH="/Users/admin/OpenDevin/workspace" MAX_ITERATIONS=10 \
poetry run pytest -s ./tests/integration
Note: in order to run integration tests correctly, please ensure your workspace is empty.
To write an integration test, there are essentially two steps:
Your config.toml should look like this:
LLM_MODEL="gpt-4-turbo"
LLM_API_KEY="<your-api-key>"
LLM_EMBEDDING_MODEL="openai"
WORKSPACE_MOUNT_PATH="<absolute-path-of-your-workspace>"
You can choose any model you'd like to generate the mock responses. You can even handcraft mock responses, especially when you would like to test the behaviour of agent for corner cases. If you use a very weak model (e.g. 8B params), chance is most agents won't be able to finish the task.
# Remove logs if you are okay to lose logs. This helps us locate the prompts and responses quickly, but is NOT a must.
rm -rf logs
# Clear the workspace, otherwise OpenDevin might not be able to reproduce your prompts in CI environment. Feel free to change the workspace name and path. Be sure to set `WORKSPACE_MOUNT_PATH` to the same absolute path.
rm -rf workspace
mkdir workspace
# Depending on the complexity of the task you want to test, you can change the number of iterations limit. Change agent accordingly. If you are adding a new test, try generating mock responses for every agent.
poetry run python ./opendevin/core/main.py -i 10 -t "Write a shell script 'hello.sh' that prints 'hello'." -c "MonologueAgent" -d "./workspace"
NOTE: If your agent decide to support user-agent interaction via natural language (e.g., you will prompted to enter user resposes when running the above main.py command), you should create a file named tests/integration/mock/<AgentName>/<TestName>/user_responses.log containing all the responses in order you provided to the agent, delimited by newline ('\n'). This will be used to mock the STDIN during testing.
After running the above commands, you should be able to locate the real prompts
and responses logged. The log folder follows logs/llm/%y-%m-%d_%H-%M.log format.
Now, move all files under that folder to tests/integration/mock/<AgentName>/<TestName> folder. For example, moving all files from logs/llm/24-04-23_21-55/ folder to
tests/integration/mock/MonologueAgent/test_write_simple_script folder.
That's it, you are good to go! When you launch an integration test, mock responses are loaded and used to replace a real LLM, so that we get deterministic and consistent behavior, and most importantly, without spending real money.