Running SWE-bench with Harbor¶

Run SWE-bench evaluations using Harbor with OpenModal as the compute backend.

Install¶

pip install "openmodal[harbor]"

Run¶

harbor run \
  --agent mini-swe-agent \
  --model openai/gpt-5.4 \
  --environment-import-path openmodal.integrations.harbor_env:ModalEnvironment \
  --dataset swe-bench/swe-bench-verified \
  --n-tasks 1

This creates a sandbox, runs the agent against a SWE-bench task, verifies the patch, and reports results.

What happens¶

Harbor downloads a SWE-bench task (e.g., a Django bug)
OpenModal creates a container with the task's Docker image
The agent runs inside the container — reads the bug, edits code, runs tests
Harbor uploads test files, runs verification, reports pass/fail
Container is cleaned up

Options¶

Different agents:

harbor run --agent claude-code --model anthropic/claude-sonnet-4-5-20250929 \
  --environment-import-path openmodal.integrations.harbor_env:ModalEnvironment \
  --dataset swe-bench/swe-bench-verified --n-tasks 5

harbor run --agent openhands --model openai/gpt-5.4 \
  --environment-import-path openmodal.integrations.harbor_env:ModalEnvironment \
  --dataset swe-bench/swe-bench-verified --n-tasks 5

Multiple attempts:

harbor run --agent mini-swe-agent --model openai/gpt-5.4 \
  --environment-import-path openmodal.integrations.harbor_env:ModalEnvironment \
  --dataset swe-bench/swe-bench-verified --n-tasks 10 --n-attempts 3

View results:

harbor view jobs

With Modal:

harbor run --agent mini-swe-agent --env modal --dataset swe-bench/swe-bench-verified

With OpenModal:

harbor run --agent mini-swe-agent \
  --environment-import-path openmodal.integrations.harbor_env:ModalEnvironment \
  --dataset swe-bench/swe-bench-verified

Same agents, same datasets, same results — runs on your own infrastructure.

Running SWE-bench with Harbor¶

Install¶

Run¶

What happens¶

Options¶

How it compares to Modal¶