Architecture¶
How OpenModal works under the hood — from openmodal run to a running container on your cloud.
The big picture¶
OpenModal is an orchestration layer on top of Kubernetes. Your code gets packaged into a Docker image, pushed to a container registry, and runs as a Kubernetes pod on your cloud.
graph LR
Code[your code] --> Image[Docker image] --> Pod[K8s pod] --> Result[result]
Under the hood, three systems are involved:
graph TB
subgraph Your Machine
CLI[openmodal CLI]
end
subgraph Container Registry
Image[Docker Image]
end
subgraph Kubernetes Cluster
API[K8s API Server]
Scheduler[Scheduler]
Node1[Node 1]
Node2[Node 2]
Node3[GPU Node]
end
CLI -->|1. build & push| Image
CLI -->|2. create pod| API
API --> Scheduler
Scheduler --> Node1
Scheduler --> Node2
Scheduler --> Node3
Image -.->|3. pull| Node1
Image -.->|3. pull| Node3
- Container Registry (Artifact Registry, ECR, or ACR) — stores your Docker images
- K8s API Server — accepts pod creation requests
- Scheduler — places pods on nodes with enough CPU, memory, and GPUs
What happens when you run openmodal run app.py¶
sequenceDiagram
participant You
participant CLI as openmodal CLI
participant Registry as Container Registry
participant K8s as K8s API
participant Pod
You->>CLI: openmodal run app.py
CLI->>CLI: Generate Dockerfile from Image chain
CLI->>Registry: Build & push image
CLI->>K8s: Create pod with image
K8s->>Pod: Schedule on node, pull image, start
Pod->>Pod: Unpickle args → call function → pickle result
Pod-->>CLI: Return result
CLI-->>You: Print return value
Step by step¶
1. Image build. OpenModal reads the Image chain (.apt_install(), .pip_install(), etc.) and generates a Dockerfile. It builds the image and pushes it to a registry.
| Provider | Registry | Build method |
|---|---|---|
| GCP | Artifact Registry | Cloud Build (remote) |
| AWS | ECR | Local docker build + push |
| Azure | ACR | ACR Tasks (remote) |
| Local | None | Local docker build |
GCP and Azure build images remotely — no local Docker needed. AWS uses local Docker because CodeBuild requires admin IAM permissions.
2. Pod creation. OpenModal creates a Kubernetes pod spec with your image, resource requests, GPU requirements, env vars, and volumes, then submits it to the K8s API.
3. Scheduling. The scheduler finds a node with enough free resources. If nothing fits, the pod stays Pending and the cluster autoscaler adds a new node (see Cluster autoscaling).
4. Image pull. The node pulls the image from the registry. First pull is slow (2-30s depending on image size). Subsequent pulls on the same node use cached layers.
5. Execution. The container runs the OpenModal agent, which unpickles your function arguments, calls your function, pickles the result, and sends it back (see Remote function execution).
Image building¶
The Image class is a chainable Dockerfile generator. Each method call appends a line to the Dockerfile:
image = (
openmodal.Image.debian_slim() # FROM ubuntu:24.04 + python 3.12
.apt_install("git", "curl") # RUN apt-get install -y git curl
.pip_install("torch", "transformers") # RUN pip install torch transformers
.run_commands("echo setup done") # RUN echo setup done
)
This generates:
FROM ubuntu:24.04
ENV DEBIAN_FRONTEND=noninteractive
RUN curl -sSL <python-build-standalone-url> | tar xz -C /usr/local ...
RUN apt-get update && apt-get install -y git curl ...
RUN pip install torch transformers
RUN echo setup done
RUN pip install openmodal
COPY your_app.py /opt/your_app.py
CMD ["python", "-m", "openmodal.runtime.agent"]
Python is installed via python-build-standalone (pre-compiled binaries from Astral). This means any Python version (3.10–3.13) works on any base image — you're not tied to the distro's Python.
Image caching¶
Images are content-hashed. If the Dockerfile and source files haven't changed, OpenModal skips the build entirely and reuses the existing image from the registry.
Sandboxes¶
Sandboxes are long-running containers you can exec commands into — like SSH-ing into a machine. They're used by coding agents (CooperBench, Harbor/SWE-bench) that need to run bash commands, edit files, and run tests inside a codebase.
sequenceDiagram
participant Agent as Your Code
participant K8s as K8s API
participant Pod as Sandbox Pod
Agent->>K8s: Sandbox.create(image=..., timeout=300)
K8s->>Pod: Start pod running "sleep 300"
Agent->>Pod: sandbox.exec("git diff")
Pod-->>Agent: stdout, stderr, returncode
Agent->>Pod: sandbox.exec("python test.py")
Pod-->>Agent: stdout, stderr, returncode
Agent->>K8s: sandbox.terminate()
K8s->>Pod: Delete pod
The pod runs sleep <timeout> as its main process — this keeps the container alive while you exec commands into it. Each exec call runs a separate process inside the same container, sharing the same filesystem. Under the hood, exec_in_pod uses the Kubernetes exec API (websocket to the kubelet). On local Docker, it's just docker exec.
Default resource requests¶
Every sandbox pod requests 0.25 CPU and 256 MB RAM. This is important for autoscaling — it tells the scheduler how many pods fit on a node:
OpenModal sets these defaults automatically so the autoscaler works out of the box.
Remote function execution¶
When you call f.remote(x), your arguments are serialized (pickled), sent to a pod, and the result is pickled back:
sequenceDiagram
participant Client as Your machine
participant Agent as Pod: openmodal agent
participant Func as Your function
Client->>Agent: Pickled (func_name, args, kwargs)
Agent->>Agent: Import your module as "_user_app"
Agent->>Agent: Unpickle args
Agent->>Func: Call function(args, kwargs)
Func-->>Agent: Return value
Agent-->>Client: Pickled result
The agent registers your module as _user_app in sys.modules before unpickling. This is critical — when you pass a dataclass or Pydantic model as an argument, Python pickles it with the module path (e.g., _user_app.TrainingConfig). The agent needs that module to exist to reconstruct the object.
f.map() — parallel execution¶
f.map(inputs) creates one pod per input and runs them in parallel across the cluster:
graph TB
Client[Your machine]
Client -->|"f.map([a, b, c, d])"| Pool[ThreadPoolExecutor]
Pool --> Pod1[Pod 1: f-a]
Pool --> Pod2[Pod 2: f-b]
Pool --> Pod3[Pod 3: f-c]
Pool --> Pod4[Pod 4: f-d]
Pod1 -.->|result| Client
Pod2 -.->|result| Client
Pod3 -.->|result| Client
Pod4 -.->|result| Client
Each pod runs on potentially different nodes. Results are yielded as they complete — you don't wait for all pods to finish before getting the first result.
GPU serving and scale-to-zero¶
When you deploy a web server (e.g., vLLM), OpenModal creates a GPU pod and monitors it for idle connections. If nobody connects for scaledown_window seconds, the pod is deleted and the GPU is released.
stateDiagram-v2
[*] --> Deployed: openmodal deploy
Deployed --> Serving: requests arrive
Serving --> Idle: no connections
Idle --> Serving: new request
Idle --> ScaledToZero: idle > scaledown_window
ScaledToZero --> Deployed: openmodal deploy
How it works per provider¶
- GCP: A CronJob runs every 60 seconds, checks active connections via a shell script, and deletes the pod if idle
- AWS / Azure: KEDA (Kubernetes Event-Driven Autoscaler) watches metrics and scales the deployment to zero replicas when idle
Cost¶
| State | What's running | Approximate cost |
|---|---|---|
| Serving requests | GPU node + pod | ~$1.20/hr (H100 spot) |
| Idle, within scaledown window | Same | Same |
| Scaled to zero | Control plane + default node | ~$0.10/hr |
| Cluster deleted | Nothing | $0 |
Cluster autoscaling¶
When many pods are created at once (e.g., CooperBench running 60 agents), the cluster scales up automatically.
sequenceDiagram
participant App as Your app
participant Sched as K8s Scheduler
participant CA as Cluster Autoscaler
participant Cloud as Cloud API
App->>Sched: Create 60 pods (0.25 CPU each)
Sched->>Sched: Existing node fits ~12
Note over Sched: 12 Running, 48 Pending
CA->>Cloud: 48 Pending → add 2 nodes
Cloud-->>CA: Nodes ready (~60s)
Sched->>Sched: Schedule remaining pods
Note over Sched: All 60 Running
Note over App,Cloud: Pods complete, nodes idle 5 min...
CA->>Cloud: Remove idle nodes
OpenModal sets default resource requests (0.25 CPU, 256 MB) on every sandbox pod, so the scheduler correctly distributes pods across nodes and the autoscaler fires when needed.
Provider comparison¶
| GCP (GKE) | AWS (EKS) | Azure (AKS) | |
|---|---|---|---|
| Autoscaler | GKE cluster autoscaler | Karpenter | AKS cluster autoscaler |
| Sandbox nodes | e2-standard-8 pool |
Karpenter picks best fit | Standard_D8s_v5 |
| Max nodes | 100 per zone | 100 CPU limit | 100 |
| Scale-up time | ~60s | ~30-60s | ~60-90s |
| GPU nodes | Separate pool per GPU type | Karpenter auto-provisions | Separate pool per GPU |
Volumes¶
Volumes sync data between cloud storage and pod filesystems. No CSI drivers or IAM admin permissions needed — it uses init containers and sidecars.
sequenceDiagram
participant Cloud as Cloud Storage
participant Init as Init Container
participant Main as Main Container
participant Sidecar as Sidecar
Note over Init,Main: Pod starts
Init->>Cloud: Sync data down to /vol
Init-->>Main: Done, volume ready
Note over Main: Your code runs, reads/writes /vol
Note over Main,Sidecar: Pod shutting down
Sidecar->>Cloud: Sync /vol back up to cloud
All three containers (init, main, sidecar) share an emptyDir volume — an ephemeral disk on the node. The init container downloads data before your code starts. The sidecar uploads changes when the pod shuts down.
| Provider | Cloud storage | Sync tool |
|---|---|---|
| GCP | GCS bucket | gcloud storage rsync |
| AWS | S3 bucket | aws s3 sync |
| Azure | Azure Blob | az storage blob sync |
| Local | ~/.openmodal/volumes/ |
Direct bind mount |
Networking¶
How your machine talks to pods differs by provider:
graph LR
subgraph GCP
You1[Your machine] -->|direct HTTP| PodGCP[Pod 10.x.x.x]
end
subgraph AWS / Azure
You2[Your machine] -->|localhost:PORT| KPF[kubectl port-forward]
KPF -->|tunnel| PodAWS[Pod 10.x.x.x]
end
| Provider | How | Why | Latency overhead |
|---|---|---|---|
| GCP | Direct pod IP | GKE pods get routable IPs | ~0ms |
| AWS | kubectl port-forward |
EKS pod IPs are VPC-internal | ~100ms |
| Azure | kubectl port-forward |
AKS pod IPs are VPC-internal | ~100ms |
| Local | Container IP / docker exec |
Docker bridge network | ~0ms |
This matters for web servers (vLLM, FastAPI). For sandboxes, all providers use the K8s exec API which has similar latency everywhere.
Provider abstraction¶
All providers implement the same CloudProvider interface. Your code never touches the provider directly.
classDiagram
class CloudProvider {
<<abstract>>
+create_instance(spec, image_uri, name)
+delete_instance(name)
+create_sandbox_pod(name, image, timeout, gpu, cpu, memory)
+exec_in_pod(pod_name, *args)
+build_image(dockerfile_dir, name, tag)
+copy_to_pod(pod_name, local, remote)
+copy_from_pod(pod_name, remote, local)
+ensure_volume(name)
+stream_logs(instance_name)
}
CloudProvider <|-- GKEProvider
CloudProvider <|-- EKSProvider
CloudProvider <|-- AKSProvider
CloudProvider <|-- LocalProvider
The provider is selected by:
- CLI flag:
--local,--aws,--azure(GCP is the default) - Environment variable:
OPENMODAL_PROVIDER=local|gcp|aws|azure
Switching providers changes where your code runs, not how you write it.