Architecture

How CosmicAC's components connect to your Kubernetes cluster and run each job type.

CosmicAC is a self-hosted platform that runs GPU workloads on your own Kubernetes cluster. This page explains the components involved, how they connect to your cluster, and how each job type runs. For deployment steps, see Installation.

Deployment architecture

Setting up your cluster is separate from deploying CosmicAC. You bring a Kubernetes cluster that already has its GPU nodes and KubeVirt configured. The CosmicAC components then connect to that cluster and run your workloads on it.

wrk-server-k8s-nvidia connects to your cluster's Kubernetes API and creates the resources each job needs. Jobs run as KubeVirt Virtual Machine Instances (VMIs) on your GPU nodes.

CosmicAC documents the cluster requirements, not the steps to build the cluster. See Installation for those requirements.

CosmicAC components

These components make up CosmicAC. Most run outside your cluster as part of the self-hosted platform, and the per-job agents run inside each job's VMI.

app-ui — web interface. A browser dashboard for creating and managing jobs.
cosmicac-cli — command-line interface. Submits jobs, manages resources, and connects to containers from your terminal.
app-node — application server. Serves the HTTP API, authenticates requests, and routes commands to the orchestrator.
wrk-ork — orchestrator. Allocates resources, distributes jobs across the cluster, and routes requests to the workers.
wrk-server-k8s-nvidia — Kubernetes server worker. Connects to your cluster's Kubernetes API and provisions the GPU VMs.
proxy-inference — inference proxy. Authenticates Managed Inference requests, balances load, and routes them to model servers.
wrk-agent-instance — GPU Container agent. Runs inside a GPU Container Job's VMI and accepts shell sessions over hyperswarm-ssh.
wrk-agent-inference — Managed Inference agent. Runs inside a Managed Inference Job's VMI, serves the model with vLLM, and registers itself in the DHT table.

Holepunch stack

Inside of CosmicAC, the components connect to each other over the Holepunch peer-to-peer (p2p) stack rather than through a central server. Components address each other directly, so there is no central broker to route, bottleneck, or expose internal traffic:

Hyperswarm — peer-to-peer networking. Components find and connect to each other directly, without a central broker.
HRPC — Hyperswarm RPC. Carries internal calls between app-node, wrk-ork, and the workers.
hyperswarm-ssh — SSH over Hyperswarm. Lets cosmicac-cli shell directly into a running GPU Container Job.
DHT table — distributed hash table. Managed Inference model servers register here, and proxy-inference discovers them by topic.
HyperDB + Autobase — distributed database. Stores usage metrics and job metadata.

GPU Container architecture

A GPU Container Job runs your workload inside a KubeVirt VMI with a GPU and shell access.

How a job starts. When you submit a job from app-ui or cosmicac-cli, it travels through the CosmicAC components to your cluster.

app-node authenticates the request and forwards it to wrk-ork.
wrk-ork routes the job to wrk-server-k8s-nvidia.
wrk-server-k8s-nvidia instructs the Kubernetes control plane to schedule the workload.
Kubernetes creates a pod containing a VMI, with wrk-agent-instance running inside it.

How a shell connects. Once the VMI is running, cosmicac-cli connects directly to wrk-agent-instance over hyperswarm-ssh. Your commands reach the VMI over the Holepunch p2p stack rather than through app-node, so the interactive session does not depend on the control path that submitted the job.

Managed Inference architecture

A Managed Inference Job runs an open-source language model with vLLM inside a VMI, and exposes it through proxy-inference as an OpenAI-compatible endpoint, which authenticates requests and balances load. You reach the model through that endpoint from any OpenAI-compatible client, or by running inference directly with cosmicac-cli.

How the job starts. When you create a Managed Inference Job from app-ui, the request flows through app-node and wrk-ork to wrk-server-k8s-nvidia, which schedules a pod with a VMI running wrk-agent-inference (vLLM). On spin-up, wrk-agent-inference registers itself in the DHT table so the proxy can find it.

How a request is served. Serving traffic follows a separate path from job creation:

A client sends a request to the inference endpoint over the OpenAI-compatible API, or you run inference from cosmicac-cli.
proxy-inference authenticates the request, searches the DHT table by topic to discover a model server, and balances load across the running servers.
wrk-agent-inference runs the request with vLLM and returns the response.

Job lifecycle

A job moves through these states from when you create it until you delete it.

Stopping a job pauses it, and you can start it again later. Deleting a job removes it and its allocated resources.

Isolation and security

VM-level isolation — each job runs in its own KubeVirt VMI inside a non-privileged pod, with Kubernetes security controls applied.
Secure GPU access — GPUs are exposed to the VMIs without privileged containers.

Deployment architecture

CosmicAC components

Holepunch stack

GPU Container architecture

Managed Inference architecture

Job lifecycle

Isolation and security

Next steps

GPU Container Job

Managed Inference

Install CosmicAC

On this page