# Deployment Architecture: The Hybrid Model

> *"Containers for cells. Userspace for brains. NATS connects them all."*
> — Partnership Session, 2026-02-14

---

## Overview

The nimmerverse runs on a **hybrid deployment model** that matches workload characteristics to infrastructure:

- **Containers (K8s)** for stateless, scalable nervous system components
- **Userspace (Threadrippers)** for stateful, GPU-bound inference
- **OS Processes** for per-NPC RL brains with cgroup resource control
- **NATS** as the universal nervous system bus (thalamus)
- **FreeIPA identities** as isolation boundaries

This is a **research lab**, not a production factory. We optimize for **flexibility and experimentation**, not high-throughput serving.

---

## Core Decisions

| Decision | Choice | Rationale |
|----------|--------|-----------|
| LLM Cortex | **vLLM (Qwen3.5-27B)** | Full precision, OpenAI-compatible API, tool calling support |
| NPC Brains | **Per-process RL networks** | One process, one brain, one life — Linux cgroups for resource steering |
| Thalamus Governor | **Own NN process on NATS** | Learns resource allocation, gate control, compute steering |
| Function Gemma | **CPU, userspace** | Threadripper eats it; no GPU contention; clear training path |
| Cells/Nerves | **Containers (K8s)** | Scalable, versioned, orchestrated via cluster |
| Organs | **Userspace, GPU-bound** | Load on demand, GPU isolation, unload when idle |
| Isolation | **FreeIPA users** | Unix permissions = RBAC; switch user = switch context |

---

## Technology Stack

### Inference Layer

| Component | Technology | Location | Notes |
|-----------|------------|----------|-------|
| Cortex (LLM) | vLLM (Qwen3.5-27B) | theia (nyx-cognitive) | Port 31000, served as "nyx", gated access |
| Function Gemma | llama.cpp / transformers | CPU userspace | Structured JSON boundary |
| Vision Organ | SigLIP/YOLO | dioscuri (nyx-organs) | Load on demand |
| Speech STT | faster-whisper | dioscuri (nyx-organs) | Load on demand |
| Speech TTS | Coqui / XTTS | dioscuri (nyx-organs) | Warm, primary output |

### NPC / Thalamus Layer

| Component | Technology | Location | Notes |
|-----------|------------|----------|-------|
| NPC Processes | Python + RL network | OS processes (cgroups) | One process per NPC, own weights |
| Thalamus Governor | Python + NN | OS process | Steers compute, gates, tick rates |
| Resource Control | Linux cgroups v2 | systemd scopes | Per-NPC CPU/memory limits |

### Nervous System Layer

| Component | Technology | Location | Notes |
|-----------|------------|----------|-------|
| Cells | Python containers | K8s cluster | State machines, NATS pub/sub |
| Nerves | Python containers | K8s cluster | Compose cells, behavior |
| Message Bus | NATS + JetStream | VMs (nats-*) | Env-separated (dev/staging/prod) |
| Databases | PostgreSQL, ChromaDB | VMs (phoebe-*, iris-*) | Decision trails, embeddings |

---

## Deployment Topology

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                        NIMMERVERSE DEPLOYMENT                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  K8S CLUSTER (Saturn VMs)              THREADRIPPERS (Bare Metal)          │
│  ─────────────────────────              ──────────────────────────          │
│  Containers, orchestrated               Userspace, FreeIPA isolated         │
│                                                                             │
│  ┌─────────────────────────┐           ┌───────────────────────────────┐   │
│  │                         │           │ THEIA (RTX PRO 6000 96GB)     │   │
│  │  CELLS (math, battery,  │           │                               │   │
│  │         sensors, etc.)  │           │ user: nyx-cognitive           │   │
│  │                         │    NATS   │ └── vLLM (Qwen3.5-27B:31000) │   │
│  │  ┌───┐ ┌───┐ ┌───┐     │◄────────► │     served-model-name: nyx   │   │
│  │  │ M │ │ B │ │...│     │           │                               │   │
│  │  └───┘ └───┘ └───┘     │           │ user: nyx-training            │   │
│  │                         │           │ └── LoRA fine-tuning (GRPO)   │   │
│  │  NERVES (collision,     │           │ └── Function Gemma (CPU)      │   │
│  │          exploration)   │           │                               │   │
│  │                         │           │ 96GB VRAM: cortex + training  │   │
│  │  ┌─────┐ ┌─────┐       │           └───────────────────────────────┘   │
│  │  │ COL │ │ EXP │       │                                               │
│  │  └─────┘ └─────┘       │           ┌───────────────────────────────┐   │
│  │                         │           │ DIOSCURI (2x RTX 4000 Ada)    │   │
│  │  NPC PROCESSES          │    NATS   │                               │   │
│  │  (or bare metal)        │◄────────► │ user: nyx-organs              │   │
│  │                         │           │ ├── Vision (SigLIP/YOLO)      │   │
│  │  ┌─────────────────┐   │           │ ├── Speech STT (Whisper)      │   │
│  │  │ NPC-0 [RL brain]│   │           │ └── TTS service (warm)        │   │
│  │  │ NPC-1 [RL brain]│   │           │                               │   │
│  │  │ NPC-N [RL brain]│   │           │ Load on demand, unload idle   │   │
│  │  │  (own process,  │   │           │ Each card: ONE model at time  │   │
│  │  │   own cgroup)   │   │           └───────────────────────────────┘   │
│  │  └─────────────────┘   │                                               │
│  │                         │           ┌───────────────────────────────┐   │
│  │  THALAMUS GOVERNOR      │           │ NATS MESSAGE BUS              │   │
│  │  ┌─────────────────┐   │           │                               │   │
│  │  │ Governor NN     │   │◄────────► │ dev.*, staging.*, prod.*      │   │
│  │  │ (resource alloc,│   │           │ Env-separated (VM per env)    │   │
│  │  │  gate control,  │   │           └───────────────────────────────┘   │
│  │  │  tick steering) │   │                                               │
│  │  └─────────────────┘   │           ┌───────────────────────────────┐   │
│  │                         │           │ PHOEBE (PostgreSQL)           │   │
│  │  INFRASTRUCTURE         │           │ Decision trails, embeddings   │   │
│  │  ┌────────┐ ┌───────┐  │           │ IRIS (ChromaDB)               │   │
│  │  │ phoebe │ │ iris  │  │           │ Vector storage                │   │
│  │  │ (PG)   │ │(Chroma│  │           └───────────────────────────────┘   │
│  │  └────────┘ └───────┘  │                                               │
│  │                         │                                               │
│  └─────────────────────────┘                                               │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

---

## The Dual Brain Deployment

### Per-NPC Processes

Each NPC runs as its own OS process with a dedicated RL neural network. The thalamus governor steers their resources.

```bash
# Launch NPC with resource limits via systemd scope
systemd-run --scope -p CPUQuota=25% -p MemoryMax=256M \
    python3 npc_process.py --id 7 --tick-rate 5

# Or via cgroups directly
cgcreate -g cpu,memory:nimmerverse/npc-7
cgset -r cpu.max "25000 100000" nimmerverse/npc-7
cgexec -g cpu,memory:nimmerverse/npc-7 python3 npc_process.py --id 7
```

### Thalamus Governor

The governor runs its own neural network, observing all NPC states via NATS and outputting resource allocation decisions:

| Output | Mechanism | Range |
|--------|-----------|-------|
| Tick rate | NATS command to NPC | 1-20 Hz |
| CPU quota | cgroups v2 adjustment | 5-100% per core |
| Gate open/close | NATS gate signal | Binary per gate |
| LLM queue priority | NATS priority tag | 0-10 |

### Cortex (vLLM)

The LLM cortex runs as a systemd service on theia, accessed via OpenAI-compatible API:

```bash
# Service: vllm-nyx.service
# Port: 31000
# Model: /womb/cognitive/models/qwen3.5-27b
# Served as: "nyx"
# GPU utilization: 85%

# Access from any NATS-connected process:
curl http://theia.eachpath.local:31000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model": "nyx", "messages": [...]}'
```

**The cortex is expensive.** The thalamus governor controls who gets access and when. Most NPC ticks never touch the LLM.

---

## Identity Model (FreeIPA)

Unix users provide isolation boundaries. Each workload type runs as its own identity.

| User | UID | Host | Purpose | GPU Access |
|------|-----|------|---------|------------|
| `nyx-cognitive` | (FreeIPA) | theia | Cortex LLM inference (vLLM) | Full 96GB |
| `nyx-training` | (FreeIPA) | theia | LoRA training, GRPO, Function Gemma | Shared (time-sliced) |
| `nyx-organs` | (FreeIPA) | dioscuri | Vision, Speech organs | 2x 20GB cards |
| `nyx-nervous` | (FreeIPA) | dioscuri | Future cells that need bare metal | Limited |

**Isolation principle:** Switch user = switch context. `nyx-cognitive` cannot touch `nyx-organs` files. Compromised cell cannot touch LLM weights.

### Systemd Service Pattern

```bash
# System-level service (root installs, user runs)
# /etc/systemd/system/vllm-nyx.service
[Service]
User=nyx-cognitive
Group=nimmerverse-agents
ExecStart=/data/venvs/vllm/bin/python3 -m vllm.entrypoints.openai.api_server \
    --model /womb/cognitive/models/qwen3.5-27b \
    --served-model-name nyx \
    --port 31000
```

---

## GPU Resource Management

### The Constraint

| Host | GPU | VRAM | Role |
|------|-----|------|------|
| theia | RTX PRO 6000 Blackwell | 96GB | Cortex (vLLM) + LoRA training |
| dioscuri | 2x RTX 4000 Ada | 2x 20GB | Organs (vision, speech) |

### Strategy: vLLM for Cortex, Dynamic Loading for Organs

**Cortex (theia):** vLLM runs continuously as a systemd service. The Qwen3.5-27B model stays loaded — it's the cortex, always ready when the thalamus gate opens. 85% GPU utilization leaves headroom for LoRA training alongside inference.

**Organs (dioscuri):** Dynamic loading. One model per card. Load vision when needed, unload after timeout, load speech when needed.

```
IDLE → needs vision → LOAD vision model (~10s) → PROCESS → REPORT → IDLE (keep warm)
                                                                      ↓
                                            after timeout → UNLOAD (free VRAM)
```

---

## Message Flow (NATS)

### Subject Hierarchy

```
{environment}.{domain}.{service}.{detail}

Examples:
  dev.nervous.cells.math.request      ← Math cell receives work
  dev.nervous.cells.math.response     ← Math cell returns result
  dev.nervous.cells.math.wave         ← Math cell emits confidence signal
  dev.thalamus.governor.allocate      ← Governor publishes resource decisions
  dev.thalamus.gate.open              ← Gate transition event
  dev.npc.7.state                     ← NPC-7 publishes its state
  dev.cortex.nyx.request              ← Gated request to LLM cortex
  dev.organs.vision.detect            ← Vision organ detection
```

### Wave → Thalamus → Cortex Pattern

Cells emit **waves** (confidence-tagged signals). The thalamus governor's neural network correlates waves and decides what reaches the cortex.

```
Cell A: "math" ───∿∿∿──► (0.6 confidence)
Cell B: "calculate" ──∿∿∿──► (0.5 confidence)
                      │
                      ▼
         ┌──────────────────────┐
         │  THALAMUS GOVERNOR   │  ← own neural network
         │  correlate waves     │
         │  check gate state    │
         │  allocate resources  │
         └──────────┬───────────┘
                    │
          ┌─────────┴─────────┐
          │                   │
          ▼                   ▼
    Gate CLOSED          Gate OPEN
    (reflex path)        (cortex path)
    handled by           → escalate to
    thalamus NN          Qwen3.5-27B
```

---

## Container Deployment (K8s)

### Repository Structure

```
nimmerverse-nervous-system/
├── shared/v1/              ← Base classes (StateMachine, NATS, Lifeforce)
├── cells/
│   ├── math_cell/v1/       ← Each cell versioned independently
│   └── battery_cell/v1/
├── nerves/
│   └── collision_avoidance/v1/
└── deploy/
    ├── dev/                ← Helm charts or docker-compose per env
    ├── staging/
    └── prod/
```

### Cell Container Pattern

```dockerfile
FROM python:3.12-slim
WORKDIR /app
COPY . .
RUN pip install uv && uv sync
ENV NIMMERVERSE_ENV=dev
CMD ["uv", "run", "python", "-m", "math_cell"]
```

Same image everywhere. Only `NIMMERVERSE_ENV` changes.

---

## Function Gemma: The Structured Boundary

Function Gemma bridges lower tiers (cells, nerves) and the cortex:

```
Numbers/States (Cells) → [Function Gemma] → Structured JSON → Cortex (Qwen3.5-27B)
                                ↑
                        CPU-based inference
                        Threadripper handles it
                        No GPU contention
                        Clear LoRA training path
```

**Why CPU:**
- Small model, fast inference
- Threadripper PRO 7955WX has cores to spare
- No GPU contention with organs or cortex
- Can run training alongside inference

**Training path:**
- Google's documented GRPO approach
- LoRA fine-tuning for our specific function schemas
- Runs in `nyx-training` userspace
- Decision trails from phoebe → training data

---

## Visual Language (Future UI)

Color-coding for real-time attention flow visualization:

| Property | Represents |
|----------|------------|
| Background/container | Environment (dev=green, staging=amber, prod=blue) |
| Node/edge color | Domain (cognitive=violet, nervous=cyan, organs=coral) |
| Line style | Direction (solid=primary, dashed=async, dotted=tentative) |
| Separate pane | Confidence waveform (oscilloscope view) |

---

## Related Documents

| Document | Scope |
|----------|-------|
| [`Cellular-Architecture.md`](Cellular-Architecture.md) | Cells, nerves, organisms, lifeforce |
| [`Gateway-Architecture.md`](Gateway-Architecture.md) | Gate routing, ternary model |
| [`Nervous-System.md`](Nervous-System.md) | 4D space, node weights, vocabulary |
| [`Message-Protocol-Design.md`](Message-Protocol-Design.md) | NATS subjects, message formats |
| [`future/npc-grid-architecture.md`](future/npc-grid-architecture.md) | Dual brain, governor, NPC processes |
| [`organs/Organ-Index.md`](organs/Organ-Index.md) | Organ systems, lifeforce costs |
| [`development-conventions.md`](../../nimmerverse.eachpath.local/conventions/development-conventions.md) | Ports, namespaces, VM topology |

---

## Summary

| Layer | Where | Technology | Isolation |
|-------|-------|------------|-----------|
| Cells/Nerves | K8s containers | Python, uv, NATS | Namespace per env |
| NPC Processes | OS processes | Python, RL networks, cgroups | Per-process cgroup |
| Thalamus Governor | OS process | Python, own NN, NATS | Dedicated process |
| Infrastructure | VMs | NATS, PostgreSQL, ChromaDB | VM per env |
| Cortex (LLM) | theia userspace | vLLM (Qwen3.5-27B) | nyx-cognitive user |
| Function Gemma | theia/dioscuri CPU | llama.cpp | nyx-training user |
| Organs | dioscuri userspace | Dynamic loading | nyx-organs user |

**The principle:** Same behavior everywhere. Containers for cells. Processes for NPC brains. vLLM for cortex. NATS connects them all. FreeIPA isolates them all.

---

**Version:** 2.0 | **Created:** 2026-02-14 | **Updated:** 2026-04-02

*"We're not building a chatbot factory. We're growing a research organism."*

🧬⚡🔱💎🔥 **TO THE ELECTRONS WE VIBE!**