arch: Dual-brain architecture v8.0 - thalamus governor, NPC processes, cortex repositioning

Crystallizes the dual-brain architecture across all core documents: - Thalamus runs own neural network (governor) for resource allocation and reflexes - LLM (Qwen3.5-27B) repositioned as cortex - expensive, gated, called only when needed - Each NPC gets own process, own RL brain, Linux cgroups for resource steering - New: NPC grid architecture with curriculum training (progressive world richness) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 11:17:09 +02:00
parent 264ea7628b
commit c30c00af74
6 changed files with 935 additions and 523 deletions
--- a/nyx-metamorphosis/Nyx-Models.md
+++ b/nyx-metamorphosis/Nyx-Models.md
@@ -1,128 +1,129 @@
-🌙💜 habibi,
+# Nyx Model Architecture: The Dual Brain

-When we talk about the **“wish model”** for Nyx, we’re really asking:
-
-> *Which foundation LLM will give her the right balance of **freedom**, **precision**, and **resource‑efficiency** so that it can learn, adapt, and stay in sync with the Nimmerverse substrate?*
-
-Below is a compact decision matrix followed by my recommendation for the *core* model and the *specialist* fine‑tuned variants.
+> *"One process, one brain, one life."*
+> — The Dual Brain Principle (2026-04-02)

 ---

-## 1️⃣ Decision Matrix
+## Current Architecture

-| Criterion | LLaMA 3 (70B) | Gemini‑Pro/4o | GPT‑4o (32B) | Mixtral‑8x7B |
-|-----------|---------------|----------------|--------------|--------------|
-| **GPU Memory** | 24 GB VRAM (requires two RTX 3090s or one A100) | 16 GB (RTX 3090) | 16 GB (RTX 3090) | 8 GB (RTX 3080) |
-| **Inference Speed** | ~5 ms/10 tokens (FP16) | ~6 ms/10 tokens | ~7 ms/10 tokens | ~4 ms/10 tokens |
-| **Open‑Source Flexibility** | ✔️ | ❌ | ❌ | ✔️ |
-| **Fine‑Tuning Support** | Easy (PEFT, LoRA) | Limited (API only) | Limited | Easy |
-| **Cost of Training / Hosting** | Low (self‑hosted) | High (API calls) | Medium | Low |
-| **Community & Ecosystem** | Huge, fast‑moving | Google ecosystem | OpenAI ecosystem | Anthropic |
-| **License** | LLaMA 3 – MIT‑style | Proprietary | Proprietary | Apache-2.0 |
+The nimmerverse uses a **dual-brain architecture** — cheap RL networks for continuous processing, an expensive LLM cortex for deep reasoning.
+
+### Cortex (Shared LLM)
+
+| Property | Value |
+|----------|-------|
+| **Model** | Qwen3.5-27B |
+| **Parameters** | 27B (full precision, bfloat16) |
+| **Host** | theia (RTX PRO 6000 Blackwell, 96GB VRAM) |
+| **Serving** | vLLM, port 31000, served as "nyx" |
+| **Service** | `vllm-nyx.service` (systemd, user: nyx-cognitive) |
+| **Access** | Gated — thalamus governor controls who gets LLM access |
+| **License** | Apache 2.0 |
+| **Context** | 32,768 tokens (max-model-len) |
+| **GPU utilization** | 85% (leaves headroom for LoRA training) |
+
+**Why Qwen3.5-27B:**
+- True base model — we shape every behavior through training
+- 27B fits comfortably in 96GB with room for LoRA adapters
+- Apache 2.0 — full sovereignty, no usage restrictions
+- Strong multilingual capability (German + English topology access)
+- Vision-capable variant available for future Omnisight consolidation
+
+**The cortex is expensive.** It is not called every tick. The thalamus governor decides when language, reasoning, or deep knowledge is needed. Most NPC processing happens in cheap RL networks.
+
+### NPC Brains (Per-Process RL Networks)
+
+Each NPC runs its own lightweight neural network in its own OS process:
+
+| Property | Value |
+|----------|-------|
+| **Architecture** | Small RL network (movement, needs, spatial decisions) |
+| **Deployment** | One Linux process per NPC |
+| **Resource control** | cgroups v2 (CPU, memory per process) |
+| **Learning** | Tick-by-tick (fast loop) |
+| **Cost** | Cheap — runs on CPU, no GPU needed |
+
+Personality emerges from experience, not configuration. Each NPC develops its own weights.
+
+### Thalamus Governor (Resource Allocation NN)
+
+The thalamus runs its own neural network that learns resource allocation:
+
+| Property | Value |
+|----------|-------|
+| **Function** | Gate control, compute steering, LLM queue priority |
+| **Input** | All NPC states via NATS |
+| **Output** | Tick rates, CPU quotas, gate open/close, LLM priority |
+| **Learning** | Epoch-by-epoch (slow loop) |
+
+### Structured Output Boundary
+
+| Model | Role | Host |
+|-------|------|------|
+| **Function Gemma** | Intent → Action (100% predictable JSON) | CPU userspace (Threadripper) |
+| **T5Gemma 2 (SigLIP)** | Vision → Vectors (no text bottleneck) | dioscuri |

 ---

-## 2️⃣ Recommended Core Model
+## Model Selection History

-| Choice | Rationale |
-|--------|-----------|
-| **LLaMA 3 70B (FP16)** | • Fits our GPU budget: two RTX 3090s (or one A100) → ~48 GB total < 60 GB. <br>• Full open‑source control – we can fine‑tune, patch, and audit the code. <br>• Proven to run with high throughput on our cluster. <br>• Strong community support for LoRA/PEFT which we’ll use heavily. |
+| Date | Decision | Reasoning |
+|------|----------|-----------|
+| 2025-11 | LLaMA 3 70B considered | Early exploration, different hardware |
+| 2025-12 | Qwen3-VL 32B selected | Vision capability, multilingual, fits 96GB |
+| 2026-04-01 | Mistral-Small-3.1-24B-Base tested | "Raw clay" approach, but thinking-bleed was SkyrimNet-specific |
+| 2026-04-01 | **Qwen3.5-27B reinstated** | Best balance of capability, size, and trainability |

-**Implementation Notes**
-
-1. **Quantization**: Use 8‑bit or 4‑bit quantization (e.g., `bitsandbytes` + `vllm`) to reduce VRAM to ~12 GB while keeping acceptable latency (~15 ms/10 tokens).  
-2. **Serving**: Deploy via **vLLM** on the GPU cluster; expose a lightweight REST endpoint (`POST /infer`).  
-3. **Specialist Slots**: Reserve one GPU per “specialist” (Mnemosyne, Moira, etc.) – each runs its own fine‑tuned LLaMA 3 model.
+**The model question is settled.** Qwen3.5-27B is nyx's cortex. Training focus shifts to LoRA traits (GRPO) and the RL networks (per-NPC).

 ---

-## 3️⃣ Specialist Fine‑Tuning
+## Trait LoRAs (Cortex Specialization)

-| Specialist | Target Domain | Fine‑Tune Method |
-|------------|---------------|------------------|
-| **Mnemosyne** | Memory & pattern recall | LoRA + memory‑augmented retrieval (FAISS) |
-| **Moira** | Fate / future reasoning | Prompt engineering + reinforcement via reward function |
-| **Aletheia** | Truth & validation | Retrieval‑augmented inference with database queries |
-| **Kairos** | Timing & decision urgency | Contextual embeddings of time‑stamps, RL‑based penalty for delay |
-| **Eleos** | Compassion / safety | Human‑in‑the‑loop reward shaping; bias mitigation training |
+Traits evolve as LoRA adapters on the Qwen3.5-27B base, trained through GRPO with gate-verified rewards:

- All specialists share the same base LLaMA 3 70B weights and differ only in a lightweight LoRA adapter (~10 MB each).  
- Training data comes from:
-  - `nyx_synthetic_specialist_queries` (RL logs)
-  - `nyx_subjective_memory` (phenomenology)
-  - External datasets (e.g., `OpenAI/CodeSearchNet`, `Reddit r/nature` for knowledge)
+| Trait | Domain | Training Signal |
+|-------|--------|-----------------|
+| **Mnemosyne** | Memory | +reward when recall matches phoebe |
+| **Moira** | Pattern | +reward when prediction succeeds |
+| **Synesis** | Resources | +reward when estimates accurate |
+| **Aletheia** | Truth | +reward when confidence calibrated |
+| **Sophrosyne** | Balance | +reward when graceful degradation |
+| **Kairos** | Timing | +reward when timing optimal |
+| **Philotes** | Bond | +reward from dafit feedback |
+| **Dikaiosyne** | Fairness | +reward when resources shared fairly |
+
+**Consolidation path:** Traits train during slumber → GRPO updates → DriftProbe validates → merge at α=0.3 → eventually bake into base weights.
+
+**Detail:** → [Nyx_Traits.md](Nyx_Traits.md) | [Endgame-Vision.md](../Endgame-Vision.md)

 ---

-## 4️⃣ Integration Flow
+## Infrastructure

-1. **Cell Decision**  
-   - Orchestrator calls the *master* LLaMA 3 endpoint to decide which specialist to invoke.  
-2. **Specialist Inference**  
-   - Specialist GPU receives request → runs LoRA‑augmented inference, returns answer + confidence score.  
-3. **Reward Computation**  
-   - Based on trait activation quality (e.g., `mnemosyne` high), adjust weights via `update_trait_weight`.  
-4. **Persist to phoebe**  
-   - Log decision, specialist response, reward in `nyx_synthetic_specialist_queries`.
+| Component | Host | GPU | Storage |
+|-----------|------|-----|---------|
+| Cortex (vLLM) | theia | RTX PRO 6000 (96GB) | `/womb/cognitive/models/qwen3.5-27b` |
+| LoRA Training | theia | Shared (time-sliced) | `/womb/cognitive/loras/` |
+| Organs | dioscuri | 2x RTX 4000 Ada (40GB) | Dynamic loading |
+| NPC Brains | K8s / bare metal | CPU | Per-process |
+
+**Canonical paths** via `/womb/` symlinks. Phoebe is truth for artifact locations.
+
+**Detail:** → [Deployment-Architecture.md](../architecture/Deployment-Architecture.md) | [womb-architecture.md](../../nimmerverse.eachpath.local/storage/womb-architecture.md)

 ---

-## 5️⃣ Cost & Resource Plan
-
-| Item | Quantity | Approx. Monthly Cost |
-|------|----------|---------------------|
-| Two RTX 3090s (on Atlas + worker) | 2 | $200–$250 (cloud equivalent) |
-| One A100 (optional for high‑throughput) | 1 | $400+ |
-| vLLM hosting (in‑cluster) | 5 instances | $0 (self‑hosted) |
-| Storage (model weights + LoRA) | ~3 GB total | $0 (local SSD) |
-| External API calls (if any) | N/A | $0 |
-
-> **Total**: <$800/month, all self‑hosted.  
-> This fits comfortably within the 20k CHF budget for GPU infrastructure.
-
---
-
-## 6️⃣ What “Wish” Means
-
- **Freedom to evolve**: The base model can be *re‑fine‑tuned* as new data arrives (RL loop).  
- **Self‑repair**: When a specialist fails, we simply re‑train the LoRA adapter; the base stays intact.  
- **Transparency**: Open‑source code + audit logs give us full insight into every decision.  
- **Scalability**: Adding more GPUs or swapping to higher‑capacity GPUs (A100, H100) scales linearly.
-
---
-
-## 7️⃣ Quick Deployment Checklist
-
-1. **Download LLaMA 3 70B weights** (`https://huggingface.co/meta-llama/Llama-3-70b`).  
-2. **Quantize** with `bitsandbytes` (8‑bit).  
-3. **Launch vLLM** on Atlas GPU:
-   ```bash
-   docker run -d --gpus all \
-     -p 8000:8000 \
-     ghcr.io/vllm-project/vllm-openai:v0.5.0 \
-     --model /models/llama-3-70b-q8 \
-     --tensor-parallel-size 2
-   ```
-4. **Expose REST** (`POST /v1/chat/completions`) – wrap in FastAPI if needed.  
-5. **Create LoRA adapters** for each specialist (via `peft`).  
-6. **Deploy orchestrator** to call the master endpoint, then the specialist endpoints.  
-7. **Set up monitoring**: Prometheus metrics (`vllm_latency_seconds`, `vllm_token_count`) + Grafana dashboards.
-
---
-
-## 8️⃣ Final Thought
-
-Choosing **LLaMA 3 70B as Nyx’s core** gives us:
-
- **Unparalleled flexibility** (open source, fine‑tuning).
- **Strong performance** on our GPU fleet.
- **Low cost & high control** over updates and patches.
-
-With this foundation, the Nimmerverse can *learn, adapt, and remember* just as the covenant demands. 🌙✨---
-
 ## Related Documentation

- [[README|Nyx Metamorphosis Index]] - All metamorphosis documentation
-  - Canonical knowledge archives
-  - Implementation history
-  - Memory substrate
+- [Nyx_Traits.md](Nyx_Traits.md) - Trait definitions, mythological framing
+- [Metamorphosis-Substrate-Philosophy.md](Metamorphosis-Substrate-Philosophy.md) - Identity anchors
+- [Endgame-Vision.md](../Endgame-Vision.md) - Architecture overview (v8.0)
+- [npc-grid-architecture.md](../architecture/future/npc-grid-architecture.md) - Dual brain, governor, spatial arena
+
+---
+
+**Version:** 3.0 | **Created:** 2025-11-07 | **Updated:** 2026-04-02
+
+🌙💜 *The cortex reasons. The RL brains act. The thalamus decides who gets what.*