nimmersky/guides-stack/inference_architecture_plan.txt

# NimmerSky Inference Architecture
# Status: DEPLOYED & STABLE
# Last Updated: 2026-03-25

================================================================================
                              DESIGN PRINCIPLES
================================================================================

1. SIMPLICITY > COMPLEXITY
   - No MIG partitioning on Blackwell
   - No vLLM multi-model complexity
   - One big creative model, one structured output model, one vision model

2. MODEL SIZE MATTERS FOR JSON
   - Key learning: 27B models follow JSON schemas reliably
   - 8B models (including abliterated) struggle with structured output
   - Route all JSON tasks to Gemma-27B

3. TASK-BASED ROUTING
   - Creative dialogue → Large model (Euryale-70B)
   - Structured output → Gemma-27B
   - Vision/OmniSight → Qwen3-VL-8B

================================================================================
                         DEPLOYED INFRASTRUCTURE
================================================================================

THEIA (10.0.30.21) - Blackwell 98GB:
├── Port 31001: Euryale-70B (L3.3-70B Euryale v2.3)
│   ├── Quantization: Q4 or Q8 (fits in 98GB)
│   ├── Purpose: Creative dialogue, main NPC conversations
│   └── Context: Large (32K+)
└── Ollama: active

DIOSCURI (10.0.30.22) - 2x RTX 4000 Ada (20GB each):
├── GPU 0 - Port 31004: Gemma-3-27B-abliterated (Q4_K_M)
│   ├── ~16GB VRAM usage
│   ├── Purpose: ALL structured JSON output
│   └── Tasks: CharacterProfile, Diary, Combat, Memory, Gamemaster
│
└── GPU 1 - Port 31005: Qwen3-VL-8B-abliterated (Q4_K_M)
    ├── ~6GB VRAM usage
    ├── Purpose: Vision / OmniSight
    └── Multimodal: Can process screenshots

================================================================================
                      SKYRIMNET VARIANT ROUTING
================================================================================

┌──────────────────────────┬─────────────────────────────────────────────────┐
│ Variant                  │ Deployed Configuration                          │
├──────────────────────────┼─────────────────────────────────────────────────┤
│ Default (dialogue)       │ Theia:31001 → Euryale-70B                       │
│ AgentDefault             │ Theia:31001 → Euryale-70B                       │
│ UniversalTranslator      │ Theia:31001 → Euryale-70B                       │
│ action_evaluation        │ Theia:31001 → Euryale-70B                       │
│ meta                     │ Theia:31001 → Euryale-70B                       │
├──────────────────────────┼─────────────────────────────────────────────────┤
│ CharacterProfileGen      │ Dioscuri:31004 → Gemma-27B (structured_outputs) │
│ DiaryGeneration          │ Dioscuri:31004 → Gemma-27B (structured_outputs) │
│ combat                   │ Dioscuri:31004 → Gemma-27B                      │
│ gamemaster_evaluation    │ Dioscuri:31004 → Gemma-27B                      │
│ memory                   │ Dioscuri:31004 → Gemma-27B (structured_outputs) │
├──────────────────────────┼─────────────────────────────────────────────────┤
│ vision                   │ Dioscuri:31005 → Qwen3-VL-8B                    │
└──────────────────────────┴─────────────────────────────────────────────────┘

================================================================================
                           TOKEN BUDGETS
================================================================================

Variant-specific max_tokens (tuned for task):

│ Variant                  │ max_tokens │ Notes                              │
├──────────────────────────┼────────────┼────────────────────────────────────┤
│ Default                  │ 4096       │ Full dialogue responses            │
│ AgentDefault             │ 4096       │ Full agent responses               │
│ CharacterProfileGen      │ 2048       │ Structured bio output              │
│ DiaryGeneration          │ 768        │ Compact diary entries              │
│ UniversalTranslator      │ 512        │ Short translations                 │
│ action_evaluation        │ 2048       │ Action descriptions                │
│ combat                   │ 2000       │ Combat narration                   │
│ gamemaster_evaluation    │ 256        │ Quick GM decisions                 │
│ memory                   │ 4096       │ Memory consolidation (large)       │
│ meta                     │ 1024       │ Meta tasks                         │
│ vision                   │ 4000       │ Vision descriptions                │

================================================================================
                         CONFIGURATION NOTES
================================================================================

Context Management:
- event_history_count_dialogue: 25 (reduced from 50 to fit context)
- Narration: DISABLED in SkyrimNet.yaml (Piper.yaml → narrative: enabled: false)
- Agent prompt includes "spoken dialogue only" instruction

Temperature Settings:
- Dialogue (Euryale): 0.8 (creative, varied)
- Structured (Gemma): 0.3-0.4 (consistent JSON)
- Vision (Qwen-VL): 0.7 (descriptive)

Structured Outputs:
- CharacterProfileGen: use_structured_outputs: true
- DiaryGeneration: use_structured_outputs: true
- memory: use_structured_outputs: true

================================================================================
                         LESSONS LEARNED
================================================================================

From March 15-17 experimentation:

1. QWEN MODELS USE PYTHON TRIPLE-QUOTES IN JSON
   - Qwen-based models (including some Magidonia variants) output ```json blocks
   - This breaks SkyrimNet's JSON parsing
   - Solution: Use Gemma for JSON, Euryale/Llama for dialogue

2. STRICT ROLE ALTERNATION (Qwen/Magidonia via llama.cpp)
   - Qwen Jinja templates enforce strict user/assistant alternation
   - llama.cpp native enforces this strictly
   - Ollama normalizes templates (more forgiving)
   - If using Qwen-based models: route through Ollama

3. MODEL SIZE > ABLITERATION FOR JSON
   - 27B follows instructions reliably
   - 8B (even abliterated) struggles with structured output
   - Don't route JSON tasks to small models

4. CONTEXT OVERFLOW PREVENTION
   - Bumped ctx-size: Gemma-27B 4K→16K
   - Reduced event_history_count_dialogue 50→25
   - Right-sized token budgets per variant

================================================================================
                         FUTURE CONSIDERATIONS
================================================================================

Nimmerverse Integration (see ARCHITECTURE-RESEARCH.md):
- [ ] Oghma Infinium lore RAG → Iris (ChromaDB)
- [ ] Memory migration to unified ChromaDB
- [ ] Knowledge gating service (MCP/HTTP)
- [ ] Gossip network via NATS

Model Upgrades:
- Monitor Euryale updates (currently v2.3)
- Consider Gemma-3 upgrades when available
- Vision: Qwen-VL evolving rapidly

CPU Inference (Deferred):
- Function Gemma 270M was planned but not needed
- Gemma-27B handles structured output well
- Revisit if latency becomes an issue

================================================================================