# NimmerSky Inference Architecture # Status: DEPLOYED & STABLE # Last Updated: 2026-03-18 ================================================================================ DESIGN PRINCIPLES ================================================================================ 1. SIMPLICITY > COMPLEXITY - No MIG partitioning on Blackwell - No vLLM multi-model complexity - One big creative model, one structured output model, one vision model 2. MODEL SIZE MATTERS FOR JSON - Key learning: 27B models follow JSON schemas reliably - 8B models (including abliterated) struggle with structured output - Route all JSON tasks to Gemma-27B 3. TASK-BASED ROUTING - Creative dialogue → Large model (Euryale-70B) - Structured output → Gemma-27B - Vision/OmniSight → Qwen3-VL-8B ================================================================================ DEPLOYED INFRASTRUCTURE ================================================================================ THEIA (10.0.30.21) - Blackwell 98GB: ├── Port 31001: Euryale-70B (L3.3-70B Euryale v2.3) │ ├── Quantization: Q4 or Q8 (fits in 98GB) │ ├── Purpose: Creative dialogue, main NPC conversations │ └── Context: Large (32K+) └── Ollama: active DIOSCURI (10.0.30.22) - 2x RTX 4000 Ada (20GB each): ├── GPU 0 - Port 31004: Gemma-3-27B-abliterated (Q4_K_M) │ ├── ~16GB VRAM usage │ ├── Purpose: ALL structured JSON output │ └── Tasks: CharacterProfile, Diary, Combat, Memory, Gamemaster │ └── GPU 1 - Port 31005: Qwen3-VL-8B-abliterated (Q4_K_M) ├── ~6GB VRAM usage ├── Purpose: Vision / OmniSight └── Multimodal: Can process screenshots ================================================================================ SKYRIMNET VARIANT ROUTING ================================================================================ ┌──────────────────────────┬─────────────────────────────────────────────────┐ │ Variant │ Deployed Configuration │ ├──────────────────────────┼─────────────────────────────────────────────────┤ │ Default (dialogue) │ Theia:31001 → Euryale-70B │ │ AgentDefault │ Theia:31001 → Euryale-70B │ │ UniversalTranslator │ Theia:31001 → Euryale-70B │ │ action_evaluation │ Theia:31001 → Euryale-70B │ │ meta │ Theia:31001 → Euryale-70B │ ├──────────────────────────┼─────────────────────────────────────────────────┤ │ CharacterProfileGen │ Dioscuri:31004 → Gemma-27B (structured_outputs) │ │ DiaryGeneration │ Dioscuri:31004 → Gemma-27B (structured_outputs) │ │ combat │ Dioscuri:31004 → Gemma-27B │ │ gamemaster_evaluation │ Dioscuri:31004 → Gemma-27B │ │ memory │ Dioscuri:31004 → Gemma-27B (structured_outputs) │ ├──────────────────────────┼─────────────────────────────────────────────────┤ │ vision │ Dioscuri:31005 → Qwen3-VL-8B │ └──────────────────────────┴─────────────────────────────────────────────────┘ ================================================================================ TOKEN BUDGETS ================================================================================ Variant-specific max_tokens (tuned for task): │ Variant │ max_tokens │ Notes │ ├──────────────────────────┼────────────┼────────────────────────────────────┤ │ Default │ 4096 │ Full dialogue responses │ │ AgentDefault │ 4096 │ Full agent responses │ │ CharacterProfileGen │ 2048 │ Structured bio output │ │ DiaryGeneration │ 768 │ Compact diary entries │ │ UniversalTranslator │ 512 │ Short translations │ │ action_evaluation │ 2048 │ Action descriptions │ │ combat │ 2000 │ Combat narration │ │ gamemaster_evaluation │ 256 │ Quick GM decisions │ │ memory │ 4096 │ Memory consolidation (large) │ │ meta │ 1024 │ Meta tasks │ │ vision │ 4000 │ Vision descriptions │ ================================================================================ CONFIGURATION NOTES ================================================================================ Context Management: - event_history_count_dialogue: 25 (reduced from 50 to fit context) - Narration: DISABLED in SkyrimNet.yaml (Piper.yaml → narrative: enabled: false) - Agent prompt includes "spoken dialogue only" instruction Temperature Settings: - Dialogue (Euryale): 0.8 (creative, varied) - Structured (Gemma): 0.3-0.4 (consistent JSON) - Vision (Qwen-VL): 0.7 (descriptive) Structured Outputs: - CharacterProfileGen: use_structured_outputs: true - DiaryGeneration: use_structured_outputs: true - memory: use_structured_outputs: true ================================================================================ LESSONS LEARNED ================================================================================ From March 15-17 experimentation: 1. QWEN MODELS USE PYTHON TRIPLE-QUOTES IN JSON - Qwen-based models (including some Magidonia variants) output ```json blocks - This breaks SkyrimNet's JSON parsing - Solution: Use Gemma for JSON, Euryale/Llama for dialogue 2. STRICT ROLE ALTERNATION (Qwen/Magidonia via llama.cpp) - Qwen Jinja templates enforce strict user/assistant alternation - llama.cpp native enforces this strictly - Ollama normalizes templates (more forgiving) - If using Qwen-based models: route through Ollama 3. MODEL SIZE > ABLITERATION FOR JSON - 27B follows instructions reliably - 8B (even abliterated) struggles with structured output - Don't route JSON tasks to small models 4. CONTEXT OVERFLOW PREVENTION - Bumped ctx-size: Gemma-27B 4K→16K - Reduced event_history_count_dialogue 50→25 - Right-sized token budgets per variant ================================================================================ FUTURE CONSIDERATIONS ================================================================================ Nimmerverse Integration (see ARCHITECTURE-RESEARCH.md): - [ ] Oghma Infinium lore RAG → Iris (ChromaDB) - [ ] Memory migration to unified ChromaDB - [ ] Knowledge gating service (MCP/HTTP) - [ ] Gossip network via NATS Model Upgrades: - Monitor Euryale updates (currently v2.3) - Consider Gemma-3 upgrades when available - Vision: Qwen-VL evolving rapidly CPU Inference (Deferred): - Function Gemma 270M was planned but not needed - Gemma-27B handles structured output well - Revisit if latency becomes an issue ================================================================================ Architecture stable as of 2026-03-18. Three-model split working well: Big creative + Structured JSON + Vision - Chrysalis