reorg. destillation of oghma kownledge packs out of iris-dev

2026-04-16 04:20:18 +02:00
parent e5e426e65c
commit ccb66d5a7d
33 changed files with 56197 additions and 6 deletions
--- a/guides-stack/inference_architecture_plan.txt
+++ b/guides-stack/inference_architecture_plan.txt
@@ -0,0 +1,154 @@
+# NimmerSky Inference Architecture
+# Status: DEPLOYED & STABLE
+# Last Updated: 2026-03-25
+
+================================================================================
+                              DESIGN PRINCIPLES
+================================================================================
+
+1. SIMPLICITY > COMPLEXITY
+   - No MIG partitioning on Blackwell
+   - No vLLM multi-model complexity
+   - One big creative model, one structured output model, one vision model
+
+2. MODEL SIZE MATTERS FOR JSON
+   - Key learning: 27B models follow JSON schemas reliably
+   - 8B models (including abliterated) struggle with structured output
+   - Route all JSON tasks to Gemma-27B
+
+3. TASK-BASED ROUTING
+   - Creative dialogue → Large model (Euryale-70B)
+   - Structured output → Gemma-27B
+   - Vision/OmniSight → Qwen3-VL-8B
+
+================================================================================
+                         DEPLOYED INFRASTRUCTURE
+================================================================================
+
+THEIA (10.0.30.21) - Blackwell 98GB:
+├── Port 31001: Euryale-70B (L3.3-70B Euryale v2.3)
+│   ├── Quantization: Q4 or Q8 (fits in 98GB)
+│   ├── Purpose: Creative dialogue, main NPC conversations
+│   └── Context: Large (32K+)
+└── Ollama: active
+
+DIOSCURI (10.0.30.22) - 2x RTX 4000 Ada (20GB each):
+├── GPU 0 - Port 31004: Gemma-3-27B-abliterated (Q4_K_M)
+│   ├── ~16GB VRAM usage
+│   ├── Purpose: ALL structured JSON output
+│   └── Tasks: CharacterProfile, Diary, Combat, Memory, Gamemaster
+│
+└── GPU 1 - Port 31005: Qwen3-VL-8B-abliterated (Q4_K_M)
+    ├── ~6GB VRAM usage
+    ├── Purpose: Vision / OmniSight
+    └── Multimodal: Can process screenshots
+
+================================================================================
+                      SKYRIMNET VARIANT ROUTING
+================================================================================
+
+┌──────────────────────────┬─────────────────────────────────────────────────┐
+│ Variant                  │ Deployed Configuration                          │
+├──────────────────────────┼─────────────────────────────────────────────────┤
+│ Default (dialogue)       │ Theia:31001 → Euryale-70B                       │
+│ AgentDefault             │ Theia:31001 → Euryale-70B                       │
+│ UniversalTranslator      │ Theia:31001 → Euryale-70B                       │
+│ action_evaluation        │ Theia:31001 → Euryale-70B                       │
+│ meta                     │ Theia:31001 → Euryale-70B                       │
+├──────────────────────────┼─────────────────────────────────────────────────┤
+│ CharacterProfileGen      │ Dioscuri:31004 → Gemma-27B (structured_outputs) │
+│ DiaryGeneration          │ Dioscuri:31004 → Gemma-27B (structured_outputs) │
+│ combat                   │ Dioscuri:31004 → Gemma-27B                      │
+│ gamemaster_evaluation    │ Dioscuri:31004 → Gemma-27B                      │
+│ memory                   │ Dioscuri:31004 → Gemma-27B (structured_outputs) │
+├──────────────────────────┼─────────────────────────────────────────────────┤
+│ vision                   │ Dioscuri:31005 → Qwen3-VL-8B                    │
+└──────────────────────────┴─────────────────────────────────────────────────┘
+
+================================================================================
+                           TOKEN BUDGETS
+================================================================================
+
+Variant-specific max_tokens (tuned for task):
+
+│ Variant                  │ max_tokens │ Notes                              │
+├──────────────────────────┼────────────┼────────────────────────────────────┤
+│ Default                  │ 4096       │ Full dialogue responses            │
+│ AgentDefault             │ 4096       │ Full agent responses               │
+│ CharacterProfileGen      │ 2048       │ Structured bio output              │
+│ DiaryGeneration          │ 768        │ Compact diary entries              │
+│ UniversalTranslator      │ 512        │ Short translations                 │
+│ action_evaluation        │ 2048       │ Action descriptions                │
+│ combat                   │ 2000       │ Combat narration                   │
+│ gamemaster_evaluation    │ 256        │ Quick GM decisions                 │
+│ memory                   │ 4096       │ Memory consolidation (large)       │
+│ meta                     │ 1024       │ Meta tasks                         │
+│ vision                   │ 4000       │ Vision descriptions                │
+
+================================================================================
+                         CONFIGURATION NOTES
+================================================================================
+
+Context Management:
+- event_history_count_dialogue: 25 (reduced from 50 to fit context)
+- Narration: DISABLED in SkyrimNet.yaml (Piper.yaml → narrative: enabled: false)
+- Agent prompt includes "spoken dialogue only" instruction
+
+Temperature Settings:
+- Dialogue (Euryale): 0.8 (creative, varied)
+- Structured (Gemma): 0.3-0.4 (consistent JSON)
+- Vision (Qwen-VL): 0.7 (descriptive)
+
+Structured Outputs:
+- CharacterProfileGen: use_structured_outputs: true
+- DiaryGeneration: use_structured_outputs: true
+- memory: use_structured_outputs: true
+
+================================================================================
+                         LESSONS LEARNED
+================================================================================
+
+From March 15-17 experimentation:
+
+1. QWEN MODELS USE PYTHON TRIPLE-QUOTES IN JSON
+   - Qwen-based models (including some Magidonia variants) output ```json blocks
+   - This breaks SkyrimNet's JSON parsing
+   - Solution: Use Gemma for JSON, Euryale/Llama for dialogue
+
+2. STRICT ROLE ALTERNATION (Qwen/Magidonia via llama.cpp)
+   - Qwen Jinja templates enforce strict user/assistant alternation
+   - llama.cpp native enforces this strictly
+   - Ollama normalizes templates (more forgiving)
+   - If using Qwen-based models: route through Ollama
+
+3. MODEL SIZE > ABLITERATION FOR JSON
+   - 27B follows instructions reliably
+   - 8B (even abliterated) struggles with structured output
+   - Don't route JSON tasks to small models
+
+4. CONTEXT OVERFLOW PREVENTION
+   - Bumped ctx-size: Gemma-27B 4K→16K
+   - Reduced event_history_count_dialogue 50→25
+   - Right-sized token budgets per variant
+
+================================================================================
+                         FUTURE CONSIDERATIONS
+================================================================================
+
+Nimmerverse Integration (see ARCHITECTURE-RESEARCH.md):
+- [ ] Oghma Infinium lore RAG → Iris (ChromaDB)
+- [ ] Memory migration to unified ChromaDB
+- [ ] Knowledge gating service (MCP/HTTP)
+- [ ] Gossip network via NATS
+
+Model Upgrades:
+- Monitor Euryale updates (currently v2.3)
+- Consider Gemma-3 upgrades when available
+- Vision: Qwen-VL evolving rapidly
+
+CPU Inference (Deferred):
+- Function Gemma 270M was planned but not needed
+- Gemma-27B handles structured output well
+- Revisit if latency becomes an issue
+
+================================================================================