Files
nimmersky/inference_architecture_plan.txt
2026-04-02 20:10:34 +02:00

155 lines
8.4 KiB
Plaintext

# NimmerSky Inference Architecture
# Status: DEPLOYED & STABLE
# Last Updated: 2026-03-25
================================================================================
DESIGN PRINCIPLES
================================================================================
1. SIMPLICITY > COMPLEXITY
- No MIG partitioning on Blackwell
- No vLLM multi-model complexity
- One big creative model, one structured output model, one vision model
2. MODEL SIZE MATTERS FOR JSON
- Key learning: 27B models follow JSON schemas reliably
- 8B models (including abliterated) struggle with structured output
- Route all JSON tasks to Gemma-27B
3. TASK-BASED ROUTING
- Creative dialogue → Large model (Euryale-70B)
- Structured output → Gemma-27B
- Vision/OmniSight → Qwen3-VL-8B
================================================================================
DEPLOYED INFRASTRUCTURE
================================================================================
THEIA (10.0.30.21) - Blackwell 98GB:
├── Port 31001: Euryale-70B (L3.3-70B Euryale v2.3)
│ ├── Quantization: Q4 or Q8 (fits in 98GB)
│ ├── Purpose: Creative dialogue, main NPC conversations
│ └── Context: Large (32K+)
└── Ollama: active
DIOSCURI (10.0.30.22) - 2x RTX 4000 Ada (20GB each):
├── GPU 0 - Port 31004: Gemma-3-27B-abliterated (Q4_K_M)
│ ├── ~16GB VRAM usage
│ ├── Purpose: ALL structured JSON output
│ └── Tasks: CharacterProfile, Diary, Combat, Memory, Gamemaster
└── GPU 1 - Port 31005: Qwen3-VL-8B-abliterated (Q4_K_M)
├── ~6GB VRAM usage
├── Purpose: Vision / OmniSight
└── Multimodal: Can process screenshots
================================================================================
SKYRIMNET VARIANT ROUTING
================================================================================
┌──────────────────────────┬─────────────────────────────────────────────────┐
│ Variant │ Deployed Configuration │
├──────────────────────────┼─────────────────────────────────────────────────┤
│ Default (dialogue) │ Theia:31001 → Euryale-70B │
│ AgentDefault │ Theia:31001 → Euryale-70B │
│ UniversalTranslator │ Theia:31001 → Euryale-70B │
│ action_evaluation │ Theia:31001 → Euryale-70B │
│ meta │ Theia:31001 → Euryale-70B │
├──────────────────────────┼─────────────────────────────────────────────────┤
│ CharacterProfileGen │ Dioscuri:31004 → Gemma-27B (structured_outputs) │
│ DiaryGeneration │ Dioscuri:31004 → Gemma-27B (structured_outputs) │
│ combat │ Dioscuri:31004 → Gemma-27B │
│ gamemaster_evaluation │ Dioscuri:31004 → Gemma-27B │
│ memory │ Dioscuri:31004 → Gemma-27B (structured_outputs) │
├──────────────────────────┼─────────────────────────────────────────────────┤
│ vision │ Dioscuri:31005 → Qwen3-VL-8B │
└──────────────────────────┴─────────────────────────────────────────────────┘
================================================================================
TOKEN BUDGETS
================================================================================
Variant-specific max_tokens (tuned for task):
│ Variant │ max_tokens │ Notes │
├──────────────────────────┼────────────┼────────────────────────────────────┤
│ Default │ 4096 │ Full dialogue responses │
│ AgentDefault │ 4096 │ Full agent responses │
│ CharacterProfileGen │ 2048 │ Structured bio output │
│ DiaryGeneration │ 768 │ Compact diary entries │
│ UniversalTranslator │ 512 │ Short translations │
│ action_evaluation │ 2048 │ Action descriptions │
│ combat │ 2000 │ Combat narration │
│ gamemaster_evaluation │ 256 │ Quick GM decisions │
│ memory │ 4096 │ Memory consolidation (large) │
│ meta │ 1024 │ Meta tasks │
│ vision │ 4000 │ Vision descriptions │
================================================================================
CONFIGURATION NOTES
================================================================================
Context Management:
- event_history_count_dialogue: 25 (reduced from 50 to fit context)
- Narration: DISABLED in SkyrimNet.yaml (Piper.yaml → narrative: enabled: false)
- Agent prompt includes "spoken dialogue only" instruction
Temperature Settings:
- Dialogue (Euryale): 0.8 (creative, varied)
- Structured (Gemma): 0.3-0.4 (consistent JSON)
- Vision (Qwen-VL): 0.7 (descriptive)
Structured Outputs:
- CharacterProfileGen: use_structured_outputs: true
- DiaryGeneration: use_structured_outputs: true
- memory: use_structured_outputs: true
================================================================================
LESSONS LEARNED
================================================================================
From March 15-17 experimentation:
1. QWEN MODELS USE PYTHON TRIPLE-QUOTES IN JSON
- Qwen-based models (including some Magidonia variants) output ```json blocks
- This breaks SkyrimNet's JSON parsing
- Solution: Use Gemma for JSON, Euryale/Llama for dialogue
2. STRICT ROLE ALTERNATION (Qwen/Magidonia via llama.cpp)
- Qwen Jinja templates enforce strict user/assistant alternation
- llama.cpp native enforces this strictly
- Ollama normalizes templates (more forgiving)
- If using Qwen-based models: route through Ollama
3. MODEL SIZE > ABLITERATION FOR JSON
- 27B follows instructions reliably
- 8B (even abliterated) struggles with structured output
- Don't route JSON tasks to small models
4. CONTEXT OVERFLOW PREVENTION
- Bumped ctx-size: Gemma-27B 4K→16K
- Reduced event_history_count_dialogue 50→25
- Right-sized token budgets per variant
================================================================================
FUTURE CONSIDERATIONS
================================================================================
Nimmerverse Integration (see ARCHITECTURE-RESEARCH.md):
- [ ] Oghma Infinium lore RAG → Iris (ChromaDB)
- [ ] Memory migration to unified ChromaDB
- [ ] Knowledge gating service (MCP/HTTP)
- [ ] Gossip network via NATS
Model Upgrades:
- Monitor Euryale updates (currently v2.3)
- Consider Gemma-3 upgrades when available
- Vision: Qwen-VL evolving rapidly
CPU Inference (Deferred):
- Function Gemma 270M was planned but not needed
- Gemma-27B handles structured output well
- Revisit if latency becomes an issue
================================================================================