reorg. destillation of oghma kownledge packs out of iris-dev
This commit is contained in:
154
guides-stack/inference_architecture_plan.txt
Normal file
154
guides-stack/inference_architecture_plan.txt
Normal file
@@ -0,0 +1,154 @@
|
||||
# NimmerSky Inference Architecture
|
||||
# Status: DEPLOYED & STABLE
|
||||
# Last Updated: 2026-03-25
|
||||
|
||||
================================================================================
|
||||
DESIGN PRINCIPLES
|
||||
================================================================================
|
||||
|
||||
1. SIMPLICITY > COMPLEXITY
|
||||
- No MIG partitioning on Blackwell
|
||||
- No vLLM multi-model complexity
|
||||
- One big creative model, one structured output model, one vision model
|
||||
|
||||
2. MODEL SIZE MATTERS FOR JSON
|
||||
- Key learning: 27B models follow JSON schemas reliably
|
||||
- 8B models (including abliterated) struggle with structured output
|
||||
- Route all JSON tasks to Gemma-27B
|
||||
|
||||
3. TASK-BASED ROUTING
|
||||
- Creative dialogue → Large model (Euryale-70B)
|
||||
- Structured output → Gemma-27B
|
||||
- Vision/OmniSight → Qwen3-VL-8B
|
||||
|
||||
================================================================================
|
||||
DEPLOYED INFRASTRUCTURE
|
||||
================================================================================
|
||||
|
||||
THEIA (10.0.30.21) - Blackwell 98GB:
|
||||
├── Port 31001: Euryale-70B (L3.3-70B Euryale v2.3)
|
||||
│ ├── Quantization: Q4 or Q8 (fits in 98GB)
|
||||
│ ├── Purpose: Creative dialogue, main NPC conversations
|
||||
│ └── Context: Large (32K+)
|
||||
└── Ollama: active
|
||||
|
||||
DIOSCURI (10.0.30.22) - 2x RTX 4000 Ada (20GB each):
|
||||
├── GPU 0 - Port 31004: Gemma-3-27B-abliterated (Q4_K_M)
|
||||
│ ├── ~16GB VRAM usage
|
||||
│ ├── Purpose: ALL structured JSON output
|
||||
│ └── Tasks: CharacterProfile, Diary, Combat, Memory, Gamemaster
|
||||
│
|
||||
└── GPU 1 - Port 31005: Qwen3-VL-8B-abliterated (Q4_K_M)
|
||||
├── ~6GB VRAM usage
|
||||
├── Purpose: Vision / OmniSight
|
||||
└── Multimodal: Can process screenshots
|
||||
|
||||
================================================================================
|
||||
SKYRIMNET VARIANT ROUTING
|
||||
================================================================================
|
||||
|
||||
┌──────────────────────────┬─────────────────────────────────────────────────┐
|
||||
│ Variant │ Deployed Configuration │
|
||||
├──────────────────────────┼─────────────────────────────────────────────────┤
|
||||
│ Default (dialogue) │ Theia:31001 → Euryale-70B │
|
||||
│ AgentDefault │ Theia:31001 → Euryale-70B │
|
||||
│ UniversalTranslator │ Theia:31001 → Euryale-70B │
|
||||
│ action_evaluation │ Theia:31001 → Euryale-70B │
|
||||
│ meta │ Theia:31001 → Euryale-70B │
|
||||
├──────────────────────────┼─────────────────────────────────────────────────┤
|
||||
│ CharacterProfileGen │ Dioscuri:31004 → Gemma-27B (structured_outputs) │
|
||||
│ DiaryGeneration │ Dioscuri:31004 → Gemma-27B (structured_outputs) │
|
||||
│ combat │ Dioscuri:31004 → Gemma-27B │
|
||||
│ gamemaster_evaluation │ Dioscuri:31004 → Gemma-27B │
|
||||
│ memory │ Dioscuri:31004 → Gemma-27B (structured_outputs) │
|
||||
├──────────────────────────┼─────────────────────────────────────────────────┤
|
||||
│ vision │ Dioscuri:31005 → Qwen3-VL-8B │
|
||||
└──────────────────────────┴─────────────────────────────────────────────────┘
|
||||
|
||||
================================================================================
|
||||
TOKEN BUDGETS
|
||||
================================================================================
|
||||
|
||||
Variant-specific max_tokens (tuned for task):
|
||||
|
||||
│ Variant │ max_tokens │ Notes │
|
||||
├──────────────────────────┼────────────┼────────────────────────────────────┤
|
||||
│ Default │ 4096 │ Full dialogue responses │
|
||||
│ AgentDefault │ 4096 │ Full agent responses │
|
||||
│ CharacterProfileGen │ 2048 │ Structured bio output │
|
||||
│ DiaryGeneration │ 768 │ Compact diary entries │
|
||||
│ UniversalTranslator │ 512 │ Short translations │
|
||||
│ action_evaluation │ 2048 │ Action descriptions │
|
||||
│ combat │ 2000 │ Combat narration │
|
||||
│ gamemaster_evaluation │ 256 │ Quick GM decisions │
|
||||
│ memory │ 4096 │ Memory consolidation (large) │
|
||||
│ meta │ 1024 │ Meta tasks │
|
||||
│ vision │ 4000 │ Vision descriptions │
|
||||
|
||||
================================================================================
|
||||
CONFIGURATION NOTES
|
||||
================================================================================
|
||||
|
||||
Context Management:
|
||||
- event_history_count_dialogue: 25 (reduced from 50 to fit context)
|
||||
- Narration: DISABLED in SkyrimNet.yaml (Piper.yaml → narrative: enabled: false)
|
||||
- Agent prompt includes "spoken dialogue only" instruction
|
||||
|
||||
Temperature Settings:
|
||||
- Dialogue (Euryale): 0.8 (creative, varied)
|
||||
- Structured (Gemma): 0.3-0.4 (consistent JSON)
|
||||
- Vision (Qwen-VL): 0.7 (descriptive)
|
||||
|
||||
Structured Outputs:
|
||||
- CharacterProfileGen: use_structured_outputs: true
|
||||
- DiaryGeneration: use_structured_outputs: true
|
||||
- memory: use_structured_outputs: true
|
||||
|
||||
================================================================================
|
||||
LESSONS LEARNED
|
||||
================================================================================
|
||||
|
||||
From March 15-17 experimentation:
|
||||
|
||||
1. QWEN MODELS USE PYTHON TRIPLE-QUOTES IN JSON
|
||||
- Qwen-based models (including some Magidonia variants) output ```json blocks
|
||||
- This breaks SkyrimNet's JSON parsing
|
||||
- Solution: Use Gemma for JSON, Euryale/Llama for dialogue
|
||||
|
||||
2. STRICT ROLE ALTERNATION (Qwen/Magidonia via llama.cpp)
|
||||
- Qwen Jinja templates enforce strict user/assistant alternation
|
||||
- llama.cpp native enforces this strictly
|
||||
- Ollama normalizes templates (more forgiving)
|
||||
- If using Qwen-based models: route through Ollama
|
||||
|
||||
3. MODEL SIZE > ABLITERATION FOR JSON
|
||||
- 27B follows instructions reliably
|
||||
- 8B (even abliterated) struggles with structured output
|
||||
- Don't route JSON tasks to small models
|
||||
|
||||
4. CONTEXT OVERFLOW PREVENTION
|
||||
- Bumped ctx-size: Gemma-27B 4K→16K
|
||||
- Reduced event_history_count_dialogue 50→25
|
||||
- Right-sized token budgets per variant
|
||||
|
||||
================================================================================
|
||||
FUTURE CONSIDERATIONS
|
||||
================================================================================
|
||||
|
||||
Nimmerverse Integration (see ARCHITECTURE-RESEARCH.md):
|
||||
- [ ] Oghma Infinium lore RAG → Iris (ChromaDB)
|
||||
- [ ] Memory migration to unified ChromaDB
|
||||
- [ ] Knowledge gating service (MCP/HTTP)
|
||||
- [ ] Gossip network via NATS
|
||||
|
||||
Model Upgrades:
|
||||
- Monitor Euryale updates (currently v2.3)
|
||||
- Consider Gemma-3 upgrades when available
|
||||
- Vision: Qwen-VL evolving rapidly
|
||||
|
||||
CPU Inference (Deferred):
|
||||
- Function Gemma 270M was planned but not needed
|
||||
- Gemma-27B handles structured output well
|
||||
- Revisit if latency becomes an issue
|
||||
|
||||
================================================================================
|
||||
Reference in New Issue
Block a user