155 lines
8.4 KiB
Plaintext
155 lines
8.4 KiB
Plaintext
# NimmerSky Inference Architecture
|
|
# Status: DEPLOYED & STABLE
|
|
# Last Updated: 2026-03-25
|
|
|
|
================================================================================
|
|
DESIGN PRINCIPLES
|
|
================================================================================
|
|
|
|
1. SIMPLICITY > COMPLEXITY
|
|
- No MIG partitioning on Blackwell
|
|
- No vLLM multi-model complexity
|
|
- One big creative model, one structured output model, one vision model
|
|
|
|
2. MODEL SIZE MATTERS FOR JSON
|
|
- Key learning: 27B models follow JSON schemas reliably
|
|
- 8B models (including abliterated) struggle with structured output
|
|
- Route all JSON tasks to Gemma-27B
|
|
|
|
3. TASK-BASED ROUTING
|
|
- Creative dialogue → Large model (Euryale-70B)
|
|
- Structured output → Gemma-27B
|
|
- Vision/OmniSight → Qwen3-VL-8B
|
|
|
|
================================================================================
|
|
DEPLOYED INFRASTRUCTURE
|
|
================================================================================
|
|
|
|
THEIA (10.0.30.21) - Blackwell 98GB:
|
|
├── Port 31001: Euryale-70B (L3.3-70B Euryale v2.3)
|
|
│ ├── Quantization: Q4 or Q8 (fits in 98GB)
|
|
│ ├── Purpose: Creative dialogue, main NPC conversations
|
|
│ └── Context: Large (32K+)
|
|
└── Ollama: active
|
|
|
|
DIOSCURI (10.0.30.22) - 2x RTX 4000 Ada (20GB each):
|
|
├── GPU 0 - Port 31004: Gemma-3-27B-abliterated (Q4_K_M)
|
|
│ ├── ~16GB VRAM usage
|
|
│ ├── Purpose: ALL structured JSON output
|
|
│ └── Tasks: CharacterProfile, Diary, Combat, Memory, Gamemaster
|
|
│
|
|
└── GPU 1 - Port 31005: Qwen3-VL-8B-abliterated (Q4_K_M)
|
|
├── ~6GB VRAM usage
|
|
├── Purpose: Vision / OmniSight
|
|
└── Multimodal: Can process screenshots
|
|
|
|
================================================================================
|
|
SKYRIMNET VARIANT ROUTING
|
|
================================================================================
|
|
|
|
┌──────────────────────────┬─────────────────────────────────────────────────┐
|
|
│ Variant │ Deployed Configuration │
|
|
├──────────────────────────┼─────────────────────────────────────────────────┤
|
|
│ Default (dialogue) │ Theia:31001 → Euryale-70B │
|
|
│ AgentDefault │ Theia:31001 → Euryale-70B │
|
|
│ UniversalTranslator │ Theia:31001 → Euryale-70B │
|
|
│ action_evaluation │ Theia:31001 → Euryale-70B │
|
|
│ meta │ Theia:31001 → Euryale-70B │
|
|
├──────────────────────────┼─────────────────────────────────────────────────┤
|
|
│ CharacterProfileGen │ Dioscuri:31004 → Gemma-27B (structured_outputs) │
|
|
│ DiaryGeneration │ Dioscuri:31004 → Gemma-27B (structured_outputs) │
|
|
│ combat │ Dioscuri:31004 → Gemma-27B │
|
|
│ gamemaster_evaluation │ Dioscuri:31004 → Gemma-27B │
|
|
│ memory │ Dioscuri:31004 → Gemma-27B (structured_outputs) │
|
|
├──────────────────────────┼─────────────────────────────────────────────────┤
|
|
│ vision │ Dioscuri:31005 → Qwen3-VL-8B │
|
|
└──────────────────────────┴─────────────────────────────────────────────────┘
|
|
|
|
================================================================================
|
|
TOKEN BUDGETS
|
|
================================================================================
|
|
|
|
Variant-specific max_tokens (tuned for task):
|
|
|
|
│ Variant │ max_tokens │ Notes │
|
|
├──────────────────────────┼────────────┼────────────────────────────────────┤
|
|
│ Default │ 4096 │ Full dialogue responses │
|
|
│ AgentDefault │ 4096 │ Full agent responses │
|
|
│ CharacterProfileGen │ 2048 │ Structured bio output │
|
|
│ DiaryGeneration │ 768 │ Compact diary entries │
|
|
│ UniversalTranslator │ 512 │ Short translations │
|
|
│ action_evaluation │ 2048 │ Action descriptions │
|
|
│ combat │ 2000 │ Combat narration │
|
|
│ gamemaster_evaluation │ 256 │ Quick GM decisions │
|
|
│ memory │ 4096 │ Memory consolidation (large) │
|
|
│ meta │ 1024 │ Meta tasks │
|
|
│ vision │ 4000 │ Vision descriptions │
|
|
|
|
================================================================================
|
|
CONFIGURATION NOTES
|
|
================================================================================
|
|
|
|
Context Management:
|
|
- event_history_count_dialogue: 25 (reduced from 50 to fit context)
|
|
- Narration: DISABLED in SkyrimNet.yaml (Piper.yaml → narrative: enabled: false)
|
|
- Agent prompt includes "spoken dialogue only" instruction
|
|
|
|
Temperature Settings:
|
|
- Dialogue (Euryale): 0.8 (creative, varied)
|
|
- Structured (Gemma): 0.3-0.4 (consistent JSON)
|
|
- Vision (Qwen-VL): 0.7 (descriptive)
|
|
|
|
Structured Outputs:
|
|
- CharacterProfileGen: use_structured_outputs: true
|
|
- DiaryGeneration: use_structured_outputs: true
|
|
- memory: use_structured_outputs: true
|
|
|
|
================================================================================
|
|
LESSONS LEARNED
|
|
================================================================================
|
|
|
|
From March 15-17 experimentation:
|
|
|
|
1. QWEN MODELS USE PYTHON TRIPLE-QUOTES IN JSON
|
|
- Qwen-based models (including some Magidonia variants) output ```json blocks
|
|
- This breaks SkyrimNet's JSON parsing
|
|
- Solution: Use Gemma for JSON, Euryale/Llama for dialogue
|
|
|
|
2. STRICT ROLE ALTERNATION (Qwen/Magidonia via llama.cpp)
|
|
- Qwen Jinja templates enforce strict user/assistant alternation
|
|
- llama.cpp native enforces this strictly
|
|
- Ollama normalizes templates (more forgiving)
|
|
- If using Qwen-based models: route through Ollama
|
|
|
|
3. MODEL SIZE > ABLITERATION FOR JSON
|
|
- 27B follows instructions reliably
|
|
- 8B (even abliterated) struggles with structured output
|
|
- Don't route JSON tasks to small models
|
|
|
|
4. CONTEXT OVERFLOW PREVENTION
|
|
- Bumped ctx-size: Gemma-27B 4K→16K
|
|
- Reduced event_history_count_dialogue 50→25
|
|
- Right-sized token budgets per variant
|
|
|
|
================================================================================
|
|
FUTURE CONSIDERATIONS
|
|
================================================================================
|
|
|
|
Nimmerverse Integration (see ARCHITECTURE-RESEARCH.md):
|
|
- [ ] Oghma Infinium lore RAG → Iris (ChromaDB)
|
|
- [ ] Memory migration to unified ChromaDB
|
|
- [ ] Knowledge gating service (MCP/HTTP)
|
|
- [ ] Gossip network via NATS
|
|
|
|
Model Upgrades:
|
|
- Monitor Euryale updates (currently v2.3)
|
|
- Consider Gemma-3 upgrades when available
|
|
- Vision: Qwen-VL evolving rapidly
|
|
|
|
CPU Inference (Deferred):
|
|
- Function Gemma 270M was planned but not needed
|
|
- Gemma-27B handles structured output well
|
|
- Revisit if latency becomes an issue
|
|
|
|
================================================================================
|