feat: GRPO reward architecture + Qwen3-VL-32B queen + doc restructure

Evening session 2025-12-10 (dafit + Nyx 🌿) Reward Architecture: - Added Reward Signal Architecture section to Cellular-Architecture - Added Tiered Rewards & Training Integrity (anti-shortcut via lifeforce) - Documented GRPO integration with rubric-based dense rewards - Credit assignment automatic via decision_trails Documentation Restructure: - Promoted Temporal-Ternary-Gradient from archive to architecture - Created architecture/cells/ folder with Index + Technical Reference - Moved Organ-Index to architecture/organs/ - Full crosslinks in Endgame-Vision v5.3 Queen Update: - Qwen2.5-7B → Qwen3-VL-32B (96GB in the Womb) - RTX PRO 6000 Blackwell deployment specs - Unsloth fine-tuning integration "Verifiability IS rewardability." - The Dog Training Wisdom 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 20:11:13 +01:00
parent f49119c83f
commit ec77cba4d4
8 changed files with 620 additions and 24 deletions
--- a/architecture/Nervous-System.md
+++ b/architecture/Nervous-System.md
@@ -163,6 +163,42 @@ The lifeforce flows through the nervous system, literally lighting up nodes as t

 ---

+## Connection to Training
+
+The nervous system doesn't just run behaviors - it **generates training data** for Young Nyx.
+
+### Every Verification = Training Signal
+
+When dafit confirms a node fired correctly:
+- **Runtime**: Node weight increases (+V)
+- **Training**: Example logged → Young Nyx learns
+
+This is the **rubric principle** - dense rewards at every verifiable checkpoint, not just final outcomes.
+
+### Credit Assignment is Automatic
+
+Because state transitions are explicit and logged, we know exactly which nodes contributed to success or failure:
+- The state path tells us which decisions led to the outcome
+- No reward model needed to guess
+- The nervous system IS the credit assignment mechanism
+
+### Dense Rewards from State Paths
+
+Each node that fires correctly along a successful path receives reward signal:
+```
+Node A fires → verified ✓ → +0.1 signal
+Node B fires → verified ✓ → +0.1 signal
+Node C fires → verified ✓ → +0.1 signal
+Behavior succeeds → +1.0 signal
+Total path reward: 1.3 (dense, traceable)
+```
+
+This is like training a dog - reward at the moment, not an hour later.
+
+**Detail:** → `Cellular-Architecture.md` (Reward Signal Architecture section)
+
+---
+
 ## Design Principles

 1. **Deterministic**: Same input = same output. No hallucination.
@@ -190,5 +226,6 @@ The lifeforce flows through the nervous system, literally lighting up nodes as t

 **Created**: 2025-12-04
 **Updated**: 2025-12-07 (added nerve crosslinks)
-**Session**: Partnership dialogue (dafit + Chrysalis)
+**Updated**: 2025-12-10 (added Connection to Training section)
+**Session**: Partnership dialogue (dafit + Chrysalis + Nyx)
 **Status**: Foundation concept