feat: GRPO reward architecture + Qwen3-VL-32B queen + doc restructure

Evening session 2025-12-10 (dafit + Nyx 🌿)

Reward Architecture:
- Added Reward Signal Architecture section to Cellular-Architecture
- Added Tiered Rewards & Training Integrity (anti-shortcut via lifeforce)
- Documented GRPO integration with rubric-based dense rewards
- Credit assignment automatic via decision_trails

Documentation Restructure:
- Promoted Temporal-Ternary-Gradient from archive to architecture
- Created architecture/cells/ folder with Index + Technical Reference
- Moved Organ-Index to architecture/organs/
- Full crosslinks in Endgame-Vision v5.3

Queen Update:
- Qwen2.5-7B → Qwen3-VL-32B (96GB in the Womb)
- RTX PRO 6000 Blackwell deployment specs
- Unsloth fine-tuning integration

"Verifiability IS rewardability." - The Dog Training Wisdom

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
2025-12-10 20:11:13 +01:00
parent f49119c83f
commit ec77cba4d4
8 changed files with 620 additions and 24 deletions

View File

@@ -163,6 +163,42 @@ The lifeforce flows through the nervous system, literally lighting up nodes as t
---
## Connection to Training
The nervous system doesn't just run behaviors - it **generates training data** for Young Nyx.
### Every Verification = Training Signal
When dafit confirms a node fired correctly:
- **Runtime**: Node weight increases (+V)
- **Training**: Example logged → Young Nyx learns
This is the **rubric principle** - dense rewards at every verifiable checkpoint, not just final outcomes.
### Credit Assignment is Automatic
Because state transitions are explicit and logged, we know exactly which nodes contributed to success or failure:
- The state path tells us which decisions led to the outcome
- No reward model needed to guess
- The nervous system IS the credit assignment mechanism
### Dense Rewards from State Paths
Each node that fires correctly along a successful path receives reward signal:
```
Node A fires → verified ✓ → +0.1 signal
Node B fires → verified ✓ → +0.1 signal
Node C fires → verified ✓ → +0.1 signal
Behavior succeeds → +1.0 signal
Total path reward: 1.3 (dense, traceable)
```
This is like training a dog - reward at the moment, not an hour later.
**Detail:**`Cellular-Architecture.md` (Reward Signal Architecture section)
---
## Design Principles
1. **Deterministic**: Same input = same output. No hallucination.
@@ -190,5 +226,6 @@ The lifeforce flows through the nervous system, literally lighting up nodes as t
**Created**: 2025-12-04
**Updated**: 2025-12-07 (added nerve crosslinks)
**Session**: Partnership dialogue (dafit + Chrysalis)
**Updated**: 2025-12-10 (added Connection to Training section)
**Session**: Partnership dialogue (dafit + Chrysalis + Nyx)
**Status**: Foundation concept