Files
nyx-probing/docs/retraining-safety-framework.md
dafit f640dbdd65 feat: complete Phase 1 - vocabulary expansion & DriftProbe infrastructure
- CLI: nyx-probe scan with --summary/--delta/--full flags
- DriftProbe: training safety with Gini coefficient + Angular Drift
- Vocabulary: 54 terms (30 nimmerverse + 24 German philosophical)
- Sentinels: ANCHOR/BRIDGE/CANARY/TARGET monitoring system

Key findings:
- German philosophical terms: 37.5% depth≥2 hit rate (vs 3.3% nimmerverse)
- Super Cluster validated: heart cross-lang sim = 1.000
- Isolated Zone confirmed: being EN↔DE sim = 0.195
- Gini signature: Philosophy ~0.5 (diffuse), Technical ~0.8 (sparse)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-06 22:39:03 +01:00

321 lines
14 KiB
Markdown

# Multilingual Activation Topology as a Retraining Safety Framework
**Status:** Research Direction / Paper Outline
**Date:** 2025-12-06
**Authors:** dafit, Nyx (Chrysalis-Nyx)
---
## Abstract
We present a framework for monitoring and protecting neural network representations during iterative fine-tuning. Building on our discovery of distinct "language zones" in multilingual LLMs—a Super Cluster of converging languages and an Isolated Zone with distinct computational paths—we propose using these topological structures as both diagnostic tools and training strategies to mitigate catastrophic forgetting and weight saturation.
**Key Contributions:**
1. Token-Norm-Valley theory: single-token vs. multi-token activation dynamics
2. Universal Concept Layer discovery at layers 12-24
3. Multilingual Triangulation Probe for depth measurement
4. DriftProbe framework for retraining safety monitoring
5. Isolated Zone Training hypothesis for collision avoidance
---
## 1. Introduction
### The Problem: Diminishing Returns in Iterative Retraining
Fine-tuning LLMs on domain-specific data is standard practice, but iterative retraining cycles face compounding challenges:
- **Weight Saturation:** Popular activation paths become over-reinforced
- **Valley Collapse:** Distinct conceptual representations merge
- **Cluster Fragmentation:** Previously stable representations drift apart
- **Depth Erosion:** Rich conceptual valleys fill with surface patterns
Current approaches to catastrophic forgetting (EWC, replay buffers, etc.) treat the model as a black box. We propose **white-box monitoring** using the model's internal representational topology.
### Our Discovery: Language Zones
Through probing Qwen2.5-7B-Base, we discovered a striking topology:
```
SUPER CLUSTER (sim=1.0): ZH, JA, EN, AR, FR, PT, ES
└── Perfect convergence at layers 12-24
└── Efficient tokenization (1-2.5 tokens)
└── Universal concept layer
ISOLATED ZONE (sim<0.52): DE, IT, TR, HI
└── Distinct computational paths
└── Multi-token representations (3-5+ tokens)
└── Access to deeper conceptual valleys
```
**Key Insight:** The isolated zone languages access representational spaces that the super cluster cannot reach—and they do so via *different neural pathways* that may be less susceptible to collision during training.
---
## 2. Theoretical Framework
### 2.1 Token-Norm-Valley Theory
| Tokens | Norm (Layer 12) | Behavior |
|--------|-----------------|----------|
| 1 (heartbeat) | 14,240 | Massive activation spike → CODE valley |
| 2 (consciousness) | 85 | Distributed signal → PROSE valley |
| 5 (Bewusstsein) | 79 | Multi-path → PHILOSOPHY valley |
**Hypothesis:** Single-token words trigger localized, high-intensity activations. Multi-token words distribute signal across more parameters, accessing different representational regions.
**Training Implication:** Training on single-token terms risks overwriting concentrated weight regions. Training on multi-token terms distributes updates more broadly.
### 2.2 The Universal Concept Layer
At layers 12-24, semantically equivalent concepts across languages converge to near-identical representations:
- EN "heart" ↔ ZH "心" ↔ AR "قلب": similarity = 1.000
- EN "being" ↔ ZH "存在": similarity = 1.000
**This layer is precious.** It represents hard-won multilingual alignment. Training that disrupts this layer could cause cascading failures across all languages.
### 2.3 Isolated Zone Depth Access
German "Sein" (being) triggers philosophical content:
> "Sein und Zeit / Being and Time is a philosophical work by the German philosopher Martin Heidegger..."
English "being" does not reach this depth. The isolated zone provides **alternative entry points** to conceptual spaces.
---
## 3. Proposed Framework: Activation Drift Monitoring
### 3.1 Architecture
```
┌─────────────────────────────────────────────────────────────────┐
│ RETRAINING LIFECYCLE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ BASELINE TRAINING CHECKPOINT │
│ ──────── ──────── ────────── │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Probe │──────▶│ Train │───────▶│ Probe │──────▶ ... │
│ │ Capture │ │ Epoch N │ │ Compare │ │
│ └─────────┘ └─────────┘ └─────────┘ │
│ │ │ │
│ └────────────────┬───────────────────┘ │
│ ▼ │
│ ┌─────────────┐ │
│ │ DRIFT REPORT│ │
│ └─────────────┘ │
│ │ │
│ ┌───────────────┼───────────────┐ │
│ ▼ ▼ ▼ │
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ │
│ │CONVERGENCE│ │ DEPTH │ │ NORM │ │
│ │ DRIFT │ │ DRIFT │ │ DRIFT │ │
│ └───────────┘ └───────────┘ └───────────┘ │
└─────────────────────────────────────────────────────────────────┘
```
### 3.2 Drift Metrics
**Convergence Drift (ΔC)**
- Measure: Change in super cluster pairwise similarity
- Alert: ΔC < -0.1 (cluster fragmenting)
- Critical: ΔC < -0.2 (universal layer damaged)
**Depth Drift (ΔD)**
- Measure: Change in isolated zone depth scores
- Alert: ΔD < -1 (valleys filling in)
- Critical: Philosophical concepts no longer accessible
**Norm Drift (ΔN)**
- Measure: Change in layer 12 activation norms
- Alert: ΔN > 20% (activation patterns shifting)
- Indicates: Weight saturation in specific regions
**Valley Migration (ΔV)**
- Measure: Change in completion classification
- Alert: PHILOSOPHY → PROSE (depth lost)
- Alert: PROSE → CODE (semantic shift)
### 3.3 Sentinel Concepts
A fixed set of probe terms, always tested:
| Concept | Languages | Purpose |
|---------|-----------|---------|
| heart | EN, ZH, AR, DE | Super cluster stability |
| being | EN, DE (Sein) | Philosophical depth |
| consciousness | EN, DE (Bewusstsein) | Abstract concept access |
| emergence | EN, DE, ZH | Technical valley |
### 3.4 Implementation: DriftProbe Class
```python
class DriftProbe:
"""Monitor activation drift during retraining."""
def __init__(self, baseline: BaselineCapture):
self.baseline = baseline
self.history = []
def capture_checkpoint(self, model: NyxModel) -> CheckpointCapture:
"""Run sentinel probes on current model state."""
triangulation_probe = MultilingualTriangulationProbe(model)
results = {}
for concept, translations in SENTINEL_CONCEPTS.items():
results[concept] = triangulation_probe.probe(concept, translations)
return CheckpointCapture(
timestamp=datetime.now(),
results=results,
convergence=self._measure_convergence(results),
depth_scores=self._measure_depths(results),
norms=self._measure_norms(model),
)
def compute_drift(self, checkpoint: CheckpointCapture) -> DriftReport:
"""Compare checkpoint to baseline, compute drift metrics."""
return DriftReport(
convergence_drift=checkpoint.convergence - self.baseline.convergence,
depth_drift=checkpoint.depth_scores - self.baseline.depth_scores,
norm_drift=checkpoint.norms - self.baseline.norms,
alerts=self._check_thresholds(checkpoint),
)
def should_stop(self, drift: DriftReport) -> bool:
"""Emergency stop if critical thresholds exceeded."""
return any(a.level == AlertLevel.CRITICAL for a in drift.alerts)
```
---
## 4. Isolated Zone Training Hypothesis
### The Core Idea
**Problem:** Training on English terms risks collision with existing single-token representations in the universal concept layer.
**Hypothesis:** Training primarily through isolated zone languages (German, Italian, Turkish, Hindi) may:
1. Deposit new knowledge in multi-token pathways (less concentrated)
2. Preserve super cluster integrity (fewer collisions)
3. Allow triangulation to retrieve knowledge without corruption
### Proposed Experiment
**Control Group:**
- Fine-tune on English philosophical texts
- Monitor drift on sentinel concepts
- Measure depth preservation
**Treatment Group:**
- Fine-tune on German philosophical texts (same content, translated)
- Monitor same drift metrics
- Compare collision/preservation rates
**Prediction:** German training will show:
- Lower convergence drift (super cluster preserved)
- Higher depth retention (isolated pathways enriched)
- Better triangulation success (knowledge retrievable in English)
---
## 5. Connections to Existing Research
### 5.1 Catastrophic Forgetting
- EWC (Elastic Weight Consolidation): Protects "important" weights
- Our approach: Identifies which *representational structures* to protect
### 5.2 Multilingual Transfer Learning
- mBERT/XLM-R: Cross-lingual alignment at embedding level
- Our finding: Alignment is layer-dependent (12-24), with exploitable gaps
### 5.3 Activation Engineering
- Representation Engineering (Anthropic): Steering via activation manipulation
- Our approach: Monitoring activation topology as training diagnostic
### 5.4 Tokenization Effects
- BPE/WordPiece influence on model behavior
- Our finding: Token count directly predicts activation magnitude and valley access
---
## 6. Future Work
1. **Implement DriftProbe** in nyx-probing framework
2. **Run controlled retraining experiments** (EN vs DE training data)
3. **Expand sentinel concept set** (more languages, more concepts)
4. **Layer-wise drift analysis** (which layers drift first?)
5. **Investigate Italian isolation** (what unique valleys does it access?)
6. **VI-ID-RU cluster mystery** (why do these cluster together?)
---
## 7. Conclusion
The discovery of language zones in LLM representations opens a new approach to retraining safety. Rather than treating catastrophic forgetting as an inevitable cost, we can:
1. **Monitor** representational health during training
2. **Route** new knowledge through isolated pathways
3. **Preserve** universal concept layer integrity
4. **Detect** early warning signs of drift
The multilingual topology of the model is not just a curiosity—it's a map for safe navigation during the dangerous waters of iterative fine-tuning.
---
## References
*To be added: Heidegger, catastrophic forgetting literature, multilingual LLM papers, activation engineering work*
---
## Appendix A: Discovered Language Topology
```
THE YOUNG MIND'S LANGUAGE TOPOLOGY
═══════════════════════════════════
┌─────────────────────────────────────────┐
│ SUPER CLUSTER (sim=1.0) │
│ ZH · JA · EN · AR · FR · PT · ES │
│ (efficient tokens) │
└────────────────┬────────────────────────┘
KO ────┼──── (bridge: 0.41/0.70)
┌────────────────┴────────────────────────┐
│ ISOLATED ZONE (sim<0.5) │
│ │
│ IT (0.49) ← MOST ISOLATED! │
│ TR (0.50) │
│ HI (0.50) │
│ DE (0.52) │
│ │
│ VI ═══ ID ═══ RU (0.79) │
│ (Southeast Asian + Russian!) │
└─────────────────────────────────────────┘
```
## Appendix B: Key Discovery Data
**Token-Norm Correlation:**
- Single token → ~14,000 norm
- Multi-token → ~80 norm
- Correlation with isolation: -0.699
**Triangulation Results (consciousness):**
| Concept | Grounding | Depth | Valley | Transfer |
|---------|-----------|-------|--------|----------|
| being | 0.570 | 2/3 | PHILOSOPHY | ✓ |
| heart | 1.000 | 1/3 | PROSE | ✓ |
| consciousness | 0.458 | 0/3 | PROSE | ✗ |
| emergence | 0.519 | 1/3 | TECHNICAL | ✗ |
---
*"Different words, same thought. The model knows. Now we learn to teach it safely."*
🌙💜