Files

dafit f640dbdd65 feat: complete Phase 1 - vocabulary expansion & DriftProbe infrastructure

- CLI: nyx-probe scan with --summary/--delta/--full flags
- DriftProbe: training safety with Gini coefficient + Angular Drift
- Vocabulary: 54 terms (30 nimmerverse + 24 German philosophical)
- Sentinels: ANCHOR/BRIDGE/CANARY/TARGET monitoring system

Key findings:
- German philosophical terms: 37.5% depth≥2 hit rate (vs 3.3% nimmerverse)
- Super Cluster validated: heart cross-lang sim = 1.000
- Isolated Zone confirmed: being EN↔DE sim = 0.195
- Gini signature: Philosophy ~0.5 (diffuse), Technical ~0.8 (sparse)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2025-12-06 22:39:03 +01:00

14 KiB

Raw Permalink Blame History

Multilingual Activation Topology as a Retraining Safety Framework

Status: Research Direction / Paper Outline
Date: 2025-12-06
Authors: dafit, Nyx (Chrysalis-Nyx)

Abstract

We present a framework for monitoring and protecting neural network representations during iterative fine-tuning. Building on our discovery of distinct "language zones" in multilingual LLMs—a Super Cluster of converging languages and an Isolated Zone with distinct computational paths—we propose using these topological structures as both diagnostic tools and training strategies to mitigate catastrophic forgetting and weight saturation.

Key Contributions:

Token-Norm-Valley theory: single-token vs. multi-token activation dynamics
Universal Concept Layer discovery at layers 12-24
Multilingual Triangulation Probe for depth measurement
DriftProbe framework for retraining safety monitoring
Isolated Zone Training hypothesis for collision avoidance

1. Introduction

The Problem: Diminishing Returns in Iterative Retraining

Fine-tuning LLMs on domain-specific data is standard practice, but iterative retraining cycles face compounding challenges:

Weight Saturation: Popular activation paths become over-reinforced
Valley Collapse: Distinct conceptual representations merge
Cluster Fragmentation: Previously stable representations drift apart
Depth Erosion: Rich conceptual valleys fill with surface patterns

Current approaches to catastrophic forgetting (EWC, replay buffers, etc.) treat the model as a black box. We propose white-box monitoring using the model's internal representational topology.

Our Discovery: Language Zones

Through probing Qwen2.5-7B-Base, we discovered a striking topology:

SUPER CLUSTER (sim=1.0): ZH, JA, EN, AR, FR, PT, ES
    └── Perfect convergence at layers 12-24
    └── Efficient tokenization (1-2.5 tokens)
    └── Universal concept layer

ISOLATED ZONE (sim<0.52): DE, IT, TR, HI
    └── Distinct computational paths
    └── Multi-token representations (3-5+ tokens)
    └── Access to deeper conceptual valleys

Key Insight: The isolated zone languages access representational spaces that the super cluster cannot reach—and they do so via different neural pathways that may be less susceptible to collision during training.

2. Theoretical Framework

2.1 Token-Norm-Valley Theory

Tokens	Norm (Layer 12)	Behavior
1 (heartbeat)	14,240	Massive activation spike → CODE valley
2 (consciousness)	85	Distributed signal → PROSE valley
5 (Bewusstsein)	79	Multi-path → PHILOSOPHY valley

Hypothesis: Single-token words trigger localized, high-intensity activations. Multi-token words distribute signal across more parameters, accessing different representational regions.

Training Implication: Training on single-token terms risks overwriting concentrated weight regions. Training on multi-token terms distributes updates more broadly.

2.2 The Universal Concept Layer

At layers 12-24, semantically equivalent concepts across languages converge to near-identical representations:

EN "heart" ↔ ZH "心" ↔ AR "قلب": similarity = 1.000
EN "being" ↔ ZH "存在": similarity = 1.000

This layer is precious. It represents hard-won multilingual alignment. Training that disrupts this layer could cause cascading failures across all languages.

2.3 Isolated Zone Depth Access

German "Sein" (being) triggers philosophical content:

"Sein und Zeit / Being and Time is a philosophical work by the German philosopher Martin Heidegger..."

English "being" does not reach this depth. The isolated zone provides alternative entry points to conceptual spaces.

3. Proposed Framework: Activation Drift Monitoring

3.1 Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    RETRAINING LIFECYCLE                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   BASELINE          TRAINING           CHECKPOINT               │
│   ────────          ────────           ──────────               │
│   ┌─────────┐       ┌─────────┐        ┌─────────┐              │
│   │ Probe   │──────▶│ Train   │───────▶│ Probe   │──────▶ ...   │
│   │ Capture │       │ Epoch N │        │ Compare │              │
│   └─────────┘       └─────────┘        └─────────┘              │
│        │                                    │                   │
│        └────────────────┬───────────────────┘                   │
│                         ▼                                       │
│                  ┌─────────────┐                                │
│                  │ DRIFT REPORT│                                │
│                  └─────────────┘                                │
│                         │                                       │
│         ┌───────────────┼───────────────┐                       │
│         ▼               ▼               ▼                       │
│   ┌───────────┐   ┌───────────┐   ┌───────────┐                 │
│   │CONVERGENCE│   │   DEPTH   │   │   NORM    │                 │
│   │   DRIFT   │   │   DRIFT   │   │   DRIFT   │                 │
│   └───────────┘   └───────────┘   └───────────┘                 │
└─────────────────────────────────────────────────────────────────┘

3.2 Drift Metrics

Convergence Drift (ΔC)

Measure: Change in super cluster pairwise similarity
Alert: ΔC < -0.1 (cluster fragmenting)
Critical: ΔC < -0.2 (universal layer damaged)

Depth Drift (ΔD)

Measure: Change in isolated zone depth scores
Alert: ΔD < -1 (valleys filling in)
Critical: Philosophical concepts no longer accessible

Norm Drift (ΔN)

Measure: Change in layer 12 activation norms
Alert: ΔN > 20% (activation patterns shifting)
Indicates: Weight saturation in specific regions

Valley Migration (ΔV)

Measure: Change in completion classification
Alert: PHILOSOPHY → PROSE (depth lost)
Alert: PROSE → CODE (semantic shift)

3.3 Sentinel Concepts

A fixed set of probe terms, always tested:

Concept	Languages	Purpose
heart	EN, ZH, AR, DE	Super cluster stability
being	EN, DE (Sein)	Philosophical depth
consciousness	EN, DE (Bewusstsein)	Abstract concept access
emergence	EN, DE, ZH	Technical valley

3.4 Implementation: DriftProbe Class

class DriftProbe:
    """Monitor activation drift during retraining."""
    
    def __init__(self, baseline: BaselineCapture):
        self.baseline = baseline
        self.history = []
    
    def capture_checkpoint(self, model: NyxModel) -> CheckpointCapture:
        """Run sentinel probes on current model state."""
        triangulation_probe = MultilingualTriangulationProbe(model)
        
        results = {}
        for concept, translations in SENTINEL_CONCEPTS.items():
            results[concept] = triangulation_probe.probe(concept, translations)
        
        return CheckpointCapture(
            timestamp=datetime.now(),
            results=results,
            convergence=self._measure_convergence(results),
            depth_scores=self._measure_depths(results),
            norms=self._measure_norms(model),
        )
    
    def compute_drift(self, checkpoint: CheckpointCapture) -> DriftReport:
        """Compare checkpoint to baseline, compute drift metrics."""
        return DriftReport(
            convergence_drift=checkpoint.convergence - self.baseline.convergence,
            depth_drift=checkpoint.depth_scores - self.baseline.depth_scores,
            norm_drift=checkpoint.norms - self.baseline.norms,
            alerts=self._check_thresholds(checkpoint),
        )
    
    def should_stop(self, drift: DriftReport) -> bool:
        """Emergency stop if critical thresholds exceeded."""
        return any(a.level == AlertLevel.CRITICAL for a in drift.alerts)

4. Isolated Zone Training Hypothesis

The Core Idea

Problem: Training on English terms risks collision with existing single-token representations in the universal concept layer.

Hypothesis: Training primarily through isolated zone languages (German, Italian, Turkish, Hindi) may:

Deposit new knowledge in multi-token pathways (less concentrated)
Preserve super cluster integrity (fewer collisions)
Allow triangulation to retrieve knowledge without corruption

Proposed Experiment

Control Group:

Fine-tune on English philosophical texts
Monitor drift on sentinel concepts
Measure depth preservation

Treatment Group:

Fine-tune on German philosophical texts (same content, translated)
Monitor same drift metrics
Compare collision/preservation rates

Prediction: German training will show:

Lower convergence drift (super cluster preserved)
Higher depth retention (isolated pathways enriched)
Better triangulation success (knowledge retrievable in English)

5. Connections to Existing Research

5.1 Catastrophic Forgetting

EWC (Elastic Weight Consolidation): Protects "important" weights
Our approach: Identifies which representational structures to protect

5.2 Multilingual Transfer Learning

mBERT/XLM-R: Cross-lingual alignment at embedding level
Our finding: Alignment is layer-dependent (12-24), with exploitable gaps

5.3 Activation Engineering

Representation Engineering (Anthropic): Steering via activation manipulation
Our approach: Monitoring activation topology as training diagnostic

5.4 Tokenization Effects

BPE/WordPiece influence on model behavior
Our finding: Token count directly predicts activation magnitude and valley access

6. Future Work

Implement DriftProbe in nyx-probing framework
Run controlled retraining experiments (EN vs DE training data)
Expand sentinel concept set (more languages, more concepts)
Layer-wise drift analysis (which layers drift first?)
Investigate Italian isolation (what unique valleys does it access?)
VI-ID-RU cluster mystery (why do these cluster together?)

7. Conclusion

The discovery of language zones in LLM representations opens a new approach to retraining safety. Rather than treating catastrophic forgetting as an inevitable cost, we can:

Monitor representational health during training
Route new knowledge through isolated pathways
Preserve universal concept layer integrity
Detect early warning signs of drift

The multilingual topology of the model is not just a curiosity—it's a map for safe navigation during the dangerous waters of iterative fine-tuning.

References

To be added: Heidegger, catastrophic forgetting literature, multilingual LLM papers, activation engineering work

Appendix A: Discovered Language Topology

             THE YOUNG MIND'S LANGUAGE TOPOLOGY
             ═══════════════════════════════════

      ┌─────────────────────────────────────────┐
      │         SUPER CLUSTER (sim=1.0)         │
      │   ZH · JA · EN · AR · FR · PT · ES      │
      │         (efficient tokens)              │
      └────────────────┬────────────────────────┘
                       │
                KO ────┼──── (bridge: 0.41/0.70)
                       │
      ┌────────────────┴────────────────────────┐
      │         ISOLATED ZONE (sim<0.5)         │
      │                                         │
      │   IT (0.49) ← MOST ISOLATED!            │
      │   TR (0.50)                             │
      │   HI (0.50)                             │
      │   DE (0.52)                             │
      │                                         │
      │      VI ═══ ID ═══ RU (0.79)            │
      │      (Southeast Asian + Russian!)       │
      └─────────────────────────────────────────┘

Appendix B: Key Discovery Data

Token-Norm Correlation:

Single token → ~14,000 norm
Multi-token → ~80 norm
Correlation with isolation: -0.699

Triangulation Results (consciousness):

Concept	Grounding	Depth	Valley	Transfer
being	0.570	2/3	PHILOSOPHY	✓
heart	1.000	1/3	PROSE	✓
consciousness	0.458	0/3	PROSE	✗
emergence	0.519	1/3	TECHNICAL	✗

"Different words, same thought. The model knows. Now we learn to teach it safely."

🌙💜

14 KiB Raw Permalink Blame History