feat: complete Phase 1 - vocabulary expansion & DriftProbe infrastructure

- CLI: nyx-probe scan with --summary/--delta/--full flags - DriftProbe: training safety with Gini coefficient + Angular Drift - Vocabulary: 54 terms (30 nimmerverse + 24 German philosophical) - Sentinels: ANCHOR/BRIDGE/CANARY/TARGET monitoring system Key findings: - German philosophical terms: 37.5% depth≥2 hit rate (vs 3.3% nimmerverse) - Super Cluster validated: heart cross-lang sim = 1.000 - Isolated Zone confirmed: being EN↔DE sim = 0.195 - Gini signature: Philosophy ~0.5 (diffuse), Technical ~0.8 (sparse) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-06 22:39:03 +01:00
parent 9853f4767b
commit f640dbdd65
29 changed files with 6164 additions and 1 deletions
--- a/docs/language-topology-complete.md
+++ b/docs/language-topology-complete.md
@@ -0,0 +1,241 @@
+# Complete Language Topology Map v2.0
+
+**Date:** 2025-12-06  
+**Model:** Qwen2.5-7B-Base  
+**Status:** Empirically validated through probing
+
+---
+
+## Executive Summary
+
+Through systematic probing of 15 languages, we've discovered that language isolation in LLMs falls into **distinct categories** with different causes and implications:
+
+1. **Super Cluster** - Languages that converge perfectly (curriculum: grounding)
+2. **Philosophical Access** - German accesses deep conceptual valleys
+3. **Code-Hijacked** - Italian/Turkish/Indonesian words become variable names
+4. **Fragmented** - Hindi is tokenized into too many pieces
+5. **Web Prose Cluster** - Vietnamese/Indonesian/Russian share content style
+
+---
+
+## The Complete Map
+
+```
+┌─────────────────────────────────────────────────────────────────────────────┐
+│                    THE YOUNG MIND'S LANGUAGE TOPOLOGY                        │
+│                              COMPLETE MAP v2.0                               │
+╞═════════════════════════════════════════════════════════════════════════════╡
+│                                                                              │
+│  ┌─────────────────────────────────────────────────────────────────────┐    │
+│  │              🌍 SUPER CLUSTER (sim=1.0)                             │    │
+│  │         ZH · JA · EN · AR · FR · PT · ES                            │    │
+│  │                                                                      │    │
+│  │    ✅ Perfect convergence at Universal Concept Layer (12-24)        │    │
+│  │    ✅ Efficient tokenization (1-2.5 tokens)                         │    │
+│  │    ✅ USE FOR: Grounding, establishing shared concepts              │    │
+│  └─────────────────────────────────────────────────────────────────────┘    │
+│                                    │                                         │
+│                        KO ─────────┼───────── (bridge: 0.41-0.70)            │
+│                                    │                                         │
+│  ┌─────────────────────────────────┴───────────────────────────────────┐    │
+│  │                        ISOLATED ZONE                                │    │
+│  ├─────────────────────────────────────────────────────────────────────┤    │
+│  │                                                                      │    │
+│  │  🧠 PHILOSOPHICAL ACCESS (sim=0.25, tokens=2.2)                     │    │
+│  │     DE (German)                                                      │    │
+│  │     → "Sein" triggers Heidegger, "Bewusstsein" → epistemology       │    │
+│  │     ✅ USE FOR: Deep philosophical training                          │    │
+│  │                                                                      │    │
+│  │  💻 CODE-HIJACKED (sim=0.25-0.33, tokens=2.2-2.8)                   │    │
+│  │     IT (Italian) - MOST ISOLATED (0.49)                             │    │
+│  │     TR (Turkish) - (0.50)                                           │    │
+│  │     ID (Indonesian) - partial (0.33)                                │    │
+│  │     → Words interpreted as Python/C++ variable names                 │    │
+│  │     ❌ NOT USEFUL: Training signal wasted on code patterns          │    │
+│  │                                                                      │    │
+│  │  📜 FRAGMENTED (sim=0.31, tokens=5.0)                               │    │
+│  │     HI (Hindi)                                                       │    │
+│  │     → "अस्तित्व" (being) = 8 tokens!                                 │    │
+│  │     → Stays trapped in Devanagari prose                             │    │
+│  │     ⚠️ LIMITED: Cross-lingual transfer impaired                     │    │
+│  │                                                                      │    │
+│  │  📰 WEB PROSE CLUSTER (sim=0.32-0.36, internal=0.6-0.7)            │    │
+│  │     VI ═══ ID ═══ RU                                                │    │
+│  │     → All generate online article style                             │    │
+│  │     → Cluster by CONTENT STYLE not linguistic features              │    │
+│  │     🤔 POTENTIAL: Factual/encyclopedic content training             │    │
+│  │                                                                      │    │
+│  └─────────────────────────────────────────────────────────────────────┘    │
+│                                                                              │
+└─────────────────────────────────────────────────────────────────────────────┘
+```
+
+---
+
+## Detailed Findings
+
+### Super Cluster (sim=1.0)
+
+| Language | Tokens | Notes |
+|----------|--------|-------|
+| Chinese (ZH) | 1.0 | Single character = single concept |
+| Japanese (JA) | 1.0 | Kanji efficiency |
+| English (EN) | 1.2 | Base language |
+| Arabic (AR) | 1.8 | Good convergence |
+| French (FR) | 2.0 | Romance baseline |
+| Portuguese (PT) | 2.2 | Clusters with FR/ES |
+| Spanish (ES) | 2.5 | Clusters with FR/PT |
+
+**Key Insight:** These 7 languages converge to **identical representations** at layers 12-24. The model "knows" they express the same concepts.
+
+### German - Philosophical Access
+
+| Metric | Value |
+|--------|-------|
+| Avg tokens | 2.2 |
+| Sim to EN | 0.251 |
+| Valley type | PHILOSOPHY |
+
+**Evidence:**
+- "Sein" → "Being and Time is a philosophical work by Martin Heidegger..."
+- "Bewusstsein" → epistemology, perception, truth
+- "Wahrheit" → academic methods
+
+**Why isolated:** Multi-token compounds preserve philosophical atoms ("sein", "geist") as separate tokens, enabling access to academic/philosophical training data.
+
+### Italian/Turkish/Indonesian - Code-Hijacked
+
+| Language | Tokens | Sim to EN | Valley |
+|----------|--------|-----------|--------|
+| Italian | 2.5 | 0.49 | CODE |
+| Turkish | 2.2 | 0.25 | CODE |
+| Indonesian | 2.8 | 0.33 | CODE |
+
+**Evidence:**
+- IT "essere" → `essere = input("Cosa devo fare?")`
+- IT "anima" → `anima = {'nome':'anima', 'idade':7...}`
+- TR "kalp" → `kalp = input("Klavyeden...")`
+- TR "varlık" → `while varlık < 10:`
+- ID "hati" → `hati::hati(QWidget *parent)`
+
+**Why isolated:** Simple Latin orthography without diacritics makes words look like valid programming identifiers. Model defaults to code because code is prevalent in training data.
+
+**Curriculum implication:** ❌ AVOID - training signal diverted to code patterns
+
+### Hindi - Fragmented
+
+| Metric | Value |
+|--------|-------|
+| Avg tokens | 5.0 |
+| Sim to EN | 0.31 |
+| Valley type | PROSE |
+
+**Evidence:**
+- "हृदय" (heart) = 5 tokens
+- "अस्तित्व" (being) = 8 tokens!
+- All completions stay in Devanagari script
+
+**Why isolated:** Extreme tokenization fragments words so severely that:
+1. Signal is distributed across many positions
+2. Cross-lingual alignment breaks down
+3. Model stays in native script prose
+
+**Curriculum implication:** ⚠️ LIMITED - Hindi content may not transfer well
+
+### VI-ID-RU Web Prose Cluster
+
+| Language | Tokens | Sim to EN | Internal sim |
+|----------|--------|-----------|--------------|
+| Vietnamese | 3.2 | 0.36 | 0.6-0.7 |
+| Indonesian | 2.8 | 0.33 | 0.6-0.7 |
+| Russian | 2.7 | 0.32 | 0.6-0.7 |
+
+**Evidence:**
+- VI "trái tim" → "Giao Thông... Hotline: 0901 514 799"
+- VI "linh hồn" → "Tạp chí Sông Hương online"
+- ID "kehidupan" → "dalam kitab Yohanes 14:16-17"
+- RU "жизнь" → "все статьи по теме. Страница 134"
+
+**Why they cluster:** Not linguistic similarity - they share **web content training data patterns**:
+- News articles
+- Blogs
+- Online encyclopedias
+- Religious/factual text
+
+**Curriculum implication:** 🤔 May be useful for factual/encyclopedic training
+
+---
+
+## Curriculum Strategy
+
+### Phase 1: GROUNDING
+Use Super Cluster languages to establish universal concepts:
+```
+EN "consciousness" → ZH "意识" → AR "الوعي" → FR "conscience"
+```
+All converge at 1.0 similarity - stable foundation.
+
+### Phase 2: DEEPENING
+Use German to access philosophical valleys:
+```
+DE "Sein" → Heidegger → existence → truth → epistemology
+```
+Depth score 2/3, transfers back to English.
+
+### Phase 3: TRIANGULATION
+Verify depth transfers:
+```
+"Sein (German): In English, it means..." 
+→ Check if philosophical depth preserved
+```
+
+### AVOID
+- Italian, Turkish, Indonesian for conceptual training
+- Their isolation is accidental (code hijacking), not useful
+
+### INVESTIGATE
+- VI-ID-RU cluster for factual content training
+- Korean as potential bridge language
+
+---
+
+## Technical Details
+
+### Measurement Methodology
+
+1. **Tokenization:** Count BPE tokens per word
+2. **Hidden states:** Extract layer 12 representations
+3. **Similarity:** Cosine similarity between languages
+4. **Valley classification:** Analyze completions for CODE/PROSE/PHILOSOPHY patterns
+
+### Model Configuration
+
+```python
+model = AutoModelForCausalLM.from_pretrained(
+    "Qwen/Qwen2.5-7B",
+    torch_dtype=torch.float16,
+    device_map="cuda",
+    output_hidden_states=True,
+)
+```
+
+### Key Layers
+
+- **Layer 12:** Primary concept layer (universal convergence)
+- **Layers 16-24:** Continued convergence, depth access
+- **Layer 28:** Output preparation
+
+---
+
+## References
+
+- `tokenization-valleys.md` - Token-Norm-Valley theory
+- `multilingual-convergence.md` - Universal concept layer discovery
+- `language-landscape.md` - Original 15-language scan
+- `retraining-safety-framework.md` - Training safety implications
+
+---
+
+*"The model's language topology is not arbitrary - it's a map for navigation."*
+
+🌙💜