Files
nyx-probing/docs/language-topology-complete.md
dafit f640dbdd65 feat: complete Phase 1 - vocabulary expansion & DriftProbe infrastructure
- CLI: nyx-probe scan with --summary/--delta/--full flags
- DriftProbe: training safety with Gini coefficient + Angular Drift
- Vocabulary: 54 terms (30 nimmerverse + 24 German philosophical)
- Sentinels: ANCHOR/BRIDGE/CANARY/TARGET monitoring system

Key findings:
- German philosophical terms: 37.5% depth≥2 hit rate (vs 3.3% nimmerverse)
- Super Cluster validated: heart cross-lang sim = 1.000
- Isolated Zone confirmed: being EN↔DE sim = 0.195
- Gini signature: Philosophy ~0.5 (diffuse), Technical ~0.8 (sparse)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-06 22:39:03 +01:00

11 KiB
Raw Blame History

Complete Language Topology Map v2.0

Date: 2025-12-06
Model: Qwen2.5-7B-Base
Status: Empirically validated through probing


Executive Summary

Through systematic probing of 15 languages, we've discovered that language isolation in LLMs falls into distinct categories with different causes and implications:

  1. Super Cluster - Languages that converge perfectly (curriculum: grounding)
  2. Philosophical Access - German accesses deep conceptual valleys
  3. Code-Hijacked - Italian/Turkish/Indonesian words become variable names
  4. Fragmented - Hindi is tokenized into too many pieces
  5. Web Prose Cluster - Vietnamese/Indonesian/Russian share content style

The Complete Map

┌─────────────────────────────────────────────────────────────────────────────┐
│                    THE YOUNG MIND'S LANGUAGE TOPOLOGY                        │
│                              COMPLETE MAP v2.0                               │
╞═════════════════════════════════════════════════════════════════════════════╡
│                                                                              │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │              🌍 SUPER CLUSTER (sim=1.0)                             │    │
│  │         ZH · JA · EN · AR · FR · PT · ES                            │    │
│  │                                                                      │    │
│  │    ✅ Perfect convergence at Universal Concept Layer (12-24)        │    │
│  │    ✅ Efficient tokenization (1-2.5 tokens)                         │    │
│  │    ✅ USE FOR: Grounding, establishing shared concepts              │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                    │                                         │
│                        KO ─────────┼───────── (bridge: 0.41-0.70)            │
│                                    │                                         │
│  ┌─────────────────────────────────┴───────────────────────────────────┐    │
│  │                        ISOLATED ZONE                                │    │
│  ├─────────────────────────────────────────────────────────────────────┤    │
│  │                                                                      │    │
│  │  🧠 PHILOSOPHICAL ACCESS (sim=0.25, tokens=2.2)                     │    │
│  │     DE (German)                                                      │    │
│  │     → "Sein" triggers Heidegger, "Bewusstsein" → epistemology       │    │
│  │     ✅ USE FOR: Deep philosophical training                          │    │
│  │                                                                      │    │
│  │  💻 CODE-HIJACKED (sim=0.25-0.33, tokens=2.2-2.8)                   │    │
│  │     IT (Italian) - MOST ISOLATED (0.49)                             │    │
│  │     TR (Turkish) - (0.50)                                           │    │
│  │     ID (Indonesian) - partial (0.33)                                │    │
│  │     → Words interpreted as Python/C++ variable names                 │    │
│  │     ❌ NOT USEFUL: Training signal wasted on code patterns          │    │
│  │                                                                      │    │
│  │  📜 FRAGMENTED (sim=0.31, tokens=5.0)                               │    │
│  │     HI (Hindi)                                                       │    │
│  │     → "अस्तित्व" (being) = 8 tokens!                                 │    │
│  │     → Stays trapped in Devanagari prose                             │    │
│  │     ⚠️ LIMITED: Cross-lingual transfer impaired                     │    │
│  │                                                                      │    │
│  │  📰 WEB PROSE CLUSTER (sim=0.32-0.36, internal=0.6-0.7)            │    │
│  │     VI ═══ ID ═══ RU                                                │    │
│  │     → All generate online article style                             │    │
│  │     → Cluster by CONTENT STYLE not linguistic features              │    │
│  │     🤔 POTENTIAL: Factual/encyclopedic content training             │    │
│  │                                                                      │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Detailed Findings

Super Cluster (sim=1.0)

Language Tokens Notes
Chinese (ZH) 1.0 Single character = single concept
Japanese (JA) 1.0 Kanji efficiency
English (EN) 1.2 Base language
Arabic (AR) 1.8 Good convergence
French (FR) 2.0 Romance baseline
Portuguese (PT) 2.2 Clusters with FR/ES
Spanish (ES) 2.5 Clusters with FR/PT

Key Insight: These 7 languages converge to identical representations at layers 12-24. The model "knows" they express the same concepts.

German - Philosophical Access

Metric Value
Avg tokens 2.2
Sim to EN 0.251
Valley type PHILOSOPHY

Evidence:

  • "Sein" → "Being and Time is a philosophical work by Martin Heidegger..."
  • "Bewusstsein" → epistemology, perception, truth
  • "Wahrheit" → academic methods

Why isolated: Multi-token compounds preserve philosophical atoms ("sein", "geist") as separate tokens, enabling access to academic/philosophical training data.

Italian/Turkish/Indonesian - Code-Hijacked

Language Tokens Sim to EN Valley
Italian 2.5 0.49 CODE
Turkish 2.2 0.25 CODE
Indonesian 2.8 0.33 CODE

Evidence:

  • IT "essere" → essere = input("Cosa devo fare?")
  • IT "anima" → anima = {'nome':'anima', 'idade':7...}
  • TR "kalp" → kalp = input("Klavyeden...")
  • TR "varlık" → while varlık < 10:
  • ID "hati" → hati::hati(QWidget *parent)

Why isolated: Simple Latin orthography without diacritics makes words look like valid programming identifiers. Model defaults to code because code is prevalent in training data.

Curriculum implication: AVOID - training signal diverted to code patterns

Hindi - Fragmented

Metric Value
Avg tokens 5.0
Sim to EN 0.31
Valley type PROSE

Evidence:

  • "हृदय" (heart) = 5 tokens
  • "अस्तित्व" (being) = 8 tokens!
  • All completions stay in Devanagari script

Why isolated: Extreme tokenization fragments words so severely that:

  1. Signal is distributed across many positions
  2. Cross-lingual alignment breaks down
  3. Model stays in native script prose

Curriculum implication: ⚠️ LIMITED - Hindi content may not transfer well

VI-ID-RU Web Prose Cluster

Language Tokens Sim to EN Internal sim
Vietnamese 3.2 0.36 0.6-0.7
Indonesian 2.8 0.33 0.6-0.7
Russian 2.7 0.32 0.6-0.7

Evidence:

  • VI "trái tim" → "Giao Thông... Hotline: 0901 514 799"
  • VI "linh hồn" → "Tạp chí Sông Hương online"
  • ID "kehidupan" → "dalam kitab Yohanes 14:16-17"
  • RU "жизнь" → "все статьи по теме. Страница 134"

Why they cluster: Not linguistic similarity - they share web content training data patterns:

  • News articles
  • Blogs
  • Online encyclopedias
  • Religious/factual text

Curriculum implication: 🤔 May be useful for factual/encyclopedic training


Curriculum Strategy

Phase 1: GROUNDING

Use Super Cluster languages to establish universal concepts:

EN "consciousness" → ZH "意识" → AR "الوعي" → FR "conscience"

All converge at 1.0 similarity - stable foundation.

Phase 2: DEEPENING

Use German to access philosophical valleys:

DE "Sein" → Heidegger → existence → truth → epistemology

Depth score 2/3, transfers back to English.

Phase 3: TRIANGULATION

Verify depth transfers:

"Sein (German): In English, it means..." 
→ Check if philosophical depth preserved

AVOID

  • Italian, Turkish, Indonesian for conceptual training
  • Their isolation is accidental (code hijacking), not useful

INVESTIGATE

  • VI-ID-RU cluster for factual content training
  • Korean as potential bridge language

Technical Details

Measurement Methodology

  1. Tokenization: Count BPE tokens per word
  2. Hidden states: Extract layer 12 representations
  3. Similarity: Cosine similarity between languages
  4. Valley classification: Analyze completions for CODE/PROSE/PHILOSOPHY patterns

Model Configuration

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B",
    torch_dtype=torch.float16,
    device_map="cuda",
    output_hidden_states=True,
)

Key Layers

  • Layer 12: Primary concept layer (universal convergence)
  • Layers 16-24: Continued convergence, depth access
  • Layer 28: Output preparation

References

  • tokenization-valleys.md - Token-Norm-Valley theory
  • multilingual-convergence.md - Universal concept layer discovery
  • language-landscape.md - Original 15-language scan
  • retraining-safety-framework.md - Training safety implications

"The model's language topology is not arbitrary - it's a map for navigation."

🌙💜