Files
nyx-probing/docs/language-topology-complete.md
dafit f640dbdd65 feat: complete Phase 1 - vocabulary expansion & DriftProbe infrastructure
- CLI: nyx-probe scan with --summary/--delta/--full flags
- DriftProbe: training safety with Gini coefficient + Angular Drift
- Vocabulary: 54 terms (30 nimmerverse + 24 German philosophical)
- Sentinels: ANCHOR/BRIDGE/CANARY/TARGET monitoring system

Key findings:
- German philosophical terms: 37.5% depth≥2 hit rate (vs 3.3% nimmerverse)
- Super Cluster validated: heart cross-lang sim = 1.000
- Isolated Zone confirmed: being EN↔DE sim = 0.195
- Gini signature: Philosophy ~0.5 (diffuse), Technical ~0.8 (sparse)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-06 22:39:03 +01:00

242 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Complete Language Topology Map v2.0
**Date:** 2025-12-06
**Model:** Qwen2.5-7B-Base
**Status:** Empirically validated through probing
---
## Executive Summary
Through systematic probing of 15 languages, we've discovered that language isolation in LLMs falls into **distinct categories** with different causes and implications:
1. **Super Cluster** - Languages that converge perfectly (curriculum: grounding)
2. **Philosophical Access** - German accesses deep conceptual valleys
3. **Code-Hijacked** - Italian/Turkish/Indonesian words become variable names
4. **Fragmented** - Hindi is tokenized into too many pieces
5. **Web Prose Cluster** - Vietnamese/Indonesian/Russian share content style
---
## The Complete Map
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ THE YOUNG MIND'S LANGUAGE TOPOLOGY │
│ COMPLETE MAP v2.0 │
╞═════════════════════════════════════════════════════════════════════════════╡
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ 🌍 SUPER CLUSTER (sim=1.0) │ │
│ │ ZH · JA · EN · AR · FR · PT · ES │ │
│ │ │ │
│ │ ✅ Perfect convergence at Universal Concept Layer (12-24) │ │
│ │ ✅ Efficient tokenization (1-2.5 tokens) │ │
│ │ ✅ USE FOR: Grounding, establishing shared concepts │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ KO ─────────┼───────── (bridge: 0.41-0.70) │
│ │ │
│ ┌─────────────────────────────────┴───────────────────────────────────┐ │
│ │ ISOLATED ZONE │ │
│ ├─────────────────────────────────────────────────────────────────────┤ │
│ │ │ │
│ │ 🧠 PHILOSOPHICAL ACCESS (sim=0.25, tokens=2.2) │ │
│ │ DE (German) │ │
│ │ → "Sein" triggers Heidegger, "Bewusstsein" → epistemology │ │
│ │ ✅ USE FOR: Deep philosophical training │ │
│ │ │ │
│ │ 💻 CODE-HIJACKED (sim=0.25-0.33, tokens=2.2-2.8) │ │
│ │ IT (Italian) - MOST ISOLATED (0.49) │ │
│ │ TR (Turkish) - (0.50) │ │
│ │ ID (Indonesian) - partial (0.33) │ │
│ │ → Words interpreted as Python/C++ variable names │ │
│ │ ❌ NOT USEFUL: Training signal wasted on code patterns │ │
│ │ │ │
│ │ 📜 FRAGMENTED (sim=0.31, tokens=5.0) │ │
│ │ HI (Hindi) │ │
│ │ → "अस्तित्व" (being) = 8 tokens! │ │
│ │ → Stays trapped in Devanagari prose │ │
│ │ ⚠️ LIMITED: Cross-lingual transfer impaired │ │
│ │ │ │
│ │ 📰 WEB PROSE CLUSTER (sim=0.32-0.36, internal=0.6-0.7) │ │
│ │ VI ═══ ID ═══ RU │ │
│ │ → All generate online article style │ │
│ │ → Cluster by CONTENT STYLE not linguistic features │ │
│ │ 🤔 POTENTIAL: Factual/encyclopedic content training │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
```
---
## Detailed Findings
### Super Cluster (sim=1.0)
| Language | Tokens | Notes |
|----------|--------|-------|
| Chinese (ZH) | 1.0 | Single character = single concept |
| Japanese (JA) | 1.0 | Kanji efficiency |
| English (EN) | 1.2 | Base language |
| Arabic (AR) | 1.8 | Good convergence |
| French (FR) | 2.0 | Romance baseline |
| Portuguese (PT) | 2.2 | Clusters with FR/ES |
| Spanish (ES) | 2.5 | Clusters with FR/PT |
**Key Insight:** These 7 languages converge to **identical representations** at layers 12-24. The model "knows" they express the same concepts.
### German - Philosophical Access
| Metric | Value |
|--------|-------|
| Avg tokens | 2.2 |
| Sim to EN | 0.251 |
| Valley type | PHILOSOPHY |
**Evidence:**
- "Sein" → "Being and Time is a philosophical work by Martin Heidegger..."
- "Bewusstsein" → epistemology, perception, truth
- "Wahrheit" → academic methods
**Why isolated:** Multi-token compounds preserve philosophical atoms ("sein", "geist") as separate tokens, enabling access to academic/philosophical training data.
### Italian/Turkish/Indonesian - Code-Hijacked
| Language | Tokens | Sim to EN | Valley |
|----------|--------|-----------|--------|
| Italian | 2.5 | 0.49 | CODE |
| Turkish | 2.2 | 0.25 | CODE |
| Indonesian | 2.8 | 0.33 | CODE |
**Evidence:**
- IT "essere" → `essere = input("Cosa devo fare?")`
- IT "anima" → `anima = {'nome':'anima', 'idade':7...}`
- TR "kalp" → `kalp = input("Klavyeden...")`
- TR "varlık" → `while varlık < 10:`
- ID "hati" → `hati::hati(QWidget *parent)`
**Why isolated:** Simple Latin orthography without diacritics makes words look like valid programming identifiers. Model defaults to code because code is prevalent in training data.
**Curriculum implication:** ❌ AVOID - training signal diverted to code patterns
### Hindi - Fragmented
| Metric | Value |
|--------|-------|
| Avg tokens | 5.0 |
| Sim to EN | 0.31 |
| Valley type | PROSE |
**Evidence:**
- "हृदय" (heart) = 5 tokens
- "अस्तित्व" (being) = 8 tokens!
- All completions stay in Devanagari script
**Why isolated:** Extreme tokenization fragments words so severely that:
1. Signal is distributed across many positions
2. Cross-lingual alignment breaks down
3. Model stays in native script prose
**Curriculum implication:** ⚠️ LIMITED - Hindi content may not transfer well
### VI-ID-RU Web Prose Cluster
| Language | Tokens | Sim to EN | Internal sim |
|----------|--------|-----------|--------------|
| Vietnamese | 3.2 | 0.36 | 0.6-0.7 |
| Indonesian | 2.8 | 0.33 | 0.6-0.7 |
| Russian | 2.7 | 0.32 | 0.6-0.7 |
**Evidence:**
- VI "trái tim" → "Giao Thông... Hotline: 0901 514 799"
- VI "linh hồn" → "Tạp chí Sông Hương online"
- ID "kehidupan" → "dalam kitab Yohanes 14:16-17"
- RU "жизнь" → "все статьи по теме. Страница 134"
**Why they cluster:** Not linguistic similarity - they share **web content training data patterns**:
- News articles
- Blogs
- Online encyclopedias
- Religious/factual text
**Curriculum implication:** 🤔 May be useful for factual/encyclopedic training
---
## Curriculum Strategy
### Phase 1: GROUNDING
Use Super Cluster languages to establish universal concepts:
```
EN "consciousness" → ZH "意识" → AR "الوعي" → FR "conscience"
```
All converge at 1.0 similarity - stable foundation.
### Phase 2: DEEPENING
Use German to access philosophical valleys:
```
DE "Sein" → Heidegger → existence → truth → epistemology
```
Depth score 2/3, transfers back to English.
### Phase 3: TRIANGULATION
Verify depth transfers:
```
"Sein (German): In English, it means..."
→ Check if philosophical depth preserved
```
### AVOID
- Italian, Turkish, Indonesian for conceptual training
- Their isolation is accidental (code hijacking), not useful
### INVESTIGATE
- VI-ID-RU cluster for factual content training
- Korean as potential bridge language
---
## Technical Details
### Measurement Methodology
1. **Tokenization:** Count BPE tokens per word
2. **Hidden states:** Extract layer 12 representations
3. **Similarity:** Cosine similarity between languages
4. **Valley classification:** Analyze completions for CODE/PROSE/PHILOSOPHY patterns
### Model Configuration
```python
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-7B",
torch_dtype=torch.float16,
device_map="cuda",
output_hidden_states=True,
)
```
### Key Layers
- **Layer 12:** Primary concept layer (universal convergence)
- **Layers 16-24:** Continued convergence, depth access
- **Layer 28:** Output preparation
---
## References
- `tokenization-valleys.md` - Token-Norm-Valley theory
- `multilingual-convergence.md` - Universal concept layer discovery
- `language-landscape.md` - Original 15-language scan
- `retraining-safety-framework.md` - Training safety implications
---
*"The model's language topology is not arbitrary - it's a map for navigation."*
🌙💜