- CLI: nyx-probe scan with --summary/--delta/--full flags - DriftProbe: training safety with Gini coefficient + Angular Drift - Vocabulary: 54 terms (30 nimmerverse + 24 German philosophical) - Sentinels: ANCHOR/BRIDGE/CANARY/TARGET monitoring system Key findings: - German philosophical terms: 37.5% depth≥2 hit rate (vs 3.3% nimmerverse) - Super Cluster validated: heart cross-lang sim = 1.000 - Isolated Zone confirmed: being EN↔DE sim = 0.195 - Gini signature: Philosophy ~0.5 (diffuse), Technical ~0.8 (sparse) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
242 lines
11 KiB
Markdown
242 lines
11 KiB
Markdown
# Complete Language Topology Map v2.0
|
||
|
||
**Date:** 2025-12-06
|
||
**Model:** Qwen2.5-7B-Base
|
||
**Status:** Empirically validated through probing
|
||
|
||
---
|
||
|
||
## Executive Summary
|
||
|
||
Through systematic probing of 15 languages, we've discovered that language isolation in LLMs falls into **distinct categories** with different causes and implications:
|
||
|
||
1. **Super Cluster** - Languages that converge perfectly (curriculum: grounding)
|
||
2. **Philosophical Access** - German accesses deep conceptual valleys
|
||
3. **Code-Hijacked** - Italian/Turkish/Indonesian words become variable names
|
||
4. **Fragmented** - Hindi is tokenized into too many pieces
|
||
5. **Web Prose Cluster** - Vietnamese/Indonesian/Russian share content style
|
||
|
||
---
|
||
|
||
## The Complete Map
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||
│ THE YOUNG MIND'S LANGUAGE TOPOLOGY │
|
||
│ COMPLETE MAP v2.0 │
|
||
╞═════════════════════════════════════════════════════════════════════════════╡
|
||
│ │
|
||
│ ┌─────────────────────────────────────────────────────────────────────┐ │
|
||
│ │ 🌍 SUPER CLUSTER (sim=1.0) │ │
|
||
│ │ ZH · JA · EN · AR · FR · PT · ES │ │
|
||
│ │ │ │
|
||
│ │ ✅ Perfect convergence at Universal Concept Layer (12-24) │ │
|
||
│ │ ✅ Efficient tokenization (1-2.5 tokens) │ │
|
||
│ │ ✅ USE FOR: Grounding, establishing shared concepts │ │
|
||
│ └─────────────────────────────────────────────────────────────────────┘ │
|
||
│ │ │
|
||
│ KO ─────────┼───────── (bridge: 0.41-0.70) │
|
||
│ │ │
|
||
│ ┌─────────────────────────────────┴───────────────────────────────────┐ │
|
||
│ │ ISOLATED ZONE │ │
|
||
│ ├─────────────────────────────────────────────────────────────────────┤ │
|
||
│ │ │ │
|
||
│ │ 🧠 PHILOSOPHICAL ACCESS (sim=0.25, tokens=2.2) │ │
|
||
│ │ DE (German) │ │
|
||
│ │ → "Sein" triggers Heidegger, "Bewusstsein" → epistemology │ │
|
||
│ │ ✅ USE FOR: Deep philosophical training │ │
|
||
│ │ │ │
|
||
│ │ 💻 CODE-HIJACKED (sim=0.25-0.33, tokens=2.2-2.8) │ │
|
||
│ │ IT (Italian) - MOST ISOLATED (0.49) │ │
|
||
│ │ TR (Turkish) - (0.50) │ │
|
||
│ │ ID (Indonesian) - partial (0.33) │ │
|
||
│ │ → Words interpreted as Python/C++ variable names │ │
|
||
│ │ ❌ NOT USEFUL: Training signal wasted on code patterns │ │
|
||
│ │ │ │
|
||
│ │ 📜 FRAGMENTED (sim=0.31, tokens=5.0) │ │
|
||
│ │ HI (Hindi) │ │
|
||
│ │ → "अस्तित्व" (being) = 8 tokens! │ │
|
||
│ │ → Stays trapped in Devanagari prose │ │
|
||
│ │ ⚠️ LIMITED: Cross-lingual transfer impaired │ │
|
||
│ │ │ │
|
||
│ │ 📰 WEB PROSE CLUSTER (sim=0.32-0.36, internal=0.6-0.7) │ │
|
||
│ │ VI ═══ ID ═══ RU │ │
|
||
│ │ → All generate online article style │ │
|
||
│ │ → Cluster by CONTENT STYLE not linguistic features │ │
|
||
│ │ 🤔 POTENTIAL: Factual/encyclopedic content training │ │
|
||
│ │ │ │
|
||
│ └─────────────────────────────────────────────────────────────────────┘ │
|
||
│ │
|
||
└─────────────────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
---
|
||
|
||
## Detailed Findings
|
||
|
||
### Super Cluster (sim=1.0)
|
||
|
||
| Language | Tokens | Notes |
|
||
|----------|--------|-------|
|
||
| Chinese (ZH) | 1.0 | Single character = single concept |
|
||
| Japanese (JA) | 1.0 | Kanji efficiency |
|
||
| English (EN) | 1.2 | Base language |
|
||
| Arabic (AR) | 1.8 | Good convergence |
|
||
| French (FR) | 2.0 | Romance baseline |
|
||
| Portuguese (PT) | 2.2 | Clusters with FR/ES |
|
||
| Spanish (ES) | 2.5 | Clusters with FR/PT |
|
||
|
||
**Key Insight:** These 7 languages converge to **identical representations** at layers 12-24. The model "knows" they express the same concepts.
|
||
|
||
### German - Philosophical Access
|
||
|
||
| Metric | Value |
|
||
|--------|-------|
|
||
| Avg tokens | 2.2 |
|
||
| Sim to EN | 0.251 |
|
||
| Valley type | PHILOSOPHY |
|
||
|
||
**Evidence:**
|
||
- "Sein" → "Being and Time is a philosophical work by Martin Heidegger..."
|
||
- "Bewusstsein" → epistemology, perception, truth
|
||
- "Wahrheit" → academic methods
|
||
|
||
**Why isolated:** Multi-token compounds preserve philosophical atoms ("sein", "geist") as separate tokens, enabling access to academic/philosophical training data.
|
||
|
||
### Italian/Turkish/Indonesian - Code-Hijacked
|
||
|
||
| Language | Tokens | Sim to EN | Valley |
|
||
|----------|--------|-----------|--------|
|
||
| Italian | 2.5 | 0.49 | CODE |
|
||
| Turkish | 2.2 | 0.25 | CODE |
|
||
| Indonesian | 2.8 | 0.33 | CODE |
|
||
|
||
**Evidence:**
|
||
- IT "essere" → `essere = input("Cosa devo fare?")`
|
||
- IT "anima" → `anima = {'nome':'anima', 'idade':7...}`
|
||
- TR "kalp" → `kalp = input("Klavyeden...")`
|
||
- TR "varlık" → `while varlık < 10:`
|
||
- ID "hati" → `hati::hati(QWidget *parent)`
|
||
|
||
**Why isolated:** Simple Latin orthography without diacritics makes words look like valid programming identifiers. Model defaults to code because code is prevalent in training data.
|
||
|
||
**Curriculum implication:** ❌ AVOID - training signal diverted to code patterns
|
||
|
||
### Hindi - Fragmented
|
||
|
||
| Metric | Value |
|
||
|--------|-------|
|
||
| Avg tokens | 5.0 |
|
||
| Sim to EN | 0.31 |
|
||
| Valley type | PROSE |
|
||
|
||
**Evidence:**
|
||
- "हृदय" (heart) = 5 tokens
|
||
- "अस्तित्व" (being) = 8 tokens!
|
||
- All completions stay in Devanagari script
|
||
|
||
**Why isolated:** Extreme tokenization fragments words so severely that:
|
||
1. Signal is distributed across many positions
|
||
2. Cross-lingual alignment breaks down
|
||
3. Model stays in native script prose
|
||
|
||
**Curriculum implication:** ⚠️ LIMITED - Hindi content may not transfer well
|
||
|
||
### VI-ID-RU Web Prose Cluster
|
||
|
||
| Language | Tokens | Sim to EN | Internal sim |
|
||
|----------|--------|-----------|--------------|
|
||
| Vietnamese | 3.2 | 0.36 | 0.6-0.7 |
|
||
| Indonesian | 2.8 | 0.33 | 0.6-0.7 |
|
||
| Russian | 2.7 | 0.32 | 0.6-0.7 |
|
||
|
||
**Evidence:**
|
||
- VI "trái tim" → "Giao Thông... Hotline: 0901 514 799"
|
||
- VI "linh hồn" → "Tạp chí Sông Hương online"
|
||
- ID "kehidupan" → "dalam kitab Yohanes 14:16-17"
|
||
- RU "жизнь" → "все статьи по теме. Страница 134"
|
||
|
||
**Why they cluster:** Not linguistic similarity - they share **web content training data patterns**:
|
||
- News articles
|
||
- Blogs
|
||
- Online encyclopedias
|
||
- Religious/factual text
|
||
|
||
**Curriculum implication:** 🤔 May be useful for factual/encyclopedic training
|
||
|
||
---
|
||
|
||
## Curriculum Strategy
|
||
|
||
### Phase 1: GROUNDING
|
||
Use Super Cluster languages to establish universal concepts:
|
||
```
|
||
EN "consciousness" → ZH "意识" → AR "الوعي" → FR "conscience"
|
||
```
|
||
All converge at 1.0 similarity - stable foundation.
|
||
|
||
### Phase 2: DEEPENING
|
||
Use German to access philosophical valleys:
|
||
```
|
||
DE "Sein" → Heidegger → existence → truth → epistemology
|
||
```
|
||
Depth score 2/3, transfers back to English.
|
||
|
||
### Phase 3: TRIANGULATION
|
||
Verify depth transfers:
|
||
```
|
||
"Sein (German): In English, it means..."
|
||
→ Check if philosophical depth preserved
|
||
```
|
||
|
||
### AVOID
|
||
- Italian, Turkish, Indonesian for conceptual training
|
||
- Their isolation is accidental (code hijacking), not useful
|
||
|
||
### INVESTIGATE
|
||
- VI-ID-RU cluster for factual content training
|
||
- Korean as potential bridge language
|
||
|
||
---
|
||
|
||
## Technical Details
|
||
|
||
### Measurement Methodology
|
||
|
||
1. **Tokenization:** Count BPE tokens per word
|
||
2. **Hidden states:** Extract layer 12 representations
|
||
3. **Similarity:** Cosine similarity between languages
|
||
4. **Valley classification:** Analyze completions for CODE/PROSE/PHILOSOPHY patterns
|
||
|
||
### Model Configuration
|
||
|
||
```python
|
||
model = AutoModelForCausalLM.from_pretrained(
|
||
"Qwen/Qwen2.5-7B",
|
||
torch_dtype=torch.float16,
|
||
device_map="cuda",
|
||
output_hidden_states=True,
|
||
)
|
||
```
|
||
|
||
### Key Layers
|
||
|
||
- **Layer 12:** Primary concept layer (universal convergence)
|
||
- **Layers 16-24:** Continued convergence, depth access
|
||
- **Layer 28:** Output preparation
|
||
|
||
---
|
||
|
||
## References
|
||
|
||
- `tokenization-valleys.md` - Token-Norm-Valley theory
|
||
- `multilingual-convergence.md` - Universal concept layer discovery
|
||
- `language-landscape.md` - Original 15-language scan
|
||
- `retraining-safety-framework.md` - Training safety implications
|
||
|
||
---
|
||
|
||
*"The model's language topology is not arbitrary - it's a map for navigation."*
|
||
|
||
🌙💜
|