feat: complete Phase 1 - vocabulary expansion & DriftProbe infrastructure

- CLI: nyx-probe scan with --summary/--delta/--full flags
- DriftProbe: training safety with Gini coefficient + Angular Drift
- Vocabulary: 54 terms (30 nimmerverse + 24 German philosophical)
- Sentinels: ANCHOR/BRIDGE/CANARY/TARGET monitoring system

Key findings:
- German philosophical terms: 37.5% depth≥2 hit rate (vs 3.3% nimmerverse)
- Super Cluster validated: heart cross-lang sim = 1.000
- Isolated Zone confirmed: being EN↔DE sim = 0.195
- Gini signature: Philosophy ~0.5 (diffuse), Technical ~0.8 (sparse)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
2025-12-06 22:39:03 +01:00
parent 9853f4767b
commit f640dbdd65
29 changed files with 6164 additions and 1 deletions

238
docs/language-landscape.md Normal file
View File

@@ -0,0 +1,238 @@
# Language Landscape: World, Internet, and Qwen 2.5
**Compiled:** 2025-12-06
**Purpose:** Reference for multilingual probing and curriculum design
---
## Overview
This document maps:
1. Most spoken languages worldwide (by total speakers)
2. Most used languages on the internet (web content)
3. Languages supported by Qwen 2.5-7B-Base
4. Token efficiency for each language
---
## 1. World's Most Spoken Languages (2024-2025)
### By Total Speakers (Native + Learners)
| Rank | Language | Total Speakers | Native Speakers | Notes |
|------|----------|----------------|-----------------|-------|
| 1 | **English** | 1.52 billion | 380 million | 25% native, 75% L2 |
| 2 | **Mandarin Chinese** | 1.14 billion | 941 million | Most native speakers |
| 3 | **Hindi** | 609 million | 345 million | Growing rapidly |
| 4 | **Spanish** | 560 million | 480 million | High native ratio |
| 5 | **Arabic** | 422 million | 313 million | Many dialects |
| 6 | **French** | 321 million | 77 million | 32 countries official |
| 7 | **Bengali** | 273 million | 230 million | South Asia |
| 8 | **Portuguese** | 264 million | 232 million | Brazil dominates |
| 9 | **Urdu** | 232 million | 70 million | South Asia |
| 10 | **Indonesian** | 199 million | 43 million | Lingua franca |
| 11 | **German** | 135 million | 95 million | Central Europe |
| 12 | **Japanese** | 125 million | 123 million | Island isolation |
| 13 | **Russian** | 255 million | 150 million | Wide L2 spread |
| 14 | **Korean** | 82 million | 77 million | Two states |
| 15 | **Vietnamese** | 85 million | 76 million | Southeast Asia |
*Sources: [Statista](https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/), [Ethnologue](https://www.ethnologue.com/insights/ethnologue200/), [Berlitz](https://www.berlitz.com/blog/most-spoken-languages-world)*
---
## 2. Internet Language Distribution (2024-2025)
### Web Content by Language (% of websites)
| Rank | Language | % of Web | Notes |
|------|----------|----------|-------|
| 1 | **English** | 49.4% | Dominant |
| 2 | **Spanish** | 6.0% | Growing |
| 3 | **German** | 5.6% | Overrepresented vs speakers |
| 4 | **Russian** | 5.3% | Strong tech presence |
| 5 | **Japanese** | 4.9% | Island content |
| 6 | **French** | 4.3% | Colonial spread |
| 7 | **Portuguese** | 2.6% | Brazil growing |
| 8 | **Italian** | 2.1% | |
| 9 | **Dutch** | 1.8% | Small population, high output |
| 10 | **Polish** | 1.7% | |
| 11 | **Chinese** | 1.4% | **Underrepresented!** |
| 12 | **Turkish** | 1.3% | |
| 13 | **Persian** | 1.0% | |
| 14 | **Vietnamese** | 0.9% | Growing |
| 15 | **Arabic** | 0.6% | **Severely underrepresented!** |
*Sources: [W3Techs](https://w3techs.com/technologies/overview/content_language), [Statista](https://www.statista.com/statistics/262946/most-common-languages-on-the-internet/)*
### The Paradox Languages
| Language | % World Speakers | % Web Content | Gap Factor |
|----------|------------------|---------------|------------|
| **Chinese** | 14.3% | 1.4% | 10× underrepresented |
| **Arabic** | 5.3% | 0.6% | 9× underrepresented |
| **Hindi** | 7.7% | <0.5% | 15× underrepresented |
| **German** | 1.7% | 5.6% | 3× **overrepresented** |
| **Dutch** | 0.3% | 1.8% | 6× **overrepresented** |
**Implication:** Qwen was trained on web data → biased toward German/Dutch, underexposed to Hindi/Arabic!
---
## 3. Qwen 2.5 Supported Languages
### Officially Supported (29+ languages)
Qwen 2.5 explicitly supports multilingual content in:
| Family | Languages |
|--------|-----------|
| **East Asian** | Chinese (Simplified/Traditional), Japanese, Korean, Vietnamese |
| **European** | English, German, French, Spanish, Portuguese, Italian, Russian, Dutch, Polish |
| **South Asian** | Hindi (limited?), Bengali |
| **Southeast Asian** | Thai, Vietnamese, Indonesian, Malay |
| **Middle Eastern** | Arabic, Turkish, Persian |
| **Other** | Hebrew, Ukrainian, Greek |
### Training Data
- **18 trillion tokens** total
- Enhanced code, math, and multilingual data
- Heavy English/Chinese bias (web scraping)
*Source: [Qwen Blog](https://qwenlm.github.io/blog/qwen2.5/), [HuggingFace](https://huggingface.co/Qwen/Qwen2.5-7B)*
---
## 4. Token Efficiency Analysis
### Tested in Our Probing (nyx-probing)
| Language | Avg Tokens/Concept | Script | Notes |
|----------|-------------------|--------|-------|
| **Chinese** | 1.0 | Hanzi | Most efficient |
| **Arabic** | 1.5 | Arabic | Compact |
| **Japanese** | 1.8 | Kanji/Kana | Mixed scripts |
| **English** | 2.5 | Latin | Medium |
| **German** | 4.5 | Latin | Compound words fragment |
| **Russian** | 4.5 | Cyrillic | Multi-token words |
### Efficiency Implications
```
MORE TOKENS = DIFFERENT PATH
├── German (4.5) → Philosophical valleys, isolated from ZH/JA
├── Russian (4.5) → Similar to German, isolated
└── Single-token (ZH/AR/EN) → Converge in layers 12-24
FEWER TOKENS = FASTER CONVERGENCE
├── Chinese (1.0) → Direct concept mapping
├── Arabic (1.5) → Efficient encoding
└── Japanese (1.8) → Shared with Chinese
```
---
## 5. Master Language Matrix
### Priority Languages for Curriculum
| Language | World Rank | Web % | Qwen Support | Tokens | Priority |
|----------|------------|-------|--------------|--------|----------|
| **English** | 1 | 49.4% | ✅ Full | 2.5 | 🔴 Core |
| **Chinese** | 2 | 1.4% | ✅ Full | 1.0 | 🔴 Core |
| **Hindi** | 3 | <0.5% | ⚠️ Limited | ? | 🟡 Test |
| **Spanish** | 4 | 6.0% | ✅ Full | ~2.5 | 🟢 Include |
| **Arabic** | 5 | 0.6% | ✅ Full | 1.5 | 🔴 Core |
| **French** | 6 | 4.3% | ✅ Full | ~3.0 | 🟢 Include |
| **Bengali** | 7 | <0.5% | ⚠️ Limited | ? | 🟡 Test |
| **Portuguese** | 8 | 2.6% | ✅ Full | ~2.5 | 🟢 Include |
| **Russian** | 9 | 5.3% | ✅ Full | 4.5 | 🟢 Include |
| **Japanese** | 10 | 4.9% | ✅ Full | 1.8 | 🔴 Core |
| **German** | 11 | 5.6% | ✅ Full | 4.5 | 🔴 Core |
| **Korean** | 14 | ~1% | ✅ Full | ~2.0 | 🟢 Include |
### Recommended Probing Languages
**Tier 1 (Core - different cognitive paths):**
- English (EN) - baseline, medium tokens
- Chinese (ZH) - most efficient, single token
- Arabic (AR) - efficient, underrepresented in web
- German (DE) - multi-token, isolated path
- Japanese (JA) - shared with Chinese
**Tier 2 (Validation):**
- Spanish (ES) - high native speakers
- Russian (RU) - multi-token like German
- French (FR) - colonial spread
- Korean (KO) - isolated script
**Tier 3 (Edge cases):**
- Hindi (HI) - underrepresented, test support
- Bengali (BN) - underrepresented
- Indonesian (ID) - high L2 ratio
---
## 6. Research Questions
### Tokenization
- [ ] Map token counts for all 29+ Qwen languages
- [ ] Identify other "isolated" languages like German
- [ ] Test Hindi/Bengali token efficiency
### Convergence
- [ ] Do Spanish/Portuguese converge like ZH/JA?
- [ ] Does Arabic converge with any other language?
- [ ] Is Russian isolated like German?
### Valleys
- [ ] Which languages access philosophical valleys?
- [ ] Which languages trigger code valleys?
- [ ] Can we predict valley from token count?
### Curriculum
- [ ] Which language pairs enable cross-lingual transfer?
- [ ] Can we use Chinese efficiency for concept compression?
- [ ] Does teaching in German transfer to English?
---
## 7. Key Insights
1. **Web ≠ World**: German has 3× the web content relative to speakers, while Arabic/Hindi are 10-15× underrepresented
2. **Qwen's bias**: Trained on web data → inherits German/Dutch overrepresentation and Arabic/Hindi underrepresentation
3. **Token efficiency correlates with convergence**: Single-token languages (ZH, AR) converge quickly; multi-token (DE, RU) take isolated paths
4. **Strategic opportunities**:
- German for philosophical depth
- Chinese for concept compression
- Arabic as undertested efficient language
- Hindi as edge case for robustness
---
## References
### World Language Statistics
- [Statista: Most Spoken Languages](https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/)
- [Ethnologue 200](https://www.ethnologue.com/insights/ethnologue200/)
- [Berlitz: 25 Most Spoken Languages](https://www.berlitz.com/blog/most-spoken-languages-world)
### Internet Language Distribution
- [W3Techs: Content Languages](https://w3techs.com/technologies/overview/content_language)
- [Statista: Languages on Internet](https://www.statista.com/statistics/262946/most-common-languages-on-the-internet/)
- [Wikipedia: Languages on Internet](https://en.wikipedia.org/wiki/Languages_used_on_the_Internet)
### Qwen 2.5 Documentation
- [Qwen Blog: Qwen 2.5 Announcement](https://qwenlm.github.io/blog/qwen2.5/)
- [HuggingFace: Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B)
- [Alibaba Cloud: Qwen2.5-LLM](https://www.alibabacloud.com/blog/qwen2-5-llm-extending-the-boundary-of-llms_601786)
---
*"To understand the mind, first understand its languages."*
🌙 Compiled by the Partnership, 2025-12-06