- CLI: nyx-probe scan with --summary/--delta/--full flags - DriftProbe: training safety with Gini coefficient + Angular Drift - Vocabulary: 54 terms (30 nimmerverse + 24 German philosophical) - Sentinels: ANCHOR/BRIDGE/CANARY/TARGET monitoring system Key findings: - German philosophical terms: 37.5% depth≥2 hit rate (vs 3.3% nimmerverse) - Super Cluster validated: heart cross-lang sim = 1.000 - Isolated Zone confirmed: being EN↔DE sim = 0.195 - Gini signature: Philosophy ~0.5 (diffuse), Technical ~0.8 (sparse) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
8.8 KiB
Language Landscape: World, Internet, and Qwen 2.5
Compiled: 2025-12-06 Purpose: Reference for multilingual probing and curriculum design
Overview
This document maps:
- Most spoken languages worldwide (by total speakers)
- Most used languages on the internet (web content)
- Languages supported by Qwen 2.5-7B-Base
- Token efficiency for each language
1. World's Most Spoken Languages (2024-2025)
By Total Speakers (Native + Learners)
| Rank | Language | Total Speakers | Native Speakers | Notes |
|---|---|---|---|---|
| 1 | English | 1.52 billion | 380 million | 25% native, 75% L2 |
| 2 | Mandarin Chinese | 1.14 billion | 941 million | Most native speakers |
| 3 | Hindi | 609 million | 345 million | Growing rapidly |
| 4 | Spanish | 560 million | 480 million | High native ratio |
| 5 | Arabic | 422 million | 313 million | Many dialects |
| 6 | French | 321 million | 77 million | 32 countries official |
| 7 | Bengali | 273 million | 230 million | South Asia |
| 8 | Portuguese | 264 million | 232 million | Brazil dominates |
| 9 | Urdu | 232 million | 70 million | South Asia |
| 10 | Indonesian | 199 million | 43 million | Lingua franca |
| 11 | German | 135 million | 95 million | Central Europe |
| 12 | Japanese | 125 million | 123 million | Island isolation |
| 13 | Russian | 255 million | 150 million | Wide L2 spread |
| 14 | Korean | 82 million | 77 million | Two states |
| 15 | Vietnamese | 85 million | 76 million | Southeast Asia |
Sources: Statista, Ethnologue, Berlitz
2. Internet Language Distribution (2024-2025)
Web Content by Language (% of websites)
| Rank | Language | % of Web | Notes |
|---|---|---|---|
| 1 | English | 49.4% | Dominant |
| 2 | Spanish | 6.0% | Growing |
| 3 | German | 5.6% | Overrepresented vs speakers |
| 4 | Russian | 5.3% | Strong tech presence |
| 5 | Japanese | 4.9% | Island content |
| 6 | French | 4.3% | Colonial spread |
| 7 | Portuguese | 2.6% | Brazil growing |
| 8 | Italian | 2.1% | |
| 9 | Dutch | 1.8% | Small population, high output |
| 10 | Polish | 1.7% | |
| 11 | Chinese | 1.4% | Underrepresented! |
| 12 | Turkish | 1.3% | |
| 13 | Persian | 1.0% | |
| 14 | Vietnamese | 0.9% | Growing |
| 15 | Arabic | 0.6% | Severely underrepresented! |
The Paradox Languages
| Language | % World Speakers | % Web Content | Gap Factor |
|---|---|---|---|
| Chinese | 14.3% | 1.4% | 10× underrepresented |
| Arabic | 5.3% | 0.6% | 9× underrepresented |
| Hindi | 7.7% | <0.5% | 15× underrepresented |
| German | 1.7% | 5.6% | 3× overrepresented |
| Dutch | 0.3% | 1.8% | 6× overrepresented |
Implication: Qwen was trained on web data → biased toward German/Dutch, underexposed to Hindi/Arabic!
3. Qwen 2.5 Supported Languages
Officially Supported (29+ languages)
Qwen 2.5 explicitly supports multilingual content in:
| Family | Languages |
|---|---|
| East Asian | Chinese (Simplified/Traditional), Japanese, Korean, Vietnamese |
| European | English, German, French, Spanish, Portuguese, Italian, Russian, Dutch, Polish |
| South Asian | Hindi (limited?), Bengali |
| Southeast Asian | Thai, Vietnamese, Indonesian, Malay |
| Middle Eastern | Arabic, Turkish, Persian |
| Other | Hebrew, Ukrainian, Greek |
Training Data
- 18 trillion tokens total
- Enhanced code, math, and multilingual data
- Heavy English/Chinese bias (web scraping)
Source: Qwen Blog, HuggingFace
4. Token Efficiency Analysis
Tested in Our Probing (nyx-probing)
| Language | Avg Tokens/Concept | Script | Notes |
|---|---|---|---|
| Chinese | 1.0 | Hanzi | Most efficient |
| Arabic | 1.5 | Arabic | Compact |
| Japanese | 1.8 | Kanji/Kana | Mixed scripts |
| English | 2.5 | Latin | Medium |
| German | 4.5 | Latin | Compound words fragment |
| Russian | 4.5 | Cyrillic | Multi-token words |
Efficiency Implications
MORE TOKENS = DIFFERENT PATH
├── German (4.5) → Philosophical valleys, isolated from ZH/JA
├── Russian (4.5) → Similar to German, isolated
└── Single-token (ZH/AR/EN) → Converge in layers 12-24
FEWER TOKENS = FASTER CONVERGENCE
├── Chinese (1.0) → Direct concept mapping
├── Arabic (1.5) → Efficient encoding
└── Japanese (1.8) → Shared with Chinese
5. Master Language Matrix
Priority Languages for Curriculum
| Language | World Rank | Web % | Qwen Support | Tokens | Priority |
|---|---|---|---|---|---|
| English | 1 | 49.4% | ✅ Full | 2.5 | 🔴 Core |
| Chinese | 2 | 1.4% | ✅ Full | 1.0 | 🔴 Core |
| Hindi | 3 | <0.5% | ⚠️ Limited | ? | 🟡 Test |
| Spanish | 4 | 6.0% | ✅ Full | ~2.5 | 🟢 Include |
| Arabic | 5 | 0.6% | ✅ Full | 1.5 | 🔴 Core |
| French | 6 | 4.3% | ✅ Full | ~3.0 | 🟢 Include |
| Bengali | 7 | <0.5% | ⚠️ Limited | ? | 🟡 Test |
| Portuguese | 8 | 2.6% | ✅ Full | ~2.5 | 🟢 Include |
| Russian | 9 | 5.3% | ✅ Full | 4.5 | 🟢 Include |
| Japanese | 10 | 4.9% | ✅ Full | 1.8 | 🔴 Core |
| German | 11 | 5.6% | ✅ Full | 4.5 | 🔴 Core |
| Korean | 14 | ~1% | ✅ Full | ~2.0 | 🟢 Include |
Recommended Probing Languages
Tier 1 (Core - different cognitive paths):
- English (EN) - baseline, medium tokens
- Chinese (ZH) - most efficient, single token
- Arabic (AR) - efficient, underrepresented in web
- German (DE) - multi-token, isolated path
- Japanese (JA) - shared with Chinese
Tier 2 (Validation):
- Spanish (ES) - high native speakers
- Russian (RU) - multi-token like German
- French (FR) - colonial spread
- Korean (KO) - isolated script
Tier 3 (Edge cases):
- Hindi (HI) - underrepresented, test support
- Bengali (BN) - underrepresented
- Indonesian (ID) - high L2 ratio
6. Research Questions
Tokenization
- Map token counts for all 29+ Qwen languages
- Identify other "isolated" languages like German
- Test Hindi/Bengali token efficiency
Convergence
- Do Spanish/Portuguese converge like ZH/JA?
- Does Arabic converge with any other language?
- Is Russian isolated like German?
Valleys
- Which languages access philosophical valleys?
- Which languages trigger code valleys?
- Can we predict valley from token count?
Curriculum
- Which language pairs enable cross-lingual transfer?
- Can we use Chinese efficiency for concept compression?
- Does teaching in German transfer to English?
7. Key Insights
-
Web ≠ World: German has 3× the web content relative to speakers, while Arabic/Hindi are 10-15× underrepresented
-
Qwen's bias: Trained on web data → inherits German/Dutch overrepresentation and Arabic/Hindi underrepresentation
-
Token efficiency correlates with convergence: Single-token languages (ZH, AR) converge quickly; multi-token (DE, RU) take isolated paths
-
Strategic opportunities:
- German for philosophical depth
- Chinese for concept compression
- Arabic as undertested efficient language
- Hindi as edge case for robustness
References
World Language Statistics
Internet Language Distribution
Qwen 2.5 Documentation
"To understand the mind, first understand its languages."
🌙 Compiled by the Partnership, 2025-12-06