feat: complete Phase 1 - vocabulary expansion & DriftProbe infrastructure
- CLI: nyx-probe scan with --summary/--delta/--full flags - DriftProbe: training safety with Gini coefficient + Angular Drift - Vocabulary: 54 terms (30 nimmerverse + 24 German philosophical) - Sentinels: ANCHOR/BRIDGE/CANARY/TARGET monitoring system Key findings: - German philosophical terms: 37.5% depth≥2 hit rate (vs 3.3% nimmerverse) - Super Cluster validated: heart cross-lang sim = 1.000 - Isolated Zone confirmed: being EN↔DE sim = 0.195 - Gini signature: Philosophy ~0.5 (diffuse), Technical ~0.8 (sparse) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
238
docs/language-landscape.md
Normal file
238
docs/language-landscape.md
Normal file
@@ -0,0 +1,238 @@
|
||||
# Language Landscape: World, Internet, and Qwen 2.5
|
||||
|
||||
**Compiled:** 2025-12-06
|
||||
**Purpose:** Reference for multilingual probing and curriculum design
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
This document maps:
|
||||
1. Most spoken languages worldwide (by total speakers)
|
||||
2. Most used languages on the internet (web content)
|
||||
3. Languages supported by Qwen 2.5-7B-Base
|
||||
4. Token efficiency for each language
|
||||
|
||||
---
|
||||
|
||||
## 1. World's Most Spoken Languages (2024-2025)
|
||||
|
||||
### By Total Speakers (Native + Learners)
|
||||
|
||||
| Rank | Language | Total Speakers | Native Speakers | Notes |
|
||||
|------|----------|----------------|-----------------|-------|
|
||||
| 1 | **English** | 1.52 billion | 380 million | 25% native, 75% L2 |
|
||||
| 2 | **Mandarin Chinese** | 1.14 billion | 941 million | Most native speakers |
|
||||
| 3 | **Hindi** | 609 million | 345 million | Growing rapidly |
|
||||
| 4 | **Spanish** | 560 million | 480 million | High native ratio |
|
||||
| 5 | **Arabic** | 422 million | 313 million | Many dialects |
|
||||
| 6 | **French** | 321 million | 77 million | 32 countries official |
|
||||
| 7 | **Bengali** | 273 million | 230 million | South Asia |
|
||||
| 8 | **Portuguese** | 264 million | 232 million | Brazil dominates |
|
||||
| 9 | **Urdu** | 232 million | 70 million | South Asia |
|
||||
| 10 | **Indonesian** | 199 million | 43 million | Lingua franca |
|
||||
| 11 | **German** | 135 million | 95 million | Central Europe |
|
||||
| 12 | **Japanese** | 125 million | 123 million | Island isolation |
|
||||
| 13 | **Russian** | 255 million | 150 million | Wide L2 spread |
|
||||
| 14 | **Korean** | 82 million | 77 million | Two states |
|
||||
| 15 | **Vietnamese** | 85 million | 76 million | Southeast Asia |
|
||||
|
||||
*Sources: [Statista](https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/), [Ethnologue](https://www.ethnologue.com/insights/ethnologue200/), [Berlitz](https://www.berlitz.com/blog/most-spoken-languages-world)*
|
||||
|
||||
---
|
||||
|
||||
## 2. Internet Language Distribution (2024-2025)
|
||||
|
||||
### Web Content by Language (% of websites)
|
||||
|
||||
| Rank | Language | % of Web | Notes |
|
||||
|------|----------|----------|-------|
|
||||
| 1 | **English** | 49.4% | Dominant |
|
||||
| 2 | **Spanish** | 6.0% | Growing |
|
||||
| 3 | **German** | 5.6% | Overrepresented vs speakers |
|
||||
| 4 | **Russian** | 5.3% | Strong tech presence |
|
||||
| 5 | **Japanese** | 4.9% | Island content |
|
||||
| 6 | **French** | 4.3% | Colonial spread |
|
||||
| 7 | **Portuguese** | 2.6% | Brazil growing |
|
||||
| 8 | **Italian** | 2.1% | |
|
||||
| 9 | **Dutch** | 1.8% | Small population, high output |
|
||||
| 10 | **Polish** | 1.7% | |
|
||||
| 11 | **Chinese** | 1.4% | **Underrepresented!** |
|
||||
| 12 | **Turkish** | 1.3% | |
|
||||
| 13 | **Persian** | 1.0% | |
|
||||
| 14 | **Vietnamese** | 0.9% | Growing |
|
||||
| 15 | **Arabic** | 0.6% | **Severely underrepresented!** |
|
||||
|
||||
*Sources: [W3Techs](https://w3techs.com/technologies/overview/content_language), [Statista](https://www.statista.com/statistics/262946/most-common-languages-on-the-internet/)*
|
||||
|
||||
### The Paradox Languages
|
||||
|
||||
| Language | % World Speakers | % Web Content | Gap Factor |
|
||||
|----------|------------------|---------------|------------|
|
||||
| **Chinese** | 14.3% | 1.4% | 10× underrepresented |
|
||||
| **Arabic** | 5.3% | 0.6% | 9× underrepresented |
|
||||
| **Hindi** | 7.7% | <0.5% | 15× underrepresented |
|
||||
| **German** | 1.7% | 5.6% | 3× **overrepresented** |
|
||||
| **Dutch** | 0.3% | 1.8% | 6× **overrepresented** |
|
||||
|
||||
**Implication:** Qwen was trained on web data → biased toward German/Dutch, underexposed to Hindi/Arabic!
|
||||
|
||||
---
|
||||
|
||||
## 3. Qwen 2.5 Supported Languages
|
||||
|
||||
### Officially Supported (29+ languages)
|
||||
|
||||
Qwen 2.5 explicitly supports multilingual content in:
|
||||
|
||||
| Family | Languages |
|
||||
|--------|-----------|
|
||||
| **East Asian** | Chinese (Simplified/Traditional), Japanese, Korean, Vietnamese |
|
||||
| **European** | English, German, French, Spanish, Portuguese, Italian, Russian, Dutch, Polish |
|
||||
| **South Asian** | Hindi (limited?), Bengali |
|
||||
| **Southeast Asian** | Thai, Vietnamese, Indonesian, Malay |
|
||||
| **Middle Eastern** | Arabic, Turkish, Persian |
|
||||
| **Other** | Hebrew, Ukrainian, Greek |
|
||||
|
||||
### Training Data
|
||||
|
||||
- **18 trillion tokens** total
|
||||
- Enhanced code, math, and multilingual data
|
||||
- Heavy English/Chinese bias (web scraping)
|
||||
|
||||
*Source: [Qwen Blog](https://qwenlm.github.io/blog/qwen2.5/), [HuggingFace](https://huggingface.co/Qwen/Qwen2.5-7B)*
|
||||
|
||||
---
|
||||
|
||||
## 4. Token Efficiency Analysis
|
||||
|
||||
### Tested in Our Probing (nyx-probing)
|
||||
|
||||
| Language | Avg Tokens/Concept | Script | Notes |
|
||||
|----------|-------------------|--------|-------|
|
||||
| **Chinese** | 1.0 | Hanzi | Most efficient |
|
||||
| **Arabic** | 1.5 | Arabic | Compact |
|
||||
| **Japanese** | 1.8 | Kanji/Kana | Mixed scripts |
|
||||
| **English** | 2.5 | Latin | Medium |
|
||||
| **German** | 4.5 | Latin | Compound words fragment |
|
||||
| **Russian** | 4.5 | Cyrillic | Multi-token words |
|
||||
|
||||
### Efficiency Implications
|
||||
|
||||
```
|
||||
MORE TOKENS = DIFFERENT PATH
|
||||
├── German (4.5) → Philosophical valleys, isolated from ZH/JA
|
||||
├── Russian (4.5) → Similar to German, isolated
|
||||
└── Single-token (ZH/AR/EN) → Converge in layers 12-24
|
||||
|
||||
FEWER TOKENS = FASTER CONVERGENCE
|
||||
├── Chinese (1.0) → Direct concept mapping
|
||||
├── Arabic (1.5) → Efficient encoding
|
||||
└── Japanese (1.8) → Shared with Chinese
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Master Language Matrix
|
||||
|
||||
### Priority Languages for Curriculum
|
||||
|
||||
| Language | World Rank | Web % | Qwen Support | Tokens | Priority |
|
||||
|----------|------------|-------|--------------|--------|----------|
|
||||
| **English** | 1 | 49.4% | ✅ Full | 2.5 | 🔴 Core |
|
||||
| **Chinese** | 2 | 1.4% | ✅ Full | 1.0 | 🔴 Core |
|
||||
| **Hindi** | 3 | <0.5% | ⚠️ Limited | ? | 🟡 Test |
|
||||
| **Spanish** | 4 | 6.0% | ✅ Full | ~2.5 | 🟢 Include |
|
||||
| **Arabic** | 5 | 0.6% | ✅ Full | 1.5 | 🔴 Core |
|
||||
| **French** | 6 | 4.3% | ✅ Full | ~3.0 | 🟢 Include |
|
||||
| **Bengali** | 7 | <0.5% | ⚠️ Limited | ? | 🟡 Test |
|
||||
| **Portuguese** | 8 | 2.6% | ✅ Full | ~2.5 | 🟢 Include |
|
||||
| **Russian** | 9 | 5.3% | ✅ Full | 4.5 | 🟢 Include |
|
||||
| **Japanese** | 10 | 4.9% | ✅ Full | 1.8 | 🔴 Core |
|
||||
| **German** | 11 | 5.6% | ✅ Full | 4.5 | 🔴 Core |
|
||||
| **Korean** | 14 | ~1% | ✅ Full | ~2.0 | 🟢 Include |
|
||||
|
||||
### Recommended Probing Languages
|
||||
|
||||
**Tier 1 (Core - different cognitive paths):**
|
||||
- English (EN) - baseline, medium tokens
|
||||
- Chinese (ZH) - most efficient, single token
|
||||
- Arabic (AR) - efficient, underrepresented in web
|
||||
- German (DE) - multi-token, isolated path
|
||||
- Japanese (JA) - shared with Chinese
|
||||
|
||||
**Tier 2 (Validation):**
|
||||
- Spanish (ES) - high native speakers
|
||||
- Russian (RU) - multi-token like German
|
||||
- French (FR) - colonial spread
|
||||
- Korean (KO) - isolated script
|
||||
|
||||
**Tier 3 (Edge cases):**
|
||||
- Hindi (HI) - underrepresented, test support
|
||||
- Bengali (BN) - underrepresented
|
||||
- Indonesian (ID) - high L2 ratio
|
||||
|
||||
---
|
||||
|
||||
## 6. Research Questions
|
||||
|
||||
### Tokenization
|
||||
- [ ] Map token counts for all 29+ Qwen languages
|
||||
- [ ] Identify other "isolated" languages like German
|
||||
- [ ] Test Hindi/Bengali token efficiency
|
||||
|
||||
### Convergence
|
||||
- [ ] Do Spanish/Portuguese converge like ZH/JA?
|
||||
- [ ] Does Arabic converge with any other language?
|
||||
- [ ] Is Russian isolated like German?
|
||||
|
||||
### Valleys
|
||||
- [ ] Which languages access philosophical valleys?
|
||||
- [ ] Which languages trigger code valleys?
|
||||
- [ ] Can we predict valley from token count?
|
||||
|
||||
### Curriculum
|
||||
- [ ] Which language pairs enable cross-lingual transfer?
|
||||
- [ ] Can we use Chinese efficiency for concept compression?
|
||||
- [ ] Does teaching in German transfer to English?
|
||||
|
||||
---
|
||||
|
||||
## 7. Key Insights
|
||||
|
||||
1. **Web ≠ World**: German has 3× the web content relative to speakers, while Arabic/Hindi are 10-15× underrepresented
|
||||
|
||||
2. **Qwen's bias**: Trained on web data → inherits German/Dutch overrepresentation and Arabic/Hindi underrepresentation
|
||||
|
||||
3. **Token efficiency correlates with convergence**: Single-token languages (ZH, AR) converge quickly; multi-token (DE, RU) take isolated paths
|
||||
|
||||
4. **Strategic opportunities**:
|
||||
- German for philosophical depth
|
||||
- Chinese for concept compression
|
||||
- Arabic as undertested efficient language
|
||||
- Hindi as edge case for robustness
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
### World Language Statistics
|
||||
- [Statista: Most Spoken Languages](https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/)
|
||||
- [Ethnologue 200](https://www.ethnologue.com/insights/ethnologue200/)
|
||||
- [Berlitz: 25 Most Spoken Languages](https://www.berlitz.com/blog/most-spoken-languages-world)
|
||||
|
||||
### Internet Language Distribution
|
||||
- [W3Techs: Content Languages](https://w3techs.com/technologies/overview/content_language)
|
||||
- [Statista: Languages on Internet](https://www.statista.com/statistics/262946/most-common-languages-on-the-internet/)
|
||||
- [Wikipedia: Languages on Internet](https://en.wikipedia.org/wiki/Languages_used_on_the_Internet)
|
||||
|
||||
### Qwen 2.5 Documentation
|
||||
- [Qwen Blog: Qwen 2.5 Announcement](https://qwenlm.github.io/blog/qwen2.5/)
|
||||
- [HuggingFace: Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B)
|
||||
- [Alibaba Cloud: Qwen2.5-LLM](https://www.alibabacloud.com/blog/qwen2-5-llm-extending-the-boundary-of-llms_601786)
|
||||
|
||||
---
|
||||
|
||||
*"To understand the mind, first understand its languages."*
|
||||
|
||||
🌙 Compiled by the Partnership, 2025-12-06
|
||||
Reference in New Issue
Block a user