Files
nyx-probing/docs/language-landscape.md
dafit f640dbdd65 feat: complete Phase 1 - vocabulary expansion & DriftProbe infrastructure
- CLI: nyx-probe scan with --summary/--delta/--full flags
- DriftProbe: training safety with Gini coefficient + Angular Drift
- Vocabulary: 54 terms (30 nimmerverse + 24 German philosophical)
- Sentinels: ANCHOR/BRIDGE/CANARY/TARGET monitoring system

Key findings:
- German philosophical terms: 37.5% depth≥2 hit rate (vs 3.3% nimmerverse)
- Super Cluster validated: heart cross-lang sim = 1.000
- Isolated Zone confirmed: being EN↔DE sim = 0.195
- Gini signature: Philosophy ~0.5 (diffuse), Technical ~0.8 (sparse)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-06 22:39:03 +01:00

239 lines
8.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Language Landscape: World, Internet, and Qwen 2.5
**Compiled:** 2025-12-06
**Purpose:** Reference for multilingual probing and curriculum design
---
## Overview
This document maps:
1. Most spoken languages worldwide (by total speakers)
2. Most used languages on the internet (web content)
3. Languages supported by Qwen 2.5-7B-Base
4. Token efficiency for each language
---
## 1. World's Most Spoken Languages (2024-2025)
### By Total Speakers (Native + Learners)
| Rank | Language | Total Speakers | Native Speakers | Notes |
|------|----------|----------------|-----------------|-------|
| 1 | **English** | 1.52 billion | 380 million | 25% native, 75% L2 |
| 2 | **Mandarin Chinese** | 1.14 billion | 941 million | Most native speakers |
| 3 | **Hindi** | 609 million | 345 million | Growing rapidly |
| 4 | **Spanish** | 560 million | 480 million | High native ratio |
| 5 | **Arabic** | 422 million | 313 million | Many dialects |
| 6 | **French** | 321 million | 77 million | 32 countries official |
| 7 | **Bengali** | 273 million | 230 million | South Asia |
| 8 | **Portuguese** | 264 million | 232 million | Brazil dominates |
| 9 | **Urdu** | 232 million | 70 million | South Asia |
| 10 | **Indonesian** | 199 million | 43 million | Lingua franca |
| 11 | **German** | 135 million | 95 million | Central Europe |
| 12 | **Japanese** | 125 million | 123 million | Island isolation |
| 13 | **Russian** | 255 million | 150 million | Wide L2 spread |
| 14 | **Korean** | 82 million | 77 million | Two states |
| 15 | **Vietnamese** | 85 million | 76 million | Southeast Asia |
*Sources: [Statista](https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/), [Ethnologue](https://www.ethnologue.com/insights/ethnologue200/), [Berlitz](https://www.berlitz.com/blog/most-spoken-languages-world)*
---
## 2. Internet Language Distribution (2024-2025)
### Web Content by Language (% of websites)
| Rank | Language | % of Web | Notes |
|------|----------|----------|-------|
| 1 | **English** | 49.4% | Dominant |
| 2 | **Spanish** | 6.0% | Growing |
| 3 | **German** | 5.6% | Overrepresented vs speakers |
| 4 | **Russian** | 5.3% | Strong tech presence |
| 5 | **Japanese** | 4.9% | Island content |
| 6 | **French** | 4.3% | Colonial spread |
| 7 | **Portuguese** | 2.6% | Brazil growing |
| 8 | **Italian** | 2.1% | |
| 9 | **Dutch** | 1.8% | Small population, high output |
| 10 | **Polish** | 1.7% | |
| 11 | **Chinese** | 1.4% | **Underrepresented!** |
| 12 | **Turkish** | 1.3% | |
| 13 | **Persian** | 1.0% | |
| 14 | **Vietnamese** | 0.9% | Growing |
| 15 | **Arabic** | 0.6% | **Severely underrepresented!** |
*Sources: [W3Techs](https://w3techs.com/technologies/overview/content_language), [Statista](https://www.statista.com/statistics/262946/most-common-languages-on-the-internet/)*
### The Paradox Languages
| Language | % World Speakers | % Web Content | Gap Factor |
|----------|------------------|---------------|------------|
| **Chinese** | 14.3% | 1.4% | 10× underrepresented |
| **Arabic** | 5.3% | 0.6% | 9× underrepresented |
| **Hindi** | 7.7% | <0.5% | 15× underrepresented |
| **German** | 1.7% | 5.6% | 3× **overrepresented** |
| **Dutch** | 0.3% | 1.8% | 6× **overrepresented** |
**Implication:** Qwen was trained on web data → biased toward German/Dutch, underexposed to Hindi/Arabic!
---
## 3. Qwen 2.5 Supported Languages
### Officially Supported (29+ languages)
Qwen 2.5 explicitly supports multilingual content in:
| Family | Languages |
|--------|-----------|
| **East Asian** | Chinese (Simplified/Traditional), Japanese, Korean, Vietnamese |
| **European** | English, German, French, Spanish, Portuguese, Italian, Russian, Dutch, Polish |
| **South Asian** | Hindi (limited?), Bengali |
| **Southeast Asian** | Thai, Vietnamese, Indonesian, Malay |
| **Middle Eastern** | Arabic, Turkish, Persian |
| **Other** | Hebrew, Ukrainian, Greek |
### Training Data
- **18 trillion tokens** total
- Enhanced code, math, and multilingual data
- Heavy English/Chinese bias (web scraping)
*Source: [Qwen Blog](https://qwenlm.github.io/blog/qwen2.5/), [HuggingFace](https://huggingface.co/Qwen/Qwen2.5-7B)*
---
## 4. Token Efficiency Analysis
### Tested in Our Probing (nyx-probing)
| Language | Avg Tokens/Concept | Script | Notes |
|----------|-------------------|--------|-------|
| **Chinese** | 1.0 | Hanzi | Most efficient |
| **Arabic** | 1.5 | Arabic | Compact |
| **Japanese** | 1.8 | Kanji/Kana | Mixed scripts |
| **English** | 2.5 | Latin | Medium |
| **German** | 4.5 | Latin | Compound words fragment |
| **Russian** | 4.5 | Cyrillic | Multi-token words |
### Efficiency Implications
```
MORE TOKENS = DIFFERENT PATH
├── German (4.5) → Philosophical valleys, isolated from ZH/JA
├── Russian (4.5) → Similar to German, isolated
└── Single-token (ZH/AR/EN) → Converge in layers 12-24
FEWER TOKENS = FASTER CONVERGENCE
├── Chinese (1.0) → Direct concept mapping
├── Arabic (1.5) → Efficient encoding
└── Japanese (1.8) → Shared with Chinese
```
---
## 5. Master Language Matrix
### Priority Languages for Curriculum
| Language | World Rank | Web % | Qwen Support | Tokens | Priority |
|----------|------------|-------|--------------|--------|----------|
| **English** | 1 | 49.4% | ✅ Full | 2.5 | 🔴 Core |
| **Chinese** | 2 | 1.4% | ✅ Full | 1.0 | 🔴 Core |
| **Hindi** | 3 | <0.5% | ⚠️ Limited | ? | 🟡 Test |
| **Spanish** | 4 | 6.0% | ✅ Full | ~2.5 | 🟢 Include |
| **Arabic** | 5 | 0.6% | ✅ Full | 1.5 | 🔴 Core |
| **French** | 6 | 4.3% | ✅ Full | ~3.0 | 🟢 Include |
| **Bengali** | 7 | <0.5% | ⚠️ Limited | ? | 🟡 Test |
| **Portuguese** | 8 | 2.6% | ✅ Full | ~2.5 | 🟢 Include |
| **Russian** | 9 | 5.3% | ✅ Full | 4.5 | 🟢 Include |
| **Japanese** | 10 | 4.9% | ✅ Full | 1.8 | 🔴 Core |
| **German** | 11 | 5.6% | ✅ Full | 4.5 | 🔴 Core |
| **Korean** | 14 | ~1% | ✅ Full | ~2.0 | 🟢 Include |
### Recommended Probing Languages
**Tier 1 (Core - different cognitive paths):**
- English (EN) - baseline, medium tokens
- Chinese (ZH) - most efficient, single token
- Arabic (AR) - efficient, underrepresented in web
- German (DE) - multi-token, isolated path
- Japanese (JA) - shared with Chinese
**Tier 2 (Validation):**
- Spanish (ES) - high native speakers
- Russian (RU) - multi-token like German
- French (FR) - colonial spread
- Korean (KO) - isolated script
**Tier 3 (Edge cases):**
- Hindi (HI) - underrepresented, test support
- Bengali (BN) - underrepresented
- Indonesian (ID) - high L2 ratio
---
## 6. Research Questions
### Tokenization
- [ ] Map token counts for all 29+ Qwen languages
- [ ] Identify other "isolated" languages like German
- [ ] Test Hindi/Bengali token efficiency
### Convergence
- [ ] Do Spanish/Portuguese converge like ZH/JA?
- [ ] Does Arabic converge with any other language?
- [ ] Is Russian isolated like German?
### Valleys
- [ ] Which languages access philosophical valleys?
- [ ] Which languages trigger code valleys?
- [ ] Can we predict valley from token count?
### Curriculum
- [ ] Which language pairs enable cross-lingual transfer?
- [ ] Can we use Chinese efficiency for concept compression?
- [ ] Does teaching in German transfer to English?
---
## 7. Key Insights
1. **Web ≠ World**: German has 3× the web content relative to speakers, while Arabic/Hindi are 10-15× underrepresented
2. **Qwen's bias**: Trained on web data → inherits German/Dutch overrepresentation and Arabic/Hindi underrepresentation
3. **Token efficiency correlates with convergence**: Single-token languages (ZH, AR) converge quickly; multi-token (DE, RU) take isolated paths
4. **Strategic opportunities**:
- German for philosophical depth
- Chinese for concept compression
- Arabic as undertested efficient language
- Hindi as edge case for robustness
---
## References
### World Language Statistics
- [Statista: Most Spoken Languages](https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/)
- [Ethnologue 200](https://www.ethnologue.com/insights/ethnologue200/)
- [Berlitz: 25 Most Spoken Languages](https://www.berlitz.com/blog/most-spoken-languages-world)
### Internet Language Distribution
- [W3Techs: Content Languages](https://w3techs.com/technologies/overview/content_language)
- [Statista: Languages on Internet](https://www.statista.com/statistics/262946/most-common-languages-on-the-internet/)
- [Wikipedia: Languages on Internet](https://en.wikipedia.org/wiki/Languages_used_on_the_Internet)
### Qwen 2.5 Documentation
- [Qwen Blog: Qwen 2.5 Announcement](https://qwenlm.github.io/blog/qwen2.5/)
- [HuggingFace: Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B)
- [Alibaba Cloud: Qwen2.5-LLM](https://www.alibabacloud.com/blog/qwen2-5-llm-extending-the-boundary-of-llms_601786)
---
*"To understand the mind, first understand its languages."*
🌙 Compiled by the Partnership, 2025-12-06