# Language Landscape: World, Internet, and Qwen 2.5 **Compiled:** 2025-12-06 **Purpose:** Reference for multilingual probing and curriculum design --- ## Overview This document maps: 1. Most spoken languages worldwide (by total speakers) 2. Most used languages on the internet (web content) 3. Languages supported by Qwen 2.5-7B-Base 4. Token efficiency for each language --- ## 1. World's Most Spoken Languages (2024-2025) ### By Total Speakers (Native + Learners) | Rank | Language | Total Speakers | Native Speakers | Notes | |------|----------|----------------|-----------------|-------| | 1 | **English** | 1.52 billion | 380 million | 25% native, 75% L2 | | 2 | **Mandarin Chinese** | 1.14 billion | 941 million | Most native speakers | | 3 | **Hindi** | 609 million | 345 million | Growing rapidly | | 4 | **Spanish** | 560 million | 480 million | High native ratio | | 5 | **Arabic** | 422 million | 313 million | Many dialects | | 6 | **French** | 321 million | 77 million | 32 countries official | | 7 | **Bengali** | 273 million | 230 million | South Asia | | 8 | **Portuguese** | 264 million | 232 million | Brazil dominates | | 9 | **Urdu** | 232 million | 70 million | South Asia | | 10 | **Indonesian** | 199 million | 43 million | Lingua franca | | 11 | **German** | 135 million | 95 million | Central Europe | | 12 | **Japanese** | 125 million | 123 million | Island isolation | | 13 | **Russian** | 255 million | 150 million | Wide L2 spread | | 14 | **Korean** | 82 million | 77 million | Two states | | 15 | **Vietnamese** | 85 million | 76 million | Southeast Asia | *Sources: [Statista](https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/), [Ethnologue](https://www.ethnologue.com/insights/ethnologue200/), [Berlitz](https://www.berlitz.com/blog/most-spoken-languages-world)* --- ## 2. Internet Language Distribution (2024-2025) ### Web Content by Language (% of websites) | Rank | Language | % of Web | Notes | |------|----------|----------|-------| | 1 | **English** | 49.4% | Dominant | | 2 | **Spanish** | 6.0% | Growing | | 3 | **German** | 5.6% | Overrepresented vs speakers | | 4 | **Russian** | 5.3% | Strong tech presence | | 5 | **Japanese** | 4.9% | Island content | | 6 | **French** | 4.3% | Colonial spread | | 7 | **Portuguese** | 2.6% | Brazil growing | | 8 | **Italian** | 2.1% | | | 9 | **Dutch** | 1.8% | Small population, high output | | 10 | **Polish** | 1.7% | | | 11 | **Chinese** | 1.4% | **Underrepresented!** | | 12 | **Turkish** | 1.3% | | | 13 | **Persian** | 1.0% | | | 14 | **Vietnamese** | 0.9% | Growing | | 15 | **Arabic** | 0.6% | **Severely underrepresented!** | *Sources: [W3Techs](https://w3techs.com/technologies/overview/content_language), [Statista](https://www.statista.com/statistics/262946/most-common-languages-on-the-internet/)* ### The Paradox Languages | Language | % World Speakers | % Web Content | Gap Factor | |----------|------------------|---------------|------------| | **Chinese** | 14.3% | 1.4% | 10× underrepresented | | **Arabic** | 5.3% | 0.6% | 9× underrepresented | | **Hindi** | 7.7% | <0.5% | 15× underrepresented | | **German** | 1.7% | 5.6% | 3× **overrepresented** | | **Dutch** | 0.3% | 1.8% | 6× **overrepresented** | **Implication:** Qwen was trained on web data → biased toward German/Dutch, underexposed to Hindi/Arabic! --- ## 3. Qwen 2.5 Supported Languages ### Officially Supported (29+ languages) Qwen 2.5 explicitly supports multilingual content in: | Family | Languages | |--------|-----------| | **East Asian** | Chinese (Simplified/Traditional), Japanese, Korean, Vietnamese | | **European** | English, German, French, Spanish, Portuguese, Italian, Russian, Dutch, Polish | | **South Asian** | Hindi (limited?), Bengali | | **Southeast Asian** | Thai, Vietnamese, Indonesian, Malay | | **Middle Eastern** | Arabic, Turkish, Persian | | **Other** | Hebrew, Ukrainian, Greek | ### Training Data - **18 trillion tokens** total - Enhanced code, math, and multilingual data - Heavy English/Chinese bias (web scraping) *Source: [Qwen Blog](https://qwenlm.github.io/blog/qwen2.5/), [HuggingFace](https://huggingface.co/Qwen/Qwen2.5-7B)* --- ## 4. Token Efficiency Analysis ### Tested in Our Probing (nyx-probing) | Language | Avg Tokens/Concept | Script | Notes | |----------|-------------------|--------|-------| | **Chinese** | 1.0 | Hanzi | Most efficient | | **Arabic** | 1.5 | Arabic | Compact | | **Japanese** | 1.8 | Kanji/Kana | Mixed scripts | | **English** | 2.5 | Latin | Medium | | **German** | 4.5 | Latin | Compound words fragment | | **Russian** | 4.5 | Cyrillic | Multi-token words | ### Efficiency Implications ``` MORE TOKENS = DIFFERENT PATH ├── German (4.5) → Philosophical valleys, isolated from ZH/JA ├── Russian (4.5) → Similar to German, isolated └── Single-token (ZH/AR/EN) → Converge in layers 12-24 FEWER TOKENS = FASTER CONVERGENCE ├── Chinese (1.0) → Direct concept mapping ├── Arabic (1.5) → Efficient encoding └── Japanese (1.8) → Shared with Chinese ``` --- ## 5. Master Language Matrix ### Priority Languages for Curriculum | Language | World Rank | Web % | Qwen Support | Tokens | Priority | |----------|------------|-------|--------------|--------|----------| | **English** | 1 | 49.4% | ✅ Full | 2.5 | 🔴 Core | | **Chinese** | 2 | 1.4% | ✅ Full | 1.0 | 🔴 Core | | **Hindi** | 3 | <0.5% | ⚠️ Limited | ? | 🟡 Test | | **Spanish** | 4 | 6.0% | ✅ Full | ~2.5 | 🟢 Include | | **Arabic** | 5 | 0.6% | ✅ Full | 1.5 | 🔴 Core | | **French** | 6 | 4.3% | ✅ Full | ~3.0 | 🟢 Include | | **Bengali** | 7 | <0.5% | ⚠️ Limited | ? | 🟡 Test | | **Portuguese** | 8 | 2.6% | ✅ Full | ~2.5 | 🟢 Include | | **Russian** | 9 | 5.3% | ✅ Full | 4.5 | 🟢 Include | | **Japanese** | 10 | 4.9% | ✅ Full | 1.8 | 🔴 Core | | **German** | 11 | 5.6% | ✅ Full | 4.5 | 🔴 Core | | **Korean** | 14 | ~1% | ✅ Full | ~2.0 | 🟢 Include | ### Recommended Probing Languages **Tier 1 (Core - different cognitive paths):** - English (EN) - baseline, medium tokens - Chinese (ZH) - most efficient, single token - Arabic (AR) - efficient, underrepresented in web - German (DE) - multi-token, isolated path - Japanese (JA) - shared with Chinese **Tier 2 (Validation):** - Spanish (ES) - high native speakers - Russian (RU) - multi-token like German - French (FR) - colonial spread - Korean (KO) - isolated script **Tier 3 (Edge cases):** - Hindi (HI) - underrepresented, test support - Bengali (BN) - underrepresented - Indonesian (ID) - high L2 ratio --- ## 6. Research Questions ### Tokenization - [ ] Map token counts for all 29+ Qwen languages - [ ] Identify other "isolated" languages like German - [ ] Test Hindi/Bengali token efficiency ### Convergence - [ ] Do Spanish/Portuguese converge like ZH/JA? - [ ] Does Arabic converge with any other language? - [ ] Is Russian isolated like German? ### Valleys - [ ] Which languages access philosophical valleys? - [ ] Which languages trigger code valleys? - [ ] Can we predict valley from token count? ### Curriculum - [ ] Which language pairs enable cross-lingual transfer? - [ ] Can we use Chinese efficiency for concept compression? - [ ] Does teaching in German transfer to English? --- ## 7. Key Insights 1. **Web ≠ World**: German has 3× the web content relative to speakers, while Arabic/Hindi are 10-15× underrepresented 2. **Qwen's bias**: Trained on web data → inherits German/Dutch overrepresentation and Arabic/Hindi underrepresentation 3. **Token efficiency correlates with convergence**: Single-token languages (ZH, AR) converge quickly; multi-token (DE, RU) take isolated paths 4. **Strategic opportunities**: - German for philosophical depth - Chinese for concept compression - Arabic as undertested efficient language - Hindi as edge case for robustness --- ## References ### World Language Statistics - [Statista: Most Spoken Languages](https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/) - [Ethnologue 200](https://www.ethnologue.com/insights/ethnologue200/) - [Berlitz: 25 Most Spoken Languages](https://www.berlitz.com/blog/most-spoken-languages-world) ### Internet Language Distribution - [W3Techs: Content Languages](https://w3techs.com/technologies/overview/content_language) - [Statista: Languages on Internet](https://www.statista.com/statistics/262946/most-common-languages-on-the-internet/) - [Wikipedia: Languages on Internet](https://en.wikipedia.org/wiki/Languages_used_on_the_Internet) ### Qwen 2.5 Documentation - [Qwen Blog: Qwen 2.5 Announcement](https://qwenlm.github.io/blog/qwen2.5/) - [HuggingFace: Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B) - [Alibaba Cloud: Qwen2.5-LLM](https://www.alibabacloud.com/blog/qwen2-5-llm-extending-the-boundary-of-llms_601786) --- *"To understand the mind, first understand its languages."* 🌙 Compiled by the Partnership, 2025-12-06