Files
nyx-probing/docs/language-landscape.md
dafit f640dbdd65 feat: complete Phase 1 - vocabulary expansion & DriftProbe infrastructure
- CLI: nyx-probe scan with --summary/--delta/--full flags
- DriftProbe: training safety with Gini coefficient + Angular Drift
- Vocabulary: 54 terms (30 nimmerverse + 24 German philosophical)
- Sentinels: ANCHOR/BRIDGE/CANARY/TARGET monitoring system

Key findings:
- German philosophical terms: 37.5% depth≥2 hit rate (vs 3.3% nimmerverse)
- Super Cluster validated: heart cross-lang sim = 1.000
- Isolated Zone confirmed: being EN↔DE sim = 0.195
- Gini signature: Philosophy ~0.5 (diffuse), Technical ~0.8 (sparse)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-06 22:39:03 +01:00

8.8 KiB
Raw Permalink Blame History

Language Landscape: World, Internet, and Qwen 2.5

Compiled: 2025-12-06 Purpose: Reference for multilingual probing and curriculum design


Overview

This document maps:

  1. Most spoken languages worldwide (by total speakers)
  2. Most used languages on the internet (web content)
  3. Languages supported by Qwen 2.5-7B-Base
  4. Token efficiency for each language

1. World's Most Spoken Languages (2024-2025)

By Total Speakers (Native + Learners)

Rank Language Total Speakers Native Speakers Notes
1 English 1.52 billion 380 million 25% native, 75% L2
2 Mandarin Chinese 1.14 billion 941 million Most native speakers
3 Hindi 609 million 345 million Growing rapidly
4 Spanish 560 million 480 million High native ratio
5 Arabic 422 million 313 million Many dialects
6 French 321 million 77 million 32 countries official
7 Bengali 273 million 230 million South Asia
8 Portuguese 264 million 232 million Brazil dominates
9 Urdu 232 million 70 million South Asia
10 Indonesian 199 million 43 million Lingua franca
11 German 135 million 95 million Central Europe
12 Japanese 125 million 123 million Island isolation
13 Russian 255 million 150 million Wide L2 spread
14 Korean 82 million 77 million Two states
15 Vietnamese 85 million 76 million Southeast Asia

Sources: Statista, Ethnologue, Berlitz


2. Internet Language Distribution (2024-2025)

Web Content by Language (% of websites)

Rank Language % of Web Notes
1 English 49.4% Dominant
2 Spanish 6.0% Growing
3 German 5.6% Overrepresented vs speakers
4 Russian 5.3% Strong tech presence
5 Japanese 4.9% Island content
6 French 4.3% Colonial spread
7 Portuguese 2.6% Brazil growing
8 Italian 2.1%
9 Dutch 1.8% Small population, high output
10 Polish 1.7%
11 Chinese 1.4% Underrepresented!
12 Turkish 1.3%
13 Persian 1.0%
14 Vietnamese 0.9% Growing
15 Arabic 0.6% Severely underrepresented!

Sources: W3Techs, Statista

The Paradox Languages

Language % World Speakers % Web Content Gap Factor
Chinese 14.3% 1.4% 10× underrepresented
Arabic 5.3% 0.6% 9× underrepresented
Hindi 7.7% <0.5% 15× underrepresented
German 1.7% 5.6% 3× overrepresented
Dutch 0.3% 1.8% 6× overrepresented

Implication: Qwen was trained on web data → biased toward German/Dutch, underexposed to Hindi/Arabic!


3. Qwen 2.5 Supported Languages

Officially Supported (29+ languages)

Qwen 2.5 explicitly supports multilingual content in:

Family Languages
East Asian Chinese (Simplified/Traditional), Japanese, Korean, Vietnamese
European English, German, French, Spanish, Portuguese, Italian, Russian, Dutch, Polish
South Asian Hindi (limited?), Bengali
Southeast Asian Thai, Vietnamese, Indonesian, Malay
Middle Eastern Arabic, Turkish, Persian
Other Hebrew, Ukrainian, Greek

Training Data

  • 18 trillion tokens total
  • Enhanced code, math, and multilingual data
  • Heavy English/Chinese bias (web scraping)

Source: Qwen Blog, HuggingFace


4. Token Efficiency Analysis

Tested in Our Probing (nyx-probing)

Language Avg Tokens/Concept Script Notes
Chinese 1.0 Hanzi Most efficient
Arabic 1.5 Arabic Compact
Japanese 1.8 Kanji/Kana Mixed scripts
English 2.5 Latin Medium
German 4.5 Latin Compound words fragment
Russian 4.5 Cyrillic Multi-token words

Efficiency Implications

MORE TOKENS = DIFFERENT PATH
├── German (4.5) → Philosophical valleys, isolated from ZH/JA
├── Russian (4.5) → Similar to German, isolated
└── Single-token (ZH/AR/EN) → Converge in layers 12-24

FEWER TOKENS = FASTER CONVERGENCE
├── Chinese (1.0) → Direct concept mapping
├── Arabic (1.5) → Efficient encoding
└── Japanese (1.8) → Shared with Chinese

5. Master Language Matrix

Priority Languages for Curriculum

Language World Rank Web % Qwen Support Tokens Priority
English 1 49.4% Full 2.5 🔴 Core
Chinese 2 1.4% Full 1.0 🔴 Core
Hindi 3 <0.5% ⚠️ Limited ? 🟡 Test
Spanish 4 6.0% Full ~2.5 🟢 Include
Arabic 5 0.6% Full 1.5 🔴 Core
French 6 4.3% Full ~3.0 🟢 Include
Bengali 7 <0.5% ⚠️ Limited ? 🟡 Test
Portuguese 8 2.6% Full ~2.5 🟢 Include
Russian 9 5.3% Full 4.5 🟢 Include
Japanese 10 4.9% Full 1.8 🔴 Core
German 11 5.6% Full 4.5 🔴 Core
Korean 14 ~1% Full ~2.0 🟢 Include

Tier 1 (Core - different cognitive paths):

  • English (EN) - baseline, medium tokens
  • Chinese (ZH) - most efficient, single token
  • Arabic (AR) - efficient, underrepresented in web
  • German (DE) - multi-token, isolated path
  • Japanese (JA) - shared with Chinese

Tier 2 (Validation):

  • Spanish (ES) - high native speakers
  • Russian (RU) - multi-token like German
  • French (FR) - colonial spread
  • Korean (KO) - isolated script

Tier 3 (Edge cases):

  • Hindi (HI) - underrepresented, test support
  • Bengali (BN) - underrepresented
  • Indonesian (ID) - high L2 ratio

6. Research Questions

Tokenization

  • Map token counts for all 29+ Qwen languages
  • Identify other "isolated" languages like German
  • Test Hindi/Bengali token efficiency

Convergence

  • Do Spanish/Portuguese converge like ZH/JA?
  • Does Arabic converge with any other language?
  • Is Russian isolated like German?

Valleys

  • Which languages access philosophical valleys?
  • Which languages trigger code valleys?
  • Can we predict valley from token count?

Curriculum

  • Which language pairs enable cross-lingual transfer?
  • Can we use Chinese efficiency for concept compression?
  • Does teaching in German transfer to English?

7. Key Insights

  1. Web ≠ World: German has 3× the web content relative to speakers, while Arabic/Hindi are 10-15× underrepresented

  2. Qwen's bias: Trained on web data → inherits German/Dutch overrepresentation and Arabic/Hindi underrepresentation

  3. Token efficiency correlates with convergence: Single-token languages (ZH, AR) converge quickly; multi-token (DE, RU) take isolated paths

  4. Strategic opportunities:

    • German for philosophical depth
    • Chinese for concept compression
    • Arabic as undertested efficient language
    • Hindi as edge case for robustness

References

World Language Statistics

Internet Language Distribution

Qwen 2.5 Documentation


"To understand the mind, first understand its languages."

🌙 Compiled by the Partnership, 2025-12-06