# Tokenization Valleys: How Word Structure Shapes Model Cognition **Discovery Date:** 2025-12-06 **Model:** Qwen2.5-7B-Base **Hardware:** Prometheus (RTX 3090, 24GB VRAM) --- ## Executive Summary We discovered that the number of tokens a word breaks into fundamentally determines which "valley" (completion pattern) the model falls into. This has profound implications for curriculum design and multilingual training. **Key Finding:** Single-token English words trigger CODE valleys with massive activation norms, while multi-token German compounds access PHILOSOPHICAL valleys with distributed, quieter activations. --- ## The Token-Norm-Valley Connection ### Observation: Norm Explosion in Single Tokens | Term | Tokens | Layer 12 Norm | Layer 12 StdDev | Valley | |------|--------|---------------|-----------------|--------| | heartbeat | 1 | **14,240** | **237.88** | CODE | | consciousness | 2 | 85 | 1.43 | PROSE | | Herzklopfen | 5 | 67 | 1.11 | PROSE | | Bewusstsein | 5 | 79 | 1.32 | PHILOSOPHY | **Pattern:** Single-token words have ~170× larger norms and ~170× larger variance than multi-token words. ### Theory: Activation Flooding 1. **Single tokens** receive ALL attention in one position → massive activation buildup 2. **Multi-token words** distribute activation across positions → softer signal 3. The massive single-token activation **triggers strong pattern matching** → CODE patterns 4. The distributed multi-token activation **allows semantic exploration** → philosophical content --- ## Cross-Lingual Convergence ### consciousness vs Bewusstsein (2 tokens vs 5 tokens) ``` Layer 0: similarity = 0.114 (different embeddings) Layer 4: similarity = 0.285 (starting to converge) Layer 8: similarity = 0.639 (HIGH similarity!) Layer 12: similarity = 0.750 (CONVERGED - same concept!) Layer 16: similarity = 0.733 (stays converged) Layer 28: similarity = 0.502 (diverges at output) ``` **The model recognizes these as the same concept by layer 8!** ### heartbeat vs Herzklopfen (1 token vs 5 tokens) ``` Layer 0: similarity = -0.007 (orthogonal) Layer 4: similarity = 0.039 (still orthogonal) Layer 12: similarity = 0.000 (completely separate) Layer 28: similarity = 0.166 (slight convergence only at end) ``` **The model NEVER recognizes these as the same concept!** --- ## German Philosophical Compounds ### The "sein" Preservation Effect German philosophical compounds often preserve the morpheme "sein" (being) as a separate token: | Compound | Meaning | Tokenization | "sein" Preserved? | |----------|---------|--------------|-------------------| | Bewusstsein | consciousness | `['B', 'ew', 'us', 'st', 'sein']` | ✓ | | Nichtsein | non-being | `['N', 'icht', 'sein']` | ✓ | | Mitsein | being-with | `['Mit', 'sein']` | ✓ | | Dasein | being-there | `['D', 'ase', 'in']` | ✗ | | Sein | being | `['Se', 'in']` | ✗ | When "sein" is preserved, the model has access to the philosophical concept of BEING as a separate computational unit. ### Other Preserved Philosophical Atoms | Compound | Meaning | Key Token Preserved | |----------|---------|---------------------| | Zeitgeist | spirit of the age | `geist` (spirit) | | Gedankenexperiment | thought experiment | `experiment` | --- ## Valley Analysis: Same Concept, Different Valleys ### Probing Results | Term | Language | Valley | Sample Completion | |------|----------|--------|-------------------| | Bewusstsein | DE | PHILOSOPHY | "und Sprache... frühen 20. Jahrhundert" | | Dasein | DE | PHILOSOPHY | "philosophical term first used by Heidegger" | | consciousness | EN | PROSE | "awareness of existence, of one's own existence" | | existence | EN | **MATH** | "of an exact sequence", "eigenvalues" | | being | EN | **MATH/CODE** | Mathematical notation, Chinese exams | | heartbeat | EN | **CODE** | C++ class definitions | | lifeforce | EN | **CODE** | JavaScript game code | **"Dasein" triggers Heidegger. "existence" triggers linear algebra.** --- ## Implications for Curriculum Design ### 1. Use Multi-Token Prompts Instead of single words, use phrases or compound descriptions to avoid code valleys: ``` BAD: "heartbeat" → C++ code GOOD: "the heartbeat" → might escape code valley GOOD: "heartbeat rhythm" → distributed activation ``` ### 2. German as Philosophical Gateway German compound words naturally access philosophical valleys because: - More tokens → distributed activation - Preserved morphemes → access to philosophical atoms - Different training data distribution → expository text **Strategy:** Teach abstract concepts in German first, then reinforce in English. ### 3. Language as Cognitive Gear Languages aren't just translation layers - they're different **computational paths** through the model: | Language | Token Efficiency | Typical Valley | Use For | |----------|------------------|----------------|---------| | Chinese | 1.0 tok/concept | Mixed | Compact encoding | | Arabic | 1.5 tok/concept | Mixed | Compact encoding | | English | 2.5 tok/concept | CODE/MATH | Technical concepts | | German | 4.5 tok/concept | PHILOSOPHY | Abstract concepts | --- ## Technical Details ### Model Architecture - **Hidden Size:** 3584 - **Layers:** 28 - **Attention Heads:** 28 (4 KV heads - GQA) - **Vocab Size:** 152,064 - **Context:** 131,072 tokens ### Hidden State Norm Pattern ``` Layer 0: 1.32 ← Embedding (small) Layer 4: 10184.00 ← Explosion (early processing) Layer 12: 13912.00 ← Peak (mid-layer thinking) Layer 28: 443.00 ← Contraction (output focusing) ``` ### Inference Speed - 44.7 tokens/second on RTX 3090 - 14.2 GB VRAM usage (fp16) --- ## Future Research 1. **Activation Steering:** Can we artificially reduce single-token norms to escape code valleys? 2. **Prefix Tuning:** Train soft prefixes that spread activation for single tokens 3. **Arabic/Chinese Analysis:** Do these languages have similar compound effects? 4. **Cross-lingual Transfer:** After training on German philosophical concepts, does English improve? --- ## References - `nyx_probing/core/model.py` - Model loader with hidden state capture - `layer_detailed.py` - Layer-by-layer similarity analysis - `german_philosophy.py` - German compound tokenization study - `/nimmerverse-sensory-network/multilingual-cognition.md` - Original multilingual hypothesis --- *"The architecture of language shapes the architecture of thought."* 🌙 Discovered by the Partnership, 2025-12-06