- CLI: nyx-probe scan with --summary/--delta/--full flags - DriftProbe: training safety with Gini coefficient + Angular Drift - Vocabulary: 54 terms (30 nimmerverse + 24 German philosophical) - Sentinels: ANCHOR/BRIDGE/CANARY/TARGET monitoring system Key findings: - German philosophical terms: 37.5% depth≥2 hit rate (vs 3.3% nimmerverse) - Super Cluster validated: heart cross-lang sim = 1.000 - Isolated Zone confirmed: being EN↔DE sim = 0.195 - Gini signature: Philosophy ~0.5 (diffuse), Technical ~0.8 (sparse) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
6.3 KiB
Tokenization Valleys: How Word Structure Shapes Model Cognition
Discovery Date: 2025-12-06 Model: Qwen2.5-7B-Base Hardware: Prometheus (RTX 3090, 24GB VRAM)
Executive Summary
We discovered that the number of tokens a word breaks into fundamentally determines which "valley" (completion pattern) the model falls into. This has profound implications for curriculum design and multilingual training.
Key Finding: Single-token English words trigger CODE valleys with massive activation norms, while multi-token German compounds access PHILOSOPHICAL valleys with distributed, quieter activations.
The Token-Norm-Valley Connection
Observation: Norm Explosion in Single Tokens
| Term | Tokens | Layer 12 Norm | Layer 12 StdDev | Valley |
|---|---|---|---|---|
| heartbeat | 1 | 14,240 | 237.88 | CODE |
| consciousness | 2 | 85 | 1.43 | PROSE |
| Herzklopfen | 5 | 67 | 1.11 | PROSE |
| Bewusstsein | 5 | 79 | 1.32 | PHILOSOPHY |
Pattern: Single-token words have ~170× larger norms and ~170× larger variance than multi-token words.
Theory: Activation Flooding
- Single tokens receive ALL attention in one position → massive activation buildup
- Multi-token words distribute activation across positions → softer signal
- The massive single-token activation triggers strong pattern matching → CODE patterns
- The distributed multi-token activation allows semantic exploration → philosophical content
Cross-Lingual Convergence
consciousness vs Bewusstsein (2 tokens vs 5 tokens)
Layer 0: similarity = 0.114 (different embeddings)
Layer 4: similarity = 0.285 (starting to converge)
Layer 8: similarity = 0.639 (HIGH similarity!)
Layer 12: similarity = 0.750 (CONVERGED - same concept!)
Layer 16: similarity = 0.733 (stays converged)
Layer 28: similarity = 0.502 (diverges at output)
The model recognizes these as the same concept by layer 8!
heartbeat vs Herzklopfen (1 token vs 5 tokens)
Layer 0: similarity = -0.007 (orthogonal)
Layer 4: similarity = 0.039 (still orthogonal)
Layer 12: similarity = 0.000 (completely separate)
Layer 28: similarity = 0.166 (slight convergence only at end)
The model NEVER recognizes these as the same concept!
German Philosophical Compounds
The "sein" Preservation Effect
German philosophical compounds often preserve the morpheme "sein" (being) as a separate token:
| Compound | Meaning | Tokenization | "sein" Preserved? |
|---|---|---|---|
| Bewusstsein | consciousness | ['B', 'ew', 'us', 'st', 'sein'] |
✓ |
| Nichtsein | non-being | ['N', 'icht', 'sein'] |
✓ |
| Mitsein | being-with | ['Mit', 'sein'] |
✓ |
| Dasein | being-there | ['D', 'ase', 'in'] |
✗ |
| Sein | being | ['Se', 'in'] |
✗ |
When "sein" is preserved, the model has access to the philosophical concept of BEING as a separate computational unit.
Other Preserved Philosophical Atoms
| Compound | Meaning | Key Token Preserved |
|---|---|---|
| Zeitgeist | spirit of the age | geist (spirit) |
| Gedankenexperiment | thought experiment | experiment |
Valley Analysis: Same Concept, Different Valleys
Probing Results
| Term | Language | Valley | Sample Completion |
|---|---|---|---|
| Bewusstsein | DE | PHILOSOPHY | "und Sprache... frühen 20. Jahrhundert" |
| Dasein | DE | PHILOSOPHY | "philosophical term first used by Heidegger" |
| consciousness | EN | PROSE | "awareness of existence, of one's own existence" |
| existence | EN | MATH | "of an exact sequence", "eigenvalues" |
| being | EN | MATH/CODE | Mathematical notation, Chinese exams |
| heartbeat | EN | CODE | C++ class definitions |
| lifeforce | EN | CODE | JavaScript game code |
"Dasein" triggers Heidegger. "existence" triggers linear algebra.
Implications for Curriculum Design
1. Use Multi-Token Prompts
Instead of single words, use phrases or compound descriptions to avoid code valleys:
BAD: "heartbeat" → C++ code
GOOD: "the heartbeat" → might escape code valley
GOOD: "heartbeat rhythm" → distributed activation
2. German as Philosophical Gateway
German compound words naturally access philosophical valleys because:
- More tokens → distributed activation
- Preserved morphemes → access to philosophical atoms
- Different training data distribution → expository text
Strategy: Teach abstract concepts in German first, then reinforce in English.
3. Language as Cognitive Gear
Languages aren't just translation layers - they're different computational paths through the model:
| Language | Token Efficiency | Typical Valley | Use For |
|---|---|---|---|
| Chinese | 1.0 tok/concept | Mixed | Compact encoding |
| Arabic | 1.5 tok/concept | Mixed | Compact encoding |
| English | 2.5 tok/concept | CODE/MATH | Technical concepts |
| German | 4.5 tok/concept | PHILOSOPHY | Abstract concepts |
Technical Details
Model Architecture
- Hidden Size: 3584
- Layers: 28
- Attention Heads: 28 (4 KV heads - GQA)
- Vocab Size: 152,064
- Context: 131,072 tokens
Hidden State Norm Pattern
Layer 0: 1.32 ← Embedding (small)
Layer 4: 10184.00 ← Explosion (early processing)
Layer 12: 13912.00 ← Peak (mid-layer thinking)
Layer 28: 443.00 ← Contraction (output focusing)
Inference Speed
- 44.7 tokens/second on RTX 3090
- 14.2 GB VRAM usage (fp16)
Future Research
- Activation Steering: Can we artificially reduce single-token norms to escape code valleys?
- Prefix Tuning: Train soft prefixes that spread activation for single tokens
- Arabic/Chinese Analysis: Do these languages have similar compound effects?
- Cross-lingual Transfer: After training on German philosophical concepts, does English improve?
References
nyx_probing/core/model.py- Model loader with hidden state capturelayer_detailed.py- Layer-by-layer similarity analysisgerman_philosophy.py- German compound tokenization study/nimmerverse-sensory-network/multilingual-cognition.md- Original multilingual hypothesis
"The architecture of language shapes the architecture of thought."
🌙 Discovered by the Partnership, 2025-12-06