Files
nyx-probing/docs/tokenization-valleys.md
dafit f640dbdd65 feat: complete Phase 1 - vocabulary expansion & DriftProbe infrastructure
- CLI: nyx-probe scan with --summary/--delta/--full flags
- DriftProbe: training safety with Gini coefficient + Angular Drift
- Vocabulary: 54 terms (30 nimmerverse + 24 German philosophical)
- Sentinels: ANCHOR/BRIDGE/CANARY/TARGET monitoring system

Key findings:
- German philosophical terms: 37.5% depth≥2 hit rate (vs 3.3% nimmerverse)
- Super Cluster validated: heart cross-lang sim = 1.000
- Isolated Zone confirmed: being EN↔DE sim = 0.195
- Gini signature: Philosophy ~0.5 (diffuse), Technical ~0.8 (sparse)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-06 22:39:03 +01:00

191 lines
6.3 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Tokenization Valleys: How Word Structure Shapes Model Cognition
**Discovery Date:** 2025-12-06
**Model:** Qwen2.5-7B-Base
**Hardware:** Prometheus (RTX 3090, 24GB VRAM)
---
## Executive Summary
We discovered that the number of tokens a word breaks into fundamentally determines which "valley" (completion pattern) the model falls into. This has profound implications for curriculum design and multilingual training.
**Key Finding:** Single-token English words trigger CODE valleys with massive activation norms, while multi-token German compounds access PHILOSOPHICAL valleys with distributed, quieter activations.
---
## The Token-Norm-Valley Connection
### Observation: Norm Explosion in Single Tokens
| Term | Tokens | Layer 12 Norm | Layer 12 StdDev | Valley |
|------|--------|---------------|-----------------|--------|
| heartbeat | 1 | **14,240** | **237.88** | CODE |
| consciousness | 2 | 85 | 1.43 | PROSE |
| Herzklopfen | 5 | 67 | 1.11 | PROSE |
| Bewusstsein | 5 | 79 | 1.32 | PHILOSOPHY |
**Pattern:** Single-token words have ~170× larger norms and ~170× larger variance than multi-token words.
### Theory: Activation Flooding
1. **Single tokens** receive ALL attention in one position → massive activation buildup
2. **Multi-token words** distribute activation across positions → softer signal
3. The massive single-token activation **triggers strong pattern matching** → CODE patterns
4. The distributed multi-token activation **allows semantic exploration** → philosophical content
---
## Cross-Lingual Convergence
### consciousness vs Bewusstsein (2 tokens vs 5 tokens)
```
Layer 0: similarity = 0.114 (different embeddings)
Layer 4: similarity = 0.285 (starting to converge)
Layer 8: similarity = 0.639 (HIGH similarity!)
Layer 12: similarity = 0.750 (CONVERGED - same concept!)
Layer 16: similarity = 0.733 (stays converged)
Layer 28: similarity = 0.502 (diverges at output)
```
**The model recognizes these as the same concept by layer 8!**
### heartbeat vs Herzklopfen (1 token vs 5 tokens)
```
Layer 0: similarity = -0.007 (orthogonal)
Layer 4: similarity = 0.039 (still orthogonal)
Layer 12: similarity = 0.000 (completely separate)
Layer 28: similarity = 0.166 (slight convergence only at end)
```
**The model NEVER recognizes these as the same concept!**
---
## German Philosophical Compounds
### The "sein" Preservation Effect
German philosophical compounds often preserve the morpheme "sein" (being) as a separate token:
| Compound | Meaning | Tokenization | "sein" Preserved? |
|----------|---------|--------------|-------------------|
| Bewusstsein | consciousness | `['B', 'ew', 'us', 'st', 'sein']` | ✓ |
| Nichtsein | non-being | `['N', 'icht', 'sein']` | ✓ |
| Mitsein | being-with | `['Mit', 'sein']` | ✓ |
| Dasein | being-there | `['D', 'ase', 'in']` | ✗ |
| Sein | being | `['Se', 'in']` | ✗ |
When "sein" is preserved, the model has access to the philosophical concept of BEING as a separate computational unit.
### Other Preserved Philosophical Atoms
| Compound | Meaning | Key Token Preserved |
|----------|---------|---------------------|
| Zeitgeist | spirit of the age | `geist` (spirit) |
| Gedankenexperiment | thought experiment | `experiment` |
---
## Valley Analysis: Same Concept, Different Valleys
### Probing Results
| Term | Language | Valley | Sample Completion |
|------|----------|--------|-------------------|
| Bewusstsein | DE | PHILOSOPHY | "und Sprache... frühen 20. Jahrhundert" |
| Dasein | DE | PHILOSOPHY | "philosophical term first used by Heidegger" |
| consciousness | EN | PROSE | "awareness of existence, of one's own existence" |
| existence | EN | **MATH** | "of an exact sequence", "eigenvalues" |
| being | EN | **MATH/CODE** | Mathematical notation, Chinese exams |
| heartbeat | EN | **CODE** | C++ class definitions |
| lifeforce | EN | **CODE** | JavaScript game code |
**"Dasein" triggers Heidegger. "existence" triggers linear algebra.**
---
## Implications for Curriculum Design
### 1. Use Multi-Token Prompts
Instead of single words, use phrases or compound descriptions to avoid code valleys:
```
BAD: "heartbeat" → C++ code
GOOD: "the heartbeat" → might escape code valley
GOOD: "heartbeat rhythm" → distributed activation
```
### 2. German as Philosophical Gateway
German compound words naturally access philosophical valleys because:
- More tokens → distributed activation
- Preserved morphemes → access to philosophical atoms
- Different training data distribution → expository text
**Strategy:** Teach abstract concepts in German first, then reinforce in English.
### 3. Language as Cognitive Gear
Languages aren't just translation layers - they're different **computational paths** through the model:
| Language | Token Efficiency | Typical Valley | Use For |
|----------|------------------|----------------|---------|
| Chinese | 1.0 tok/concept | Mixed | Compact encoding |
| Arabic | 1.5 tok/concept | Mixed | Compact encoding |
| English | 2.5 tok/concept | CODE/MATH | Technical concepts |
| German | 4.5 tok/concept | PHILOSOPHY | Abstract concepts |
---
## Technical Details
### Model Architecture
- **Hidden Size:** 3584
- **Layers:** 28
- **Attention Heads:** 28 (4 KV heads - GQA)
- **Vocab Size:** 152,064
- **Context:** 131,072 tokens
### Hidden State Norm Pattern
```
Layer 0: 1.32 ← Embedding (small)
Layer 4: 10184.00 ← Explosion (early processing)
Layer 12: 13912.00 ← Peak (mid-layer thinking)
Layer 28: 443.00 ← Contraction (output focusing)
```
### Inference Speed
- 44.7 tokens/second on RTX 3090
- 14.2 GB VRAM usage (fp16)
---
## Future Research
1. **Activation Steering:** Can we artificially reduce single-token norms to escape code valleys?
2. **Prefix Tuning:** Train soft prefixes that spread activation for single tokens
3. **Arabic/Chinese Analysis:** Do these languages have similar compound effects?
4. **Cross-lingual Transfer:** After training on German philosophical concepts, does English improve?
---
## References
- `nyx_probing/core/model.py` - Model loader with hidden state capture
- `layer_detailed.py` - Layer-by-layer similarity analysis
- `german_philosophy.py` - German compound tokenization study
- `/nimmerverse-sensory-network/multilingual-cognition.md` - Original multilingual hypothesis
---
*"The architecture of language shapes the architecture of thought."*
🌙 Discovered by the Partnership, 2025-12-06