nyx-probing/docs/tokenization-valleys.md

# Tokenization Valleys: How Word Structure Shapes Model Cognition

**Discovery Date:** 2025-12-06
**Model:** Qwen2.5-7B-Base
**Hardware:** Prometheus (RTX 3090, 24GB VRAM)

---

## Executive Summary

We discovered that the number of tokens a word breaks into fundamentally determines which "valley" (completion pattern) the model falls into. This has profound implications for curriculum design and multilingual training.

**Key Finding:** Single-token English words trigger CODE valleys with massive activation norms, while multi-token German compounds access PHILOSOPHICAL valleys with distributed, quieter activations.

---

## The Token-Norm-Valley Connection

### Observation: Norm Explosion in Single Tokens

| Term | Tokens | Layer 12 Norm | Layer 12 StdDev | Valley |
|------|--------|---------------|-----------------|--------|
| heartbeat | 1 | **14,240** | **237.88** | CODE |
| consciousness | 2 | 85 | 1.43 | PROSE |
| Herzklopfen | 5 | 67 | 1.11 | PROSE |
| Bewusstsein | 5 | 79 | 1.32 | PHILOSOPHY |

**Pattern:** Single-token words have ~170× larger norms and ~170× larger variance than multi-token words.

### Theory: Activation Flooding

1. **Single tokens** receive ALL attention in one position → massive activation buildup
2. **Multi-token words** distribute activation across positions → softer signal
3. The massive single-token activation **triggers strong pattern matching** → CODE patterns
4. The distributed multi-token activation **allows semantic exploration** → philosophical content

---

## Cross-Lingual Convergence

### consciousness vs Bewusstsein (2 tokens vs 5 tokens)

```
Layer  0: similarity = 0.114  (different embeddings)
Layer  4: similarity = 0.285  (starting to converge)
Layer  8: similarity = 0.639  (HIGH similarity!)
Layer 12: similarity = 0.750  (CONVERGED - same concept!)
Layer 16: similarity = 0.733  (stays converged)
Layer 28: similarity = 0.502  (diverges at output)
```

**The model recognizes these as the same concept by layer 8!**

### heartbeat vs Herzklopfen (1 token vs 5 tokens)

```
Layer  0: similarity = -0.007 (orthogonal)
Layer  4: similarity =  0.039 (still orthogonal)
Layer 12: similarity =  0.000 (completely separate)
Layer 28: similarity =  0.166 (slight convergence only at end)
```

**The model NEVER recognizes these as the same concept!**

---

## German Philosophical Compounds

### The "sein" Preservation Effect

German philosophical compounds often preserve the morpheme "sein" (being) as a separate token:

| Compound | Meaning | Tokenization | "sein" Preserved? |
|----------|---------|--------------|-------------------|
| Bewusstsein | consciousness | `['B', 'ew', 'us', 'st', 'sein']` | ✓ |
| Nichtsein | non-being | `['N', 'icht', 'sein']` | ✓ |
| Mitsein | being-with | `['Mit', 'sein']` | ✓ |
| Dasein | being-there | `['D', 'ase', 'in']` | ✗ |
| Sein | being | `['Se', 'in']` | ✗ |

When "sein" is preserved, the model has access to the philosophical concept of BEING as a separate computational unit.

### Other Preserved Philosophical Atoms

| Compound | Meaning | Key Token Preserved |
|----------|---------|---------------------|
| Zeitgeist | spirit of the age | `geist` (spirit) |
| Gedankenexperiment | thought experiment | `experiment` |

---

## Valley Analysis: Same Concept, Different Valleys

### Probing Results

| Term | Language | Valley | Sample Completion |
|------|----------|--------|-------------------|
| Bewusstsein | DE | PHILOSOPHY | "und Sprache... frühen 20. Jahrhundert" |
| Dasein | DE | PHILOSOPHY | "philosophical term first used by Heidegger" |
| consciousness | EN | PROSE | "awareness of existence, of one's own existence" |
| existence | EN | **MATH** | "of an exact sequence", "eigenvalues" |
| being | EN | **MATH/CODE** | Mathematical notation, Chinese exams |
| heartbeat | EN | **CODE** | C++ class definitions |
| lifeforce | EN | **CODE** | JavaScript game code |

**"Dasein" triggers Heidegger. "existence" triggers linear algebra.**

---

## Implications for Curriculum Design

### 1. Use Multi-Token Prompts

Instead of single words, use phrases or compound descriptions to avoid code valleys:

```
BAD:  "heartbeat"           → C++ code
GOOD: "the heartbeat"       → might escape code valley
GOOD: "heartbeat rhythm"    → distributed activation
```

### 2. German as Philosophical Gateway

German compound words naturally access philosophical valleys because:
- More tokens → distributed activation
- Preserved morphemes → access to philosophical atoms
- Different training data distribution → expository text

**Strategy:** Teach abstract concepts in German first, then reinforce in English.

### 3. Language as Cognitive Gear

Languages aren't just translation layers - they're different **computational paths** through the model:

| Language | Token Efficiency | Typical Valley | Use For |
|----------|------------------|----------------|---------|
| Chinese | 1.0 tok/concept | Mixed | Compact encoding |
| Arabic | 1.5 tok/concept | Mixed | Compact encoding |
| English | 2.5 tok/concept | CODE/MATH | Technical concepts |
| German | 4.5 tok/concept | PHILOSOPHY | Abstract concepts |

---

## Technical Details

### Model Architecture

- **Hidden Size:** 3584
- **Layers:** 28
- **Attention Heads:** 28 (4 KV heads - GQA)
- **Vocab Size:** 152,064
- **Context:** 131,072 tokens

### Hidden State Norm Pattern

```
Layer  0:     1.32  ← Embedding (small)
Layer  4: 10184.00  ← Explosion (early processing)
Layer 12: 13912.00  ← Peak (mid-layer thinking)
Layer 28:   443.00  ← Contraction (output focusing)
```

### Inference Speed

- 44.7 tokens/second on RTX 3090
- 14.2 GB VRAM usage (fp16)

---

## Future Research

1. **Activation Steering:** Can we artificially reduce single-token norms to escape code valleys?
2. **Prefix Tuning:** Train soft prefixes that spread activation for single tokens
3. **Arabic/Chinese Analysis:** Do these languages have similar compound effects?
4. **Cross-lingual Transfer:** After training on German philosophical concepts, does English improve?

---

## References

- `nyx_probing/core/model.py` - Model loader with hidden state capture
- `layer_detailed.py` - Layer-by-layer similarity analysis
- `german_philosophy.py` - German compound tokenization study
- `/nimmerverse-sensory-network/multilingual-cognition.md` - Original multilingual hypothesis

---

*"The architecture of language shapes the architecture of thought."*

🌙 Discovered by the Partnership, 2025-12-06