From bcc5bfe9d1936e886b213145c5b6fed86c7078e5 Mon Sep 17 00:00:00 2001
From: dafit <david.martin@eachpath.com>
Date: Sat, 13 Dec 2025 19:29:23 +0100
Subject: [PATCH] docs: add Phase 1D corpus extraction pipeline to toolchain
 docs
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Toolchain-Architecture.md:
- Added extractors module to current state
- New Phase 1D section: Corpus Extraction Pipeline
- VocabExtractor and CoOccurrenceAnalyzer documentation
- RAG policy integration table

TOOLCHAIN-PROGRESS.md:
- Phase 1D complete (2025-12-13)
- 7 files created, 19 total tasks complete
- Key metrics: 5,243 terms, 18,169 co-occurrence pairs
- 20 anchor signatures for DriftProbe-lite

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
---
 architecture/TOOLCHAIN-PROGRESS.md     |  90 ++++++++++++++++++---
 architecture/Toolchain-Architecture.md | 103 +++++++++++++++++++++++++
 2 files changed, 181 insertions(+), 12 deletions(-)

diff --git a/architecture/TOOLCHAIN-PROGRESS.md b/architecture/TOOLCHAIN-PROGRESS.md
index 11b983e..9d5c54c 100644
--- a/architecture/TOOLCHAIN-PROGRESS.md
+++ b/architecture/TOOLCHAIN-PROGRESS.md
@@ -65,6 +65,62 @@
 
 ---
 
+## Phase 1D: Corpus Extraction Pipeline ✅ COMPLETE
+
+**Goal**: Extract vocabulary and co-occurrence metrics for RAG policy development
+
+### ✅ Completed (2025-12-13)
+
+- [x] Create extractors module in nyx-probing
+- [x] Implement VocabExtractor (TF-IDF vocabulary)
+- [x] Implement CoOccurrenceAnalyzer (PMI, Jaccard, Dice)
+- [x] Generate anchor term signatures (20 anchors)
+- [x] Generate chunking recommendations (5 clusters)
+- [x] Run initial extraction on nimmerverse vault
+- [x] Export glossary to CSV/JSON (5,243 terms)
+- [x] Export co-occurrence analysis (18,169 pairs)
+
+**Files Created**: 7 new files
+- `nyx_probing/extractors/__init__.py`
+- `nyx_probing/extractors/vocab_extractor.py` (~350 LOC)
+- `nyx_probing/extractors/cooccurrence.py` (~400 LOC)
+- `data/nimmerverse_glossary.csv`
+- `data/nimmerverse_glossary.json`
+- `data/cooccurrence_analysis.csv`
+- `data/cooccurrence_analysis.json`
+
+**Key Metrics Extracted**:
+| Metric | Value |
+|--------|-------|
+| Documents scanned | 263 |
+| Total tokens | 130,229 |
+| Unique terms (filtered) | 5,243 |
+| Co-occurrence pairs | 18,169 |
+| Anchor signatures | 20 |
+| Chunking clusters | 5 |
+
+**Top Terms by TF-IDF**:
+1. nyx (1149.70)
+2. local (980.53)
+3. eachpath (902.31)
+4. tool (873.34)
+5. young (799.95)
+
+**Anchor Signature Examples** (for DriftProbe-lite):
+- `nyx`: chroma|chromadb|continuity|ingress|introspection
+- `system`: athena|freeipa|ipa|rocky|sssd
+- `network`: firewall|proxmox|saturn|vlan|vulkan
+
+**RAG Policy Integration**:
+- Tier 2: Synonym detection (Dice=1.0: yubi↔yubikey)
+- Tier 3: Anchor signatures for topology safety
+- Tier 4: Co-occurrence for chunking strategy
+- Tier 5: TF-IDF for utility filtering
+
+**Status**: 🟢 Corpus extraction complete, ready for RAG policy development
+
+---
+
 ## Future Phases (Not Started)
 
 ### Phase 2: ChromaDB Integration (iris) ⏸️ PLANNED
@@ -92,34 +148,44 @@
 
 ## Metrics
 
-**Phase 1 (A+B) Tasks**: 11 total
-**Completed**: 11 (100%) ✅
+**Phase 1 Tasks**: 19 total
+**Completed**: 19 (100%) ✅
 **In Progress**: 0
-**Remaining**: 0
+**Phases Complete**: A, B, D (C ready to execute)
 
-**Files Created**: 12 total
+**Files Created**: 19 total
 - nyx-substrate: 9 files
-- nyx-probing: 3 files
+- nyx-probing runners: 3 files
+- nyx-probing extractors: 3 files
+- Data outputs: 4 files
 
-**Files Modified**: 4 total
+**Files Modified**: 5 total
 - nyx-substrate/README.md
 - nyx-probing/pyproject.toml
 - nyx-probing/cli/probe.py
+- nyx-probing/extractors/__init__.py
 - TOOLCHAIN-PROGRESS.md
 
-**Lines of Code**: ~1250 total
+**Lines of Code**: ~2000 total
 - nyx-substrate: ~800 LOC
-- nyx-probing: ~450 LOC
+- nyx-probing runners: ~450 LOC
+- nyx-probing extractors: ~750 LOC
 
-**CLI Commands**: 4 new commands
+**CLI Commands**: 4 variance commands
 - nyx-probe variance collect
 - nyx-probe variance batch
 - nyx-probe variance stats
 - nyx-probe variance analyze
 
+**Data Artifacts**:
+- nimmerverse_glossary.csv (5,243 terms)
+- nimmerverse_glossary.json (130,229 tokens)
+- cooccurrence_analysis.csv (18,169 pairs)
+- cooccurrence_analysis.json (20 anchor signatures)
+
 ---
 
-**Last Updated**: 2025-12-07 17:00 CET
-**Status**: 🎉 Phase 1 (A+B) COMPLETE! Ready for baseline collection on prometheus.
+**Last Updated**: 2025-12-13 (Phase 1D complete)
+**Status**: 🎉 Phase 1 (A+B+D) COMPLETE! Corpus extraction ready. Variance collection on prometheus pending.
 
-🌙💜 *The substrate holds. Progress persists. The toolchain grows.*
+🌙💜 *The substrate holds. The glossary grows. Anchor signatures protect the topology.*
diff --git a/architecture/Toolchain-Architecture.md b/architecture/Toolchain-Architecture.md
index 04263d2..e893f73 100644
--- a/architecture/Toolchain-Architecture.md
+++ b/architecture/Toolchain-Architecture.md
@@ -30,6 +30,9 @@ Build a modular, composable toolchain for the Nimmerverse research and training
 - CLI interface (7 commands)
 - NyxModel wrapper (Qwen2.5-7B loading, hidden state capture)
 - ProbeResult dataclasses (to_dict() serialization)
+- **Extractors module** (NEW 2025-12-13):
+  - VocabExtractor: TF-IDF vocabulary extraction from markdown corpus
+  - CoOccurrenceAnalyzer: PMI, Jaccard, Dice, anchor signatures
 - **Gap**: No database persistence, only local JSON files
 
 **nyx-substrate** (`/home/dafit/nimmerverse/nyx-substrate/`):
@@ -401,6 +404,106 @@ Godot Command Center displays live DriftProbe charts
 
 ---
 
+## 📚 Phase 1D: Corpus Extraction Pipeline (NEW)
+
+### Goal
+Extract vocabulary and co-occurrence metrics from nimmerverse vault for RAG policy development.
+
+**Integration Point**: Feeds into [RAG-as-Scaffold.md](/home/dafit/nimmerverse/nimmerverse-sensory-network/operations/RAG-as-Scaffold.md) progressive policy validation.
+
+### Deliverables
+
+#### 1. VocabExtractor (`nyx_probing/extractors/vocab_extractor.py`)
+
+**Purpose**: Extract TF-IDF vocabulary glossary from markdown corpus
+
+**Features**:
+- Scans all .md files (skips venv, hidden dirs)
+- Strips YAML frontmatter, code blocks, markdown syntax
+- Tokenizes with compound term support (hyphenated, CamelCase)
+- Calculates TF, DF, TF-IDF per term
+- Exports to CSV and JSON
+
+**Output** (`data/nimmerverse_glossary.json`):
+```json
+{
+  "metadata": {
+    "total_docs": 263,
+    "total_tokens": 130229,
+    "unique_terms": 5243
+  },
+  "terms": [
+    {"term": "nyx", "tf": 1073, "df": 137, "tfidf": 1149.70, ...},
+    ...
+  ]
+}
+```
+
+**Usage**:
+```bash
+python3 nyx_probing/extractors/vocab_extractor.py /path/to/vault output.csv
+```
+
+#### 2. CoOccurrenceAnalyzer (`nyx_probing/extractors/cooccurrence.py`)
+
+**Purpose**: Analyze term co-occurrence for chunking and topology safety
+
+**Features**:
+- Computes PMI (Pointwise Mutual Information)
+- Computes Jaccard similarity and Dice coefficient
+- Generates anchor term signatures (for DriftProbe-lite)
+- Produces chunking recommendations based on cohesion
+
+**Key Metrics**:
+| Metric | Formula | Use Case |
+|--------|---------|----------|
+| PMI | log2(P(a,b) / P(a)*P(b)) | Semantic association strength |
+| Jaccard | \|A∩B\| / \|A∪B\| | Term overlap similarity |
+| Dice | 2\|A∩B\| / (\|A\|+\|B\|) | Chunking cohesion |
+
+**Anchor Signatures** (for Policy Tier 3: Topology Safety):
+```
+nyx: chroma|chromadb|continuity|ingress|introspection
+system: athena|freeipa|ipa|rocky|sssd
+network: firewall|proxmox|saturn|vlan|vulkan
+```
+
+**Output** (`data/cooccurrence_analysis.json`):
+- 18,169 co-occurrence pairs
+- 20 anchor signatures
+- 5 chunking recommendations
+
+**Usage**:
+```bash
+python3 nyx_probing/extractors/cooccurrence.py /path/to/vault glossary.json output.json
+```
+
+### RAG Policy Integration
+
+These tools directly feed into RAG-as-Scaffold progressive policies:
+
+| Policy Tier | Tool | Validation |
+|-------------|------|------------|
+| **Tier 2: Semantic Quality** | CoOccurrenceAnalyzer | Dice=1.0 terms are synonyms (de-duplicate) |
+| **Tier 3: Topology Safety** | Anchor Signatures | New terms shouldn't change anchor neighbors |
+| **Tier 4: Cross-Reference** | CoOccurrenceAnalyzer | High PMI pairs should chunk together |
+| **Tier 5: Utility** | VocabExtractor TF-IDF | Low TF-IDF terms have low utility |
+
+### Files Created
+
+**nyx-probing/nyx_probing/extractors/**:
+- `__init__.py` - Module exports
+- `vocab_extractor.py` - VocabExtractor class (~350 LOC)
+- `cooccurrence.py` - CoOccurrenceAnalyzer class (~400 LOC)
+
+**nyx-probing/data/**:
+- `nimmerverse_glossary.csv` - 5,243 terms with TF-IDF
+- `nimmerverse_glossary.json` - Same with metadata
+- `cooccurrence_analysis.csv` - 18,169 pairs
+- `cooccurrence_analysis.json` - Full analysis with signatures
+
+---
+
 ## 🔮 Future Phases (Not in Current Plan)
 
 ### Phase 2: ChromaDB Integration (iris)