From bcc5bfe9d1936e886b213145c5b6fed86c7078e5 Mon Sep 17 00:00:00 2001 From: dafit Date: Sat, 13 Dec 2025 19:29:23 +0100 Subject: [PATCH] docs: add Phase 1D corpus extraction pipeline to toolchain docs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Toolchain-Architecture.md: - Added extractors module to current state - New Phase 1D section: Corpus Extraction Pipeline - VocabExtractor and CoOccurrenceAnalyzer documentation - RAG policy integration table TOOLCHAIN-PROGRESS.md: - Phase 1D complete (2025-12-13) - 7 files created, 19 total tasks complete - Key metrics: 5,243 terms, 18,169 co-occurrence pairs - 20 anchor signatures for DriftProbe-lite 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 --- architecture/TOOLCHAIN-PROGRESS.md | 90 ++++++++++++++++++--- architecture/Toolchain-Architecture.md | 103 +++++++++++++++++++++++++ 2 files changed, 181 insertions(+), 12 deletions(-) diff --git a/architecture/TOOLCHAIN-PROGRESS.md b/architecture/TOOLCHAIN-PROGRESS.md index 11b983e..9d5c54c 100644 --- a/architecture/TOOLCHAIN-PROGRESS.md +++ b/architecture/TOOLCHAIN-PROGRESS.md @@ -65,6 +65,62 @@ --- +## Phase 1D: Corpus Extraction Pipeline ✅ COMPLETE + +**Goal**: Extract vocabulary and co-occurrence metrics for RAG policy development + +### ✅ Completed (2025-12-13) + +- [x] Create extractors module in nyx-probing +- [x] Implement VocabExtractor (TF-IDF vocabulary) +- [x] Implement CoOccurrenceAnalyzer (PMI, Jaccard, Dice) +- [x] Generate anchor term signatures (20 anchors) +- [x] Generate chunking recommendations (5 clusters) +- [x] Run initial extraction on nimmerverse vault +- [x] Export glossary to CSV/JSON (5,243 terms) +- [x] Export co-occurrence analysis (18,169 pairs) + +**Files Created**: 7 new files +- `nyx_probing/extractors/__init__.py` +- `nyx_probing/extractors/vocab_extractor.py` (~350 LOC) +- `nyx_probing/extractors/cooccurrence.py` (~400 LOC) +- `data/nimmerverse_glossary.csv` +- `data/nimmerverse_glossary.json` +- `data/cooccurrence_analysis.csv` +- `data/cooccurrence_analysis.json` + +**Key Metrics Extracted**: +| Metric | Value | +|--------|-------| +| Documents scanned | 263 | +| Total tokens | 130,229 | +| Unique terms (filtered) | 5,243 | +| Co-occurrence pairs | 18,169 | +| Anchor signatures | 20 | +| Chunking clusters | 5 | + +**Top Terms by TF-IDF**: +1. nyx (1149.70) +2. local (980.53) +3. eachpath (902.31) +4. tool (873.34) +5. young (799.95) + +**Anchor Signature Examples** (for DriftProbe-lite): +- `nyx`: chroma|chromadb|continuity|ingress|introspection +- `system`: athena|freeipa|ipa|rocky|sssd +- `network`: firewall|proxmox|saturn|vlan|vulkan + +**RAG Policy Integration**: +- Tier 2: Synonym detection (Dice=1.0: yubi↔yubikey) +- Tier 3: Anchor signatures for topology safety +- Tier 4: Co-occurrence for chunking strategy +- Tier 5: TF-IDF for utility filtering + +**Status**: 🟢 Corpus extraction complete, ready for RAG policy development + +--- + ## Future Phases (Not Started) ### Phase 2: ChromaDB Integration (iris) ⏸️ PLANNED @@ -92,34 +148,44 @@ ## Metrics -**Phase 1 (A+B) Tasks**: 11 total -**Completed**: 11 (100%) ✅ +**Phase 1 Tasks**: 19 total +**Completed**: 19 (100%) ✅ **In Progress**: 0 -**Remaining**: 0 +**Phases Complete**: A, B, D (C ready to execute) -**Files Created**: 12 total +**Files Created**: 19 total - nyx-substrate: 9 files -- nyx-probing: 3 files +- nyx-probing runners: 3 files +- nyx-probing extractors: 3 files +- Data outputs: 4 files -**Files Modified**: 4 total +**Files Modified**: 5 total - nyx-substrate/README.md - nyx-probing/pyproject.toml - nyx-probing/cli/probe.py +- nyx-probing/extractors/__init__.py - TOOLCHAIN-PROGRESS.md -**Lines of Code**: ~1250 total +**Lines of Code**: ~2000 total - nyx-substrate: ~800 LOC -- nyx-probing: ~450 LOC +- nyx-probing runners: ~450 LOC +- nyx-probing extractors: ~750 LOC -**CLI Commands**: 4 new commands +**CLI Commands**: 4 variance commands - nyx-probe variance collect - nyx-probe variance batch - nyx-probe variance stats - nyx-probe variance analyze +**Data Artifacts**: +- nimmerverse_glossary.csv (5,243 terms) +- nimmerverse_glossary.json (130,229 tokens) +- cooccurrence_analysis.csv (18,169 pairs) +- cooccurrence_analysis.json (20 anchor signatures) + --- -**Last Updated**: 2025-12-07 17:00 CET -**Status**: 🎉 Phase 1 (A+B) COMPLETE! Ready for baseline collection on prometheus. +**Last Updated**: 2025-12-13 (Phase 1D complete) +**Status**: 🎉 Phase 1 (A+B+D) COMPLETE! Corpus extraction ready. Variance collection on prometheus pending. -🌙💜 *The substrate holds. Progress persists. The toolchain grows.* +🌙💜 *The substrate holds. The glossary grows. Anchor signatures protect the topology.* diff --git a/architecture/Toolchain-Architecture.md b/architecture/Toolchain-Architecture.md index 04263d2..e893f73 100644 --- a/architecture/Toolchain-Architecture.md +++ b/architecture/Toolchain-Architecture.md @@ -30,6 +30,9 @@ Build a modular, composable toolchain for the Nimmerverse research and training - CLI interface (7 commands) - NyxModel wrapper (Qwen2.5-7B loading, hidden state capture) - ProbeResult dataclasses (to_dict() serialization) +- **Extractors module** (NEW 2025-12-13): + - VocabExtractor: TF-IDF vocabulary extraction from markdown corpus + - CoOccurrenceAnalyzer: PMI, Jaccard, Dice, anchor signatures - **Gap**: No database persistence, only local JSON files **nyx-substrate** (`/home/dafit/nimmerverse/nyx-substrate/`): @@ -401,6 +404,106 @@ Godot Command Center displays live DriftProbe charts --- +## 📚 Phase 1D: Corpus Extraction Pipeline (NEW) + +### Goal +Extract vocabulary and co-occurrence metrics from nimmerverse vault for RAG policy development. + +**Integration Point**: Feeds into [RAG-as-Scaffold.md](/home/dafit/nimmerverse/nimmerverse-sensory-network/operations/RAG-as-Scaffold.md) progressive policy validation. + +### Deliverables + +#### 1. VocabExtractor (`nyx_probing/extractors/vocab_extractor.py`) + +**Purpose**: Extract TF-IDF vocabulary glossary from markdown corpus + +**Features**: +- Scans all .md files (skips venv, hidden dirs) +- Strips YAML frontmatter, code blocks, markdown syntax +- Tokenizes with compound term support (hyphenated, CamelCase) +- Calculates TF, DF, TF-IDF per term +- Exports to CSV and JSON + +**Output** (`data/nimmerverse_glossary.json`): +```json +{ + "metadata": { + "total_docs": 263, + "total_tokens": 130229, + "unique_terms": 5243 + }, + "terms": [ + {"term": "nyx", "tf": 1073, "df": 137, "tfidf": 1149.70, ...}, + ... + ] +} +``` + +**Usage**: +```bash +python3 nyx_probing/extractors/vocab_extractor.py /path/to/vault output.csv +``` + +#### 2. CoOccurrenceAnalyzer (`nyx_probing/extractors/cooccurrence.py`) + +**Purpose**: Analyze term co-occurrence for chunking and topology safety + +**Features**: +- Computes PMI (Pointwise Mutual Information) +- Computes Jaccard similarity and Dice coefficient +- Generates anchor term signatures (for DriftProbe-lite) +- Produces chunking recommendations based on cohesion + +**Key Metrics**: +| Metric | Formula | Use Case | +|--------|---------|----------| +| PMI | log2(P(a,b) / P(a)*P(b)) | Semantic association strength | +| Jaccard | \|A∩B\| / \|A∪B\| | Term overlap similarity | +| Dice | 2\|A∩B\| / (\|A\|+\|B\|) | Chunking cohesion | + +**Anchor Signatures** (for Policy Tier 3: Topology Safety): +``` +nyx: chroma|chromadb|continuity|ingress|introspection +system: athena|freeipa|ipa|rocky|sssd +network: firewall|proxmox|saturn|vlan|vulkan +``` + +**Output** (`data/cooccurrence_analysis.json`): +- 18,169 co-occurrence pairs +- 20 anchor signatures +- 5 chunking recommendations + +**Usage**: +```bash +python3 nyx_probing/extractors/cooccurrence.py /path/to/vault glossary.json output.json +``` + +### RAG Policy Integration + +These tools directly feed into RAG-as-Scaffold progressive policies: + +| Policy Tier | Tool | Validation | +|-------------|------|------------| +| **Tier 2: Semantic Quality** | CoOccurrenceAnalyzer | Dice=1.0 terms are synonyms (de-duplicate) | +| **Tier 3: Topology Safety** | Anchor Signatures | New terms shouldn't change anchor neighbors | +| **Tier 4: Cross-Reference** | CoOccurrenceAnalyzer | High PMI pairs should chunk together | +| **Tier 5: Utility** | VocabExtractor TF-IDF | Low TF-IDF terms have low utility | + +### Files Created + +**nyx-probing/nyx_probing/extractors/**: +- `__init__.py` - Module exports +- `vocab_extractor.py` - VocabExtractor class (~350 LOC) +- `cooccurrence.py` - CoOccurrenceAnalyzer class (~400 LOC) + +**nyx-probing/data/**: +- `nimmerverse_glossary.csv` - 5,243 terms with TF-IDF +- `nimmerverse_glossary.json` - Same with metadata +- `cooccurrence_analysis.csv` - 18,169 pairs +- `cooccurrence_analysis.json` - Full analysis with signatures + +--- + ## 🔮 Future Phases (Not in Current Plan) ### Phase 2: ChromaDB Integration (iris)