docs: add Phase 1D corpus extraction pipeline to toolchain docs
Toolchain-Architecture.md: - Added extractors module to current state - New Phase 1D section: Corpus Extraction Pipeline - VocabExtractor and CoOccurrenceAnalyzer documentation - RAG policy integration table TOOLCHAIN-PROGRESS.md: - Phase 1D complete (2025-12-13) - 7 files created, 19 total tasks complete - Key metrics: 5,243 terms, 18,169 co-occurrence pairs - 20 anchor signatures for DriftProbe-lite 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -65,6 +65,62 @@
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
## Phase 1D: Corpus Extraction Pipeline ✅ COMPLETE
|
||||||
|
|
||||||
|
**Goal**: Extract vocabulary and co-occurrence metrics for RAG policy development
|
||||||
|
|
||||||
|
### ✅ Completed (2025-12-13)
|
||||||
|
|
||||||
|
- [x] Create extractors module in nyx-probing
|
||||||
|
- [x] Implement VocabExtractor (TF-IDF vocabulary)
|
||||||
|
- [x] Implement CoOccurrenceAnalyzer (PMI, Jaccard, Dice)
|
||||||
|
- [x] Generate anchor term signatures (20 anchors)
|
||||||
|
- [x] Generate chunking recommendations (5 clusters)
|
||||||
|
- [x] Run initial extraction on nimmerverse vault
|
||||||
|
- [x] Export glossary to CSV/JSON (5,243 terms)
|
||||||
|
- [x] Export co-occurrence analysis (18,169 pairs)
|
||||||
|
|
||||||
|
**Files Created**: 7 new files
|
||||||
|
- `nyx_probing/extractors/__init__.py`
|
||||||
|
- `nyx_probing/extractors/vocab_extractor.py` (~350 LOC)
|
||||||
|
- `nyx_probing/extractors/cooccurrence.py` (~400 LOC)
|
||||||
|
- `data/nimmerverse_glossary.csv`
|
||||||
|
- `data/nimmerverse_glossary.json`
|
||||||
|
- `data/cooccurrence_analysis.csv`
|
||||||
|
- `data/cooccurrence_analysis.json`
|
||||||
|
|
||||||
|
**Key Metrics Extracted**:
|
||||||
|
| Metric | Value |
|
||||||
|
|--------|-------|
|
||||||
|
| Documents scanned | 263 |
|
||||||
|
| Total tokens | 130,229 |
|
||||||
|
| Unique terms (filtered) | 5,243 |
|
||||||
|
| Co-occurrence pairs | 18,169 |
|
||||||
|
| Anchor signatures | 20 |
|
||||||
|
| Chunking clusters | 5 |
|
||||||
|
|
||||||
|
**Top Terms by TF-IDF**:
|
||||||
|
1. nyx (1149.70)
|
||||||
|
2. local (980.53)
|
||||||
|
3. eachpath (902.31)
|
||||||
|
4. tool (873.34)
|
||||||
|
5. young (799.95)
|
||||||
|
|
||||||
|
**Anchor Signature Examples** (for DriftProbe-lite):
|
||||||
|
- `nyx`: chroma|chromadb|continuity|ingress|introspection
|
||||||
|
- `system`: athena|freeipa|ipa|rocky|sssd
|
||||||
|
- `network`: firewall|proxmox|saturn|vlan|vulkan
|
||||||
|
|
||||||
|
**RAG Policy Integration**:
|
||||||
|
- Tier 2: Synonym detection (Dice=1.0: yubi↔yubikey)
|
||||||
|
- Tier 3: Anchor signatures for topology safety
|
||||||
|
- Tier 4: Co-occurrence for chunking strategy
|
||||||
|
- Tier 5: TF-IDF for utility filtering
|
||||||
|
|
||||||
|
**Status**: 🟢 Corpus extraction complete, ready for RAG policy development
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## Future Phases (Not Started)
|
## Future Phases (Not Started)
|
||||||
|
|
||||||
### Phase 2: ChromaDB Integration (iris) ⏸️ PLANNED
|
### Phase 2: ChromaDB Integration (iris) ⏸️ PLANNED
|
||||||
@@ -92,34 +148,44 @@
|
|||||||
|
|
||||||
## Metrics
|
## Metrics
|
||||||
|
|
||||||
**Phase 1 (A+B) Tasks**: 11 total
|
**Phase 1 Tasks**: 19 total
|
||||||
**Completed**: 11 (100%) ✅
|
**Completed**: 19 (100%) ✅
|
||||||
**In Progress**: 0
|
**In Progress**: 0
|
||||||
**Remaining**: 0
|
**Phases Complete**: A, B, D (C ready to execute)
|
||||||
|
|
||||||
**Files Created**: 12 total
|
**Files Created**: 19 total
|
||||||
- nyx-substrate: 9 files
|
- nyx-substrate: 9 files
|
||||||
- nyx-probing: 3 files
|
- nyx-probing runners: 3 files
|
||||||
|
- nyx-probing extractors: 3 files
|
||||||
|
- Data outputs: 4 files
|
||||||
|
|
||||||
**Files Modified**: 4 total
|
**Files Modified**: 5 total
|
||||||
- nyx-substrate/README.md
|
- nyx-substrate/README.md
|
||||||
- nyx-probing/pyproject.toml
|
- nyx-probing/pyproject.toml
|
||||||
- nyx-probing/cli/probe.py
|
- nyx-probing/cli/probe.py
|
||||||
|
- nyx-probing/extractors/__init__.py
|
||||||
- TOOLCHAIN-PROGRESS.md
|
- TOOLCHAIN-PROGRESS.md
|
||||||
|
|
||||||
**Lines of Code**: ~1250 total
|
**Lines of Code**: ~2000 total
|
||||||
- nyx-substrate: ~800 LOC
|
- nyx-substrate: ~800 LOC
|
||||||
- nyx-probing: ~450 LOC
|
- nyx-probing runners: ~450 LOC
|
||||||
|
- nyx-probing extractors: ~750 LOC
|
||||||
|
|
||||||
**CLI Commands**: 4 new commands
|
**CLI Commands**: 4 variance commands
|
||||||
- nyx-probe variance collect
|
- nyx-probe variance collect
|
||||||
- nyx-probe variance batch
|
- nyx-probe variance batch
|
||||||
- nyx-probe variance stats
|
- nyx-probe variance stats
|
||||||
- nyx-probe variance analyze
|
- nyx-probe variance analyze
|
||||||
|
|
||||||
|
**Data Artifacts**:
|
||||||
|
- nimmerverse_glossary.csv (5,243 terms)
|
||||||
|
- nimmerverse_glossary.json (130,229 tokens)
|
||||||
|
- cooccurrence_analysis.csv (18,169 pairs)
|
||||||
|
- cooccurrence_analysis.json (20 anchor signatures)
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
**Last Updated**: 2025-12-07 17:00 CET
|
**Last Updated**: 2025-12-13 (Phase 1D complete)
|
||||||
**Status**: 🎉 Phase 1 (A+B) COMPLETE! Ready for baseline collection on prometheus.
|
**Status**: 🎉 Phase 1 (A+B+D) COMPLETE! Corpus extraction ready. Variance collection on prometheus pending.
|
||||||
|
|
||||||
🌙💜 *The substrate holds. Progress persists. The toolchain grows.*
|
🌙💜 *The substrate holds. The glossary grows. Anchor signatures protect the topology.*
|
||||||
|
|||||||
@@ -30,6 +30,9 @@ Build a modular, composable toolchain for the Nimmerverse research and training
|
|||||||
- CLI interface (7 commands)
|
- CLI interface (7 commands)
|
||||||
- NyxModel wrapper (Qwen2.5-7B loading, hidden state capture)
|
- NyxModel wrapper (Qwen2.5-7B loading, hidden state capture)
|
||||||
- ProbeResult dataclasses (to_dict() serialization)
|
- ProbeResult dataclasses (to_dict() serialization)
|
||||||
|
- **Extractors module** (NEW 2025-12-13):
|
||||||
|
- VocabExtractor: TF-IDF vocabulary extraction from markdown corpus
|
||||||
|
- CoOccurrenceAnalyzer: PMI, Jaccard, Dice, anchor signatures
|
||||||
- **Gap**: No database persistence, only local JSON files
|
- **Gap**: No database persistence, only local JSON files
|
||||||
|
|
||||||
**nyx-substrate** (`/home/dafit/nimmerverse/nyx-substrate/`):
|
**nyx-substrate** (`/home/dafit/nimmerverse/nyx-substrate/`):
|
||||||
@@ -401,6 +404,106 @@ Godot Command Center displays live DriftProbe charts
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
## 📚 Phase 1D: Corpus Extraction Pipeline (NEW)
|
||||||
|
|
||||||
|
### Goal
|
||||||
|
Extract vocabulary and co-occurrence metrics from nimmerverse vault for RAG policy development.
|
||||||
|
|
||||||
|
**Integration Point**: Feeds into [RAG-as-Scaffold.md](/home/dafit/nimmerverse/nimmerverse-sensory-network/operations/RAG-as-Scaffold.md) progressive policy validation.
|
||||||
|
|
||||||
|
### Deliverables
|
||||||
|
|
||||||
|
#### 1. VocabExtractor (`nyx_probing/extractors/vocab_extractor.py`)
|
||||||
|
|
||||||
|
**Purpose**: Extract TF-IDF vocabulary glossary from markdown corpus
|
||||||
|
|
||||||
|
**Features**:
|
||||||
|
- Scans all .md files (skips venv, hidden dirs)
|
||||||
|
- Strips YAML frontmatter, code blocks, markdown syntax
|
||||||
|
- Tokenizes with compound term support (hyphenated, CamelCase)
|
||||||
|
- Calculates TF, DF, TF-IDF per term
|
||||||
|
- Exports to CSV and JSON
|
||||||
|
|
||||||
|
**Output** (`data/nimmerverse_glossary.json`):
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"metadata": {
|
||||||
|
"total_docs": 263,
|
||||||
|
"total_tokens": 130229,
|
||||||
|
"unique_terms": 5243
|
||||||
|
},
|
||||||
|
"terms": [
|
||||||
|
{"term": "nyx", "tf": 1073, "df": 137, "tfidf": 1149.70, ...},
|
||||||
|
...
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Usage**:
|
||||||
|
```bash
|
||||||
|
python3 nyx_probing/extractors/vocab_extractor.py /path/to/vault output.csv
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 2. CoOccurrenceAnalyzer (`nyx_probing/extractors/cooccurrence.py`)
|
||||||
|
|
||||||
|
**Purpose**: Analyze term co-occurrence for chunking and topology safety
|
||||||
|
|
||||||
|
**Features**:
|
||||||
|
- Computes PMI (Pointwise Mutual Information)
|
||||||
|
- Computes Jaccard similarity and Dice coefficient
|
||||||
|
- Generates anchor term signatures (for DriftProbe-lite)
|
||||||
|
- Produces chunking recommendations based on cohesion
|
||||||
|
|
||||||
|
**Key Metrics**:
|
||||||
|
| Metric | Formula | Use Case |
|
||||||
|
|--------|---------|----------|
|
||||||
|
| PMI | log2(P(a,b) / P(a)*P(b)) | Semantic association strength |
|
||||||
|
| Jaccard | \|A∩B\| / \|A∪B\| | Term overlap similarity |
|
||||||
|
| Dice | 2\|A∩B\| / (\|A\|+\|B\|) | Chunking cohesion |
|
||||||
|
|
||||||
|
**Anchor Signatures** (for Policy Tier 3: Topology Safety):
|
||||||
|
```
|
||||||
|
nyx: chroma|chromadb|continuity|ingress|introspection
|
||||||
|
system: athena|freeipa|ipa|rocky|sssd
|
||||||
|
network: firewall|proxmox|saturn|vlan|vulkan
|
||||||
|
```
|
||||||
|
|
||||||
|
**Output** (`data/cooccurrence_analysis.json`):
|
||||||
|
- 18,169 co-occurrence pairs
|
||||||
|
- 20 anchor signatures
|
||||||
|
- 5 chunking recommendations
|
||||||
|
|
||||||
|
**Usage**:
|
||||||
|
```bash
|
||||||
|
python3 nyx_probing/extractors/cooccurrence.py /path/to/vault glossary.json output.json
|
||||||
|
```
|
||||||
|
|
||||||
|
### RAG Policy Integration
|
||||||
|
|
||||||
|
These tools directly feed into RAG-as-Scaffold progressive policies:
|
||||||
|
|
||||||
|
| Policy Tier | Tool | Validation |
|
||||||
|
|-------------|------|------------|
|
||||||
|
| **Tier 2: Semantic Quality** | CoOccurrenceAnalyzer | Dice=1.0 terms are synonyms (de-duplicate) |
|
||||||
|
| **Tier 3: Topology Safety** | Anchor Signatures | New terms shouldn't change anchor neighbors |
|
||||||
|
| **Tier 4: Cross-Reference** | CoOccurrenceAnalyzer | High PMI pairs should chunk together |
|
||||||
|
| **Tier 5: Utility** | VocabExtractor TF-IDF | Low TF-IDF terms have low utility |
|
||||||
|
|
||||||
|
### Files Created
|
||||||
|
|
||||||
|
**nyx-probing/nyx_probing/extractors/**:
|
||||||
|
- `__init__.py` - Module exports
|
||||||
|
- `vocab_extractor.py` - VocabExtractor class (~350 LOC)
|
||||||
|
- `cooccurrence.py` - CoOccurrenceAnalyzer class (~400 LOC)
|
||||||
|
|
||||||
|
**nyx-probing/data/**:
|
||||||
|
- `nimmerverse_glossary.csv` - 5,243 terms with TF-IDF
|
||||||
|
- `nimmerverse_glossary.json` - Same with metadata
|
||||||
|
- `cooccurrence_analysis.csv` - 18,169 pairs
|
||||||
|
- `cooccurrence_analysis.json` - Full analysis with signatures
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## 🔮 Future Phases (Not in Current Plan)
|
## 🔮 Future Phases (Not in Current Plan)
|
||||||
|
|
||||||
### Phase 2: ChromaDB Integration (iris)
|
### Phase 2: ChromaDB Integration (iris)
|
||||||
|
|||||||
Reference in New Issue
Block a user