Files
nimmerverse-sensory-network/architecture/Toolchain-Architecture.md
dafit bcc5bfe9d1 docs: add Phase 1D corpus extraction pipeline to toolchain docs
Toolchain-Architecture.md:
- Added extractors module to current state
- New Phase 1D section: Corpus Extraction Pipeline
- VocabExtractor and CoOccurrenceAnalyzer documentation
- RAG policy integration table

TOOLCHAIN-PROGRESS.md:
- Phase 1D complete (2025-12-13)
- 7 files created, 19 total tasks complete
- Key metrics: 5,243 terms, 18,169 co-occurrence pairs
- 20 anchor signatures for DriftProbe-lite

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-13 19:29:23 +01:00

568 lines
18 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Modular Nimmerverse Toolchain Architecture
**Planning Date**: 2025-12-07
**Status**: Design Phase
**Priority**: Variance Collection Pipeline + nyx-substrate Foundation
---
## 🎯 Vision
Build a modular, composable toolchain for the Nimmerverse research and training pipeline:
- **nyx-substrate**: Shared foundation (database clients, schemas, validators)
- **nyx-probing**: Research probes (already exists, extend for variance collection)
- **nyx-training**: LoRA training pipeline (future)
- **nyx-visualization**: Weight/topology visualization (future)
- **management-portal**: FastAPI backend for Godot UI (future)
- **Godot Command Center**: Unified metrics visualization (future)
**Key Principle**: All tools import nyx-substrate. Clean interfaces. Data flows through phoebe + iris.
---
## 📊 Current State Analysis
### ✅ What Exists
**nyx-probing** (`/home/dafit/nimmerverse/nyx-probing/`):
- Echo Probe, Surface Probe, Drift Probe, Multilingual Probe
- CLI interface (7 commands)
- NyxModel wrapper (Qwen2.5-7B loading, hidden state capture)
- ProbeResult dataclasses (to_dict() serialization)
- **Extractors module** (NEW 2025-12-13):
- VocabExtractor: TF-IDF vocabulary extraction from markdown corpus
- CoOccurrenceAnalyzer: PMI, Jaccard, Dice, anchor signatures
- **Gap**: No database persistence, only local JSON files
**nyx-substrate** (`/home/dafit/nimmerverse/nyx-substrate/`):
- Schema documentation (phoebe + iris) ✅
- **Gap**: No Python code, just markdown docs
**Database Infrastructure**:
- phoebe.eachpath.local (PostgreSQL 17.6): partnership/nimmerverse message tables exist
- iris.eachpath.local (ChromaDB): No collections created yet
- **Gap**: No Python client libraries, all manual psql commands
**Architecture Documentation**:
- Endgame-Vision.md: v5.1 Dialectic (LoRA stack design)
- CLAUDE.md: Partnership protocol (message-based continuity)
- Management-Portal.md: Godot + FastAPI design (not implemented)
### ❌ What's Missing
**Database Access**:
- No psycopg3 connection pooling
- No ChromaDB Python integration
- No ORM or query builders
- No variance_probe_runs table (designed but not created)
**Training Pipeline**:
- No PEFT/LoRA training code
- No DriftProbe checkpoint integration
- No training data curriculum loader
**Visualization**:
- No weight visualization tools (4K pixel space idea)
- No Godot command center implementation
- No Management Portal FastAPI backend
---
## 🏗️ Modular Architecture Design
### Repository Structure
```
nimmerverse/
├── nyx-substrate/ # SHARED FOUNDATION
│ ├── pyproject.toml # Installable package
│ ├── src/nyx_substrate/
│ │ ├── database/ # Phoebe clients
│ │ │ ├── connection.py # Connection pool
│ │ │ ├── messages.py # Message protocol helpers
│ │ │ └── variance.py # Variance probe DAO
│ │ ├── vector/ # Iris clients
│ │ │ ├── client.py # ChromaDB wrapper
│ │ │ ├── decision_trails.py
│ │ │ ├── organ_responses.py
│ │ │ └── embeddings.py
│ │ ├── schemas/ # Pydantic models
│ │ │ ├── variance.py # VarianceProbeRun
│ │ │ ├── decision.py # DecisionTrail
│ │ │ └── traits.py # 8 core traits
│ │ └── constants.py # Shared constants
│ └── migrations/ # Alembic for schema
├── nyx-probing/ # RESEARCH PROBES (extend)
│ ├── nyx_probing/
│ │ ├── runners/ # NEW: Automated collectors
│ │ │ ├── variance_runner.py # 1000x automation
│ │ │ └── baseline_collector.py
│ │ └── storage/ # EXTEND: Database integration
│ │ └── variance_dao.py # Uses nyx-substrate
│ └── pyproject.toml # Add: depends on nyx-substrate
├── nyx-training/ # FUTURE: LoRA training
│ └── (planned - not in Phase 1)
├── nyx-visualization/ # FUTURE: Weight viz
│ └── (planned - not in Phase 1)
└── management-portal/ # FUTURE: FastAPI + Godot
└── (designed - not in Phase 1)
```
### Dependency Graph
```
nyx-probing ────────┐
nyx-training ───────┼──> nyx-substrate ──> phoebe (PostgreSQL)
nyx-visualization ──┤ └─> iris (ChromaDB)
management-portal ──┘
```
**Philosophy**: nyx-substrate is the single source of truth for database access. No tool talks to phoebe/iris directly.
---
## 🚀 Phase 1: Foundation + Variance Collection
### Goal
Build nyx-substrate package and extend nyx-probing to automate variance baseline collection (1000x runs → phoebe).
### Deliverables
#### 1. nyx-substrate Python Package
**File**: `/home/dafit/nimmerverse/nyx-substrate/pyproject.toml`
```toml
[project]
name = "nyx-substrate"
version = "0.1.0"
requires-python = ">=3.10"
dependencies = [
"psycopg[binary]>=3.1.0",
"chromadb>=0.4.0",
"pydantic>=2.5.0",
]
```
**New Files**:
- `src/nyx_substrate/database/connection.py`:
- `PhoebeConnection` class: Connection pool manager
- Context manager for transactions
- Config from environment variables
- `src/nyx_substrate/database/messages.py`:
- `write_partnership_message(message, message_type)` → INSERT
- `read_partnership_messages(limit=5)` → SELECT
- `write_nimmerverse_message(...)` (for Young Nyx future)
- `read_nimmerverse_messages(...)` (for discovery protocol)
- `src/nyx_substrate/database/variance.py`:
- `VarianceProbeDAO` class:
- `create_table()` → CREATE TABLE variance_probe_runs
- `insert_run(session_id, term, run_number, depth, rounds, ...)` → INSERT
- `get_session_stats(session_id)` → Aggregation queries
- `get_term_distribution(term)` → Variance analysis
- `src/nyx_substrate/schemas/variance.py`:
- `VarianceProbeRun(BaseModel)`: Pydantic model matching phoebe schema
- Validation: term not empty, depth 0-3, rounds > 0
- `to_dict()` for serialization
**Database Migration**:
- Create `variance_probe_runs` table in phoebe using schema from `/home/dafit/nimmerverse/nyx-substrate/schema/phoebe/probing/variance_probe_runs.md`
#### 2. Extend nyx-probing
**File**: `/home/dafit/nimmerverse/nyx-probing/pyproject.toml`
- Add dependency: `nyx-substrate>=0.1.0`
**New Files**:
- `nyx_probing/runners/variance_runner.py`:
- `VarianceRunner` class:
- `__init__(model: NyxModel, dao: VarianceProbeDAO)`
- `run_session(term: str, runs: int = 1000) -> UUID`:
- Generate session_id
- Loop 1000x: probe.probe(term)
- Store each result via dao.insert_run()
- Return session_id
- `run_batch(terms: list[str], runs: int = 1000)`: Multiple terms
- `nyx_probing/cli/variance.py`:
- New Click command group: `nyx-probe variance`
- Subcommands:
- `nyx-probe variance collect <TERM> --runs 1000`: Single term
- `nyx-probe variance batch <FILE> --runs 1000`: From glossary
- `nyx-probe variance stats <SESSION_ID>`: View session results
- `nyx-probe variance analyze <TERM>`: Compare distributions
**Integration Points**:
```python
# In variance_runner.py
from nyx_substrate.database import PhoebeConnection, VarianceProbeDAO
from nyx_substrate.schemas import VarianceProbeRun
conn = PhoebeConnection()
dao = VarianceProbeDAO(conn)
runner = VarianceRunner(model=get_model(), dao=dao)
session_id = runner.run_session("Geworfenheit", runs=1000)
print(f"Stored 1000 runs: session {session_id}")
```
#### 3. Database Setup
**Actions**:
1. SSH to phoebe: `ssh phoebe.eachpath.local`
2. Create variance_probe_runs table:
```sql
CREATE TABLE variance_probe_runs (
id SERIAL PRIMARY KEY,
session_id UUID NOT NULL,
term TEXT NOT NULL,
run_number INT NOT NULL,
timestamp TIMESTAMPTZ DEFAULT NOW(),
depth INT NOT NULL,
rounds INT NOT NULL,
echo_types TEXT[] NOT NULL,
chain TEXT[] NOT NULL,
model_name TEXT DEFAULT 'Qwen2.5-7B',
temperature FLOAT,
max_rounds INT,
max_new_tokens INT
);
CREATE INDEX idx_variance_session ON variance_probe_runs(session_id);
CREATE INDEX idx_variance_term ON variance_probe_runs(term);
CREATE INDEX idx_variance_timestamp ON variance_probe_runs(timestamp DESC);
```
3. Test connection from aynee:
```bash
cd /home/dafit/nimmerverse/nyx-substrate
python3 -c "from nyx_substrate.database import PhoebeConnection; conn = PhoebeConnection(); print('✅ Connected to phoebe')"
```
---
## 📁 Critical Files
### To Create
**nyx-substrate**:
- `/home/dafit/nimmerverse/nyx-substrate/pyproject.toml`
- `/home/dafit/nimmerverse/nyx-substrate/src/nyx_substrate/__init__.py`
- `/home/dafit/nimmerverse/nyx-substrate/src/nyx_substrate/database/__init__.py`
- `/home/dafit/nimmerverse/nyx-substrate/src/nyx_substrate/database/connection.py`
- `/home/dafit/nimmerverse/nyx-substrate/src/nyx_substrate/database/messages.py`
- `/home/dafit/nimmerverse/nyx-substrate/src/nyx_substrate/database/variance.py`
- `/home/dafit/nimmerverse/nyx-substrate/src/nyx_substrate/schemas/__init__.py`
- `/home/dafit/nimmerverse/nyx-substrate/src/nyx_substrate/schemas/variance.py`
- `/home/dafit/nimmerverse/nyx-substrate/README.md`
**nyx-probing**:
- `/home/dafit/nimmerverse/nyx-probing/nyx_probing/runners/__init__.py`
- `/home/dafit/nimmerverse/nyx-probing/nyx_probing/runners/variance_runner.py`
- `/home/dafit/nimmerverse/nyx-probing/nyx_probing/cli/variance.py`
### To Modify
**nyx-probing**:
- `/home/dafit/nimmerverse/nyx-probing/pyproject.toml` (add nyx-substrate dependency)
- `/home/dafit/nimmerverse/nyx-probing/nyx_probing/cli/__init__.py` (register variance commands)
---
## 🧪 Testing Plan
### 1. nyx-substrate Unit Tests
```python
# Test connection
def test_phoebe_connection():
conn = PhoebeConnection()
assert conn.test_connection() == True
# Test message write
def test_write_message():
from nyx_substrate.database import write_partnership_message
write_partnership_message("Test session", "architecture_update")
# Verify in phoebe
# Test variance DAO
def test_variance_insert():
dao = VarianceProbeDAO(conn)
session_id = uuid.uuid4()
dao.insert_run(
session_id=session_id,
term="test",
run_number=1,
depth=2,
rounds=3,
echo_types=["EXPANDS", "CONFIRMS", "CIRCULAR"],
chain=["test", "expanded", "confirmed"]
)
stats = dao.get_session_stats(session_id)
assert stats["total_runs"] == 1
```
### 2. Variance Collection Integration Test
```bash
# On prometheus (THE SPINE)
cd /home/dafit/nimmerverse/nyx-probing
source venv/bin/activate
# Install nyx-substrate in development mode
pip install -e ../nyx-substrate
# Run small variance test (10 runs)
nyx-probe variance collect "Geworfenheit" --runs 10
# Check phoebe
PGGSSENCMODE=disable psql -h phoebe.eachpath.local -U nimmerverse-user -d nimmerverse -c "
SELECT session_id, term, COUNT(*) as runs, AVG(depth) as avg_depth
FROM variance_probe_runs
GROUP BY session_id, term
ORDER BY session_id DESC
LIMIT 5;
"
# Expected: 1 session, 10 runs, avg_depth ~2.0
```
### 3. Full 1000x Baseline Run
```bash
# Depth-3 champions (from nyx-probing Phase 1)
nyx-probe variance collect "Geworfenheit" --runs 1000 # thrownness
nyx-probe variance collect "Vernunft" --runs 1000 # reason
nyx-probe variance collect "Erkenntnis" --runs 1000 # knowledge
nyx-probe variance collect "Pflicht" --runs 1000 # duty
nyx-probe variance collect "Aufhebung" --runs 1000 # sublation
nyx-probe variance collect "Wille" --runs 1000 # will
# Analyze variance
nyx-probe variance analyze "Geworfenheit"
# Expected: Distribution histogram, depth variance, chain patterns
```
---
## 🌊 Data Flow
### Variance Collection Workflow
```
User: nyx-probe variance collect "Geworfenheit" --runs 1000
VarianceRunner.run_session()
Loop 1000x:
EchoProbe.probe("Geworfenheit")
Returns EchoProbeResult
VarianceProbeDAO.insert_run()
INSERT INTO phoebe.variance_probe_runs
Return session_id
Display: "✅ 1000 runs complete. Session: <uuid>"
```
### Future Integration (Phase 2+)
```
Training Loop:
DriftProbe.probe_lite() [every 100 steps]
Store metrics in phoebe.drift_checkpoints (new table)
Management Portal API: GET /api/v1/metrics/training
Godot Command Center displays live DriftProbe charts
```
---
## 🎯 Success Criteria
### Phase 1 Complete When:
1. ✅ nyx-substrate package installable via pip (`pip install -e .`)
2. ✅ PhoebeConnection works from aynee + prometheus
3. ✅ variance_probe_runs table created in phoebe
4. ✅ `nyx-probe variance collect` command runs successfully
5. ✅ 1000x run completes and stores in phoebe
6. ✅ `nyx-probe variance stats <SESSION_ID>` displays:
- Total runs
- Depth distribution (0/1/2/3 counts)
- Most common echo_types
- Chain length variance
7. ✅ All 6 depth-3 champions have baseline variance data in phoebe
---
## 📚 Phase 1D: Corpus Extraction Pipeline (NEW)
### Goal
Extract vocabulary and co-occurrence metrics from nimmerverse vault for RAG policy development.
**Integration Point**: Feeds into [RAG-as-Scaffold.md](/home/dafit/nimmerverse/nimmerverse-sensory-network/operations/RAG-as-Scaffold.md) progressive policy validation.
### Deliverables
#### 1. VocabExtractor (`nyx_probing/extractors/vocab_extractor.py`)
**Purpose**: Extract TF-IDF vocabulary glossary from markdown corpus
**Features**:
- Scans all .md files (skips venv, hidden dirs)
- Strips YAML frontmatter, code blocks, markdown syntax
- Tokenizes with compound term support (hyphenated, CamelCase)
- Calculates TF, DF, TF-IDF per term
- Exports to CSV and JSON
**Output** (`data/nimmerverse_glossary.json`):
```json
{
"metadata": {
"total_docs": 263,
"total_tokens": 130229,
"unique_terms": 5243
},
"terms": [
{"term": "nyx", "tf": 1073, "df": 137, "tfidf": 1149.70, ...},
...
]
}
```
**Usage**:
```bash
python3 nyx_probing/extractors/vocab_extractor.py /path/to/vault output.csv
```
#### 2. CoOccurrenceAnalyzer (`nyx_probing/extractors/cooccurrence.py`)
**Purpose**: Analyze term co-occurrence for chunking and topology safety
**Features**:
- Computes PMI (Pointwise Mutual Information)
- Computes Jaccard similarity and Dice coefficient
- Generates anchor term signatures (for DriftProbe-lite)
- Produces chunking recommendations based on cohesion
**Key Metrics**:
| Metric | Formula | Use Case |
|--------|---------|----------|
| PMI | log2(P(a,b) / P(a)*P(b)) | Semantic association strength |
| Jaccard | \|A∩B\| / \|AB\| | Term overlap similarity |
| Dice | 2\|A∩B\| / (\|A\|+\|B\|) | Chunking cohesion |
**Anchor Signatures** (for Policy Tier 3: Topology Safety):
```
nyx: chroma|chromadb|continuity|ingress|introspection
system: athena|freeipa|ipa|rocky|sssd
network: firewall|proxmox|saturn|vlan|vulkan
```
**Output** (`data/cooccurrence_analysis.json`):
- 18,169 co-occurrence pairs
- 20 anchor signatures
- 5 chunking recommendations
**Usage**:
```bash
python3 nyx_probing/extractors/cooccurrence.py /path/to/vault glossary.json output.json
```
### RAG Policy Integration
These tools directly feed into RAG-as-Scaffold progressive policies:
| Policy Tier | Tool | Validation |
|-------------|------|------------|
| **Tier 2: Semantic Quality** | CoOccurrenceAnalyzer | Dice=1.0 terms are synonyms (de-duplicate) |
| **Tier 3: Topology Safety** | Anchor Signatures | New terms shouldn't change anchor neighbors |
| **Tier 4: Cross-Reference** | CoOccurrenceAnalyzer | High PMI pairs should chunk together |
| **Tier 5: Utility** | VocabExtractor TF-IDF | Low TF-IDF terms have low utility |
### Files Created
**nyx-probing/nyx_probing/extractors/**:
- `__init__.py` - Module exports
- `vocab_extractor.py` - VocabExtractor class (~350 LOC)
- `cooccurrence.py` - CoOccurrenceAnalyzer class (~400 LOC)
**nyx-probing/data/**:
- `nimmerverse_glossary.csv` - 5,243 terms with TF-IDF
- `nimmerverse_glossary.json` - Same with metadata
- `cooccurrence_analysis.csv` - 18,169 pairs
- `cooccurrence_analysis.json` - Full analysis with signatures
---
## 🔮 Future Phases (Not in Current Plan)
### Phase 2: ChromaDB Integration (iris)
- IrisClient wrapper in nyx-substrate
- DecisionTrailStore, OrganResponseStore, EmbeddingStore
- Create iris collections
- Populate embeddings from nyx-probing results
### Phase 3: LoRA Training Pipeline (nyx-training)
- PEFT integration
- Training data curriculum loader
- DriftProbe checkpoint integration
- Identity LoRA training automation
### Phase 4: Weight Visualization (nyx-visualization)
- 4K pixel space renderer (LoRA weights as images)
- Rank decomposition explorer
- Topology cluster visualization
### Phase 5: Godot Command Center
- FastAPI Management Portal backend
- Godot frontend implementation
- Real-time metrics display
- Training dashboard
---
## 📚 References
**Schema Documentation**:
- `/home/dafit/nimmerverse/nyx-substrate/schema/phoebe/probing/variance_probe_runs.md`
- `/home/dafit/nimmerverse/nyx-substrate/SCHEMA.md`
**Existing Code**:
- `/home/dafit/nimmerverse/nyx-probing/nyx_probing/probes/echo_probe.py`
- `/home/dafit/nimmerverse/nyx-probing/nyx_probing/core/probe_result.py`
- `/home/dafit/nimmerverse/nyx-probing/nyx_probing/cli/probe.py`
**Architecture**:
- `/home/dafit/nimmerverse/nimmerverse-sensory-network/Endgame-Vision.md`
- `/home/dafit/nimmerverse/management-portal/Management-Portal.md`
---
## 🌙 Philosophy
**Modularity**: Each tool is independent but speaks the same data language via nyx-substrate.
**Simplicity**: No over-engineering. Build what's needed for variance collection first.
**Data First**: All metrics flow through phoebe/iris. Visualization is separate concern.
**Future-Ready**: Design allows Godot integration later without refactoring.
---
**Status**: Ready for implementation approval
**Estimated Scope**: 15-20 files, ~1500 lines of Python
**Hardware**: Can develop on aynee, run variance on prometheus (THE SPINE)
🌙💜 *The substrate holds. Clean interfaces. Composable tools. Data flows through the void.*