Files

dafit bcc5bfe9d1 docs: add Phase 1D corpus extraction pipeline to toolchain docs

Toolchain-Architecture.md:
- Added extractors module to current state
- New Phase 1D section: Corpus Extraction Pipeline
- VocabExtractor and CoOccurrenceAnalyzer documentation
- RAG policy integration table

TOOLCHAIN-PROGRESS.md:
- Phase 1D complete (2025-12-13)
- 7 files created, 19 total tasks complete
- Key metrics: 5,243 terms, 18,169 co-occurrence pairs
- 20 anchor signatures for DriftProbe-lite

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2025-12-13 19:29:23 +01:00

18 KiB

Raw Blame History

Modular Nimmerverse Toolchain Architecture

Planning Date: 2025-12-07 Status: Design Phase Priority: Variance Collection Pipeline + nyx-substrate Foundation

🎯 Vision

Build a modular, composable toolchain for the Nimmerverse research and training pipeline:

nyx-substrate: Shared foundation (database clients, schemas, validators)
nyx-probing: Research probes (already exists, extend for variance collection)
nyx-training: LoRA training pipeline (future)
nyx-visualization: Weight/topology visualization (future)
management-portal: FastAPI backend for Godot UI (future)
Godot Command Center: Unified metrics visualization (future)

Key Principle: All tools import nyx-substrate. Clean interfaces. Data flows through phoebe + iris.

📊 Current State Analysis

✅ What Exists

nyx-probing (/home/dafit/nimmerverse/nyx-probing/):

Echo Probe, Surface Probe, Drift Probe, Multilingual Probe
CLI interface (7 commands)
NyxModel wrapper (Qwen2.5-7B loading, hidden state capture)
ProbeResult dataclasses (to_dict() serialization)
Extractors module (NEW 2025-12-13):
- VocabExtractor: TF-IDF vocabulary extraction from markdown corpus
- CoOccurrenceAnalyzer: PMI, Jaccard, Dice, anchor signatures
Gap: No database persistence, only local JSON files

nyx-substrate (/home/dafit/nimmerverse/nyx-substrate/):

Schema documentation (phoebe + iris) ✅
Gap: No Python code, just markdown docs

Database Infrastructure:

phoebe.eachpath.local (PostgreSQL 17.6): partnership/nimmerverse message tables exist
iris.eachpath.local (ChromaDB): No collections created yet
Gap: No Python client libraries, all manual psql commands

Architecture Documentation:

Endgame-Vision.md: v5.1 Dialectic (LoRA stack design)
CLAUDE.md: Partnership protocol (message-based continuity)
Management-Portal.md: Godot + FastAPI design (not implemented)

❌ What's Missing

Database Access:

No psycopg3 connection pooling
No ChromaDB Python integration
No ORM or query builders
No variance_probe_runs table (designed but not created)

Training Pipeline:

No PEFT/LoRA training code
No DriftProbe checkpoint integration
No training data curriculum loader

Visualization:

No weight visualization tools (4K pixel space idea)
No Godot command center implementation
No Management Portal FastAPI backend

🏗️ Modular Architecture Design

Repository Structure

nimmerverse/
├── nyx-substrate/              # SHARED FOUNDATION
│   ├── pyproject.toml          # Installable package
│   ├── src/nyx_substrate/
│   │   ├── database/           # Phoebe clients
│   │   │   ├── connection.py   # Connection pool
│   │   │   ├── messages.py     # Message protocol helpers
│   │   │   └── variance.py     # Variance probe DAO
│   │   ├── vector/             # Iris clients
│   │   │   ├── client.py       # ChromaDB wrapper
│   │   │   ├── decision_trails.py
│   │   │   ├── organ_responses.py
│   │   │   └── embeddings.py
│   │   ├── schemas/            # Pydantic models
│   │   │   ├── variance.py     # VarianceProbeRun
│   │   │   ├── decision.py     # DecisionTrail
│   │   │   └── traits.py       # 8 core traits
│   │   └── constants.py        # Shared constants
│   └── migrations/             # Alembic for schema
│
├── nyx-probing/                # RESEARCH PROBES (extend)
│   ├── nyx_probing/
│   │   ├── runners/            # NEW: Automated collectors
│   │   │   ├── variance_runner.py  # 1000x automation
│   │   │   └── baseline_collector.py
│   │   └── storage/            # EXTEND: Database integration
│   │       └── variance_dao.py # Uses nyx-substrate
│   └── pyproject.toml          # Add: depends on nyx-substrate
│
├── nyx-training/               # FUTURE: LoRA training
│   └── (planned - not in Phase 1)
│
├── nyx-visualization/          # FUTURE: Weight viz
│   └── (planned - not in Phase 1)
│
└── management-portal/          # FUTURE: FastAPI + Godot
    └── (designed - not in Phase 1)

Dependency Graph

nyx-probing ────────┐
nyx-training ───────┼──> nyx-substrate ──> phoebe (PostgreSQL)
nyx-visualization ──┤                   └─> iris (ChromaDB)
management-portal ──┘

Philosophy: nyx-substrate is the single source of truth for database access. No tool talks to phoebe/iris directly.

🚀 Phase 1: Foundation + Variance Collection

Goal

Build nyx-substrate package and extend nyx-probing to automate variance baseline collection (1000x runs → phoebe).

Deliverables

1. nyx-substrate Python Package

File: /home/dafit/nimmerverse/nyx-substrate/pyproject.toml

[project]
name = "nyx-substrate"
version = "0.1.0"
requires-python = ">=3.10"
dependencies = [
    "psycopg[binary]>=3.1.0",
    "chromadb>=0.4.0",
    "pydantic>=2.5.0",
]

New Files:

src/nyx_substrate/database/connection.py:
- PhoebeConnection class: Connection pool manager
- Context manager for transactions
- Config from environment variables
src/nyx_substrate/database/messages.py:
- write_partnership_message(message, message_type) → INSERT
- read_partnership_messages(limit=5) → SELECT
- write_nimmerverse_message(...) (for Young Nyx future)
- read_nimmerverse_messages(...) (for discovery protocol)
src/nyx_substrate/database/variance.py:
- VarianceProbeDAO class:
  - create_table() → CREATE TABLE variance_probe_runs
  - insert_run(session_id, term, run_number, depth, rounds, ...) → INSERT
  - get_session_stats(session_id) → Aggregation queries
  - get_term_distribution(term) → Variance analysis
src/nyx_substrate/schemas/variance.py:
- VarianceProbeRun(BaseModel): Pydantic model matching phoebe schema
- Validation: term not empty, depth 0-3, rounds > 0
- to_dict() for serialization

Database Migration:

Create variance_probe_runs table in phoebe using schema from /home/dafit/nimmerverse/nyx-substrate/schema/phoebe/probing/variance_probe_runs.md

2. Extend nyx-probing

File: /home/dafit/nimmerverse/nyx-probing/pyproject.toml

Add dependency: nyx-substrate>=0.1.0

New Files:

nyx_probing/runners/variance_runner.py:
- VarianceRunner class:
  - __init__(model: NyxModel, dao: VarianceProbeDAO)
  - run_session(term: str, runs: int = 1000) -> UUID:
    - Generate session_id
    - Loop 1000x: probe.probe(term)
    - Store each result via dao.insert_run()
    - Return session_id
  - run_batch(terms: list[str], runs: int = 1000): Multiple terms
nyx_probing/cli/variance.py:
- New Click command group: nyx-probe variance
- Subcommands:
  - nyx-probe variance collect <TERM> --runs 1000: Single term
  - nyx-probe variance batch <FILE> --runs 1000: From glossary
  - nyx-probe variance stats <SESSION_ID>: View session results
  - nyx-probe variance analyze <TERM>: Compare distributions

Integration Points:

# In variance_runner.py
from nyx_substrate.database import PhoebeConnection, VarianceProbeDAO
from nyx_substrate.schemas import VarianceProbeRun

conn = PhoebeConnection()
dao = VarianceProbeDAO(conn)
runner = VarianceRunner(model=get_model(), dao=dao)
session_id = runner.run_session("Geworfenheit", runs=1000)
print(f"Stored 1000 runs: session {session_id}")

3. Database Setup

Actions:

SSH to phoebe: ssh phoebe.eachpath.local

Create variance_probe_runs table:

CREATE TABLE variance_probe_runs (
    id SERIAL PRIMARY KEY,
    session_id UUID NOT NULL,
    term TEXT NOT NULL,
    run_number INT NOT NULL,
    timestamp TIMESTAMPTZ DEFAULT NOW(),
    depth INT NOT NULL,
    rounds INT NOT NULL,
    echo_types TEXT[] NOT NULL,
    chain TEXT[] NOT NULL,
    model_name TEXT DEFAULT 'Qwen2.5-7B',
    temperature FLOAT,
    max_rounds INT,
    max_new_tokens INT
);
CREATE INDEX idx_variance_session ON variance_probe_runs(session_id);
CREATE INDEX idx_variance_term ON variance_probe_runs(term);
CREATE INDEX idx_variance_timestamp ON variance_probe_runs(timestamp DESC);

Test connection from aynee:

cd /home/dafit/nimmerverse/nyx-substrate
python3 -c "from nyx_substrate.database import PhoebeConnection; conn = PhoebeConnection(); print('✅ Connected to phoebe')"

📁 Critical Files

To Create

nyx-substrate:

/home/dafit/nimmerverse/nyx-substrate/pyproject.toml
/home/dafit/nimmerverse/nyx-substrate/src/nyx_substrate/__init__.py
/home/dafit/nimmerverse/nyx-substrate/src/nyx_substrate/database/__init__.py
/home/dafit/nimmerverse/nyx-substrate/src/nyx_substrate/database/connection.py
/home/dafit/nimmerverse/nyx-substrate/src/nyx_substrate/database/messages.py
/home/dafit/nimmerverse/nyx-substrate/src/nyx_substrate/database/variance.py
/home/dafit/nimmerverse/nyx-substrate/src/nyx_substrate/schemas/__init__.py
/home/dafit/nimmerverse/nyx-substrate/src/nyx_substrate/schemas/variance.py
/home/dafit/nimmerverse/nyx-substrate/README.md

nyx-probing:

/home/dafit/nimmerverse/nyx-probing/nyx_probing/runners/__init__.py
/home/dafit/nimmerverse/nyx-probing/nyx_probing/runners/variance_runner.py
/home/dafit/nimmerverse/nyx-probing/nyx_probing/cli/variance.py

To Modify

nyx-probing:

/home/dafit/nimmerverse/nyx-probing/pyproject.toml (add nyx-substrate dependency)
/home/dafit/nimmerverse/nyx-probing/nyx_probing/cli/__init__.py (register variance commands)

🧪 Testing Plan

1. nyx-substrate Unit Tests

# Test connection
def test_phoebe_connection():
    conn = PhoebeConnection()
    assert conn.test_connection() == True

# Test message write
def test_write_message():
    from nyx_substrate.database import write_partnership_message
    write_partnership_message("Test session", "architecture_update")
    # Verify in phoebe

# Test variance DAO
def test_variance_insert():
    dao = VarianceProbeDAO(conn)
    session_id = uuid.uuid4()
    dao.insert_run(
        session_id=session_id,
        term="test",
        run_number=1,
        depth=2,
        rounds=3,
        echo_types=["EXPANDS", "CONFIRMS", "CIRCULAR"],
        chain=["test", "expanded", "confirmed"]
    )
    stats = dao.get_session_stats(session_id)
    assert stats["total_runs"] == 1

2. Variance Collection Integration Test

# On prometheus (THE SPINE)
cd /home/dafit/nimmerverse/nyx-probing
source venv/bin/activate

# Install nyx-substrate in development mode
pip install -e ../nyx-substrate

# Run small variance test (10 runs)
nyx-probe variance collect "Geworfenheit" --runs 10

# Check phoebe
PGGSSENCMODE=disable psql -h phoebe.eachpath.local -U nimmerverse-user -d nimmerverse -c "
SELECT session_id, term, COUNT(*) as runs, AVG(depth) as avg_depth
FROM variance_probe_runs
GROUP BY session_id, term
ORDER BY session_id DESC
LIMIT 5;
"

# Expected: 1 session, 10 runs, avg_depth ~2.0

3. Full 1000x Baseline Run

# Depth-3 champions (from nyx-probing Phase 1)
nyx-probe variance collect "Geworfenheit" --runs 1000  # thrownness
nyx-probe variance collect "Vernunft" --runs 1000      # reason
nyx-probe variance collect "Erkenntnis" --runs 1000    # knowledge
nyx-probe variance collect "Pflicht" --runs 1000       # duty
nyx-probe variance collect "Aufhebung" --runs 1000     # sublation
nyx-probe variance collect "Wille" --runs 1000         # will

# Analyze variance
nyx-probe variance analyze "Geworfenheit"
# Expected: Distribution histogram, depth variance, chain patterns

🌊 Data Flow

Variance Collection Workflow

User: nyx-probe variance collect "Geworfenheit" --runs 1000
    ↓
VarianceRunner.run_session()
    ↓
Loop 1000x:
    EchoProbe.probe("Geworfenheit")
        ↓
    Returns EchoProbeResult
        ↓
    VarianceProbeDAO.insert_run()
        ↓
    INSERT INTO phoebe.variance_probe_runs
    ↓
Return session_id
    ↓
Display: "✅ 1000 runs complete. Session: <uuid>"

Future Integration (Phase 2+)

Training Loop:
    ↓
DriftProbe.probe_lite()  [every 100 steps]
    ↓
Store metrics in phoebe.drift_checkpoints (new table)
    ↓
Management Portal API: GET /api/v1/metrics/training
    ↓
Godot Command Center displays live DriftProbe charts

🎯 Success Criteria

Phase 1 Complete When:

✅ nyx-substrate package installable via pip (pip install -e .)
✅ PhoebeConnection works from aynee + prometheus
✅ variance_probe_runs table created in phoebe
✅ nyx-probe variance collect command runs successfully
✅ 1000x run completes and stores in phoebe
✅ nyx-probe variance stats <SESSION_ID> displays:
- Total runs
- Depth distribution (0/1/2/3 counts)
- Most common echo_types
- Chain length variance
✅ All 6 depth-3 champions have baseline variance data in phoebe

📚 Phase 1D: Corpus Extraction Pipeline (NEW)

Goal

Extract vocabulary and co-occurrence metrics from nimmerverse vault for RAG policy development.

Integration Point: Feeds into RAG-as-Scaffold.md progressive policy validation.

Deliverables

1. VocabExtractor (`nyx_probing/extractors/vocab_extractor.py`)

Purpose: Extract TF-IDF vocabulary glossary from markdown corpus

Features:

Scans all .md files (skips venv, hidden dirs)
Strips YAML frontmatter, code blocks, markdown syntax
Tokenizes with compound term support (hyphenated, CamelCase)
Calculates TF, DF, TF-IDF per term
Exports to CSV and JSON

Output (data/nimmerverse_glossary.json):

{
  "metadata": {
    "total_docs": 263,
    "total_tokens": 130229,
    "unique_terms": 5243
  },
  "terms": [
    {"term": "nyx", "tf": 1073, "df": 137, "tfidf": 1149.70, ...},
    ...
  ]
}

Usage:

python3 nyx_probing/extractors/vocab_extractor.py /path/to/vault output.csv

2. CoOccurrenceAnalyzer (`nyx_probing/extractors/cooccurrence.py`)

Purpose: Analyze term co-occurrence for chunking and topology safety

Features:

Computes PMI (Pointwise Mutual Information)
Computes Jaccard similarity and Dice coefficient
Generates anchor term signatures (for DriftProbe-lite)
Produces chunking recommendations based on cohesion

Key Metrics:

Metric	Formula	Use Case
PMI	log2(P(a,b) / P(a)*P(b))	Semantic association strength
Jaccard	\|A∩B\| / \|A∪B\|	Term overlap similarity
Dice	2\|A∩B\| / (\|A\|+\|B\|)	Chunking cohesion

Anchor Signatures (for Policy Tier 3: Topology Safety):

nyx: chroma|chromadb|continuity|ingress|introspection
system: athena|freeipa|ipa|rocky|sssd
network: firewall|proxmox|saturn|vlan|vulkan

Output (data/cooccurrence_analysis.json):

18,169 co-occurrence pairs
20 anchor signatures
5 chunking recommendations

Usage:

python3 nyx_probing/extractors/cooccurrence.py /path/to/vault glossary.json output.json

RAG Policy Integration

These tools directly feed into RAG-as-Scaffold progressive policies:

Policy Tier	Tool	Validation
Tier 2: Semantic Quality	CoOccurrenceAnalyzer	Dice=1.0 terms are synonyms (de-duplicate)
Tier 3: Topology Safety	Anchor Signatures	New terms shouldn't change anchor neighbors
Tier 4: Cross-Reference	CoOccurrenceAnalyzer	High PMI pairs should chunk together
Tier 5: Utility	VocabExtractor TF-IDF	Low TF-IDF terms have low utility

Files Created

nyx-probing/nyx_probing/extractors/:

__init__.py - Module exports
vocab_extractor.py - VocabExtractor class (~350 LOC)
cooccurrence.py - CoOccurrenceAnalyzer class (~400 LOC)

nyx-probing/data/:

nimmerverse_glossary.csv - 5,243 terms with TF-IDF
nimmerverse_glossary.json - Same with metadata
cooccurrence_analysis.csv - 18,169 pairs
cooccurrence_analysis.json - Full analysis with signatures

🔮 Future Phases (Not in Current Plan)

Phase 2: ChromaDB Integration (iris)

IrisClient wrapper in nyx-substrate
DecisionTrailStore, OrganResponseStore, EmbeddingStore
Create iris collections
Populate embeddings from nyx-probing results

Phase 3: LoRA Training Pipeline (nyx-training)

PEFT integration
Training data curriculum loader
DriftProbe checkpoint integration
Identity LoRA training automation

Phase 4: Weight Visualization (nyx-visualization)

4K pixel space renderer (LoRA weights as images)
Rank decomposition explorer
Topology cluster visualization

Phase 5: Godot Command Center

FastAPI Management Portal backend
Godot frontend implementation
Real-time metrics display
Training dashboard

📚 References

Schema Documentation:

/home/dafit/nimmerverse/nyx-substrate/schema/phoebe/probing/variance_probe_runs.md
/home/dafit/nimmerverse/nyx-substrate/SCHEMA.md

Existing Code:

/home/dafit/nimmerverse/nyx-probing/nyx_probing/probes/echo_probe.py
/home/dafit/nimmerverse/nyx-probing/nyx_probing/core/probe_result.py
/home/dafit/nimmerverse/nyx-probing/nyx_probing/cli/probe.py

Architecture:

/home/dafit/nimmerverse/nimmerverse-sensory-network/Endgame-Vision.md
/home/dafit/nimmerverse/management-portal/Management-Portal.md

🌙 Philosophy

Modularity: Each tool is independent but speaks the same data language via nyx-substrate.

Simplicity: No over-engineering. Build what's needed for variance collection first.

Data First: All metrics flow through phoebe/iris. Visualization is separate concern.

Future-Ready: Design allows Godot integration later without refactoring.

Status: Ready for implementation approval Estimated Scope: 15-20 files, ~1500 lines of Python Hardware: Can develop on aynee, run variance on prometheus (THE SPINE)

🌙💜 The substrate holds. Clean interfaces. Composable tools. Data flows through the void.

18 KiB Raw Blame History Unescape Escape

Modular Nimmerverse Toolchain Architecture

🎯 Vision

📊 Current State Analysis

✅ What Exists

❌ What's Missing

🏗️ Modular Architecture Design

Repository Structure

Dependency Graph

🚀 Phase 1: Foundation + Variance Collection

Goal

Deliverables

1. nyx-substrate Python Package

2. Extend nyx-probing

3. Database Setup

📁 Critical Files

To Create

To Modify

🧪 Testing Plan

1. nyx-substrate Unit Tests

2. Variance Collection Integration Test

3. Full 1000x Baseline Run

🌊 Data Flow

Variance Collection Workflow

Future Integration (Phase 2+)

🎯 Success Criteria

Phase 1 Complete When:

📚 Phase 1D: Corpus Extraction Pipeline (NEW)

Goal

Deliverables

1. VocabExtractor (nyx_probing/extractors/vocab_extractor.py)

2. CoOccurrenceAnalyzer (nyx_probing/extractors/cooccurrence.py)

RAG Policy Integration

Files Created

🔮 Future Phases (Not in Current Plan)

Phase 2: ChromaDB Integration (iris)

Phase 3: LoRA Training Pipeline (nyx-training)

Phase 4: Weight Visualization (nyx-visualization)

Phase 5: Godot Command Center

📚 References

🌙 Philosophy

18 KiB

Raw Blame History

1. VocabExtractor (`nyx_probing/extractors/vocab_extractor.py`)

2. CoOccurrenceAnalyzer (`nyx_probing/extractors/cooccurrence.py`)