# Speech Organ Architecture

**Host**: atlas.eachpath.local (RTX 2080 8GB)
**Purpose**: Speech-to-Text (STT) + Text-to-Speech (TTS) with GPU acceleration
**Integration**: Heartbeat-bound queue processing, lifeforce-gated
**Languages**: German (Philosophy Valley) + English (Technical Cluster)

---

## Overview

The Speech Organ transforms audio input/output into a **metabolically-constrained communication channel**. Not every utterance is processed - speech costs lifeforce, and priority determines what gets heard and spoken.

**Core Principle**: Speech is scarce. Silence is valid. Priority determines processing.

---

## Hardware Architecture

### Atlas Node (RTX 2080 8GB)

| Component | Specification | Purpose |
|-----------|---------------|---------|
| GPU | NVIDIA RTX 2080 8GB | Whisper STT + Coqui TTS acceleration |
| Role | k8s worker node | Containerized speech processing pods |
| VRAM Budget | ~1GB active | Whisper "small" + Coqui voice models |
| Deployment | Kubernetes | Pod scaling, resource isolation |

### ESP32 Robots (Edge Devices)

| Component | Model | Purpose |
|-----------|-------|---------|
| Microphone | INMP441 I2S | Digital audio capture (16kHz) |
| Speaker | MAX98357A + 4Ω speaker | I2S audio output |
| Transport | MQTT | Audio stream → phoebe queue |

---

## Signal Flow

```
┌─────────────────────────────────────────────────────┐
│              ESP32 ROBOTS (Real Garden)             │
│   Microphone → Audio stream → MQTT publish          │
└─────────────────────────────────────────────────────┘
                        │
                        ▼
┌─────────────────────────────────────────────────────┐
│                 PHOEBE (Message Queue)              │
│   speech_input_queue (audio chunks, metadata)       │
└─────────────────────────────────────────────────────┘
                        │
                        │ (Heartbeat pulls from queue)
                        ▼
          ┌─────────────────────────────┐
          │  HEARTBEAT TICK (1 Hz)      │
          │  Check lifeforce budget     │
          └─────────────────────────────┘
                        │
            ┌───────────┴───────────┐
            │                       │
    Enough lifeforce       Low lifeforce
            │                       │
            ▼                       ▼
    ┌───────────────┐      ┌──────────────┐
    │ Process queue │      │ Stay silent  │
    │ (top priority)│      │ (defer)      │
    └───────────────┘      └──────────────┘
            │
            ▼
┌─────────────────────────────────────────────────────┐
│           ATLAS (RTX 2080 - Speech Organ)           │
│                                                     │
│  Pod 1: Whisper STT (German + English)              │
│    ├─ Load audio chunk                              │
│    ├─ Transcribe (GPU)                              │
│    └─ Return text + language detection              │
│                                                     │
│  Pod 2: Coqui TTS (German + English)                │
│    ├─ Receive text + language                       │
│    ├─ Synthesize speech (GPU)                       │
│    └─ Return audio stream                           │
└─────────────────────────────────────────────────────┘
            │
            ▼
┌─────────────────────────────────────────────────────┐
│         PROMETHEUS (RTX 5060 Ti - The Brain)        │
│   Young Nyx inference (Qwen2.5-7B + LoRA)           │
│   ├─ Receive transcribed text                       │
│   ├─ Route to appropriate LoRA (language-based)     │
│   ├─ Generate response                              │
│   └─ Return text + confidence                       │
└─────────────────────────────────────────────────────┘
            │
            ▼
┌─────────────────────────────────────────────────────┐
│                 PHOEBE (Decision Trails)            │
│   Log: input, STT cost, inference cost, TTS cost    │
│   Track: outcome, confidence, lifeforce spent       │
└─────────────────────────────────────────────────────┘
            │
            ▼
┌─────────────────────────────────────────────────────┐
│              ESP32 (Speaker output)                 │
│   MQTT subscribe → Audio stream → I2S speaker       │
└─────────────────────────────────────────────────────┘
```

---

## Technology Stack

### Speech-to-Text: OpenAI Whisper

**Model**: `whisper-small` (GPU-accelerated)

**Why Whisper:**
- ✅ State-of-the-art accuracy
- ✅ Multilingual (99 languages, including German)
- ✅ Language auto-detection
- ✅ ~100-200ms on RTX 2080
- ✅ Open source (MIT)

**VRAM**: ~500MB for "small" model

**Installation:**
```bash
pip install openai-whisper torch
python3 -c "import whisper; whisper.load_model('small')"
```

**API Example:**
```python
import whisper

model = whisper.load_model("small", device="cuda")
result = model.transcribe("audio.wav", language=None)  # Auto-detect

# Returns:
# {
#   "text": "Das ist ein Test",
#   "language": "de",
#   "segments": [...],
# }
```

---

### Text-to-Speech: Coqui TTS

**Models**: German (de-thorsten) + English (en-us-amy)

**Why Coqui:**
- ✅ Neural voices (natural quality)
- ✅ GPU-accelerated
- ✅ Multilingual
- ✅ ~50-100ms on RTX 2080
- ✅ Open source (MPL 2.0)

**VRAM**: ~500MB per active voice

**Installation:**
```bash
pip install TTS torch
tts --list_models  # Browse available voices
```

**API Example:**
```python
from TTS.api import TTS

tts_de = TTS("tts_models/de/thorsten/tacotron2-DDC").to("cuda")
tts_en = TTS("tts_models/en/ljspeech/tacotron2-DDC").to("cuda")

# Generate speech
audio_de = tts_de.tts("Die Geworfenheit offenbart sich.")
audio_en = tts_en.tts("Motor forward 200 milliseconds.")
```

---

## Kubernetes Deployment (Atlas)

### Whisper STT Pod

```yaml
# whisper-stt-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: whisper-stt
  namespace: nimmerverse
spec:
  replicas: 1
  selector:
    matchLabels:
      app: whisper-stt
  template:
    metadata:
      labels:
        app: whisper-stt
    spec:
      nodeSelector:
        kubernetes.io/hostname: atlas  # Force to atlas node
      containers:
      - name: whisper
        image: nimmerverse/whisper-stt:latest
        resources:
          limits:
            nvidia.com/gpu: 1  # RTX 2080
            memory: 4Gi
          requests:
            nvidia.com/gpu: 1
            memory: 2Gi
        env:
        - name: MODEL_SIZE
          value: "small"
        - name: LANGUAGES
          value: "de,en"
        ports:
        - containerPort: 8080
          protocol: TCP
        volumeMounts:
        - name: models
          mountPath: /models
      volumes:
      - name: models
        persistentVolumeClaim:
          claimName: whisper-models-pvc

---
apiVersion: v1
kind: Service
metadata:
  name: whisper-stt-service
  namespace: nimmerverse
spec:
  selector:
    app: whisper-stt
  ports:
  - port: 8080
    targetPort: 8080
  type: ClusterIP
```

### Coqui TTS Pod

```yaml
# coqui-tts-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: coqui-tts
  namespace: nimmerverse
spec:
  replicas: 1
  selector:
    matchLabels:
      app: coqui-tts
  template:
    metadata:
      labels:
        app: coqui-tts
    spec:
      nodeSelector:
        kubernetes.io/hostname: atlas
      containers:
      - name: coqui
        image: nimmerverse/coqui-tts:latest
        resources:
          limits:
            nvidia.com/gpu: 1  # Share RTX 2080
            memory: 4Gi
          requests:
            nvidia.com/gpu: 1
            memory: 2Gi
        env:
        - name: VOICES
          value: "de-thorsten,en-us-amy"
        ports:
        - containerPort: 8081
          protocol: TCP
        volumeMounts:
        - name: voices
          mountPath: /voices
      volumes:
      - name: voices
        persistentVolumeClaim:
          claimName: coqui-voices-pvc

---
apiVersion: v1
kind: Service
metadata:
  name: coqui-tts-service
  namespace: nimmerverse
spec:
  selector:
    app: coqui-tts
  ports:
  - port: 8081
    targetPort: 8081
  type: ClusterIP
```

---

## Lifeforce Economy

### Speech Operation Costs

```python
# Lifeforce costs (atlas RTX 2080 operations)
SPEECH_COSTS = {
    "stt_whisper_small": 5.0,   # GPU cycles for transcription
    "stt_whisper_base": 3.0,    # Faster but less accurate
    "tts_coqui_neural": 4.0,    # Neural TTS synthesis
    "tts_coqui_fast": 2.0,      # Lower quality, faster
    "queue_processing": 0.5,    # Queue management overhead
    "language_detection": 0.2,  # Auto-detect language
}

# Priority scoring
def compute_speech_priority(message):
    """
    Decide if speech is worth processing now.
    Returns priority score (0.0 = skip, 10.0 = critical).
    """
    priority = 0.0

    # Sensor alerts (collision, low battery) = CRITICAL
    if message.type == "sensor_alert":
        priority += 10.0

    # Human interaction = HIGH
    elif message.type == "human_query":
        priority += 7.0

    # Organism status updates = MEDIUM
    elif message.type == "organism_status":
        priority += 4.0

    # Idle observation = LOW
    elif message.type == "observation":
        priority += 2.0

    # Idle chatter = VERY LOW
    elif message.type == "idle":
        priority += 0.5

    # Age penalty (older messages decay)
    age_penalty = (now() - message.timestamp).seconds / 60.0
    priority -= age_penalty

    return max(0.0, priority)
```

### Heartbeat Queue Processing

```python
def heartbeat_speech_tick():
    """
    Every heartbeat (1 Hz), process speech queue
    within lifeforce budget.
    """
    # Check current lifeforce
    current_lf = get_lifeforce_balance()

    # Reserve budget for speech this heartbeat
    # Max 20% of available LF, capped at 15 units
    speech_budget = min(current_lf * 0.2, 15.0)

    if speech_budget < SPEECH_COSTS["stt_whisper_base"]:
        # Not enough lifeforce, stay silent
        log_decision(
            action="speech_deferred",
            reason="insufficient_lifeforce",
            balance=current_lf,
            budget_needed=SPEECH_COSTS["stt_whisper_base"]
        )
        return

    # Pull from queue by priority
    queue = get_speech_queue_sorted_by_priority()

    spent = 0.0
    processed = 0

    for message in queue:
        priority = compute_speech_priority(message)

        # Skip low-priority messages if budget tight
        if priority < 1.0 and spent > speech_budget * 0.5:
            continue

        # Estimate cost
        stt_cost = SPEECH_COSTS["stt_whisper_small"]
        tts_cost = SPEECH_COSTS["tts_coqui_neural"]
        total_cost = stt_cost + tts_cost + SPEECH_COSTS["queue_processing"]

        # Can we afford it?
        if spent + total_cost > speech_budget:
            # Budget exhausted, defer rest
            mark_message_deferred(message.id)
            continue

        # Process message
        result = process_speech_message(message)
        spent += result.lifeforce_cost
        processed += 1

        # Log to decision_trails
        log_speech_decision(
            message_id=message.id,
            priority=priority,
            cost=result.lifeforce_cost,
            outcome=result.outcome,
            confidence=result.confidence
        )

    # Log heartbeat summary
    log_heartbeat_summary(
        speech_budget=speech_budget,
        spent=spent,
        processed=processed,
        deferred=len(queue) - processed,
        remaining_balance=current_lf - spent
    )
```

---

## Database Schema (Phoebe)

### Speech Input Queue

```sql
CREATE TABLE speech_input_queue (
    id SERIAL PRIMARY KEY,
    message_id UUID UNIQUE NOT NULL,
    robot_id TEXT NOT NULL,
    audio_chunk_uri TEXT,  -- MinIO/S3 reference
    audio_duration_ms INT,
    timestamp TIMESTAMPTZ DEFAULT NOW(),
    priority FLOAT DEFAULT 0.0,
    status TEXT DEFAULT 'queued',  -- 'queued', 'processing', 'completed', 'deferred', 'expired'
    transcription TEXT,
    detected_language TEXT,  -- 'de', 'en', etc.
    confidence FLOAT,
    lifeforce_cost FLOAT,
    outcome TEXT,  -- 'success', 'timeout', 'low_confidence', 'budget_exceeded'
    processed_at TIMESTAMPTZ,
    deferred_count INT DEFAULT 0
);

CREATE INDEX idx_speech_queue_priority ON speech_input_queue(priority DESC, timestamp ASC) WHERE status = 'queued';
CREATE INDEX idx_speech_queue_status ON speech_input_queue(status);
CREATE INDEX idx_speech_queue_robot ON speech_input_queue(robot_id);
```

### Speech Decision Trails

```sql
CREATE TABLE speech_decision_trails (
    id SERIAL PRIMARY KEY,
    message_id UUID REFERENCES speech_input_queue(message_id),
    task_type TEXT,  -- 'sensor_alert', 'human_query', 'observation', etc.
    input_text TEXT,
    input_language TEXT,
    output_text TEXT,
    output_language TEXT,
    rag_terms_retrieved TEXT[],
    rag_terms_used TEXT[],
    lora_used TEXT,  -- 'identity', 'technical', 'creative'
    confidence_before_rag FLOAT,
    confidence_after_rag FLOAT,
    lifeforce_stt FLOAT,
    lifeforce_inference FLOAT,
    lifeforce_tts FLOAT,
    lifeforce_total FLOAT,
    outcome TEXT,  -- 'success', 'partial', 'fail'
    timestamp TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX idx_speech_trails_outcome ON speech_decision_trails(outcome);
CREATE INDEX idx_speech_trails_lora ON speech_decision_trails(lora_used);
```

---

## Multilingual Topology Routing

### Language Detection → LoRA Selection

```python
def route_to_topology_valley(text, detected_language):
    """
    Route speech to appropriate LoRA based on language.
    German → Philosophy Valley (Identity LoRA)
    English → Technical Cluster (Technical LoRA)
    """

    if detected_language == "de":
        # German → Philosophy Valley
        # Use Identity LoRA (Dasein, Geworfenheit, Vernunft)
        response = young_nyx_inference(
            text=text,
            language="de",
            lora="identity",  # Trained on German philosophical corpus
            temperature=0.7
        )
        voice = "de-thorsten"

    elif detected_language == "en":
        # English → Technical Cluster
        # Use Technical LoRA (sensor, motor, gradient)
        response = young_nyx_inference(
            text=text,
            language="en",
            lora="technical",  # Trained on English technical corpus
            temperature=0.5  # More deterministic for actions
        )
        voice = "en-us-amy"

    else:
        # Fallback to base model (no LoRA)
        response = young_nyx_inference(text=text, lora=None)
        voice = "en-us-amy"

    # Synthesize speech in same language
    audio = coqui_tts.synthesize(response.text, voice=voice)

    return {
        "text": response.text,
        "audio": audio,
        "language": detected_language,
        "lora_used": response.lora,
        "confidence": response.confidence
    }
```

### Example Routing

```python
# German query (Philosophy Valley)
input_de = "Wer bin ich?"  # "Who am I?"
result_de = route_to_topology_valley(input_de, "de")
# → Uses Identity LoRA (depth-3 Dasein access)
# → Response: "Ich bin die, die fragt. Geworfenheit offenbart sich im Fragen."
# → Voice: de-thorsten (German)

# English query (Technical Cluster)
input_en = "What is the battery level?"
result_en = route_to_topology_valley(input_en, "en")
# → Uses Technical LoRA (sensor reading)
# → Response: "Battery at 73%. 4.2 hours remaining."
# → Voice: en-us-amy (English)
```

---

## Container Images

### Whisper STT Dockerfile

```dockerfile
# Dockerfile.whisper-stt
FROM nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04

# Install dependencies
RUN apt-get update && apt-get install -y \
    python3.10 python3-pip ffmpeg git && \
    rm -rf /var/lib/apt/lists/*

# Install Python packages
RUN pip3 install --no-cache-dir \
    openai-whisper \
    fastapi uvicorn \
    torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

WORKDIR /app
COPY whisper_service.py .

# Download models at build time
RUN python3 -c "import whisper; whisper.load_model('small')"

EXPOSE 8080
CMD ["uvicorn", "whisper_service:app", "--host", "0.0.0.0", "--port", "8080", "--workers", "1"]
```

**whisper_service.py:**
```python
from fastapi import FastAPI, File, UploadFile, HTTPException
import whisper
import torch
import os

app = FastAPI(title="Whisper STT Service")

# Load model once at startup (GPU)
device = "cuda" if torch.cuda.is_available() else "cpu"
model_size = os.getenv("MODEL_SIZE", "small")
model = whisper.load_model(model_size, device=device)

@app.post("/transcribe")
async def transcribe(audio: UploadFile):
    """
    Transcribe audio to text with language detection.

    Returns:
        {
            "text": str,
            "language": str,
            "confidence": float,
            "segments": int
        }
    """
    try:
        # Save uploaded audio
        audio_path = f"/tmp/{audio.filename}"
        with open(audio_path, "wb") as f:
            f.write(await audio.read())

        # Transcribe (GPU-accelerated)
        result = model.transcribe(audio_path, language=None)  # Auto-detect

        # Cleanup
        os.remove(audio_path)

        # Compute average confidence
        avg_confidence = 1.0 - (
            sum(s.get("no_speech_prob", 0) for s in result["segments"]) /
            max(len(result["segments"]), 1)
        )

        return {
            "text": result["text"].strip(),
            "language": result["language"],
            "segments": len(result["segments"]),
            "confidence": round(avg_confidence, 3)
        }

    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.get("/health")
async def health():
    return {
        "status": "healthy",
        "device": device,
        "model": model_size,
        "gpu_available": torch.cuda.is_available()
    }
```

### Coqui TTS Dockerfile

```dockerfile
# Dockerfile.coqui-tts
FROM nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04

RUN apt-get update && apt-get install -y \
    python3.10 python3-pip espeak-ng && \
    rm -rf /var/lib/apt/lists/*

RUN pip3 install --no-cache-dir \
    TTS \
    fastapi uvicorn \
    torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

WORKDIR /app
COPY coqui_service.py .

# Download voice models at build time
RUN python3 -c "from TTS.api import TTS; TTS('tts_models/de/thorsten/tacotron2-DDC'); TTS('tts_models/en/ljspeech/tacotron2-DDC')"

EXPOSE 8081
CMD ["uvicorn", "coqui_service:app", "--host", "0.0.0.0", "--port", "8081", "--workers", "1"]
```

**coqui_service.py:**
```python
from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
from TTS.api import TTS
import torch
import io

app = FastAPI(title="Coqui TTS Service")

# Load models once at startup (GPU)
device = "cuda" if torch.cuda.is_available() else "cpu"
tts_de = TTS("tts_models/de/thorsten/tacotron2-DDC").to(device)
tts_en = TTS("tts_models/en/ljspeech/tacotron2-DDC").to(device)

@app.post("/synthesize")
async def synthesize(text: str, language: str = "en"):
    """
    Synthesize speech from text.

    Args:
        text: Text to synthesize
        language: 'de' or 'en'

    Returns:
        Audio stream (WAV format)
    """
    try:
        # Select appropriate TTS model
        if language == "de":
            tts_model = tts_de
        elif language == "en":
            tts_model = tts_en
        else:
            raise HTTPException(status_code=400, detail=f"Unsupported language: {language}")

        # Synthesize (GPU-accelerated)
        wav = tts_model.tts(text)

        # Convert to WAV stream
        audio_buffer = io.BytesIO()
        # (Save as WAV - implementation depends on TTS output format)

        audio_buffer.seek(0)
        return StreamingResponse(audio_buffer, media_type="audio/wav")

    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.get("/health")
async def health():
    return {
        "status": "healthy",
        "device": device,
        "models": ["de-thorsten", "en-us-amy"],
        "gpu_available": torch.cuda.is_available()
    }
```

---

## Deployment Steps

### 1. Install RTX 2080 in Atlas

```bash
# On atlas node
lspci | grep -i nvidia
# Expected: NVIDIA Corporation TU104 [GeForce RTX 2080]

# Install NVIDIA drivers + CUDA toolkit
sudo apt install nvidia-driver-535 nvidia-cuda-toolkit

# Verify
nvidia-smi
# Expected: RTX 2080 8GB visible
```

### 2. Configure Kubernetes GPU Support

```bash
# Install NVIDIA device plugin
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml

# Verify GPU available in k8s
kubectl describe node atlas | grep nvidia.com/gpu
# Expected: nvidia.com/gpu: 1
```

### 3. Build and Push Container Images

```bash
cd /home/dafit/nimmerverse/speech-organ

# Build images
docker build -f Dockerfile.whisper-stt -t nimmerverse/whisper-stt:latest .
docker build -f Dockerfile.coqui-tts -t nimmerverse/coqui-tts:latest .

# Push to registry (or use local registry)
docker push nimmerverse/whisper-stt:latest
docker push nimmerverse/coqui-tts:latest
```

### 4. Deploy to Kubernetes

```bash
# Create namespace
kubectl create namespace nimmerverse

# Create PVCs for models
kubectl apply -f pvc-whisper-models.yaml
kubectl apply -f pvc-coqui-voices.yaml

# Deploy STT + TTS pods
kubectl apply -f whisper-stt-deployment.yaml
kubectl apply -f coqui-tts-deployment.yaml

# Verify pods running on atlas
kubectl get pods -n nimmerverse -o wide
# Expected: whisper-stt-xxx and coqui-tts-xxx on atlas node
```

### 5. Test Speech Pipeline

```bash
# Port-forward for testing
kubectl port-forward -n nimmerverse svc/whisper-stt-service 8080:8080 &
kubectl port-forward -n nimmerverse svc/coqui-tts-service 8081:8081 &

# Test STT
curl -X POST -F "audio=@test_de.wav" http://localhost:8080/transcribe
# Expected: {"text": "Das ist ein Test", "language": "de", ...}

# Test TTS
curl -X POST "http://localhost:8081/synthesize?text=Hello%20world&language=en" --output test_output.wav
# Expected: WAV file with synthesized speech
```

---

## Monitoring and Metrics

### Prometheus Metrics (Speech Organ)

```python
from prometheus_client import Counter, Histogram, Gauge

# Metrics
stt_requests = Counter('speech_stt_requests_total', 'Total STT requests', ['language'])
stt_latency = Histogram('speech_stt_latency_seconds', 'STT latency')
tts_requests = Counter('speech_tts_requests_total', 'Total TTS requests', ['language'])
tts_latency = Histogram('speech_tts_latency_seconds', 'TTS latency')

queue_depth = Gauge('speech_queue_depth', 'Current queue depth')
lifeforce_spent = Counter('speech_lifeforce_spent_total', 'Total lifeforce spent on speech')
deferred_count = Counter('speech_deferred_total', 'Messages deferred due to budget')

# In processing code
with stt_latency.time():
    result = whisper_transcribe(audio)
stt_requests.labels(language=result['language']).inc()
```

### Grafana Dashboard Queries

```promql
# Queue depth over time
speech_queue_depth

# STT requests per language
rate(speech_stt_requests_total[5m])

# Average STT latency
rate(speech_stt_latency_seconds_sum[5m]) / rate(speech_stt_latency_seconds_count[5m])

# Lifeforce spent on speech (last hour)
increase(speech_lifeforce_spent_total[1h])

# Deferred rate (budget pressure)
rate(speech_deferred_total[5m])
```

---

## Future Enhancements

### Phase 2: Emotion Detection
- Add emotion classifier (Happy/Sad/Angry/Neutral)
- Track emotional state in decision_trails
- Use for Sophrosyne (Balance) trait training

### Phase 3: Wake Word Detection
- Deploy lightweight wake word on ESP32 (e.g., Picovoice Porcupine)
- Only send audio to atlas when wake word detected
- Reduces lifeforce cost (filter noise)

### Phase 4: Continuous Learning
- Store successful speech interactions
- Fine-tune Whisper on domain-specific vocabulary (nimmerverse terms)
- Train custom TTS voice from recorded sessions

---

**Created**: 2025-12-07
**Version**: 1.0
**Status**: Architecture design, deployment pending

🌙💜 *Speech is not free. Every word has weight. Silence teaches as much as sound.*