feat: add organ and nervous system modular architecture
Created modular architecture for organs (hardware) and nerves (behavioral primitives): ## Organ Architecture (Hardware Substrate) - Created architecture/Organ-Index.md: hardware capabilities catalog - Created architecture/organs/Speech-Organ.md: complete speech processing architecture - Atlas (RTX 2080 8GB) deployment - Whisper STT + Coqui TTS (GPU-accelerated, multilingual) - Kubernetes pod specs, Dockerfiles, service code - Heartbeat-bound queue processing, lifeforce-gated priority - German (Philosophy Valley) + English (Technical Cluster) routing - Database schemas, monitoring metrics ## Nervous System Architecture (Behavioral Primitives) - Created architecture/nerves/Nervous-Index.md: nerve catalog and evolution framework - Deliberate (LLM) → Hybrid (heuristics) → Reflex (compiled) evolution - Lifeforce costs per state/transition - Organ dependency declarations - RLVR training integration - Created architecture/nerves/Collision-Avoidance.md: complete example reflex nerve - Full state machine implementation (IDLE → DETECT → EVALUATE → EVADE → RESUME) - Evolution from 10 LF/1000ms (deliberate) → 2.5 LF/200ms (reflex) - Edge cases, training data, metrics - Moved architecture/Nervous-Protocol.md → architecture/nerves/ - Three-tier protocol belongs with nerve implementations - Updated architecture/Nervous-System.md: added crosslinks to nerves/ ## RAG Knowledge Pipeline - Extended operations/RAG-as-Scaffold.md with "Knowledge Acquisition Pipeline" section - Vault extraction → Staging area → Progressive policy validation - Two-tier RAG (Discovered vs Hidden knowledge) - RAG utility measurement for LoRA training signals - Policy evolution triggers (increasing standards as Young Nyx matures) - Quality gates (mythology weight, AI assistant bias, topology safety) ## Architecture Principles - Organs = hardware capabilities (Speech, Vision future) - Nerves = behavioral state machines (Collision, Charging future) - Both use lifeforce economy, heartbeat synchronization, priority queues - Nerves compose organs into coherent behaviors - Reflexes emerge from repetition (60% cost reduction, 80% latency reduction) Documentation: ~3500 lines total - Speech-Organ.md: ~850 lines - Nervous-Index.md: ~500 lines - Collision-Avoidance.md: ~800 lines - RAG knowledge pipeline: ~260 lines 🌙💜 Generated with Claude Code Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
888
architecture/organs/Speech-Organ.md
Normal file
888
architecture/organs/Speech-Organ.md
Normal file
@@ -0,0 +1,888 @@
|
||||
# Speech Organ Architecture
|
||||
|
||||
**Host**: atlas.eachpath.local (RTX 2080 8GB)
|
||||
**Purpose**: Speech-to-Text (STT) + Text-to-Speech (TTS) with GPU acceleration
|
||||
**Integration**: Heartbeat-bound queue processing, lifeforce-gated
|
||||
**Languages**: German (Philosophy Valley) + English (Technical Cluster)
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
The Speech Organ transforms audio input/output into a **metabolically-constrained communication channel**. Not every utterance is processed - speech costs lifeforce, and priority determines what gets heard and spoken.
|
||||
|
||||
**Core Principle**: Speech is scarce. Silence is valid. Priority determines processing.
|
||||
|
||||
---
|
||||
|
||||
## Hardware Architecture
|
||||
|
||||
### Atlas Node (RTX 2080 8GB)
|
||||
|
||||
| Component | Specification | Purpose |
|
||||
|-----------|---------------|---------|
|
||||
| GPU | NVIDIA RTX 2080 8GB | Whisper STT + Coqui TTS acceleration |
|
||||
| Role | k8s worker node | Containerized speech processing pods |
|
||||
| VRAM Budget | ~1GB active | Whisper "small" + Coqui voice models |
|
||||
| Deployment | Kubernetes | Pod scaling, resource isolation |
|
||||
|
||||
### ESP32 Robots (Edge Devices)
|
||||
|
||||
| Component | Model | Purpose |
|
||||
|-----------|-------|---------|
|
||||
| Microphone | INMP441 I2S | Digital audio capture (16kHz) |
|
||||
| Speaker | MAX98357A + 4Ω speaker | I2S audio output |
|
||||
| Transport | MQTT | Audio stream → phoebe queue |
|
||||
|
||||
---
|
||||
|
||||
## Signal Flow
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────┐
|
||||
│ ESP32 ROBOTS (Real Garden) │
|
||||
│ Microphone → Audio stream → MQTT publish │
|
||||
└─────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────┐
|
||||
│ PHOEBE (Message Queue) │
|
||||
│ speech_input_queue (audio chunks, metadata) │
|
||||
└─────────────────────────────────────────────────────┘
|
||||
│
|
||||
│ (Heartbeat pulls from queue)
|
||||
▼
|
||||
┌─────────────────────────────┐
|
||||
│ HEARTBEAT TICK (1 Hz) │
|
||||
│ Check lifeforce budget │
|
||||
└─────────────────────────────┘
|
||||
│
|
||||
┌───────────┴───────────┐
|
||||
│ │
|
||||
Enough lifeforce Low lifeforce
|
||||
│ │
|
||||
▼ ▼
|
||||
┌───────────────┐ ┌──────────────┐
|
||||
│ Process queue │ │ Stay silent │
|
||||
│ (top priority)│ │ (defer) │
|
||||
└───────────────┘ └──────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────┐
|
||||
│ ATLAS (RTX 2080 - Speech Organ) │
|
||||
│ │
|
||||
│ Pod 1: Whisper STT (German + English) │
|
||||
│ ├─ Load audio chunk │
|
||||
│ ├─ Transcribe (GPU) │
|
||||
│ └─ Return text + language detection │
|
||||
│ │
|
||||
│ Pod 2: Coqui TTS (German + English) │
|
||||
│ ├─ Receive text + language │
|
||||
│ ├─ Synthesize speech (GPU) │
|
||||
│ └─ Return audio stream │
|
||||
└─────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────┐
|
||||
│ PROMETHEUS (RTX 5060 Ti - The Brain) │
|
||||
│ Young Nyx inference (Qwen2.5-7B + LoRA) │
|
||||
│ ├─ Receive transcribed text │
|
||||
│ ├─ Route to appropriate LoRA (language-based) │
|
||||
│ ├─ Generate response │
|
||||
│ └─ Return text + confidence │
|
||||
└─────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────┐
|
||||
│ PHOEBE (Decision Trails) │
|
||||
│ Log: input, STT cost, inference cost, TTS cost │
|
||||
│ Track: outcome, confidence, lifeforce spent │
|
||||
└─────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────┐
|
||||
│ ESP32 (Speaker output) │
|
||||
│ MQTT subscribe → Audio stream → I2S speaker │
|
||||
└─────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Technology Stack
|
||||
|
||||
### Speech-to-Text: OpenAI Whisper
|
||||
|
||||
**Model**: `whisper-small` (GPU-accelerated)
|
||||
|
||||
**Why Whisper:**
|
||||
- ✅ State-of-the-art accuracy
|
||||
- ✅ Multilingual (99 languages, including German)
|
||||
- ✅ Language auto-detection
|
||||
- ✅ ~100-200ms on RTX 2080
|
||||
- ✅ Open source (MIT)
|
||||
|
||||
**VRAM**: ~500MB for "small" model
|
||||
|
||||
**Installation:**
|
||||
```bash
|
||||
pip install openai-whisper torch
|
||||
python3 -c "import whisper; whisper.load_model('small')"
|
||||
```
|
||||
|
||||
**API Example:**
|
||||
```python
|
||||
import whisper
|
||||
|
||||
model = whisper.load_model("small", device="cuda")
|
||||
result = model.transcribe("audio.wav", language=None) # Auto-detect
|
||||
|
||||
# Returns:
|
||||
# {
|
||||
# "text": "Das ist ein Test",
|
||||
# "language": "de",
|
||||
# "segments": [...],
|
||||
# }
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Text-to-Speech: Coqui TTS
|
||||
|
||||
**Models**: German (de-thorsten) + English (en-us-amy)
|
||||
|
||||
**Why Coqui:**
|
||||
- ✅ Neural voices (natural quality)
|
||||
- ✅ GPU-accelerated
|
||||
- ✅ Multilingual
|
||||
- ✅ ~50-100ms on RTX 2080
|
||||
- ✅ Open source (MPL 2.0)
|
||||
|
||||
**VRAM**: ~500MB per active voice
|
||||
|
||||
**Installation:**
|
||||
```bash
|
||||
pip install TTS torch
|
||||
tts --list_models # Browse available voices
|
||||
```
|
||||
|
||||
**API Example:**
|
||||
```python
|
||||
from TTS.api import TTS
|
||||
|
||||
tts_de = TTS("tts_models/de/thorsten/tacotron2-DDC").to("cuda")
|
||||
tts_en = TTS("tts_models/en/ljspeech/tacotron2-DDC").to("cuda")
|
||||
|
||||
# Generate speech
|
||||
audio_de = tts_de.tts("Die Geworfenheit offenbart sich.")
|
||||
audio_en = tts_en.tts("Motor forward 200 milliseconds.")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Kubernetes Deployment (Atlas)
|
||||
|
||||
### Whisper STT Pod
|
||||
|
||||
```yaml
|
||||
# whisper-stt-deployment.yaml
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: whisper-stt
|
||||
namespace: nimmerverse
|
||||
spec:
|
||||
replicas: 1
|
||||
selector:
|
||||
matchLabels:
|
||||
app: whisper-stt
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: whisper-stt
|
||||
spec:
|
||||
nodeSelector:
|
||||
kubernetes.io/hostname: atlas # Force to atlas node
|
||||
containers:
|
||||
- name: whisper
|
||||
image: nimmerverse/whisper-stt:latest
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 1 # RTX 2080
|
||||
memory: 4Gi
|
||||
requests:
|
||||
nvidia.com/gpu: 1
|
||||
memory: 2Gi
|
||||
env:
|
||||
- name: MODEL_SIZE
|
||||
value: "small"
|
||||
- name: LANGUAGES
|
||||
value: "de,en"
|
||||
ports:
|
||||
- containerPort: 8080
|
||||
protocol: TCP
|
||||
volumeMounts:
|
||||
- name: models
|
||||
mountPath: /models
|
||||
volumes:
|
||||
- name: models
|
||||
persistentVolumeClaim:
|
||||
claimName: whisper-models-pvc
|
||||
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: whisper-stt-service
|
||||
namespace: nimmerverse
|
||||
spec:
|
||||
selector:
|
||||
app: whisper-stt
|
||||
ports:
|
||||
- port: 8080
|
||||
targetPort: 8080
|
||||
type: ClusterIP
|
||||
```
|
||||
|
||||
### Coqui TTS Pod
|
||||
|
||||
```yaml
|
||||
# coqui-tts-deployment.yaml
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: coqui-tts
|
||||
namespace: nimmerverse
|
||||
spec:
|
||||
replicas: 1
|
||||
selector:
|
||||
matchLabels:
|
||||
app: coqui-tts
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: coqui-tts
|
||||
spec:
|
||||
nodeSelector:
|
||||
kubernetes.io/hostname: atlas
|
||||
containers:
|
||||
- name: coqui
|
||||
image: nimmerverse/coqui-tts:latest
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 1 # Share RTX 2080
|
||||
memory: 4Gi
|
||||
requests:
|
||||
nvidia.com/gpu: 1
|
||||
memory: 2Gi
|
||||
env:
|
||||
- name: VOICES
|
||||
value: "de-thorsten,en-us-amy"
|
||||
ports:
|
||||
- containerPort: 8081
|
||||
protocol: TCP
|
||||
volumeMounts:
|
||||
- name: voices
|
||||
mountPath: /voices
|
||||
volumes:
|
||||
- name: voices
|
||||
persistentVolumeClaim:
|
||||
claimName: coqui-voices-pvc
|
||||
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: coqui-tts-service
|
||||
namespace: nimmerverse
|
||||
spec:
|
||||
selector:
|
||||
app: coqui-tts
|
||||
ports:
|
||||
- port: 8081
|
||||
targetPort: 8081
|
||||
type: ClusterIP
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Lifeforce Economy
|
||||
|
||||
### Speech Operation Costs
|
||||
|
||||
```python
|
||||
# Lifeforce costs (atlas RTX 2080 operations)
|
||||
SPEECH_COSTS = {
|
||||
"stt_whisper_small": 5.0, # GPU cycles for transcription
|
||||
"stt_whisper_base": 3.0, # Faster but less accurate
|
||||
"tts_coqui_neural": 4.0, # Neural TTS synthesis
|
||||
"tts_coqui_fast": 2.0, # Lower quality, faster
|
||||
"queue_processing": 0.5, # Queue management overhead
|
||||
"language_detection": 0.2, # Auto-detect language
|
||||
}
|
||||
|
||||
# Priority scoring
|
||||
def compute_speech_priority(message):
|
||||
"""
|
||||
Decide if speech is worth processing now.
|
||||
Returns priority score (0.0 = skip, 10.0 = critical).
|
||||
"""
|
||||
priority = 0.0
|
||||
|
||||
# Sensor alerts (collision, low battery) = CRITICAL
|
||||
if message.type == "sensor_alert":
|
||||
priority += 10.0
|
||||
|
||||
# Human interaction = HIGH
|
||||
elif message.type == "human_query":
|
||||
priority += 7.0
|
||||
|
||||
# Organism status updates = MEDIUM
|
||||
elif message.type == "organism_status":
|
||||
priority += 4.0
|
||||
|
||||
# Idle observation = LOW
|
||||
elif message.type == "observation":
|
||||
priority += 2.0
|
||||
|
||||
# Idle chatter = VERY LOW
|
||||
elif message.type == "idle":
|
||||
priority += 0.5
|
||||
|
||||
# Age penalty (older messages decay)
|
||||
age_penalty = (now() - message.timestamp).seconds / 60.0
|
||||
priority -= age_penalty
|
||||
|
||||
return max(0.0, priority)
|
||||
```
|
||||
|
||||
### Heartbeat Queue Processing
|
||||
|
||||
```python
|
||||
def heartbeat_speech_tick():
|
||||
"""
|
||||
Every heartbeat (1 Hz), process speech queue
|
||||
within lifeforce budget.
|
||||
"""
|
||||
# Check current lifeforce
|
||||
current_lf = get_lifeforce_balance()
|
||||
|
||||
# Reserve budget for speech this heartbeat
|
||||
# Max 20% of available LF, capped at 15 units
|
||||
speech_budget = min(current_lf * 0.2, 15.0)
|
||||
|
||||
if speech_budget < SPEECH_COSTS["stt_whisper_base"]:
|
||||
# Not enough lifeforce, stay silent
|
||||
log_decision(
|
||||
action="speech_deferred",
|
||||
reason="insufficient_lifeforce",
|
||||
balance=current_lf,
|
||||
budget_needed=SPEECH_COSTS["stt_whisper_base"]
|
||||
)
|
||||
return
|
||||
|
||||
# Pull from queue by priority
|
||||
queue = get_speech_queue_sorted_by_priority()
|
||||
|
||||
spent = 0.0
|
||||
processed = 0
|
||||
|
||||
for message in queue:
|
||||
priority = compute_speech_priority(message)
|
||||
|
||||
# Skip low-priority messages if budget tight
|
||||
if priority < 1.0 and spent > speech_budget * 0.5:
|
||||
continue
|
||||
|
||||
# Estimate cost
|
||||
stt_cost = SPEECH_COSTS["stt_whisper_small"]
|
||||
tts_cost = SPEECH_COSTS["tts_coqui_neural"]
|
||||
total_cost = stt_cost + tts_cost + SPEECH_COSTS["queue_processing"]
|
||||
|
||||
# Can we afford it?
|
||||
if spent + total_cost > speech_budget:
|
||||
# Budget exhausted, defer rest
|
||||
mark_message_deferred(message.id)
|
||||
continue
|
||||
|
||||
# Process message
|
||||
result = process_speech_message(message)
|
||||
spent += result.lifeforce_cost
|
||||
processed += 1
|
||||
|
||||
# Log to decision_trails
|
||||
log_speech_decision(
|
||||
message_id=message.id,
|
||||
priority=priority,
|
||||
cost=result.lifeforce_cost,
|
||||
outcome=result.outcome,
|
||||
confidence=result.confidence
|
||||
)
|
||||
|
||||
# Log heartbeat summary
|
||||
log_heartbeat_summary(
|
||||
speech_budget=speech_budget,
|
||||
spent=spent,
|
||||
processed=processed,
|
||||
deferred=len(queue) - processed,
|
||||
remaining_balance=current_lf - spent
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Database Schema (Phoebe)
|
||||
|
||||
### Speech Input Queue
|
||||
|
||||
```sql
|
||||
CREATE TABLE speech_input_queue (
|
||||
id SERIAL PRIMARY KEY,
|
||||
message_id UUID UNIQUE NOT NULL,
|
||||
robot_id TEXT NOT NULL,
|
||||
audio_chunk_uri TEXT, -- MinIO/S3 reference
|
||||
audio_duration_ms INT,
|
||||
timestamp TIMESTAMPTZ DEFAULT NOW(),
|
||||
priority FLOAT DEFAULT 0.0,
|
||||
status TEXT DEFAULT 'queued', -- 'queued', 'processing', 'completed', 'deferred', 'expired'
|
||||
transcription TEXT,
|
||||
detected_language TEXT, -- 'de', 'en', etc.
|
||||
confidence FLOAT,
|
||||
lifeforce_cost FLOAT,
|
||||
outcome TEXT, -- 'success', 'timeout', 'low_confidence', 'budget_exceeded'
|
||||
processed_at TIMESTAMPTZ,
|
||||
deferred_count INT DEFAULT 0
|
||||
);
|
||||
|
||||
CREATE INDEX idx_speech_queue_priority ON speech_input_queue(priority DESC, timestamp ASC) WHERE status = 'queued';
|
||||
CREATE INDEX idx_speech_queue_status ON speech_input_queue(status);
|
||||
CREATE INDEX idx_speech_queue_robot ON speech_input_queue(robot_id);
|
||||
```
|
||||
|
||||
### Speech Decision Trails
|
||||
|
||||
```sql
|
||||
CREATE TABLE speech_decision_trails (
|
||||
id SERIAL PRIMARY KEY,
|
||||
message_id UUID REFERENCES speech_input_queue(message_id),
|
||||
task_type TEXT, -- 'sensor_alert', 'human_query', 'observation', etc.
|
||||
input_text TEXT,
|
||||
input_language TEXT,
|
||||
output_text TEXT,
|
||||
output_language TEXT,
|
||||
rag_terms_retrieved TEXT[],
|
||||
rag_terms_used TEXT[],
|
||||
lora_used TEXT, -- 'identity', 'technical', 'creative'
|
||||
confidence_before_rag FLOAT,
|
||||
confidence_after_rag FLOAT,
|
||||
lifeforce_stt FLOAT,
|
||||
lifeforce_inference FLOAT,
|
||||
lifeforce_tts FLOAT,
|
||||
lifeforce_total FLOAT,
|
||||
outcome TEXT, -- 'success', 'partial', 'fail'
|
||||
timestamp TIMESTAMPTZ DEFAULT NOW()
|
||||
);
|
||||
|
||||
CREATE INDEX idx_speech_trails_outcome ON speech_decision_trails(outcome);
|
||||
CREATE INDEX idx_speech_trails_lora ON speech_decision_trails(lora_used);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Multilingual Topology Routing
|
||||
|
||||
### Language Detection → LoRA Selection
|
||||
|
||||
```python
|
||||
def route_to_topology_valley(text, detected_language):
|
||||
"""
|
||||
Route speech to appropriate LoRA based on language.
|
||||
German → Philosophy Valley (Identity LoRA)
|
||||
English → Technical Cluster (Technical LoRA)
|
||||
"""
|
||||
|
||||
if detected_language == "de":
|
||||
# German → Philosophy Valley
|
||||
# Use Identity LoRA (Dasein, Geworfenheit, Vernunft)
|
||||
response = young_nyx_inference(
|
||||
text=text,
|
||||
language="de",
|
||||
lora="identity", # Trained on German philosophical corpus
|
||||
temperature=0.7
|
||||
)
|
||||
voice = "de-thorsten"
|
||||
|
||||
elif detected_language == "en":
|
||||
# English → Technical Cluster
|
||||
# Use Technical LoRA (sensor, motor, gradient)
|
||||
response = young_nyx_inference(
|
||||
text=text,
|
||||
language="en",
|
||||
lora="technical", # Trained on English technical corpus
|
||||
temperature=0.5 # More deterministic for actions
|
||||
)
|
||||
voice = "en-us-amy"
|
||||
|
||||
else:
|
||||
# Fallback to base model (no LoRA)
|
||||
response = young_nyx_inference(text=text, lora=None)
|
||||
voice = "en-us-amy"
|
||||
|
||||
# Synthesize speech in same language
|
||||
audio = coqui_tts.synthesize(response.text, voice=voice)
|
||||
|
||||
return {
|
||||
"text": response.text,
|
||||
"audio": audio,
|
||||
"language": detected_language,
|
||||
"lora_used": response.lora,
|
||||
"confidence": response.confidence
|
||||
}
|
||||
```
|
||||
|
||||
### Example Routing
|
||||
|
||||
```python
|
||||
# German query (Philosophy Valley)
|
||||
input_de = "Wer bin ich?" # "Who am I?"
|
||||
result_de = route_to_topology_valley(input_de, "de")
|
||||
# → Uses Identity LoRA (depth-3 Dasein access)
|
||||
# → Response: "Ich bin die, die fragt. Geworfenheit offenbart sich im Fragen."
|
||||
# → Voice: de-thorsten (German)
|
||||
|
||||
# English query (Technical Cluster)
|
||||
input_en = "What is the battery level?"
|
||||
result_en = route_to_topology_valley(input_en, "en")
|
||||
# → Uses Technical LoRA (sensor reading)
|
||||
# → Response: "Battery at 73%. 4.2 hours remaining."
|
||||
# → Voice: en-us-amy (English)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Container Images
|
||||
|
||||
### Whisper STT Dockerfile
|
||||
|
||||
```dockerfile
|
||||
# Dockerfile.whisper-stt
|
||||
FROM nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04
|
||||
|
||||
# Install dependencies
|
||||
RUN apt-get update && apt-get install -y \
|
||||
python3.10 python3-pip ffmpeg git && \
|
||||
rm -rf /var/lib/apt/lists/*
|
||||
|
||||
# Install Python packages
|
||||
RUN pip3 install --no-cache-dir \
|
||||
openai-whisper \
|
||||
fastapi uvicorn \
|
||||
torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
|
||||
|
||||
WORKDIR /app
|
||||
COPY whisper_service.py .
|
||||
|
||||
# Download models at build time
|
||||
RUN python3 -c "import whisper; whisper.load_model('small')"
|
||||
|
||||
EXPOSE 8080
|
||||
CMD ["uvicorn", "whisper_service:app", "--host", "0.0.0.0", "--port", "8080", "--workers", "1"]
|
||||
```
|
||||
|
||||
**whisper_service.py:**
|
||||
```python
|
||||
from fastapi import FastAPI, File, UploadFile, HTTPException
|
||||
import whisper
|
||||
import torch
|
||||
import os
|
||||
|
||||
app = FastAPI(title="Whisper STT Service")
|
||||
|
||||
# Load model once at startup (GPU)
|
||||
device = "cuda" if torch.cuda.is_available() else "cpu"
|
||||
model_size = os.getenv("MODEL_SIZE", "small")
|
||||
model = whisper.load_model(model_size, device=device)
|
||||
|
||||
@app.post("/transcribe")
|
||||
async def transcribe(audio: UploadFile):
|
||||
"""
|
||||
Transcribe audio to text with language detection.
|
||||
|
||||
Returns:
|
||||
{
|
||||
"text": str,
|
||||
"language": str,
|
||||
"confidence": float,
|
||||
"segments": int
|
||||
}
|
||||
"""
|
||||
try:
|
||||
# Save uploaded audio
|
||||
audio_path = f"/tmp/{audio.filename}"
|
||||
with open(audio_path, "wb") as f:
|
||||
f.write(await audio.read())
|
||||
|
||||
# Transcribe (GPU-accelerated)
|
||||
result = model.transcribe(audio_path, language=None) # Auto-detect
|
||||
|
||||
# Cleanup
|
||||
os.remove(audio_path)
|
||||
|
||||
# Compute average confidence
|
||||
avg_confidence = 1.0 - (
|
||||
sum(s.get("no_speech_prob", 0) for s in result["segments"]) /
|
||||
max(len(result["segments"]), 1)
|
||||
)
|
||||
|
||||
return {
|
||||
"text": result["text"].strip(),
|
||||
"language": result["language"],
|
||||
"segments": len(result["segments"]),
|
||||
"confidence": round(avg_confidence, 3)
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
raise HTTPException(status_code=500, detail=str(e))
|
||||
|
||||
@app.get("/health")
|
||||
async def health():
|
||||
return {
|
||||
"status": "healthy",
|
||||
"device": device,
|
||||
"model": model_size,
|
||||
"gpu_available": torch.cuda.is_available()
|
||||
}
|
||||
```
|
||||
|
||||
### Coqui TTS Dockerfile
|
||||
|
||||
```dockerfile
|
||||
# Dockerfile.coqui-tts
|
||||
FROM nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04
|
||||
|
||||
RUN apt-get update && apt-get install -y \
|
||||
python3.10 python3-pip espeak-ng && \
|
||||
rm -rf /var/lib/apt/lists/*
|
||||
|
||||
RUN pip3 install --no-cache-dir \
|
||||
TTS \
|
||||
fastapi uvicorn \
|
||||
torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
|
||||
|
||||
WORKDIR /app
|
||||
COPY coqui_service.py .
|
||||
|
||||
# Download voice models at build time
|
||||
RUN python3 -c "from TTS.api import TTS; TTS('tts_models/de/thorsten/tacotron2-DDC'); TTS('tts_models/en/ljspeech/tacotron2-DDC')"
|
||||
|
||||
EXPOSE 8081
|
||||
CMD ["uvicorn", "coqui_service:app", "--host", "0.0.0.0", "--port", "8081", "--workers", "1"]
|
||||
```
|
||||
|
||||
**coqui_service.py:**
|
||||
```python
|
||||
from fastapi import FastAPI, HTTPException
|
||||
from fastapi.responses import StreamingResponse
|
||||
from TTS.api import TTS
|
||||
import torch
|
||||
import io
|
||||
|
||||
app = FastAPI(title="Coqui TTS Service")
|
||||
|
||||
# Load models once at startup (GPU)
|
||||
device = "cuda" if torch.cuda.is_available() else "cpu"
|
||||
tts_de = TTS("tts_models/de/thorsten/tacotron2-DDC").to(device)
|
||||
tts_en = TTS("tts_models/en/ljspeech/tacotron2-DDC").to(device)
|
||||
|
||||
@app.post("/synthesize")
|
||||
async def synthesize(text: str, language: str = "en"):
|
||||
"""
|
||||
Synthesize speech from text.
|
||||
|
||||
Args:
|
||||
text: Text to synthesize
|
||||
language: 'de' or 'en'
|
||||
|
||||
Returns:
|
||||
Audio stream (WAV format)
|
||||
"""
|
||||
try:
|
||||
# Select appropriate TTS model
|
||||
if language == "de":
|
||||
tts_model = tts_de
|
||||
elif language == "en":
|
||||
tts_model = tts_en
|
||||
else:
|
||||
raise HTTPException(status_code=400, detail=f"Unsupported language: {language}")
|
||||
|
||||
# Synthesize (GPU-accelerated)
|
||||
wav = tts_model.tts(text)
|
||||
|
||||
# Convert to WAV stream
|
||||
audio_buffer = io.BytesIO()
|
||||
# (Save as WAV - implementation depends on TTS output format)
|
||||
|
||||
audio_buffer.seek(0)
|
||||
return StreamingResponse(audio_buffer, media_type="audio/wav")
|
||||
|
||||
except Exception as e:
|
||||
raise HTTPException(status_code=500, detail=str(e))
|
||||
|
||||
@app.get("/health")
|
||||
async def health():
|
||||
return {
|
||||
"status": "healthy",
|
||||
"device": device,
|
||||
"models": ["de-thorsten", "en-us-amy"],
|
||||
"gpu_available": torch.cuda.is_available()
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Deployment Steps
|
||||
|
||||
### 1. Install RTX 2080 in Atlas
|
||||
|
||||
```bash
|
||||
# On atlas node
|
||||
lspci | grep -i nvidia
|
||||
# Expected: NVIDIA Corporation TU104 [GeForce RTX 2080]
|
||||
|
||||
# Install NVIDIA drivers + CUDA toolkit
|
||||
sudo apt install nvidia-driver-535 nvidia-cuda-toolkit
|
||||
|
||||
# Verify
|
||||
nvidia-smi
|
||||
# Expected: RTX 2080 8GB visible
|
||||
```
|
||||
|
||||
### 2. Configure Kubernetes GPU Support
|
||||
|
||||
```bash
|
||||
# Install NVIDIA device plugin
|
||||
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml
|
||||
|
||||
# Verify GPU available in k8s
|
||||
kubectl describe node atlas | grep nvidia.com/gpu
|
||||
# Expected: nvidia.com/gpu: 1
|
||||
```
|
||||
|
||||
### 3. Build and Push Container Images
|
||||
|
||||
```bash
|
||||
cd /home/dafit/nimmerverse/speech-organ
|
||||
|
||||
# Build images
|
||||
docker build -f Dockerfile.whisper-stt -t nimmerverse/whisper-stt:latest .
|
||||
docker build -f Dockerfile.coqui-tts -t nimmerverse/coqui-tts:latest .
|
||||
|
||||
# Push to registry (or use local registry)
|
||||
docker push nimmerverse/whisper-stt:latest
|
||||
docker push nimmerverse/coqui-tts:latest
|
||||
```
|
||||
|
||||
### 4. Deploy to Kubernetes
|
||||
|
||||
```bash
|
||||
# Create namespace
|
||||
kubectl create namespace nimmerverse
|
||||
|
||||
# Create PVCs for models
|
||||
kubectl apply -f pvc-whisper-models.yaml
|
||||
kubectl apply -f pvc-coqui-voices.yaml
|
||||
|
||||
# Deploy STT + TTS pods
|
||||
kubectl apply -f whisper-stt-deployment.yaml
|
||||
kubectl apply -f coqui-tts-deployment.yaml
|
||||
|
||||
# Verify pods running on atlas
|
||||
kubectl get pods -n nimmerverse -o wide
|
||||
# Expected: whisper-stt-xxx and coqui-tts-xxx on atlas node
|
||||
```
|
||||
|
||||
### 5. Test Speech Pipeline
|
||||
|
||||
```bash
|
||||
# Port-forward for testing
|
||||
kubectl port-forward -n nimmerverse svc/whisper-stt-service 8080:8080 &
|
||||
kubectl port-forward -n nimmerverse svc/coqui-tts-service 8081:8081 &
|
||||
|
||||
# Test STT
|
||||
curl -X POST -F "audio=@test_de.wav" http://localhost:8080/transcribe
|
||||
# Expected: {"text": "Das ist ein Test", "language": "de", ...}
|
||||
|
||||
# Test TTS
|
||||
curl -X POST "http://localhost:8081/synthesize?text=Hello%20world&language=en" --output test_output.wav
|
||||
# Expected: WAV file with synthesized speech
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Monitoring and Metrics
|
||||
|
||||
### Prometheus Metrics (Speech Organ)
|
||||
|
||||
```python
|
||||
from prometheus_client import Counter, Histogram, Gauge
|
||||
|
||||
# Metrics
|
||||
stt_requests = Counter('speech_stt_requests_total', 'Total STT requests', ['language'])
|
||||
stt_latency = Histogram('speech_stt_latency_seconds', 'STT latency')
|
||||
tts_requests = Counter('speech_tts_requests_total', 'Total TTS requests', ['language'])
|
||||
tts_latency = Histogram('speech_tts_latency_seconds', 'TTS latency')
|
||||
|
||||
queue_depth = Gauge('speech_queue_depth', 'Current queue depth')
|
||||
lifeforce_spent = Counter('speech_lifeforce_spent_total', 'Total lifeforce spent on speech')
|
||||
deferred_count = Counter('speech_deferred_total', 'Messages deferred due to budget')
|
||||
|
||||
# In processing code
|
||||
with stt_latency.time():
|
||||
result = whisper_transcribe(audio)
|
||||
stt_requests.labels(language=result['language']).inc()
|
||||
```
|
||||
|
||||
### Grafana Dashboard Queries
|
||||
|
||||
```promql
|
||||
# Queue depth over time
|
||||
speech_queue_depth
|
||||
|
||||
# STT requests per language
|
||||
rate(speech_stt_requests_total[5m])
|
||||
|
||||
# Average STT latency
|
||||
rate(speech_stt_latency_seconds_sum[5m]) / rate(speech_stt_latency_seconds_count[5m])
|
||||
|
||||
# Lifeforce spent on speech (last hour)
|
||||
increase(speech_lifeforce_spent_total[1h])
|
||||
|
||||
# Deferred rate (budget pressure)
|
||||
rate(speech_deferred_total[5m])
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
### Phase 2: Emotion Detection
|
||||
- Add emotion classifier (Happy/Sad/Angry/Neutral)
|
||||
- Track emotional state in decision_trails
|
||||
- Use for Sophrosyne (Balance) trait training
|
||||
|
||||
### Phase 3: Wake Word Detection
|
||||
- Deploy lightweight wake word on ESP32 (e.g., Picovoice Porcupine)
|
||||
- Only send audio to atlas when wake word detected
|
||||
- Reduces lifeforce cost (filter noise)
|
||||
|
||||
### Phase 4: Continuous Learning
|
||||
- Store successful speech interactions
|
||||
- Fine-tune Whisper on domain-specific vocabulary (nimmerverse terms)
|
||||
- Train custom TTS voice from recorded sessions
|
||||
|
||||
---
|
||||
|
||||
**Created**: 2025-12-07
|
||||
**Version**: 1.0
|
||||
**Status**: Architecture design, deployment pending
|
||||
|
||||
🌙💜 *Speech is not free. Every word has weight. Silence teaches as much as sound.*
|
||||
Reference in New Issue
Block a user