Last updated: Aug 12, 2025, 01:09 PM UTC

Local LLM Integration Guide

Generated: 2025-01-05 UTC
Status: IMPLEMENTED
Purpose: Comprehensive guide for running local LLMs within Sasha Studio container
Applicable To: Enterprise deployments requiring data sovereignty and offline capabilities

Implementation Status

TinyLlama 1.1B Successfully Integrated (January 2025)

  • Ollama service running in both Docker and local environments
  • TinyLlama 1.1B with 4-bit quantization (637MB) as default
  • Automatic fallback when cloud providers unavailable
  • Streaming responses working with Node.js compatibility
  • Zero-cost operation for local queries

Overview

This guide provides step-by-step instructions for integrating local Large Language Models (LLMs) into the Sasha Studio single-container architecture. By running models locally, organizations can maintain complete data sovereignty, operate offline, and reduce API costs while leveraging the full power of AI.

Related Guides:

Key Benefits (Now Live in Production)

  • Data Sovereignty: All data stays within your infrastructure
  • Cost Efficiency: No per-token API charges - $0 for local queries
  • Offline Operation: Full functionality without internet
  • Fallback Protection: Automatic failover from cloud to local
  • Performance: 50-60 tokens/sec with TinyLlama 4-bit

Technology Stack

Recommended Solution: Ollama

After evaluating multiple local LLM solutions, Ollama emerges as the optimal choice for Sasha Studio integration:

Feature Ollama LocalAI vLLM llama.cpp
Ease of Setup
Model Library
API Compatibility OpenAI-compatible OpenAI-compatible OpenAI-compatible Custom
Resource Efficiency
Container Integration
Management UI CLI + API Web UI CLI CLI

Why Ollama?

  1. Simple Management: One-command model downloads and updates
  2. Optimized Performance: Automatic GPU detection and optimization
  3. Wide Model Support: Llama 3, Mistral, Phi-3, Gemma, and more
  4. Easy Integration: REST API that works seamlessly with LLxprt CLI
  5. Production Ready: Battle-tested in enterprise environments

Installation and Setup

Docker Implementation Design Notes

Architecture Decisions

  1. Single Container Strategy

    • All services (Sasha, Ollama, PostgreSQL, Redis, Nginx) in one container
    • Supervisord manages all processes
    • Simplifies deployment but increases container size (~2GB base + models)
    • Trade-off: Ease of deployment vs. microservices best practices
  2. Model Storage Strategy

    • Models stored in /models volume mount for persistence
    • TinyLlama (637MB) pre-loaded during container build
    • Additional models downloaded on-demand
    • Volume mount allows model sharing between container updates
  3. Environment Detection

    • Automatic detection of Docker vs. local environment
    • File checks: /.dockerenv and /proc/1/cgroup
    • Environment variable: DOCKER_ENV=true
    • Different startup sequences based on environment
  4. Service Dependencies

    PostgreSQL β†’ Redis β†’ Ollama β†’ Sasha API β†’ Nginx
    
    • Health checks ensure proper startup order
    • Retry logic for service connections
    • Graceful fallback if Ollama unavailable
  5. Resource Allocation

    • Minimum: 2GB RAM (TinyLlama only)
    • Recommended: 8GB RAM (multiple models)
    • GPU: Optional but recommended for larger models
    • CPU: 4+ cores for concurrent request handling

Production Considerations

  1. Security

    • Ollama runs on internal port 11434 (not exposed externally)
    • API gateway handles authentication before model access
    • Model selection restricted by user permissions
    • No direct access to Ollama admin endpoints
  2. Performance Optimization

    • Model preloading during container startup
    • Shared memory for inter-process communication
    • Connection pooling for database and Redis
    • Nginx caching for static responses
  3. Monitoring & Logging

    • Centralized logging to /logs volume
    • Prometheus metrics endpoint at /metrics
    • Health checks for each service
    • Resource usage tracking per model
  4. Upgrade Strategy

    • Blue-green deployment for zero downtime
    • Model versions pinned in configuration
    • Backward compatibility for API changes
    • Automated rollback on health check failures

Single Container Integration

# Dockerfile - Sasha Studio with Ollama
FROM ubuntu:22.04

# Install system dependencies
RUN apt-get update && apt-get install -y \
    curl \
    git \
    build-essential \
    nvidia-cuda-toolkit \
    nodejs \
    npm \
    postgresql-14 \
    redis-server \
    nginx \
    supervisor \
    && rm -rf /var/lib/apt/lists/*

# Install Ollama
RUN curl -fsSL https://ollama.ai/install.sh | sh

# Create directories
RUN mkdir -p /models /data /config /logs

# Copy Ollama model configuration
COPY config/ollama-models.txt /config/

# Copy application code
COPY . /app
WORKDIR /app

# Install Node dependencies
RUN npm ci --production

# Configure Supervisord to manage all services
COPY config/supervisord.conf /etc/supervisor/conf.d/

# Expose ports
EXPOSE 80 11434

# Health check
HEALTHCHECK --interval=30s --timeout=10s \
    CMD curl -f http://localhost/health && curl -f http://localhost:11434/api/tags || exit 1

# Start script
COPY scripts/start.sh /start.sh
RUN chmod +x /start.sh

CMD ["/start.sh"]

Supervisord Configuration

# supervisord.conf
[supervisord]
nodaemon=true
logfile=/logs/supervisord.log

[program:ollama]
command=/usr/local/bin/ollama serve
autostart=true
autorestart=true
stdout_logfile=/logs/ollama.stdout.log
stderr_logfile=/logs/ollama.stderr.log
environment=OLLAMA_MODELS="/models",OLLAMA_HOST="0.0.0.0:11434"

[program:sasha-api]
command=node /app/backend/server.js
autostart=true
autorestart=true
stdout_logfile=/logs/sasha-api.stdout.log
stderr_logfile=/logs/sasha-api.stderr.log
environment=NODE_ENV="production"

[program:nginx]
command=/usr/sbin/nginx -g "daemon off;"
autostart=true
autorestart=true
stdout_logfile=/logs/nginx.stdout.log
stderr_logfile=/logs/nginx.stderr.log

[program:postgresql]
command=/usr/lib/postgresql/14/bin/postgres -D /var/lib/postgresql/14/main -c config_file=/etc/postgresql/14/main/postgresql.conf
autostart=true
autorestart=true
user=postgres
stdout_logfile=/logs/postgresql.stdout.log
stderr_logfile=/logs/postgresql.stderr.log

[program:redis]
command=/usr/bin/redis-server /etc/redis/redis.conf
autostart=true
autorestart=true
stdout_logfile=/logs/redis.stdout.log
stderr_logfile=/logs/redis.stderr.log

Start Script

#!/bin/bash
# start.sh - Initialize and start all services

echo "πŸš€ Starting Sasha Studio with Local LLM Support..."

# Initialize database if needed
if [ ! -f /data/.initialized ]; then
    echo "πŸ“¦ Initializing database..."
    su - postgres -c "/usr/lib/postgresql/14/bin/initdb -D /var/lib/postgresql/14/main"
    su - postgres -c "/usr/lib/postgresql/14/bin/pg_ctl -D /var/lib/postgresql/14/main -l /logs/postgresql.log start"
    sleep 5
    su - postgres -c "createdb sasha"
    cd /app && npm run db:migrate
    touch /data/.initialized
fi

# Pre-download essential models
if [ ! -f /models/.models-initialized ]; then
    echo "πŸ“₯ Downloading essential models..."
    # TinyLlama is our primary fallback model (637MB, 4-bit quantized)
    ollama pull tinyllama:latest
    # Future: Add more models as needed
    # ollama pull llama3:8b
    # ollama pull mistral:7b
    touch /models/.models-initialized
fi

# Start supervisord
echo "βœ… Starting all services..."
exec /usr/bin/supervisord -c /etc/supervisor/conf.d/supervisord.conf

Model Management

Pre-configured Models

# ollama-models.yml - Model configuration
models:
  # Fast, general purpose
  - name: llama3:8b
    purpose: general
    memory: 8GB
    context: 8192
    
  # Balanced performance
  - name: mistral:7b
    purpose: general
    memory: 6GB
    context: 32768
    
  # Code specialized
  - name: codellama:13b
    purpose: code
    memory: 10GB
    context: 16384
    
  # Large, high quality
  - name: llama3:70b
    purpose: advanced
    memory: 40GB
    context: 8192
    gpu_required: true
    
  # Tiny, fast responses
  - name: phi3:mini
    purpose: quick
    memory: 2GB
    context: 4096
    
  # Ultra-lightweight, cost-effective ⭐ IMPLEMENTED
  - name: tinyllama:latest
    purpose: fallback & lightweight queries
    memory: 637MB (4-bit quantized)
    context: 2048
    status: "βœ… Active in production"
    features:
      - Automatic fallback from cloud providers
      - Zero-cost local inference
      - 50-60 tokens/sec performance

Model Download Script

// model-manager.js - Automated model management
const { exec } = require('child_process');
const util = require('util');
const execAsync = util.promisify(exec);

class ModelManager {
  constructor() {
    this.requiredModels = [
      'tinyllama:latest'  // βœ… IMPLEMENTED - Primary fallback model
      // Future models to add:
      // 'llama3:8b',
      // 'mistral:7b',
      // 'phi3:mini'
    ];
  }
  
  async ensureModelsAvailable() {
    console.log('πŸ” Checking local models...');
    
    // List current models
    const { stdout } = await execAsync('ollama list');
    const installedModels = stdout.split('\n')
      .slice(1) // Skip header
      .map(line => line.split(/\s+/)[0])
      .filter(Boolean);
    
    // Download missing models
    for (const model of this.requiredModels) {
      if (!installedModels.includes(model)) {
        console.log(`πŸ“₯ Downloading ${model}...`);
        await this.downloadModel(model);
      } else {
        console.log(`βœ… ${model} already available`);
      }
    }
  }
  
  async downloadModel(modelName) {
    try {
      await execAsync(`ollama pull ${modelName}`, {
        // Stream output for progress
        stdio: 'inherit'
      });
      console.log(`βœ… Successfully downloaded ${modelName}`);
    } catch (error) {
      console.error(`❌ Failed to download ${modelName}:`, error);
      throw error;
    }
  }
  
  async getModelInfo(modelName) {
    const { stdout } = await execAsync(`ollama show ${modelName}`);
    return this.parseModelInfo(stdout);
  }
  
  parseModelInfo(output) {
    const info = {};
    const lines = output.split('\n');
    
    lines.forEach(line => {
      if (line.includes('Parameters:')) {
        info.parameters = line.split(':')[1].trim();
      }
      if (line.includes('Size:')) {
        info.size = line.split(':')[1].trim();
      }
      if (line.includes('Quantization:')) {
        info.quantization = line.split(':')[1].trim();
      }
    });
    
    return info;
  }
}

// Auto-download on startup
const manager = new ModelManager();
manager.ensureModelsAvailable().catch(console.error);

TinyLlama 1.1B Integration

Model Overview

TinyLlama is an ultra-efficient 1.1B parameter model that provides exceptional performance for its size:

  • Architecture: Llama 2 compatible (22 layers, 2048 embedding dimension)
  • Training: 3 trillion tokens over 90 days
  • Compatibility: Drop-in replacement for Llama-based applications
  • Performance: 50-80 tokens/sec on CPU (varies by quantization)
  • Use Cases: Quick responses, edge deployment, cost optimization

Quantization Options

Understanding Quantization

Quantization reduces model size by lowering the numerical precision of weights. Think of it like image compression - you trade some quality for significantly smaller file sizes:

  • Original: 32-bit or 16-bit floating point numbers
  • Quantized: 2-bit, 4-bit, or 8-bit integers
  • Impact: 50-85% size reduction with minimal quality loss

Available Quantization Profiles

// config/tinyllama-quantization.js
const quantizationProfiles = {
  'q4_k_m': {  // ⭐ RECOMMENDED DEFAULT
    name: '4-bit Quantized',
    size: '637 MB',
    memoryRequired: '1.2 GB',
    speed: '50-60 tokens/sec',
    qualityLoss: '~3%',
    description: 'Best balance of size, speed, and quality. Perfect for production.',
    modelTag: 'tinyllama:1.1b-q4_k_m',
    useCase: 'General purpose, production deployments'
  },
  
  'q2_k': {  // πŸš€ ULTRA COMPACT
    name: '2-bit Quantized',
    size: '432 MB',
    memoryRequired: '800 MB',
    speed: '70-80 tokens/sec',
    qualityLoss: '~8%',
    description: 'Smallest possible size. Some quality degradation but extremely fast.',
    modelTag: 'tinyllama:1.1b-q2_k',
    useCase: 'Edge devices, IoT, Raspberry Pi, speed-critical applications'
  },
  
  'q8_0': {  // πŸ’Ž HIGH QUALITY
    name: '8-bit Quantized',
    size: '1.1 GB',
    memoryRequired: '2 GB',
    speed: '40-50 tokens/sec',
    qualityLoss: '~1%',
    description: 'Higher quality with moderate size increase.',
    modelTag: 'tinyllama:1.1b-q8_0',
    useCase: 'Quality-sensitive tasks, customer-facing applications'
  },
  
  'fp16': {  // 🎯 MAXIMUM PRECISION
    name: '16-bit Full Precision',
    size: '1.94 GB',
    memoryRequired: '3 GB',
    speed: '25-35 tokens/sec',
    qualityLoss: '0%',
    description: 'Original model quality. No quantization losses.',
    modelTag: 'tinyllama:1.1b-fp16',
    useCase: 'Development, testing, benchmarking, critical accuracy needs'
  }
};

Choosing the Right Quantization

Scenario Recommended Reasoning
Production Server q4_k_m (4-bit) Best balance, handles 95% of use cases well
Raspberry Pi/Edge q2_k (2-bit) Fits in limited memory, still functional
Customer Support q8_0 (8-bit) Higher quality for user-facing responses
Development fp16 (16-bit) Baseline for quality comparison
High Traffic q2_k or q4_k_m Maximize throughput
Limited RAM (<1GB) q2_k (2-bit) Only option that fits
Quality Critical fp16 or q8_0 Minimize quality loss

Installation and Configuration

Environment Variables

# .env configuration
# Quantization selection (q2_k, q4_k_m, q8_0, fp16)
TINYLLAMA_QUANTIZATION=q4_k_m  # Default: 4-bit balanced

# Enable automatic quantization selection based on available memory
TINYLLAMA_AUTO_QUANTIZATION=true

# Memory threshold for auto-selection (MB)
TINYLLAMA_MEMORY_THRESHOLD=1500

# Fallback if selected quantization unavailable
TINYLLAMA_FALLBACK_QUANTIZATION=q2_k

# Model routing preferences
LOCAL_MODEL_PRIORITY=balanced  # Options: speed, quality, balanced
TINYLLAMA_MAX_CONTEXT=2048
TINYLLAMA_DEFAULT_TEMPERATURE=0.7

# Performance tuning
TINYLLAMA_BATCH_SIZE=512
TINYLLAMA_THREADS=4  # CPU threads to use

Automatic Quantization Selection

// services/tinyllama-manager.js
class TinyLlamaManager {
  constructor() {
    this.quantization = process.env.TINYLLAMA_QUANTIZATION || 'q4_k_m';
    this.autoSelect = process.env.TINYLLAMA_AUTO_QUANTIZATION === 'true';
  }
  
  async selectOptimalQuantization() {
    if (!this.autoSelect) {
      return this.quantization;
    }
    
    const availableMemory = await this.getAvailableMemory();
    const priority = process.env.LOCAL_MODEL_PRIORITY || 'balanced';
    
    // Memory-based selection
    if (availableMemory < 1000) {
      console.log('πŸ“Š Low memory detected, using 2-bit quantization');
      return 'q2_k';
    }
    
    if (availableMemory < 1500) {
      console.log('πŸ“Š Moderate memory, using 4-bit quantization');
      return 'q4_k_m';
    }
    
    // Priority-based selection for sufficient memory
    if (priority === 'speed') {
      return 'q2_k';  // Fastest inference
    }
    
    if (priority === 'quality' && availableMemory > 2000) {
      return availableMemory > 3000 ? 'fp16' : 'q8_0';
    }
    
    // Default balanced approach
    return 'q4_k_m';
  }
  
  async downloadModel(quantization) {
    const profile = quantizationProfiles[quantization];
    if (!profile) {
      throw new Error(`Unknown quantization: ${quantization}`);
    }
    
    console.log(`πŸ“₯ Downloading TinyLlama ${profile.name}...`);
    console.log(`   Size: ${profile.size}`);
    console.log(`   Quality loss: ${profile.qualityLoss}`);
    console.log(`   Use case: ${profile.useCase}`);
    
    const { exec } = require('child_process');
    const util = require('util');
    const execAsync = util.promisify(exec);
    
    try {
      await execAsync(`ollama pull ${profile.modelTag}`);
      console.log(`βœ… Successfully downloaded ${profile.modelTag}`);
      return profile;
    } catch (error) {
      console.error(`❌ Failed to download: ${error.message}`);
      
      // Try fallback
      const fallback = process.env.TINYLLAMA_FALLBACK_QUANTIZATION;
      if (fallback && fallback !== quantization) {
        console.log(`πŸ”„ Attempting fallback to ${fallback}...`);
        return this.downloadModel(fallback);
      }
      
      throw error;
    }
  }
  
  async getAvailableMemory() {
    const os = require('os');
    const freeMem = os.freemem() / (1024 * 1024); // Convert to MB
    const totalMem = os.totalmem() / (1024 * 1024);
    
    // Conservative estimate - leave headroom for system
    const available = Math.floor(freeMem * 0.7);
    
    console.log(`πŸ’Ύ Memory: ${available}MB available (${freeMem.toFixed(0)}MB free of ${totalMem.toFixed(0)}MB total)`);
    return available;
  }
}

Docker Deployment

# Dockerfile with configurable TinyLlama quantization
FROM ubuntu:22.04

# Build arguments for quantization selection
ARG TINYLLAMA_QUANTIZATION=q4_k_m
ARG PRELOAD_ALL_QUANTIZATIONS=false

# ... [existing setup code] ...

# Install and configure TinyLlama
# Note: Currently only tinyllama:latest is available (4-bit quantized)
# This section prepared for future when specific quantizations are available
RUN echo "πŸ“₯ Downloading TinyLlama (4-bit quantized, 637MB)..." && \
    ollama pull tinyllama:latest

# Environment configuration
ENV TINYLLAMA_QUANTIZATION=${TINYLLAMA_QUANTIZATION}
ENV TINYLLAMA_AUTO_QUANTIZATION=true
ENV OLLAMA_MODELS=/models

Performance Benchmarks

Speed Comparison (tokens/second)

Quantization CPU (4 cores) CPU (8 cores) Apple M1 NVIDIA 3060
q2_k 70-80 120-140 150-180 200-250
q4_k_m 50-60 90-100 120-140 180-200
q8_0 40-50 70-80 90-110 150-170
fp16 25-35 45-55 60-75 100-120

Quality Metrics (MMLU Benchmark)

Quantization Accuracy Coherence Factuality
fp16 100% (baseline) Excellent Very Good
q8_0 99% Excellent Very Good
q4_k_m 97% Very Good Good
q2_k 92% Good Acceptable

Integration with Sasha

// services/model-router.js
class SashaModelRouter {
  constructor() {
    this.tinyLlama = new TinyLlamaManager();
    this.initialized = false;
  }
  
  async initialize() {
    // Select and download optimal TinyLlama variant
    const quantization = await this.tinyLlama.selectOptimalQuantization();
    const profile = await this.tinyLlama.downloadModel(quantization);
    
    console.log(`πŸš€ TinyLlama ready: ${profile.name}`);
    console.log(`   Expected speed: ${profile.speed}`);
    console.log(`   Memory usage: ${profile.memoryRequired}`);
    
    this.currentProfile = profile;
    this.initialized = true;
  }
  
  async routeQuery(query, context) {
    const tokenCount = this.estimateTokens(query + context);
    
    // Route to TinyLlama for suitable queries
    if (this.shouldUseTinyLlama(query, tokenCount)) {
      return {
        provider: 'ollama',
        model: this.currentProfile.modelTag,
        reason: 'Local processing for privacy and speed'
      };
    }
    
    // Fallback to cloud models
    return {
      provider: 'openrouter',
      model: 'openai/gpt-4o-mini',
      reason: 'Complex query requiring advanced model'
    };
  }
  
  shouldUseTinyLlama(query, tokenCount) {
    // Use TinyLlama for:
    // 1. Short contexts (under 2k tokens)
    // 2. Simple Q&A
    // 3. Non-code queries
    // 4. Privacy-sensitive content
    
    if (tokenCount > 2000) return false;
    if (query.includes('code') || query.includes('debug')) return false;
    if (query.match(/complex|analyze|detailed/i)) return false;
    
    return true;
  }
  
  estimateTokens(text) {
    // Rough estimate: 1 token β‰ˆ 4 characters
    return Math.ceil(text.length / 4);
  }
}

Use Case Examples

Current Implementation (Production Ready)

# Current deployment configuration
ENABLE_LOCAL_MODELS=true         # Enable TinyLlama fallback
PREFER_LOCAL_MODELS=false        # Use cloud by default, fallback to local
TINYLLAMA_QUANTIZATION=q4_k_m    # 4-bit quantization (637MB)
OLLAMA_HOST=http://localhost:11434

# This configuration provides:
# - Automatic fallback when cloud providers fail
# - Zero-cost operation for fallback queries
# - 50-60 tokens/sec performance
# - Minimal memory footprint (637MB)

Example 1: Customer Support Bot (Future)

# Optimize for quality and speed when more models available
TINYLLAMA_QUANTIZATION=q8_0  # Higher quality for customer-facing
LOCAL_MODEL_PRIORITY=quality
TINYLLAMA_DEFAULT_TEMPERATURE=0.5  # More consistent responses

Example 2: Internal Documentation Search

# Optimize for speed and cost
TINYLLAMA_QUANTIZATION=q4_k_m  # Balanced
LOCAL_MODEL_PRIORITY=speed
TINYLLAMA_DEFAULT_TEMPERATURE=0.3  # Factual responses

Example 3: Edge Device Deployment

# Optimize for minimal resources
TINYLLAMA_QUANTIZATION=q2_k  # Smallest size
TINYLLAMA_AUTO_QUANTIZATION=false  # Don't change
TINYLLAMA_THREADS=2  # Limited CPU

Cost Analysis

Cloud vs Local Comparison

Model Provider Cost per 1M tokens Speed Privacy
GPT-4 OpenAI $30-60 Fast Cloud
Claude 3 Anthropic $15-75 Fast Cloud
GPT-3.5 OpenAI $0.50-2.00 Very Fast Cloud
TinyLlama (Local) Self-hosted $0.00 Very Fast Local

ROI Calculation

For a typical Sasha deployment handling 100k queries/day:

  • Average query: 500 tokens (input + output)
  • Daily tokens: 50M tokens
  • Monthly tokens: 1.5B tokens

Cloud costs: $750-3,000/month (GPT-3.5 to GPT-4)
TinyLlama costs: $0/month (after initial hardware)

Hardware investment:

  • Basic server (32GB RAM, 8 cores): $1,000-2,000
  • Break-even: 1-3 months

Security Considerations

Data Privacy Benefits

  1. Complete Data Isolation: No data leaves your infrastructure
  2. Compliance Ready: GDPR, HIPAA, SOC2 compliant by design
  3. No API Key Management: Eliminate API key security risks
  4. Audit Trail: Complete control over logging and monitoring

Security Configuration

# Secure TinyLlama deployment
OLLAMA_HOST=127.0.0.1:11434  # Local only, no external access
OLLAMA_ORIGINS=http://localhost:3002  # Restrict CORS
TINYLLAMA_LOG_LEVEL=error  # Minimal logging
TINYLLAMA_SECURE_MODE=true  # Disable model downloads in production

Monitoring and Observability

// services/tinyllama-monitor.js
class TinyLlamaMonitor {
  constructor() {
    this.metrics = {
      requestCount: 0,
      totalTokens: 0,
      averageLatency: 0,
      quantizationUsage: {},
      errorRate: 0
    };
  }
  
  async collectMetrics() {
    return {
      health: await this.checkHealth(),
      performance: {
        tokensPerSecond: this.calculateThroughput(),
        p95Latency: this.getPercentileLatency(95),
        queueDepth: await this.getQueueDepth()
      },
      resource: {
        memoryUsage: await this.getMemoryUsage(),
        modelLoaded: await this.getLoadedModel(),
        cacheHitRate: this.getCacheStats()
      }
    };
  }
  
  async checkHealth() {
    try {
      const response = await fetch('http://localhost:11434/api/tags');
      return response.ok ? 'healthy' : 'degraded';
    } catch (error) {
      return 'unhealthy';
    }
  }
}

LLxprt CLI Integration

Configuration

// llxprt-config.js - Configure LLxprt for local models
const config = {
  providers: {
    ollama: {
      endpoint: 'http://localhost:11434',
      models: {
        'llama3:8b': {
          contextLength: 8192,
          costPer1kTokens: 0, // Free!
          capabilities: ['general', 'analysis', 'coding']
        },
        'mistral:7b': {
          contextLength: 32768,
          costPer1kTokens: 0,
          capabilities: ['general', 'long-context']
        },
        'codellama:13b': {
          contextLength: 16384,
          costPer1kTokens: 0,
          capabilities: ['coding', 'debugging']
        }
      }
    },
    anthropic: {
      // Fallback to cloud when needed
      apiKey: process.env.ANTHROPIC_API_KEY,
      models: ['claude-3-opus', 'claude-3-sonnet']
    }
  },
  
  routing: {
    // Route to local models by default
    defaultProvider: 'ollama',
    rules: [
      {
        condition: (task) => task.requiresInternet,
        provider: 'anthropic'
      },
      {
        condition: (task) => task.type === 'code',
        model: 'codellama:13b'
      },
      {
        condition: (task) => task.context > 8192,
        model: 'mistral:7b'
      }
    ]
  }
};

module.exports = config;

API Adapter

// ollama-adapter.js - Adapt Ollama to OpenAI format
class OllamaAdapter {
  constructor(baseUrl = 'http://localhost:11434') {
    this.baseUrl = baseUrl;
  }
  
  async chat(messages, options = {}) {
    const response = await fetch(`${this.baseUrl}/api/chat`, {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({
        model: options.model || 'llama3:8b',
        messages: this.convertMessages(messages),
        stream: options.stream || false,
        options: {
          temperature: options.temperature || 0.7,
          top_p: options.top_p || 0.9,
          num_predict: options.max_tokens || 2048
        }
      })
    });
    
    if (options.stream) {
      return this.handleStream(response);
    }
    
    const data = await response.json();
    return {
      choices: [{
        message: {
          role: 'assistant',
          content: data.message.content
        }
      }],
      usage: {
        prompt_tokens: data.prompt_eval_count || 0,
        completion_tokens: data.eval_count || 0
      }
    };
  }
  
  convertMessages(messages) {
    return messages.map(msg => ({
      role: msg.role === 'system' ? 'system' : msg.role,
      content: msg.content
    }));
  }
  
  async *handleStream(response) {
    const reader = response.body.getReader();
    const decoder = new TextDecoder();
    
    while (true) {
      const { done, value } = await reader.read();
      if (done) break;
      
      const chunk = decoder.decode(value);
      const lines = chunk.split('\n').filter(Boolean);
      
      for (const line of lines) {
        try {
          const data = JSON.parse(line);
          yield {
            choices: [{
              delta: {
                content: data.message?.content || ''
              }
            }]
          };
        } catch (e) {
          // Skip invalid JSON
        }
      }
    }
  }
}

Dynamic Model Selection

Smart Router Implementation

// model-router.js - Intelligent model selection
class LocalModelRouter {
  constructor() {
    this.modelCapabilities = {
      'llama3:8b': {
        strengths: ['general', 'balanced', 'fast'],
        maxContext: 8192,
        speed: 'fast',
        quality: 'good'
      },
      'mistral:7b': {
        strengths: ['long-context', 'analysis', 'reasoning'],
        maxContext: 32768,
        speed: 'medium',
        quality: 'good'
      },
      'codellama:13b': {
        strengths: ['coding', 'debugging', 'refactoring'],
        maxContext: 16384,
        speed: 'medium',
        quality: 'excellent-for-code'
      },
      'llama3:70b': {
        strengths: ['complex-reasoning', 'analysis', 'writing'],
        maxContext: 8192,
        speed: 'slow',
        quality: 'excellent',
        requiresGPU: true
      },
      'phi3:mini': {
        strengths: ['quick-responses', 'simple-tasks'],
        maxContext: 4096,
        speed: 'very-fast',
        quality: 'adequate'
      }
    };
  }
  
  async selectModel(task) {
    // Check available models
    const availableModels = await this.getAvailableModels();
    
    // Score each model for the task
    const scores = availableModels.map(model => ({
      model,
      score: this.scoreModel(model, task)
    }));
    
    // Sort by score and return best match
    scores.sort((a, b) => b.score - a.score);
    
    const selected = scores[0];
    console.log(`🎯 Selected ${selected.model} for task (score: ${selected.score})`);
    
    return selected.model;
  }
  
  scoreModel(modelName, task) {
    const model = this.modelCapabilities[modelName];
    if (!model) return 0;
    
    let score = 50; // Base score
    
    // Task type matching
    if (task.type === 'code' && model.strengths.includes('coding')) {
      score += 30;
    }
    if (task.type === 'analysis' && model.strengths.includes('analysis')) {
      score += 20;
    }
    
    // Context size requirements
    if (task.estimatedTokens > model.maxContext) {
      return 0; // Can't handle this task
    }
    if (task.estimatedTokens < model.maxContext * 0.5) {
      score += 10; // Efficient use of context
    }
    
    // Speed requirements
    if (task.priority === 'fast' && model.speed === 'very-fast') {
      score += 25;
    }
    if (task.priority === 'quality' && model.quality.includes('excellent')) {
      score += 25;
    }
    
    // Resource availability
    if (model.requiresGPU && !this.hasGPU()) {
      score -= 50;
    }
    
    return score;
  }
  
  async getAvailableModels() {
    const response = await fetch('http://localhost:11434/api/tags');
    const data = await response.json();
    return data.models.map(m => m.name);
  }
  
  hasGPU() {
    // Check if GPU is available
    try {
      const { execSync } = require('child_process');
      execSync('nvidia-smi');
      return true;
    } catch {
      return false;
    }
  }
}

Simplified Monitoring Dashboard

Lightweight Monitoring Solution

// monitoring-dashboard.js - Simple monitoring without heavy dependencies
const si = require('systeminformation');
const diskspace = require('diskspace');
const express = require('express');
const WebSocket = require('ws');

class SimplifiedMonitoring {
  constructor() {
    this.metrics = {
      memory: { used: 0, total: 0, percentage: 0 },
      disk: { used: 0, total: 0, percentage: 0 },
      cpu: { usage: 0, temperature: 0 },
      gpu: { memory: 0, utilization: 0 },
      models: { loaded: [], totalSize: 0 },
      requests: { total: 0, rate: 0 },
      versions: { ollama: '', sasha: '', models: {} },
      alerts: []
    };
    
    this.thresholds = {
      memory: 85,      // Alert at 85% memory usage
      disk: 90,        // Alert at 90% disk usage
      cpu: 80,         // Alert at 80% CPU usage
      gpu: 90          // Alert at 90% GPU usage
    };
  }
  
  async collectMetrics() {
    try {
      // Memory metrics
      const mem = await si.mem();
      this.metrics.memory = {
        used: Math.round(mem.used / 1024 / 1024 / 1024 * 10) / 10,
        total: Math.round(mem.total / 1024 / 1024 / 1024 * 10) / 10,
        percentage: Math.round((mem.used / mem.total) * 100)
      };
      
      // Disk metrics
      const disks = await si.fsSize();
      const mainDisk = disks.find(d => d.mount === '/') || disks[0];
      this.metrics.disk = {
        used: Math.round(mainDisk.used / 1024 / 1024 / 1024 * 10) / 10,
        total: Math.round(mainDisk.size / 1024 / 1024 / 1024 * 10) / 10,
        percentage: Math.round(mainDisk.use)
      };
      
      // CPU metrics
      const cpuData = await si.currentLoad();
      const cpuTemp = await si.cpuTemperature();
      this.metrics.cpu = {
        usage: Math.round(cpuData.currentLoad),
        temperature: cpuTemp.main || 0
      };
      
      // GPU metrics (if available)
      try {
        const gpu = await si.graphics();
        if (gpu.controllers && gpu.controllers[0]) {
          this.metrics.gpu = {
            memory: gpu.controllers[0].memoryUsed || 0,
            utilization: gpu.controllers[0].utilizationGpu || 0
          };
        }
      } catch (e) {
        // GPU monitoring not available
      }
      
      // Model information
      await this.updateModelInfo();
      
      // Version information
      await this.updateVersionInfo();
      
      // Check thresholds and generate alerts
      this.checkAlerts();
      
    } catch (error) {
      console.error('Error collecting metrics:', error);
    }
  }
  
  async updateModelInfo() {
    try {
      // Get loaded models from Ollama
      const response = await fetch('http://localhost:11434/api/tags');
      const data = await response.json();
      
      this.metrics.models.loaded = data.models.map(m => ({
        name: m.name,
        size: Math.round(m.size / 1024 / 1024 / 1024 * 10) / 10 // GB
      }));
      
      this.metrics.models.totalSize = this.metrics.models.loaded
        .reduce((sum, m) => sum + m.size, 0);
    } catch (e) {
      // Ollama not running or API error
    }
  }
  
  async updateVersionInfo() {
    try {
      // Get Ollama version
      const ollamaResp = await fetch('http://localhost:11434/api/version');
      const ollamaData = await ollamaResp.json();
      this.metrics.versions.ollama = ollamaData.version;
      
      // Get package versions
      const pkg = require('./package.json');
      this.metrics.versions.sasha = pkg.version;
      
      // Check for updates
      await this.checkForUpdates();
    } catch (e) {
      // Version check failed
    }
  }
  
  async checkForUpdates() {
    // Simple version checking - in production, check against npm/github
    const latestVersions = {
      ollama: '0.1.35',  // Would fetch from API
      sasha: '2.0.0'     // Would fetch from API
    };
    
    if (this.compareVersions(this.metrics.versions.ollama, latestVersions.ollama) < 0) {
      this.addAlert('info', `Ollama update available: ${latestVersions.ollama}`);
    }
    
    if (this.compareVersions(this.metrics.versions.sasha, latestVersions.sasha) < 0) {
      this.addAlert('info', `Sasha update available: ${latestVersions.sasha}`);
    }
  }
  
  checkAlerts() {
    this.metrics.alerts = [];
    
    // Memory alert
    if (this.metrics.memory.percentage > this.thresholds.memory) {
      this.addAlert('warning', `High memory usage: ${this.metrics.memory.percentage}%`);
    }
    
    // Disk alert
    if (this.metrics.disk.percentage > this.thresholds.disk) {
      this.addAlert('critical', `Low disk space: ${this.metrics.disk.percentage}% used`);
    }
    
    // CPU alert
    if (this.metrics.cpu.usage > this.thresholds.cpu) {
      this.addAlert('warning', `High CPU usage: ${this.metrics.cpu.usage}%`);
    }
    
    // Temperature alert
    if (this.metrics.cpu.temperature > 80) {
      this.addAlert('warning', `High CPU temperature: ${this.metrics.cpu.temperature}Β°C`);
    }
  }
  
  addAlert(level, message) {
    this.metrics.alerts.push({
      level,
      message,
      timestamp: new Date().toISOString()
    });
  }
  
  compareVersions(current, latest) {
    const cur = current.split('.').map(Number);
    const lat = latest.split('.').map(Number);
    
    for (let i = 0; i < 3; i++) {
      if (cur[i] < lat[i]) return -1;
      if (cur[i] > lat[i]) return 1;
    }
    return 0;
  }
}

Simple Web Dashboard

<!-- monitoring-dashboard.html -->
<!DOCTYPE html>
<html>
<head>
  <title>Sasha Studio - System Monitor</title>
  <style>
    body {
      font-family: -apple-system, system-ui, sans-serif;
      background: #1a1a1a;
      color: #fff;
      margin: 0;
      padding: 20px;
    }
    
    .dashboard {
      display: grid;
      grid-template-columns: repeat(auto-fit, minmax(300px, 1fr));
      gap: 20px;
      max-width: 1400px;
      margin: 0 auto;
    }
    
    .metric-card {
      background: #2a2a2a;
      border-radius: 12px;
      padding: 20px;
      box-shadow: 0 4px 6px rgba(0,0,0,0.3);
    }
    
    .metric-title {
      font-size: 14px;
      color: #888;
      margin-bottom: 10px;
      text-transform: uppercase;
      letter-spacing: 1px;
    }
    
    .metric-value {
      font-size: 36px;
      font-weight: 600;
      margin-bottom: 10px;
    }
    
    .metric-detail {
      font-size: 14px;
      color: #aaa;
    }
    
    .progress-bar {
      width: 100%;
      height: 8px;
      background: #444;
      border-radius: 4px;
      overflow: hidden;
      margin-top: 10px;
    }
    
    .progress-fill {
      height: 100%;
      background: #4CAF50;
      transition: width 0.3s ease;
    }
    
    .progress-fill.warning { background: #ff9800; }
    .progress-fill.critical { background: #f44336; }
    
    .alerts {
      grid-column: 1 / -1;
    }
    
    .alert {
      padding: 12px 16px;
      border-radius: 8px;
      margin-bottom: 10px;
      display: flex;
      align-items: center;
      gap: 10px;
    }
    
    .alert.info { background: #2196F3; }
    .alert.warning { background: #ff9800; }
    .alert.critical { background: #f44336; }
    
    .models-list {
      margin-top: 10px;
    }
    
    .model-item {
      display: flex;
      justify-content: space-between;
      padding: 8px 0;
      border-bottom: 1px solid #444;
    }
    
    .model-item:last-child {
      border-bottom: none;
    }
    
    @media (max-width: 768px) {
      .dashboard {
        grid-template-columns: 1fr;
      }
    }
  </style>
</head>
<body>
  <h1>πŸ–₯️ Sasha Studio System Monitor</h1>
  
  <div id="alerts" class="alerts"></div>
  
  <div class="dashboard">
    <!-- Memory Card -->
    <div class="metric-card">
      <div class="metric-title">Memory Usage</div>
      <div class="metric-value" id="memory-percentage">--</div>
      <div class="metric-detail" id="memory-detail">-- GB / -- GB</div>
      <div class="progress-bar">
        <div class="progress-fill" id="memory-progress"></div>
      </div>
    </div>
    
    <!-- Disk Card -->
    <div class="metric-card">
      <div class="metric-title">Disk Space</div>
      <div class="metric-value" id="disk-percentage">--</div>
      <div class="metric-detail" id="disk-detail">-- GB / -- GB</div>
      <div class="progress-bar">
        <div class="progress-fill" id="disk-progress"></div>
      </div>
    </div>
    
    <!-- CPU Card -->
    <div class="metric-card">
      <div class="metric-title">CPU Usage</div>
      <div class="metric-value" id="cpu-usage">--</div>
      <div class="metric-detail" id="cpu-temp">Temperature: --Β°C</div>
      <div class="progress-bar">
        <div class="progress-fill" id="cpu-progress"></div>
      </div>
    </div>
    
    <!-- GPU Card -->
    <div class="metric-card">
      <div class="metric-title">GPU Status</div>
      <div class="metric-value" id="gpu-usage">--</div>
      <div class="metric-detail" id="gpu-memory">Memory: -- GB</div>
      <div class="progress-bar">
        <div class="progress-fill" id="gpu-progress"></div>
      </div>
    </div>
    
    <!-- Models Card -->
    <div class="metric-card">
      <div class="metric-title">Loaded Models</div>
      <div class="metric-value" id="model-count">0</div>
      <div class="metric-detail" id="model-size">Total Size: 0 GB</div>
      <div class="models-list" id="models-list"></div>
    </div>
    
    <!-- Versions Card -->
    <div class="metric-card">
      <div class="metric-title">System Versions</div>
      <div class="metric-detail">
        <div>Ollama: <span id="ollama-version">--</span></div>
        <div>Sasha Studio: <span id="sasha-version">--</span></div>
        <div>Node.js: <span id="node-version">--</span></div>
      </div>
    </div>
  </div>
  
  <script>
    // WebSocket connection for real-time updates
    const ws = new WebSocket('ws://localhost:8001');
    
    ws.onmessage = (event) => {
      const metrics = JSON.parse(event.data);
      updateDashboard(metrics);
    };
    
    function updateDashboard(metrics) {
      // Update memory
      document.getElementById('memory-percentage').textContent = metrics.memory.percentage + '%';
      document.getElementById('memory-detail').textContent = 
        `${metrics.memory.used} GB / ${metrics.memory.total} GB`;
      updateProgress('memory-progress', metrics.memory.percentage);
      
      // Update disk
      document.getElementById('disk-percentage').textContent = metrics.disk.percentage + '%';
      document.getElementById('disk-detail').textContent = 
        `${metrics.disk.used} GB / ${metrics.disk.total} GB`;
      updateProgress('disk-progress', metrics.disk.percentage);
      
      // Update CPU
      document.getElementById('cpu-usage').textContent = metrics.cpu.usage + '%';
      document.getElementById('cpu-temp').textContent = 
        `Temperature: ${metrics.cpu.temperature}Β°C`;
      updateProgress('cpu-progress', metrics.cpu.usage);
      
      // Update GPU
      document.getElementById('gpu-usage').textContent = metrics.gpu.utilization + '%';
      document.getElementById('gpu-memory').textContent = 
        `Memory: ${(metrics.gpu.memory / 1024).toFixed(1)} GB`;
      updateProgress('gpu-progress', metrics.gpu.utilization);
      
      // Update models
      document.getElementById('model-count').textContent = metrics.models.loaded.length;
      document.getElementById('model-size').textContent = 
        `Total Size: ${metrics.models.totalSize.toFixed(1)} GB`;
      
      const modelsList = document.getElementById('models-list');
      modelsList.innerHTML = metrics.models.loaded
        .map(m => `
          <div class="model-item">
            <span>${m.name}</span>
            <span>${m.size} GB</span>
          </div>
        `).join('');
      
      // Update versions
      document.getElementById('ollama-version').textContent = metrics.versions.ollama || '--';
      document.getElementById('sasha-version').textContent = metrics.versions.sasha || '--';
      document.getElementById('node-version').textContent = process.version || '--';
      
      // Update alerts
      const alertsContainer = document.getElementById('alerts');
      alertsContainer.innerHTML = metrics.alerts
        .map(a => `
          <div class="alert ${a.level}">
            <span>${a.level === 'critical' ? '🚨' : a.level === 'warning' ? '⚠️' : 'ℹ️'}</span>
            <span>${a.message}</span>
          </div>
        `).join('');
    }
    
    function updateProgress(elementId, percentage) {
      const element = document.getElementById(elementId);
      element.style.width = percentage + '%';
      
      // Update color based on threshold
      element.className = 'progress-fill';
      if (percentage > 90) {
        element.classList.add('critical');
      } else if (percentage > 75) {
        element.classList.add('warning');
      }
    }
  </script>
</body>
</html>

Monitoring Server

// monitoring-server.js - Lightweight monitoring server
const express = require('express');
const WebSocket = require('ws');
const SimplifiedMonitoring = require('./monitoring-dashboard');

const app = express();
const monitoring = new SimplifiedMonitoring();

// Serve dashboard
app.use(express.static('public'));

// API endpoints for metrics
app.get('/api/metrics', async (req, res) => {
  await monitoring.collectMetrics();
  res.json(monitoring.metrics);
});

// WebSocket server for real-time updates
const server = app.listen(8001, () => {
  console.log('Monitoring dashboard available at http://localhost:8001');
});

const wss = new WebSocket.Server({ server });

// Broadcast metrics every 5 seconds
setInterval(async () => {
  await monitoring.collectMetrics();
  const data = JSON.stringify(monitoring.metrics);
  
  wss.clients.forEach(client => {
    if (client.readyState === WebSocket.OPEN) {
      client.send(data);
    }
  });
}, 5000);

// Handle new connections
wss.on('connection', async (ws) => {
  // Send initial metrics
  await monitoring.collectMetrics();
  ws.send(JSON.stringify(monitoring.metrics));
});

Performance Optimization

Resource Management

# Model resource allocation
model_configs:
  llama3-8b:
    gpu_layers: 35
    cpu_threads: 8
    context_size: 4096
    batch_size: 512
  
  llama3-70b:
    gpu_layers: 80
    cpu_threads: 16
    context_size: 8192
    batch_size: 1024

Monitoring Integration with Sasha Studio

// sasha-monitoring-integration.js
class SashaMonitoringIntegration {
  constructor() {
    this.monitoring = new SimplifiedMonitoring();
    this.metricsHistory = [];
    this.maxHistorySize = 288; // 24 hours at 5-minute intervals
  }
  
  async integrateWithSashaAPI(app) {
    // Add monitoring endpoints to existing Sasha API
    app.get('/api/system/metrics', async (req, res) => {
      await this.monitoring.collectMetrics();
      res.json({
        current: this.monitoring.metrics,
        history: this.getMetricsHistory(req.query.period || '1h')
      });
    });
    
    // Health check endpoint
    app.get('/api/health', async (req, res) => {
      const health = await this.checkSystemHealth();
      res.status(health.healthy ? 200 : 503).json(health);
    });
    
    // Start periodic collection
    this.startMetricsCollection();
  }
  
  async checkSystemHealth() {
    await this.monitoring.collectMetrics();
    const metrics = this.monitoring.metrics;
    
    const checks = {
      memory: metrics.memory.percentage < 90,
      disk: metrics.disk.percentage < 95,
      cpu: metrics.cpu.usage < 90,
      ollama: await this.checkOllamaHealth(),
      database: await this.checkDatabaseHealth()
    };
    
    const healthy = Object.values(checks).every(v => v === true);
    
    return {
      healthy,
      checks,
      timestamp: new Date().toISOString()
    };
  }
  
  async checkOllamaHealth() {
    try {
      const response = await fetch('http://localhost:11434/api/tags');
      return response.ok;
    } catch (e) {
      return false;
    }
  }
  
  async checkDatabaseHealth() {
    // Check PostgreSQL connection
    try {
      const { Pool } = require('pg');
      const pool = new Pool();
      const result = await pool.query('SELECT 1');
      await pool.end();
      return true;
    } catch (e) {
      return false;
    }
  }
  
  startMetricsCollection() {
    // Collect metrics every 5 minutes
    setInterval(async () => {
      await this.monitoring.collectMetrics();
      
      // Store in history
      this.metricsHistory.push({
        timestamp: new Date().toISOString(),
        metrics: { ...this.monitoring.metrics }
      });
      
      // Trim history
      if (this.metricsHistory.length > this.maxHistorySize) {
        this.metricsHistory.shift();
      }
      
      // Check for critical alerts
      this.checkCriticalAlerts();
    }, 5 * 60 * 1000);
  }
  
  checkCriticalAlerts() {
    const criticalAlerts = this.monitoring.metrics.alerts
      .filter(a => a.level === 'critical');
    
    if (criticalAlerts.length > 0) {
      // In production, send notifications
      console.error('Critical alerts:', criticalAlerts);
      
      // Could integrate with:
      // - Email notifications
      // - Slack/Discord webhooks
      // - PagerDuty
      // - Custom notification service
    }
  }
  
  getMetricsHistory(period) {
    const now = Date.now();
    const periodMs = {
      '15m': 15 * 60 * 1000,
      '1h': 60 * 60 * 1000,
      '6h': 6 * 60 * 60 * 1000,
      '24h': 24 * 60 * 60 * 1000
    }[period] || 60 * 60 * 1000;
    
    return this.metricsHistory.filter(entry => {
      const entryTime = new Date(entry.timestamp).getTime();
      return now - entryTime <= periodMs;
    });
  }
}

Caching Strategy

// llm-cache.js - Response caching for efficiency
const crypto = require('crypto');
const Redis = require('redis');

class LLMCache {
  constructor() {
    this.redis = Redis.createClient({
      url: 'redis://localhost:6379'
    });
    this.ttl = 3600; // 1 hour default
  }
  
  generateKey(messages, model, temperature) {
    const content = JSON.stringify({ messages, model, temperature });
    return crypto.createHash('sha256').update(content).digest('hex');
  }
  
  async get(messages, model, temperature) {
    const key = this.generateKey(messages, model, temperature);
    const cached = await this.redis.get(key);
    
    if (cached) {
      console.log('🎯 Cache hit for query');
      return JSON.parse(cached);
    }
    
    return null;
  }
  
  async set(messages, model, temperature, response) {
    const key = this.generateKey(messages, model, temperature);
    await this.redis.setex(
      key, 
      this.ttl, 
      JSON.stringify(response)
    );
  }
  
  async invalidatePattern(pattern) {
    const keys = await this.redis.keys(pattern);
    if (keys.length > 0) {
      await this.redis.del(keys);
    }
  }
}

Single Container Architecture

Overview

For production deployments, Sasha runs in a single Docker container with Ollama embedded, simplifying deployment and management while maintaining all functionality.

Architecture Benefits

  1. Simplified Deployment: One container, one command
  2. No Networking Complexity: Ollama and Sasha communicate via localhost
  3. Unified Resource Management: Single container resource limits
  4. Easier Monitoring: One container to monitor
  5. Persistent Models: Models stored in Docker volume

Container Startup Sequence

graph TD A[Container Start] --> B[Start Ollama Service] B --> C[Wait for Ollama Ready] C --> D[Check/Download TinyLlama] D --> E[Start Sasha Server] E --> F[Initialize AI Services] F --> G[Ready for Requests] style A fill:#e3f2fd style G fill:#e8f5e9

Production Dockerfile

# Dockerfile - Single container with embedded Ollama
FROM ubuntu:22.04

# Install system dependencies
RUN apt-get update && apt-get install -y \
    curl \
    nodejs \
    npm \
    && rm -rf /var/lib/apt/lists/*

# Install Ollama
RUN curl -fsSL https://ollama.ai/install.sh | sh

# Set up application
WORKDIR /app
COPY package*.json ./
RUN npm ci --production
COPY . .

# Create directories
RUN mkdir -p /data /models /logs

# Pre-download TinyLlama (at build time for faster startup)
ARG TINYLLAMA_QUANTIZATION=q4_k_m
RUN ollama serve & \
    sleep 10 && \
    ollama pull tinyllama:1.1b-${TINYLLAMA_QUANTIZATION} && \
    pkill ollama

# Configure environment
ENV NODE_ENV=production \
    DOCKER_ENV=true \
    OLLAMA_MODELS=/models \
    OLLAMA_HOST=http://localhost:11434 \
    ENABLE_LOCAL_MODELS=true \
    TINYLLAMA_QUANTIZATION=q4_k_m

# Copy and setup entrypoint
COPY scripts/docker-entrypoint.sh /docker-entrypoint.sh
RUN chmod +x /docker-entrypoint.sh

# Expose port
EXPOSE 3002

# Health check for both services
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
    CMD curl -f http://localhost:3002/health && \
        curl -f http://localhost:11434/api/tags || exit 1

ENTRYPOINT ["/docker-entrypoint.sh"]

Docker Entrypoint Script

#!/bin/bash
# scripts/docker-entrypoint.sh

set -e

echo "πŸš€ Starting Sasha Chat with integrated TinyLlama..."
echo "πŸ“Š Environment:"
echo "   NODE_ENV: ${NODE_ENV}"
echo "   TINYLLAMA_QUANTIZATION: ${TINYLLAMA_QUANTIZATION}"
echo "   OLLAMA_MODELS: ${OLLAMA_MODELS}"

# Start Ollama in background
echo "πŸ”§ Starting Ollama service..."
ollama serve &
OLLAMA_PID=$!

# Function to check if Ollama is ready
check_ollama() {
    curl -s http://localhost:11434/api/tags > /dev/null 2>&1
}

# Wait for Ollama with timeout
echo "⏳ Waiting for Ollama to be ready..."
TIMEOUT=60
ELAPSED=0
while [ $ELAPSED -lt $TIMEOUT ]; do
    if check_ollama; then
        echo "βœ… Ollama is ready"
        break
    fi
    sleep 2
    ELAPSED=$((ELAPSED + 2))
    echo "   Waiting... ($ELAPSED/$TIMEOUT seconds)"
done

if [ $ELAPSED -ge $TIMEOUT ]; then
    echo "❌ Ollama failed to start within ${TIMEOUT} seconds"
    exit 1
fi

# Ensure TinyLlama is available
echo "πŸ” Checking for TinyLlama model..."
if ! ollama list | grep -q "tinyllama"; then
    echo "πŸ“₯ Downloading TinyLlama (this happens once)..."
    ollama pull tinyllama:1.1b-${TINYLLAMA_QUANTIZATION:-q4_k_m}
else
    echo "βœ… TinyLlama already available"
fi

echo "πŸ“Š Available models:"
ollama list

# Handle shutdown gracefully
trap 'echo "Shutting down..."; kill $OLLAMA_PID; exit 0' SIGTERM SIGINT

# Start Node.js application
echo "πŸš€ Starting Sasha Chat server..."
exec node server.js

Deployment Commands

Build Container

# Build with default 4-bit quantization
docker build -t sasha-chat:latest .

# Build with specific quantization
docker build \
  --build-arg TINYLLAMA_QUANTIZATION=q8_0 \
  -t sasha-chat:q8 .

Run Container

# Run with persistent storage
docker run -d \
  --name sasha-chat \
  -p 3002:3002 \
  -v sasha-models:/models \
  -v sasha-data:/data \
  --restart unless-stopped \
  sasha-chat:latest

# Run with custom configuration
docker run -d \
  --name sasha-chat \
  -p 3002:3002 \
  -v sasha-models:/models \
  -v sasha-data:/data \
  -e TINYLLAMA_QUANTIZATION=q2_k \
  -e TINYLLAMA_AUTO_QUANTIZATION=true \
  --memory="4g" \
  --cpus="2" \
  sasha-chat:latest

Docker Compose (Optional)

# docker-compose.yml
version: '3.8'

services:
  sasha:
    image: sasha-chat:latest
    container_name: sasha-chat
    ports:
      - "3002:3002"
    volumes:
      - sasha-models:/models
      - sasha-data:/data
      - ./logs:/logs
    environment:
      - TINYLLAMA_QUANTIZATION=${TINYLLAMA_QUANTIZATION:-q4_k_m}
      - OPENROUTER_API_KEY=${OPENROUTER_API_KEY}
      - NODE_ENV=production
    restart: unless-stopped
    deploy:
      resources:
        limits:
          memory: 4G
          cpus: '2'
        reservations:
          memory: 2G
          cpus: '1'

volumes:
  sasha-models:
    driver: local
  sasha-data:
    driver: local

Environment Detection in Code

// services/ollama-service.js
class OllamaService {
  constructor() {
    // Detect Docker environment
    this.isDocker = this.detectDocker();
    this.ollamaHost = 'http://localhost:11434';
    console.log(`πŸ”§ Ollama Service (${this.isDocker ? 'Docker' : 'Local'} mode)`);
  }
  
  detectDocker() {
    // Multiple detection methods
    return process.env.DOCKER_ENV === 'true' || 
           fs.existsSync('/.dockerenv') ||
           fs.existsSync('/proc/1/cgroup');
  }
  
  async initialize() {
    if (this.isDocker) {
      // In Docker, Ollama should already be running
      // Started by docker-entrypoint.sh
      console.log('πŸ“¦ Running in Docker container');
    } else {
      // In local development, check if Ollama is running
      if (!await this.checkHealth()) {
        console.log('ℹ️  Ollama not running. Start with: ollama serve');
        console.log('   Or run: npm run setup:ollama');
      }
    }
    
    // Wait for service to be ready
    await this.waitForReady();
  }
}

Local Development Setup

For local development, Ollama runs as a separate process on your machine:

# scripts/setup-local.sh
#!/bin/bash

echo "πŸš€ Setting up local development with Ollama..."

# Detect OS
OS="$(uname -s)"
case "${OS}" in
    Linux*)     INSTALL_CMD="curl -fsSL https://ollama.ai/install.sh | sh";;
    Darwin*)    INSTALL_CMD="brew install ollama || curl -fsSL https://ollama.ai/install.sh | sh";;
    *)          echo "Unsupported OS: ${OS}"; exit 1;;
esac

# Install Ollama if needed
if ! command -v ollama &> /dev/null; then
    echo "πŸ“¦ Installing Ollama..."
    eval $INSTALL_CMD
fi

# Start Ollama service
if ! pgrep -x "ollama" > /dev/null; then
    echo "πŸ”§ Starting Ollama service..."
    ollama serve &
    sleep 5
fi

# Pull TinyLlama
echo "πŸ“₯ Ensuring TinyLlama is available..."
ollama pull tinyllama:1.1b-q4_k_m

echo "βœ… Setup complete! You can now run: npm run dev"

Container Management

Viewing Logs

# View combined logs
docker logs sasha-chat

# Follow logs
docker logs -f sasha-chat

# View Ollama-specific logs
docker exec sasha-chat journalctl -u ollama

Model Management

# List models in container
docker exec sasha-chat ollama list

# Pull additional model
docker exec sasha-chat ollama pull llama2:7b

# Remove unused model
docker exec sasha-chat ollama rm phi3:mini

Backup and Restore

# Backup models volume
docker run --rm \
  -v sasha-models:/models \
  -v $(pwd):/backup \
  alpine tar czf /backup/models-backup.tar.gz -C /models .

# Restore models volume
docker run --rm \
  -v sasha-models:/models \
  -v $(pwd):/backup \
  alpine tar xzf /backup/models-backup.tar.gz -C /models

Performance Tuning

Resource Limits

# docker-compose.yml with resource limits
deploy:
  resources:
    limits:
      memory: 4G  # Total for Sasha + Ollama + TinyLlama
      cpus: '2'
    reservations:
      memory: 2G  # Minimum required
      cpus: '1'

Quantization Auto-Selection

// Auto-select based on container resources
async function selectQuantization() {
  const totalMemory = os.totalmem() / (1024 * 1024 * 1024); // GB
  
  if (totalMemory < 2) {
    return 'q2_k';  // Ultra-light for minimal containers
  } else if (totalMemory < 4) {
    return 'q4_k_m';  // Balanced for standard containers
  } else {
    return 'q8_0';  // Quality for larger containers
  }
}

Production Deployment

Docker Compose Configuration

# docker-compose.yml - Complete local LLM setup
version: '3.8'

services:
  sasha-studio:
    build: .
    image: sasha/studio-local-llm:latest
    container_name: sasha-studio-local
    ports:
      - "80:80"        # Main web interface
      - "8001:8001"    # Monitoring dashboard
    volumes:
      # Persistent data
      - sasha-data:/data
      - sasha-models:/models
      - sasha-config:/config
      - sasha-logs:/logs
      
      # Development mounts (remove in production)
      - ./guides:/app/guides
      - ./custom-models:/custom-models
    
    environment:
      # LLM Configuration
      - ENABLE_LOCAL_MODELS=true
      - DEFAULT_MODEL=llama3:8b
      - OLLAMA_NUM_PARALLEL=4
      - OLLAMA_MODELS=/models
      
      # Optional cloud fallback
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY:-}
      - OPENAI_API_KEY=${OPENAI_API_KEY:-}
      
      # Resource limits
      - OLLAMA_MAX_LOADED_MODELS=2
      - OLLAMA_MEMORY_LIMIT=32GB
      
      # Monitoring
      - ENABLE_MONITORING=true
      - MONITORING_PORT=8001
    
    # Resource constraints
    deploy:
      resources:
        limits:
          cpus: '8'
          memory: 64G
        reservations:
          cpus: '4'
          memory: 32G
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    
    restart: unless-stopped
    
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost/health"]
      interval: 30s
      timeout: 10s
      retries: 3

volumes:
  sasha-data:
  sasha-models:
  sasha-config:
  sasha-logs:

Production Checklist

  • Hardware Requirements Met

    • Minimum 32GB RAM (64GB recommended)
    • 500GB+ SSD storage for models
    • GPU with 24GB+ VRAM (for 70B models)
    • 8+ CPU cores
  • Models Pre-downloaded

    • Base models (llama3:8b, mistral:7b)
    • Specialized models as needed
    • Model update schedule defined
  • Monitoring Configured

    • Dashboard accessible
    • Alert thresholds set
    • Notification channels configured
  • Backup Strategy

    • Model backups scheduled
    • Configuration backups
    • Data persistence verified
  • Security Hardened

    • Network isolation configured
    • Access controls implemented
    • Audit logging enabled

Success Metrics

  • Response Time: <2s for 8B models, <5s for 70B models
  • Throughput: 10+ concurrent requests
  • Availability: 99.9% uptime
  • Cost Savings: 80%+ reduction vs cloud APIs
  • Data Security: 100% on-premise processing

Additional Resources

Related Guides

External Resources


This guide provides a complete framework for integrating local LLMs into Sasha Studio, ensuring data sovereignty, cost efficiency, and high performance while maintaining the flexibility to leverage cloud models when needed.