Last updated: Aug 12, 2025, 01:18 PM UTC

Lessons Learnt - Sasha Project

This document captures key insights and learnings from the Sasha AI Knowledge Management System development.

Docker Workspace Path Resolution

Date: 2025-08-11

The Challenge

  1. Specialists and guides weren't loading in Sliplane deployment - UI only showed 2 default specialists instead of 8 (5 system + 3 user)
  2. HTML documentation wasn't displaying in Knowledge tab iframe - showed blank content

Root Cause Analysis

  1. Path Mismatch: Code expected files at /app/docs/ and /app/html-static/ but they were actually at /app/workspaces/workspace/docs/ and /app/workspaces/workspace/html-static/
  2. Hidden Directory: Private content was in .private/ (hidden) instead of private/ directory
  3. Missing Files: Specialists weren't being copied from image to workspace volume during container initialization

Investigation Process

  1. Backend readPersonas() worked perfectly locally (returned all 8 specialists)
  2. SSH into container revealed docs were in workspace path, not standard path
  3. .private directory existed but lacked specialists subdirectory
  4. Docker entrypoint script wasn't creating/copying specialist files

The Solution

1. Dynamic Path Resolution in content-reader.js

// Check workspace path first in Docker environments
if (process.env.USE_DOCKER_WORKSPACE === 'true' || process.env.RUNNING_IN_DOCKER === 'true') {
  const workspaceDocsPath = '/app/workspaces/workspace/docs';
  const standardDocsPath = process.env.DOCS_PATH || '/app/docs';
  // Use whichever exists
}

2. Handle Hidden Private Directory

// In Docker, check for .private first, fall back to private
privateDir = path.join(docsPath, '.private', 'specialists');

3. Enhanced Docker Entrypoint Script

# Create directory structure
mkdir -p "$WORKSPACES_PATH/workspace/docs/.private/specialists"

# Copy specialists from image to workspace on first run
if [ -d "/app/docs/private/specialists" ]; then
  cp -r /app/docs/private/specialists/* "$WORKSPACES_PATH/workspace/docs/.private/specialists/"
fi

4. Dynamic Path Resolution for HTML Static Files in server/index.js

// Determine correct html-static path based on environment
let htmlStaticPath;
if (process.env.USE_DOCKER_WORKSPACE === 'true' || process.env.RUNNING_IN_DOCKER === 'true') {
  const workspaceHtmlPath = '/app/workspaces/workspace/html-static';
  const standardHtmlPath = path.join(__dirname, '../../html-static');
  
  // Check which path exists
  if (fs.existsSync(workspaceHtmlPath)) {
    htmlStaticPath = workspaceHtmlPath;
  } else {
    htmlStaticPath = standardHtmlPath;
  }
} else {
  htmlStaticPath = path.join(__dirname, '../../html-static');
}

app.use('/api/docs-content', express.static(htmlStaticPath));

Key Learnings

  1. Always verify actual paths in production - SSH into containers to check real directory structure
  2. Workspace volumes need initialization - Content must be copied from image to persistent volumes
  3. Support multiple path configurations - Code should check multiple possible locations
  4. Hidden directories in Docker - Private content may be intentionally hidden with dot prefix
  5. Debug with actual environment - Local testing may not reveal Docker-specific path issues
  6. Apply same path fixes everywhere - If docs are in workspace path, html-static likely is too
  7. Add comprehensive logging - Path resolution logging helps quickly identify issues in production

Best Practices for Docker Deployments

  • Add comprehensive path debugging on startup
  • Check multiple possible locations for critical files
  • Initialize workspace volumes with required content
  • Document the expected vs actual directory structure
  • Test with the exact deployment environment (Sliplane, etc.)

Semantic Versioning Implementation

Date: 2025-08-11

The Challenge

Implementing semantic versioning for Docker builds while maintaining simplicity for local development and providing CI/CD compatibility.

What Worked Well

  • Single Source of Truth: Using a VERSION file at project root eliminated version drift
  • Automatic Synchronization: The version script updates both VERSION and package.json automatically
  • Multiple Tag Strategy: Creating multiple Docker tags (1.0.0, 1.0, 1, latest) enables flexible deployment strategies
  • Build Metadata: Including git commit, branch, and timestamp in development builds aids debugging
  • UI Integration: Version displays in Settings > Version tab by reading from package.json

Key Implementation Details

Version Management Script

# Simple commands for all version operations
./scripts/version.sh patch    # 1.0.0 -> 1.0.1
./scripts/version.sh minor    # 1.0.0 -> 1.1.0
./scripts/version.sh major    # 1.0.0 -> 2.0.0

Docker Build Integration

  • Enhanced docker-build.sh automatically creates semantic version tags
  • Development builds get unique tags with timestamps: 1.0.0-dev.20240111.abc123
  • Production builds create full tag hierarchy: exact, major.minor, major, latest
  • Build info saved to .last-build.json for reference

Package.json Synchronization

// Automatic update in version.sh
sed -i "s/\"version\": \".*\"/\"version\": \"$NEW_VERSION\"/" package.json

Lessons Learned

  1. Keep It Simple: Local builds should remain simple - complexity belongs in CI/CD
  2. Automate Sync: Never rely on manual version synchronization between files
  3. Tag Strategically: Multiple Docker tags allow flexible rollback strategies
  4. Display Everywhere: Show version in UI, health endpoints, and Docker labels
  5. Git Integration: Optional git tagging in version script maintains release history
  6. Hybrid Approach: Support both local and GitHub Actions builds without conflict

Best Practices Discovered

  • Always reset version to stable after testing (e.g., back to 1.0.0)
  • Include version in health endpoint for runtime verification
  • Use .last-build.json to track what was built when
  • Development builds should indicate "dirty" git state
  • Branch-based tags help identify feature builds

Technical Patterns

# Version file as single source
VERSION=$(cat VERSION)

# Multiple tag creation
docker build -t app:$VERSION -t app:latest -t app:$(echo $VERSION | cut -d. -f1-2)

# Build metadata for debugging
BUILD_DATE=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
GIT_COMMIT=$(git rev-parse --short HEAD)

Future Improvements

  • Consider semantic-release for fully automated versioning
  • Add changelog generation from commit messages
  • Implement version constraints for dependencies
  • Add pre-commit hooks to verify version consistency

UI/UX Development

Navigation System Implementation

Date: 2025-01-05

What Worked Well

  • Reusable Navigation Component: Created a single navigation overlay system that could be easily replicated across all mockups with minimal changes
  • Slide-in Animation: The right-side slide-in menu pattern provided smooth, modern interactions
  • Active State Management: Clear visual indicators for current page helped users understand their location
  • Coming Soon Pattern: Using alerts for unfinished features set clear expectations while maintaining navigation structure

Key Learnings

  1. CSS Organization: Keeping navigation styles in a dedicated section made it easier to maintain consistency
  2. Escape Key Support: Adding keyboard navigation (ESC to close) significantly improved usability
  3. Mobile-First Responsive: Ensuring the navigation menu takes full width on mobile devices prevented layout issues
  4. Stop Propagation: Using event.stopPropagation() on the menu container prevented accidental closes when clicking inside

Technical Patterns

// Effective pattern for navigation toggle
function openNavMenu() {
    navOverlay.classList.add('active');
    document.body.style.overflow = 'hidden'; // Prevent background scrolling
}

Mockup Architecture

Date: 2025-01-05

What Worked Well

  • Phosphor Icons: Using emoji-based icons provided consistent, scalable icons without external dependencies
  • Status Badges: Visual indicators (New, Soon) helped communicate feature availability
  • Gradient Headers: Linear gradients created visual hierarchy and brand consistency

Challenges & Solutions

  • String Replacement in Large Files: When editing large HTML files, finding exact strings for replacement was challenging
    • Solution: Use more targeted searches and consider breaking large files into components
  • Cross-File Consistency: Maintaining consistent navigation across multiple mockup files
    • Solution: Create a standard navigation template that can be copied with minimal modifications

File Upload and Conversion System

Date: 2025-01-08

The Challenge

Implementing file upload with automatic document conversion in a React/Express application where:

  • Files need to be uploaded via multipart/form-data
  • Documents (PDF, Word, Excel) need to be converted to Markdown
  • Project paths are encoded with dashes but contain special characters like dots
  • File browser and upload must use consistent path resolution

Critical Issue: FormData and Content-Type Headers

Problem: The authenticatedFetch utility was setting Content-Type: 'application/json' for all requests, which broke multipart/form-data uploads.

Why it Failed:

  • Multer needs the browser to set Content-Type: multipart/form-data; boundary=----WebKitFormBoundary...
  • Our code was forcing Content-Type: application/json
  • Result: 400 Bad Request or 413 Payload Too Large errors

Solution:

// In authenticatedFetch
const isFormData = options.body instanceof FormData;
if (!isFormData) {
  defaultHeaders['Content-Type'] = 'application/json';
}
// Let browser set Content-Type with boundary for FormData

Path Encoding/Decoding Issues

Problem: Project names are encoded by replacing / with -, but this loses dots:

  • Original: /Users/lindsaysmith/Documents/lambda1.nosync/sasha
  • Encoded: -Users-lindsaysmith-Documents-lambda1-nosync-sasha (dot lost!)
  • Decoded: /Users/lindsaysmith/Documents/lambda1/nosync/sasha (wrong!)

Solution: Use extractProjectDirectory from Claude's JSONL files to get the actual path:

const { extractProjectDirectory } = await import('../projects.js');
workspacePath = await extractProjectDirectory(projectName);
// This reads the actual cwd from session files, preserving special characters

Middleware Ordering

Problem: Express body parsers were interfering with multer's multipart parsing.

Solution: Mount file upload routes BEFORE body parsers:

app.use('/api', filesRoutes);  // File routes with multer
app.use(express.json());       // JSON body parser comes after

Document Conversion API

Learning: The @knowcode/convert-to-markdown package uses:

  • converter.pdf.toMarkdown() not pdfToMarkdown()
  • converter.word.toMarkdown() for Word documents
  • converter.excel.toMarkdown() for Excel files

Key Implementation Patterns

  1. Always check if body is FormData before setting Content-Type
  2. Use actual project paths from JSONL, not decoded names
  3. Mount multer routes before body parsers
  4. Handle conversion errors gracefully with fallback text

File System Browser Design

Security-First Approach

Date: 2025-01-05

Key Insights

  1. Visual Security Indicators: Color-coding storage types (local, remote, cloud) immediately communicates security context
  2. Permission Warnings: Modal confirmations for write access changes prevent accidental security vulnerabilities
  3. Read-Only by Default: Starting with restrictive permissions and requiring explicit user action for write access

UI Patterns That Worked

  • Storage Type Icons: Using distinct icons and colors for different storage types
    • Local: Green with hard drive icon
    • Remote: Blue with server icon
    • Cloud: Purple with cloud icon
  • Checkbox Confirmation: Requiring users to check "I understand the risks" before enabling write access
  • 15-Second Cooldown: Preventing hasty decisions by adding a delay before confirmation

Local LLM Administration

Dashboard Design Principles

Date: 2025-01-05

Successful Patterns

  1. Tabbed Interface: Organizing complex admin functions into logical tabs improved discoverability
  2. Real-Time Status: Live indicators for model health and resource usage
  3. Action Buttons: Clear, contextual actions (Start, Stop, Update) for each model
  4. Resource Visualization: Progress bars and charts made resource usage immediately understandable

Technical Implementation

  • Model Cards: Displaying each model as a card with status, specs, and actions
  • Configuration Sections: Grouping related settings (model configs, resource limits, security)
  • Alert System: Combining visual indicators with detailed log information

Development Workflow

Todo List Management

Date: 2025-01-05

Best Practices

  1. Granular Tasks: Breaking down complex features into specific, actionable items
  2. Real-Time Updates: Marking tasks as completed immediately after finishing
  3. Priority Levels: Using high/medium/low priorities to guide work order
  4. Status Tracking: Clear in_progress markers to show current focus

Panel System Architecture

Unified Panel Management Implementation

Date: 2025-08-05

Critical DOM Manipulation Lessons

The innerHTML Timing Problem
When implementing a unified panel system that restructures existing DOM elements, we encountered a critical timing issue:

// โŒ PROBLEMATIC PATTERN - Immediate querySelector after innerHTML replacement
panel.element.innerHTML = `<div class="new-structure">${existingContent}</div>`;
const closeBtn = panel.element.querySelector('#closeBtn'); // May return null!

Root Cause: The browser needs time to parse and construct new DOM elements after innerHTML assignment. Immediate querySelector operations may fail because elements aren't fully available yet.

What We Learned

  1. Event Handler Lifecycle: When you replace innerHTML, all existing event listeners on child elements are destroyed
  2. DOM Construction Timing: New elements created via innerHTML may not be immediately queryable
  3. Selector Strategy: Relying on a single selector strategy is fragile - multiple fallback approaches are essential

Solution Patterns That Work

Multi-Strategy Close Button Detection

attachCloseHandler(panel, id) {
    const strategies = [
        // Strategy 1: Specific IDs for known panels
        () => panel.element.querySelector(specificSelectors[id]),
        // Strategy 2: Generic patterns
        () => panel.element.querySelector('[id*="close"], .panel-close'),
        // Strategy 3: Event delegation fallback
        () => this.addEventDelegation(panel, id)
    ];
    
    const tryAttachHandler = () => {
        for (let strategy of strategies) {
            const result = strategy();
            if (result) return true;
        }
        return false;
    };
    
    // Try immediately, then retry after DOM update
    if (!tryAttachHandler()) {
        requestAnimationFrame(() => tryAttachHandler());
    }
}

Event Delegation as Ultimate Fallback

// When specific button detection fails, use event delegation
panel.element.addEventListener('click', (e) => {
    if (e.target.closest('[id*="close"]')) {
        this.closePanel(id);
    }
});

Key Insights

  1. Defensive Programming: Always have multiple strategies for finding DOM elements after restructuring
  2. DOM Timing: Use requestAnimationFrame() or setTimeout() when immediate element access fails
  3. Event Delegation: Provides reliable fallback when specific element detection fails
  4. Debugging: Console logging successful handler attachment helps diagnose issues

What This Prevented

  • Silent Failures: Close buttons appearing functional but not working
  • Inconsistent Behavior: Some panels working while others don't
  • User Frustration: Broken interactions in an otherwise polished interface

Future Applications

This pattern applies to any system that:

  • Dynamically restructures existing DOM elements
  • Needs to reattach event handlers after DOM manipulation
  • Implements unified behavior across heterogeneous existing components

Bottom Line: When building systems that reshape existing DOM, assume your first attempt to find elements will fail and build accordingly.

JavaScript Error Debugging in Complex HTML Files

Date: 2025-08-05

The Silent Failure Problem

Critical Issue: A single null reference error in early JavaScript code can silently prevent ALL subsequent JavaScript from executing, even in separate logical sections.

Scenario: Implementing a unified panel system, but close buttons weren't working and no console output appeared.

Root Cause Analysis Process

Step 1: No Console Output = Script Not Running

  • When NO console logs appear, the issue isn't logic - it's script execution failure
  • Don't debug individual features; debug whether JavaScript is running at all

Step 2: Error Location Strategy

// Add basic execution test at script start
console.log('๐ŸŸข JavaScript is running - First script tag loaded');

Step 3: The Actual Error

Uncaught TypeError: Cannot read properties of null (reading 'addEventListener')
at chat-interface:2073:23

Root Cause: Code assumed an element existed without checking:

// โŒ DANGEROUS - Will crash if element doesn't exist
const modelSelector = document.getElementById('modelSelector');
modelSelector.addEventListener('click', () => {...}); // Crashes if null

Solution Patterns

Defensive Element Access

const modelSelector = document.getElementById('modelSelector');
const modelDropdown = document.getElementById('modelDropdown');

if (modelSelector && modelDropdown) {
    // Only run if both elements exist
    modelSelector.addEventListener('click', () => {...});
} else {
    console.log('โš ๏ธ Model selector elements not found - skipping functionality');
}

Variable Declaration Order Matters

// โŒ WRONG ORDER - Variables used before declaration
const panelManager = new PanelManager();
panelManager.registerPanel('panel', {
    onClose: () => chatMessages.scrollTop = 0 // chatMessages not declared yet!
});
const chatMessages = document.getElementById('chatMessages');

// โœ… CORRECT ORDER - Variables declared first
const chatMessages = document.getElementById('chatMessages');
const panelManager = new PanelManager();
panelManager.registerPanel('panel', {
    onClose: () => chatMessages.scrollTop = 0 // chatMessages exists
});

Debugging Methodology

  1. Execution Test: Add console.log() at script start to verify JavaScript runs
  2. Error Location: Use browser console to identify exact line and error type
  3. Null Checks: Add defensive checks for ALL getElementById calls
  4. Incremental Testing: Test each major section with console logs
  5. Variable Order: Ensure all variables are declared before use

Key Insights

  1. Single Point of Failure: One null reference can break an entire application
  2. Error Propagation: JavaScript errors don't stay contained to their logical sections
  3. Console Silence: No logs = script execution failure, not logic problems
  4. Element Assumptions: Never assume DOM elements exist - always check
  5. Order Dependencies: Variable declaration order affects runtime behavior

What This Prevented

  • Hours of Wrong Debugging: Would have spent time debugging panel logic instead of script execution
  • Feature-Specific Fixes: Would have tried to fix panels individually instead of the root cause
  • Silent Production Failures: This type of error could cause complete UI failure in production

Prevention Checklist

  • Add execution verification logs at script start
  • Null check ALL getElementById() calls
  • Declare variables before use in callbacks
  • Test JavaScript execution before debugging features
  • Use browser console to identify exact error locations

Lesson: Always verify JavaScript is executing before debugging application logic. One missing element check can silently break everything.

Icon System Consistency and Missing Definitions

Date: 2025-08-05

The Hidden UI Failure Problem

Critical Issue: Using icon class names in HTML without corresponding CSS definitions creates invisible UI elements that appear to work in development but fail silently in production.

Scenario: Navigation menus and UI elements showing blank spaces instead of icons, making the interface appear broken or incomplete.

Root Cause Analysis

The Icon Definition Gap

<!-- HTML uses the class -->
<span class="phosphor-icon chart-line"></span>

<!-- But CSS definition is missing -->
/* .phosphor-icon.chart-line::before { content: '๐Ÿ“ˆ'; } โ† NOT DEFINED */

Result: The element exists in the DOM but displays nothing, creating invisible buttons and confusing UX.

What We Discovered

Massive Scale of the Problem:

  • account-settings.html: 16 missing icon definitions
  • activity-log.html: 13 missing icon definitions
  • Total: 29 missing icons across just 2 pages

Common Missing Icons:

  • Navigation: chat-circle, rocket-launch, book-open, folder-open
  • System: chart-line, package, cpu, puzzle-piece
  • UI: house, moon, device-mobile, gear
  • Controls: bars-three (hamburger menu)

Detection Methodology

Step 1: Audit Icon Usage vs Definitions

# Find all phosphor-icon classes used in HTML
grep -o "phosphor-icon [a-z-]+" file.html

# Find all phosphor-icon definitions in CSS  
grep "\.phosphor-icon\.[a-z-]+::before" file.html

# Compare lists to find missing definitions

Step 2: Visual Inspection Strategy

  • Look for blank spaces where icons should appear
  • Check navigation menus for missing visual elements
  • Test hover states on buttons that should have icons

Step 3: Systematic Validation

/* Audit pattern - ensure every used class has a definition */
.phosphor-icon.CLASSNAME::before { content: 'EMOJI'; font-size: inherit; }

Solution Patterns

Complete Icon System Audit

/* Navigation icons */
.phosphor-icon.chat-circle::before { content: '๐Ÿ’ฌ'; font-size: inherit; }
.phosphor-icon.rocket-launch::before { content: '๐Ÿš€'; font-size: inherit; }
.phosphor-icon.book-open::before { content: '๐Ÿ“–'; font-size: inherit; }

/* System icons */
.phosphor-icon.chart-line::before { content: '๐Ÿ“ˆ'; font-size: inherit; }
.phosphor-icon.package::before { content: '๐Ÿ“ฆ'; font-size: inherit; }
.phosphor-icon.cpu::before { content: '๐Ÿ–ฅ๏ธ'; font-size: inherit; }

/* Control icons */
.phosphor-icon.bars-three::before { content: 'โ˜ฐ'; font-size: inherit; }

Eliminate Direct Unicode Usage

<!-- โŒ INCONSISTENT - Direct unicode -->
<button><span>โ˜ฐ</span></button>

<!-- โœ… CONSISTENT - Phosphor icon system -->
<button><span class="phosphor-icon bars-three"></span></button>

Prevention Strategies

  1. Icon System Documentation: Maintain a complete list of available icons and their class names
  2. Development Checklist: Verify all icon classes have corresponding CSS definitions
  3. Visual Testing: Test all pages to ensure no blank icon spaces exist
  4. Automated Validation: Create scripts to detect unused classes or missing definitions
  5. Consistent Implementation: Never mix direct unicode with icon systems

Key Insights

  1. Silent Failures: Missing icon definitions don't throw errors - they just show nothing
  2. Scale Impact: Small oversights compound across multiple pages
  3. User Experience: Blank icons make interfaces appear broken or unprofessional
  4. Maintenance Debt: Inconsistent icon systems create ongoing maintenance issues
  5. Design System Integrity: Complete icon coverage is essential for professional UI

What This Prevented

  • Professional Appearance Issues: Navigation menus with missing icons
  • User Confusion: Buttons that appear non-functional due to missing visual cues
  • Inconsistent Branding: Mixed unicode and icon system usage
  • Future Scalability Problems: Incomplete icon systems become harder to maintain

Implementation Checklist

  • Audit all pages for icon class usage vs CSS definitions
  • Create complete icon definition library for the design system
  • Replace all direct unicode characters with proper icon classes
  • Test visual appearance of all interactive elements
  • Document available icons and their proper class names
  • Establish icon system usage guidelines for future development

Critical Learning: Icon systems require complete coverage - partial implementations create invisible UI failures that silently degrade user experience. Every icon class used in HTML must have a corresponding CSS definition, and mixing unicode with icon systems creates maintenance nightmares.

Best Practice: Treat icon systems like any other dependency - incomplete implementations are broken implementations.

Documentation Standards

What Works

  1. No Metadata Headers: Keeping markdown documents clean without status/version headers (except for special cases)
  2. Image Organization: Storing images in _images/ directories relative to markdown files
  3. Descriptive Alt Tags: Ensuring all images have meaningful alt text for accessibility
  4. Color Samples: Showing visual samples when hex colors are specified

Implementation Insights

HTML Mockup Best Practices

  1. Inline Styles First: Starting with inline styles for rapid prototyping, then organizing into structured CSS
  2. Progressive Enhancement: Building core functionality first, then adding animations and polish
  3. Consistent Spacing: Using CSS variables for consistent spacing and sizing across components
  4. Hover States: Adding subtle hover effects to all interactive elements

Cross-Browser Compatibility

  • CSS Variables: Using custom properties for theming made dark mode preparation easier
  • Flexbox/Grid: Modern layout systems simplified responsive design
  • Transition Timing: Consistent timing functions created cohesive animations

Project Management

Communication Patterns

  1. Clear Status Updates: Regular progress updates with specific accomplishments
  2. Visual Examples: Including screenshots or detailed descriptions of UI changes
  3. Incremental Delivery: Completing and demonstrating features incrementally

File Organization

mockups/
โ”œโ”€โ”€ index.html           # Central navigation hub
โ”œโ”€โ”€ chat-interface.html  # Core user experience
โ”œโ”€โ”€ *-admin.html        # Administrative interfaces
โ””โ”€โ”€ *.html              # Feature-specific mockups

๐Ÿ”ฎ Future Considerations

Scalability

  1. Component Library: Consider creating reusable components for common UI patterns
  2. Style Guide: Develop a comprehensive style guide for consistent design language
  3. Template System: Create templates for new mockup pages to ensure consistency

Performance

  1. Lazy Loading: For production, implement lazy loading for heavy dashboard components
  2. Code Splitting: Separate navigation code into its own module
  3. Icon Optimization: Consider using an icon font or SVG sprite for better performance

Accessibility

  1. ARIA Labels: Add proper ARIA labels to all interactive elements
  2. Keyboard Navigation: Ensure all features are keyboard accessible
  3. Screen Reader Testing: Validate mockups work well with screen readers

ReactMarkdown Code Block Styling Issues

Date: 2025-01-09

The Black Border Problem

Critical Issue: Code blocks displayed with harsh black borders in the UI, even after updating component styling, because ReactMarkdown was wrapping the custom code component in a <pre> tag with default browser/Tailwind Typography styling.

Symptoms:

  • Code blocks showing black borders despite custom gradient backgrounds
  • Browser inspection revealed: <pre><div class="custom-styled-code">...</div></pre>
  • Changes to component styling had no effect on the outer border

Root Cause Analysis

The Double-Wrapping Problem:

  1. ReactMarkdown automatically wraps code blocks in <pre> tags
  2. Tailwind Typography plugin (@tailwindcss/typography) applies default styles to .prose pre
  3. Browser default styles for <pre> tags include borders
  4. Our custom code component was wrapped inside, not replacing, the <pre> tag

Discovery Process:

<!-- What we expected -->
<div class="bg-gradient-to-br from-slate-50...">
  <code>...</code>
</div>

<!-- What we got -->
<pre> <!-- This added unwanted styling! -->
  <div class="bg-gradient-to-br from-slate-50...">
    <code>...</code>
  </div>
</pre>

Solution Implementation

Override the Pre Component in ReactMarkdown:

// In ReactMarkdown components prop
pre: ({children}) => {
  // Return just the children (our custom code component)
  // This prevents ReactMarkdown from wrapping in <pre>
  return <>{children}</>;
}

Add CSS Overrides for Safety:

/* Remove default pre styling from prose */
.prose pre {
  background-color: transparent !important;
  border: none !important;
  padding: 0 !important;
  margin: 0 !important;
}

/* Ensure no borders on any pre tags */
pre {
  border: none !important;
  background: transparent !important;
}

Key Insights

  1. Component Wrapping: ReactMarkdown components don't replace elements, they wrap them
  2. Tailwind Typography: The prose class applies opinionated styles that can conflict with custom designs
  3. Invalid Tailwind Classes: Using non-existent Tailwind classes (like slate-850) fails silently
  4. Dark Mode Detection: Ensure parent elements have the dark class for dark mode styles to apply
  5. Browser Cache: Hard refresh (Cmd+Shift+R) may be needed after CSS changes

Debugging Methodology

  1. Inspect Actual HTML: Use browser DevTools to see the real DOM structure
  2. Check Class Names: Verify Tailwind classes actually exist (max is 950, not 850)
  3. Trace Parent Wrappers: Look for unexpected parent elements adding styles
  4. Test Component Isolation: Check if the component works outside of ReactMarkdown
  5. Verify Dark Mode Context: Ensure the dark class is on document.documentElement

Prevention Strategies

  1. Always Override Both Pre and Code: When customizing code blocks in ReactMarkdown
  2. Test with Browser Inspector: Don't just rely on component code
  3. Use Valid Tailwind Classes: Reference the Tailwind documentation for valid values
  4. Add Defensive CSS: Include fallback styles to override unwanted defaults
  5. Document Component Structure: Note when libraries wrap vs. replace elements

What This Prevented

  • Poor User Experience: Harsh black borders made the UI feel unpolished
  • Inconsistent Theming: Code blocks didn't match the overall design aesthetic
  • Light/Dark Mode Issues: Borders were especially jarring in light mode
  • Brand Consistency: The harsh styling conflicted with the soft, modern design

Technical Pattern for Future Use

// Complete ReactMarkdown code block customization pattern
<ReactMarkdown
  components={{
    // Override pre to prevent wrapper
    pre: ({children}) => <>{children}</>,
    // Custom code component with full styling control
    code: ({inline, className, children, ...props}) => {
      if (inline) {
        return <code className="custom-inline-code">{children}</code>;
      }
      return (
        <div className="custom-code-block">
          {/* Your fully controlled code block UI */}
        </div>
      );
    }
  }}
>
  {content}
</ReactMarkdown>

Lesson: When third-party libraries generate HTML, always check the actual DOM output, not just your component code. Default styles from libraries and browsers can override your carefully crafted designs in unexpected ways.

Chat Message State Management Issues

Date: 2025-08-09

The Disappearing User Messages Bug

Critical Issue: User messages would flash briefly then disappear from the chat history when session messages were loaded.

Symptoms:

  • User sends a message โ†’ appears briefly in chat
  • Session updates trigger โ†’ message disappears
  • Messages lost before being saved to session

Root Cause Analysis

The Problem Flow:

  1. User message added to chatMessages state
  2. Session messages loaded from API into sessionMessages
  3. useEffect watching sessionMessages triggers
  4. BUG: Completely overwrites chatMessages with only converted session messages
  5. New local messages that weren't saved yet are lost

The Faulty Code:

// โŒ WRONG - Overwrites everything
useEffect(() => {
  if (sessionMessages.length > 0) {
    setChatMessages(convertedMessages); // Loses local messages!
  }
}, [convertedMessages, sessionMessages]);

Solution Implementation

Preserve Local Messages:

// Merge session messages with newer local messages
setChatMessages(prev => {
  if (convertedMessages.length > 0) {
    const lastSessionTime = new Date(
      convertedMessages[convertedMessages.length - 1].timestamp
    ).getTime();
    
    // Keep messages newer than last session message
    const newLocalMessages = prev.filter(msg => {
      const msgTime = new Date(msg.timestamp).getTime();
      return msgTime > lastSessionTime && 
             !convertedMessages.some(cm => 
               cm.timestamp === msg.timestamp && 
               cm.content === msg.content
             );
    });
    
    return [...convertedMessages, ...newLocalMessages];
  }
  return convertedMessages;
});

Key Insights

  1. State Synchronization: When merging state from multiple sources, always consider what should be preserved
  2. Timestamp Ordering: Use timestamps to determine which messages are newer
  3. Duplicate Prevention: Check both timestamp and content to avoid duplicates
  4. Local-First: Preserve local changes until they're confirmed saved

Inline Code vs Code Block Rendering

Date: 2025-08-09

The Problem

Issue: Inline code (like CONFIG_DIR) was being rendered as full code blocks with borders, headers, and copy buttons instead of simple highlighted text within sentences.

Root Cause

react-markdown v10 Breaking Change: The inline parameter is no longer reliably passed to the code component, making the original detection logic fail:

// โŒ This check always failed in v10
if (inline) {
  return <InlineCode />;
}

Solution: Content-Based Detection

Smart Detection Logic:

code: ({node, inline, className, children, ...props}) => {
  // Analyze content to determine if it's inline
  const codeString = String(children).replace(/\n$/, '');
  const hasNewlines = codeString.includes('\n');
  const hasLanguageClass = className?.startsWith('language-');
  const isInlineCode = !hasNewlines && !hasLanguageClass;
  
  if (isInlineCode) {
    // Simple inline highlighting
    return <code className="px-1.5 py-0.5 bg-blue-50 ...">{children}</code>;
  }
  
  // Full code block UI
  return <CodeBlock>...</CodeBlock>;
}

Detection Rules

  1. Inline Code Characteristics:

    • No newlines in content
    • No language-* className
    • Usually short snippets
  2. Code Block Characteristics:

    • Contains newlines (multi-line)
    • Has language-* className
    • Typically longer code samples

Key Learnings

  1. Library Version Changes: Always check breaking changes when libraries update
  2. Fallback Detection: Don't rely on single parameters - use multiple signals
  3. Content Analysis: Sometimes analyzing the content itself is more reliable than metadata
  4. User Experience: Different content types need different UI treatments

Critical Learning: When third-party libraries change their API, implement robust detection that doesn't rely on single parameters. Use multiple signals and content analysis for reliable feature detection.

JSX Structure and Build Errors

Date: 2025-08-09

The "Unterminated Regular Expression" JSX Error

Critical Issue: JSX parsing errors can manifest as cryptic "Unterminated regular expression" errors when there are structural issues with React components, particularly with mismatched tags or improper nesting of conditional renders.

Symptoms:

  • Build error: ERROR: Unterminated regular expression at a closing div tag
  • Error points to innocent-looking JSX like </div>
  • The actual issue is elsewhere in the component structure

Root Cause Analysis

The Nested Conditional Problem:
When implementing expandable tool messages with conditional rendering, improper nesting of JSX elements within conditionals created invalid structures:

// โŒ PROBLEMATIC - Missing proper nesting
{expandedTools.has(message.toolId) && (
  <div className="expanded-content">
  {/* Another conditional started without proper closure */}
  {message.toolInput && (() => {

Why It Failed:

  1. Opening a div inside a conditional render
  2. Immediately starting another conditional without proper JSX structure
  3. Mismatched opening and closing tags across conditional boundaries
  4. Parser interpreting malformed JSX as regular expressions

Key Discovery Process

  1. Error Misleading: "Unterminated regular expression" doesn't mean regex - it means JSX parsing failed
  2. Count Tags: Systematically counted opening vs closing divs (found 56 opening, 54 closing)
  3. Trace Conditionals: Each conditional render must have properly balanced JSX
  4. Check Nesting: Ensure conditionals inside JSX elements are properly wrapped

Solution Patterns

Proper Conditional Nesting:

{expandedTools.has(message.toolId) && (
  <div className="expanded-content">
    {/* Properly nested content */}
    {message.toolInput && (() => {
      // Content here
    })()}
  </div>
)}

Validate Structure Before Complex Changes:

# Count opening and closing tags
sed -n '327,1121p' file.jsx | grep -c '<div'
sed -n '327,1121p' file.jsx | grep -c '</div>'

Debugging Methodology

  1. Build Error Location: Note the line number but don't trust it - the real issue is often earlier
  2. Count Tags: Use grep/sed to count opening and closing tags in the affected section
  3. Trace Ternaries: Map out the complete ternary operator chain structure
  4. Check Conditionals: Verify each conditional render has balanced JSX
  5. Revert and Rebuild: When structure is too broken, revert and carefully reapply changes

Prevention Strategies

  1. Small Incremental Changes: Test build after each structural change
  2. Comment Complex Structures: Add comments showing where conditionals open/close
  3. Use Fragments Properly: Use <>...</> when you need to wrap without adding DOM elements
  4. Validate After Edits: Run build immediately after complex JSX changes
  5. Keep Backup Points: Commit working versions before major structural changes

What We Learned

  1. Parser Confusion: Invalid JSX structure confuses the parser into thinking it's parsing JavaScript
  2. Error Messages Mislead: "Unterminated regular expression" is a symptom, not the cause
  3. Structure Over Content: Fix structural issues before implementing features
  4. Indentation Matters: Proper indentation helps spot nesting issues
  5. Tool Limitations: AI assistants can struggle with complex JSX structure debugging

Implementation Checklist for Complex JSX

  • Map out the complete conditional structure before coding
  • Test build after each conditional branch addition
  • Count opening and closing tags programmatically
  • Use proper indentation to visualize nesting
  • Add temporary console logs to verify conditional paths
  • Keep the previous working version easily accessible
  • Document the intended structure in comments

Critical Lesson: When you see "Unterminated regular expression" in a JSX file, immediately check for:

  1. Mismatched opening/closing tags
  2. Improper conditional render nesting
  3. Missing closing parentheses in ternary chains
  4. Adjacent JSX elements without wrappers

The error message is telling you the parser got confused, not that you have a regex problem.

Multi-Client Deployment Management System

Date: 2025-08-11

The Challenge: Inefficient Client-Specific Docker Images

Initial Problem: Originally building separate Docker images for each client, leading to:

  • Redundant builds for identical code
  • Storage waste on Docker Hub
  • Inconsistent versions across clients
  • Complex deployment pipeline

User Insight: "Why do we not use the same docker image for clients - shouldnt each image be exactly the same per version?"

This feedback highlighted a fundamental architecture flaw that needed immediate correction.

Solution: Shared Images with Environment Differentiation

Refactored Architecture:

linzoid/sasha-studio:1.0.2  <- Single shared image
    โ”œโ”€โ”€ sasha-main (env: COMPANY_NAME=Knowcode)
    โ”œโ”€โ”€ hirebest   (env: COMPANY_NAME=HireBest)  
    โ””โ”€โ”€ acme-corp  (env: COMPANY_NAME=ACME Corp)

Key Implementation Changes:

  1. Removed client-specific Docker tags - eliminated tag_suffix from configurations
  2. Unified build process - one image serves all clients
  3. Environment-based differentiation - clients differ only via Sliplane environment variables
  4. Shared version management - all clients use same VERSION file

Security: Auto-Generated Cryptographic Secrets

Problem: Manual secret generation was error-prone and insecure.

Solution: Automated generation using OpenSSL:

# Each client gets unique 256-bit secrets
SESSION_SECRET=$(openssl rand -base64 32)
JWT_SECRET=$(openssl rand -base64 32)

Security Architecture:

  • Session Isolation: Each client has unique session secrets
  • JWT Security: Independent token verification per client
  • Breach Containment: Compromise of one client doesn't affect others
  • Zero Placeholders: Real secrets generated automatically

User Experience: Enhanced Deployment Instructions

Problem: Color escape sequences showing as text (\033[0;34m) instead of actual colors.

Root Cause: Missing -e flag in echo statements prevented interpretation of escape sequences.

Solution:

# โŒ Wrong - shows escape sequences as text
echo "\\033[0;34mDeployment starting\\033[0m"

# โœ… Correct - shows actual colors
echo -e "\\033[0;34mDeployment starting\\033[0m"

Enhanced Output Features:

  • Color-coded instructions with proper terminal formatting
  • Step-by-step Sliplane setup guide with exact button names
  • Copy-paste ready environment variables
  • Post-deployment verification checklists

Multi-Client Management CLI

Created comprehensive tooling:

./manage-clients.sh create client-name    # Auto-generates secrets
./deploy-client.sh client-name            # Shared image deployment
./show-setup.sh client-name               # Complete setup guide

Library Functions:

  • lib/common.sh: Secret generation, validation utilities
  • lib/docker.sh: Shared image operations
  • lib/sliplane.sh: Webhook deployment management

Key Technical Insights

  1. Shared Images Are Superior: Build once, deploy many times with environment differentiation
  2. Security Through Automation: Auto-generated secrets eliminate human error
  3. User Experience Matters: Proper terminal formatting significantly improves deployment experience
  4. Documentation Drives Adoption: Step-by-step instructions reduce deployment friction

What This Architecture Enables

Efficiency Gains:

  • 75% reduction in build time: One build instead of per-client builds
  • Reduced Docker Hub storage: Single image replicated vs multiple unique images
  • Guaranteed consistency: All clients run identical code with different config

Security Improvements:

  • Cryptographically unique secrets: 256-bit entropy per client
  • Client isolation: Sessions and tokens cannot cross client boundaries
  • Audit trail: Clear separation of client data and authentication

Operational Benefits:

  • Simple scaling: Add new clients without code changes
  • Version management: Single VERSION file controls all deployments
  • Troubleshooting: Consistent behavior across all client environments

Implementation Patterns for Future Use

Auto-Secret Generation:

generate_secret() {
    local length=${1:-32}
    openssl rand -base64 "$length" | tr -d '\n'
}

# Usage in client creation
SESSION_SECRET=$(generate_secret 32)
JWT_SECRET=$(generate_secret 32)

Color-Coded Terminal Output:

# Define colors once, use everywhere
RED='\\033[0;31m'
GREEN='\\033[0;32m' 
BLUE='\\033[0;34m'
NC='\\033[0m'

# Always use -e with echo for colors
echo -e "${GREEN}โœ… Success${NC}"
echo -e "${RED}โŒ Error${NC}"

Shared Docker Image Pattern:

# Build once
docker build -t ${REPO}:${VERSION} .

# Deploy many times with different env
# Client 1: COMPANY_NAME=ClientA
# Client 2: COMPANY_NAME=ClientB
# Client 3: COMPANY_NAME=ClientC

Comprehensive Documentation Created

What This Prevented

  • Operational Inefficiency: Multiple redundant Docker builds
  • Security Vulnerabilities: Weak or placeholder secrets in production
  • User Frustration: Confusing deployment instructions with formatting issues
  • Scaling Problems: Architecture that wouldn't scale to many clients
  • Maintenance Overhead: Managing separate codebases per client

Critical Lessons Learned

  1. Listen to User Feedback: The "why separate images?" question revealed a fundamental flaw
  2. Security Should Be Automatic: Manual secret generation invites mistakes
  3. UI/UX Applies to CLI: Terminal formatting significantly impacts developer experience
  4. Architecture Decisions Compound: Shared images unlock numerous downstream benefits
  5. Document Everything: Comprehensive docs enable team scaling and knowledge transfer

Future Considerations

  • Secret Rotation: Implement automated secret rotation for high-security environments
  • Multi-Environment Support: Extend pattern to staging/production environment separation
  • Monitoring Integration: Add deployment status monitoring and alerting
  • Template System: Create client configuration templates for common scenarios

Bottom Line: The shift from client-specific images to shared images with environment differentiation represents a fundamental architectural improvement that enhances security, efficiency, and user experience while enabling seamless scaling to unlimited clients.

Docker Alpine Linux Child Process Spawning

Date: 2025-08-09

The ENOENT Spawn Error in Alpine Containers

Critical Issue: Claude CLI failed to spawn in Alpine Docker containers with Error: spawn /usr/local/bin/claude ENOENT despite the binary existing and being executable.

Symptoms:

  • spawn command failed with ENOENT errors
  • Binary existed and was executable when checked directly
  • Same code worked outside Docker
  • Multiple attempts with different paths all failed

Root Cause Analysis

The Alpine Linux Difference:

  1. musl libc vs glibc: Alpine uses musl libc instead of glibc
  2. Shell Differences: Alpine's /bin/sh is BusyBox, not bash
  3. Binary Compatibility: Node.js binaries compiled for glibc may not work properly with musl
  4. Spawn Behavior: child_process.spawn behaves differently in Alpine

Discovery Process:

// โŒ All these approaches failed in Alpine
spawn('claude', args)                    // ENOENT
spawn('/usr/local/bin/claude', args)    // ENOENT  
spawn('/usr/local/bin/node', ['/usr/local/bin/claude', ...args]) // ENOENT
spawn('sh', ['-c', 'claude ' + args])   // spawn /bin/sh ENOENT

Solution: Use execFile Instead of Spawn

The Working Solution:

import { spawn, execFile } from 'child_process';

if (process.env.RUNNING_IN_DOCKER === 'true') {
  // In Docker Alpine, use execFile which is more reliable than spawn
  console.log('๐Ÿณ Using execFile for Docker Alpine environment');
  
  claudeCommand = '/usr/local/bin/node';
  finalArgs = ['/usr/local/bin/claude', ...args];
  
  // execFile doesn't require a shell and works reliably in Alpine
  claudeProcess = execFile(claudeCommand, finalArgs, spawnOptions);
} else {
  // For non-Docker environments, use regular spawn
  claudeProcess = spawn('claude', args, spawnOptions);
}

Why execFile Works When spawn Fails

  1. No Shell Required: execFile directly executes the binary without shell interpretation
  2. Path Resolution: execFile handles path resolution differently than spawn
  3. Alpine Compatibility: Better compatibility with musl libc and BusyBox environment
  4. Error Handling: More predictable error behavior in minimal environments

Working Directory Path Issues

Secondary Problem: Relative paths like default/workspace caused failures.

Solution: Always use absolute paths in Docker:

let workingDir = cwd || process.cwd();

// If the working directory doesn't start with /, prepend /app/workspaces/
if (!workingDir.startsWith('/')) {
  if (process.env.RUNNING_IN_DOCKER === 'true') {
    workingDir = `/app/workspaces/${workingDir}`;
  } else {
    workingDir = path.resolve(workingDir);
  }
}

// Ensure the directory exists in Docker
if (process.env.RUNNING_IN_DOCKER === 'true') {
  await fs.mkdir(workingDir, { recursive: true });
}

API Key Persistence in Docker

Problem: API keys need to persist across container restarts.

Solution: Load from persistent volume on startup:

// Docker uses /app/config for persistent storage
const isDocker = process.env.RUNNING_IN_DOCKER === 'true';
const configDir = isDocker ? '/app/config' : path.join(__dirname, '..');

// Load .env from persistent volume
if (isDocker && fs.existsSync(path.join(configDir, '.env'))) {
  dotenv.config({ path: path.join(configDir, '.env') });
  console.log('๐Ÿ”‘ ANTHROPIC_API_KEY loaded from .env');
}

Testing Methodology

Verification Script:

#!/bin/bash
# Test Claude CLI in Docker container

echo "1. Checking Claude CLI installation:"
docker compose exec -T sasha-studio-test which claude

echo "2. Testing Claude CLI version:"
docker compose exec -T sasha-studio-test /usr/local/bin/node /usr/local/bin/claude --version

echo "3. Testing execFile approach:"
docker compose exec -T sasha-studio-test /usr/local/bin/node -e "
const { execFile } = require('child_process');
execFile('/usr/local/bin/node', ['/usr/local/bin/claude', '--version'], (error, stdout) => {
  if (error) {
    console.error('Error:', error.message);
  } else {
    console.log('Success! Output:', stdout);
  }
});
"

Key Insights

  1. Alpine is Different: Never assume Linux behaviors are universal - Alpine's minimal nature creates unique challenges
  2. execFile > spawn: In containerized environments, execFile is often more reliable
  3. Absolute Paths: Always use absolute paths in Docker to avoid ambiguity
  4. Test in Target Environment: Always test Node.js child processes in the actual Docker container
  5. Persistent Configuration: Design for configuration persistence from the start

Prevention Strategies

  1. Choose Base Images Carefully: Consider using node:20 instead of node:20-alpine if compatibility is more important than size
  2. Test Child Processes Early: Test external binary execution immediately when setting up Docker
  3. Document Environment Differences: Note Alpine-specific behaviors in documentation
  4. Use execFile for Reliability: Default to execFile when spawning Node.js scripts in containers
  5. Implement Fallback Strategies: Have multiple approaches ready for process spawning

Alternative Solutions (Not Used)

  1. Switch from Alpine: Use node:20 base image (larger but more compatible)
  2. Install glibc: Add glibc compatibility layer to Alpine (complex)
  3. Use Docker exec: Execute commands via Docker API (requires Docker socket)
  4. HTTP API Wrapper: Wrap Claude CLI in an HTTP service (additional complexity)

What This Prevented

  • Production Failures: Claude CLI completely non-functional in Docker
  • User Frustration: Core functionality broken in containerized deployment
  • Deployment Blockers: Unable to ship Docker version
  • Support Burden: Cryptic ENOENT errors difficult to diagnose

Docker Configuration Best Practices

Dockerfile Optimizations:

# Install Claude CLI globally for all users
RUN npm install -g @anthropic-ai/claude-code@latest

# Ensure proper permissions for nodejs user
RUN mkdir -p /home/nodejs/.claude && \
    chown -R nodejs:nodejs /home/nodejs/.claude

# Use dumb-init to handle signals properly
ENTRYPOINT ["dumb-init", "--"]

docker-compose.yml Configuration:

volumes:
  - sasha-config:/app/config  # Persistent API key storage
  - sasha-data:/app/data      # Persistent database
environment:
  - RUNNING_IN_DOCKER=true
  - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY:-}  # Optional env override

Critical Learnings

  1. ENOENT Doesn't Mean File Not Found: In Alpine, it often means execution failed due to library issues
  2. Shell Option Doesn't Help: Using shell: true with spawn just moves the problem to /bin/sh
  3. Cross-spawn Isn't Universal: Even cross-platform libraries can fail in Alpine
  4. Debug with Direct Execution: Test binaries directly in container before attempting to spawn
  5. Environment Variables Matter: Always verify PATH and other env vars in container

Bottom Line: When deploying Node.js applications that spawn child processes to Alpine Docker containers, use execFile instead of spawn, always use absolute paths, and test thoroughly in the actual container environment. The time saved by using Alpine's smaller image size can be quickly lost to debugging compatibility issues.

๐Ÿ‹ Docker Architecture Mismatch - ARM64 vs AMD64

Date: 2025-08-11

The Critical Platform Architecture Problem

Critical Issue: Docker images built on Apple Silicon Macs (ARM64/aarch64) fail to run on AMD64 servers with "exec format error", affecting both system binaries and native Node.js modules.

Symptoms:

  • exec /usr/bin/dumb-init: exec format error when container starts
  • Error loading shared library /app/node_modules/node-pty/build/Release/pty.node: Exec format error
  • Container exits immediately on Sliplane (AMD64 servers)
  • Same image works perfectly on Mac (ARM64)

Root Cause Analysis

The Architecture Contamination Chain:

  1. Mac builds create ARM64 binaries by default
  2. Multi-stage Docker builds copy node_modules between stages
  3. Native modules (like node-pty) contain platform-specific compiled code
  4. Copied ARM64 binaries fail on AMD64 runtime environment

Why It Worked Before, Then Broke:

  • Initial deployments may have been built on AMD64 CI/CD systems
  • Local Mac builds started being used for deployment
  • Native module dependencies were added or updated
  • The problem compounds with each native module added

The Failed Attempts

Attempt 1: Just build for AMD64

docker build --platform linux/amd64 ...

Result: Fixed dumb-init but node-pty still failed

Attempt 2: Rebuild native modules

RUN npm rebuild

Result: Rebuild happened in build stage (ARM64), not runtime stage

Attempt 3: Copy node_modules between stages

COPY --from=builder /app/node_modules ./node_modules

Result: Perpetuated architecture mismatch

The Solution: Fresh Dependencies in Runner Stage

Working Dockerfile Pattern:

# Stage 2: Build (can be ARM64 or AMD64)
FROM node:20-alpine AS builder
WORKDIR /app
COPY --from=deps /app/node_modules ./node_modules
COPY claudecodeui/ .
RUN npm run build
# Critical: Remove node_modules to prevent contamination
RUN rm -rf node_modules

# Stage 4: Production (MUST be AMD64)
FROM node:20-alpine AS runner
WORKDIR /app

# Copy built assets but NOT node_modules
COPY --from=builder /app/dist ./dist
COPY claudecodeui/package*.json ./

# Install production dependencies fresh for target architecture
RUN npm ci --production && \
    # Rebuild ensures native modules compile for THIS platform
    npm rebuild

The Complete Build Command

Correct Multi-Platform Build:

# Use buildx for explicit platform targeting
docker buildx build \
  --platform linux/amd64 \
  -f claudecodeui/Dockerfile.sliplane \
  -t linzoid/sasha-studio:$VERSION \
  -t linzoid/sasha-studio:latest \
  --push \
  .

Key Technical Insights

  1. Native Modules Are Platform-Specific: Modules with C++ bindings (node-pty, bcrypt, better-sqlite3) MUST be compiled for the target architecture

  2. Multi-Stage Builds Can Contaminate: Copying node_modules between stages carries architecture-specific binaries

  3. npm ci vs npm install: Use npm ci --production in runner stage for reproducible, production-only dependencies

  4. npm rebuild Is Essential: Always run after npm ci to ensure native modules match the platform

  5. Docker's Platform Flag: --platform linux/amd64 affects ALL stages, not just the final image

Architecture Detection

Verify Image Architecture:

# Check image architecture
docker image inspect linzoid/sasha-studio:latest | grep Architecture

# Inside container, verify platform
docker run --rm linzoid/sasha-studio:latest uname -m
# Should output: x86_64 (not aarch64)

Health Check Validation:

{
  "build": {
    "platform": "linux",
    "arch": "x64"  // Must be x64 for Sliplane
  }
}

Prevention Strategies

  1. CI/CD Builds: Use GitHub Actions or other CI/CD that runs on AMD64
  2. Explicit Platform Targeting: Always specify --platform linux/amd64 for production builds
  3. Separate Dev/Prod Dockerfiles: Use different approaches for local dev vs production
  4. Architecture Testing: Add health endpoint that reports architecture
  5. Build Verification: Test image on AMD64 before deployment

Common Native Modules Affected

  • node-pty: Terminal emulation (C++ bindings)
  • bcrypt: Password hashing (C++ crypto)
  • better-sqlite3: SQLite database (C++ bindings)
  • sharp: Image processing (C++ bindings)
  • canvas: Canvas rendering (C++ bindings)

What This Prevented

  • Complete Deployment Failure: Services unable to start on production servers
  • Cryptic Error Messages: "exec format error" doesn't clearly indicate architecture issues
  • Time Wasted on Wrong Solutions: Could have spent days on database or authentication debugging
  • Platform Lock-in: Would have required Mac-only deployments

Critical Lessons Learned

  1. "It Works on My Machine" Is Architecture-Dependent: Mac (ARM64) โ‰  Linux servers (AMD64)
  2. Native Modules Require Special Handling: Can't just copy node_modules around
  3. Docker Build Context Matters: Building FOR a platform vs ON a platform
  4. Test on Target Architecture: Always validate on actual deployment platform
  5. Error Messages Can Mislead: "exec format error" sounds like permissions but is architecture

Debugging Methodology

  1. Check Container Logs: First error often reveals architecture mismatch
  2. Inspect Image Architecture: Verify image was built for correct platform
  3. Test Incrementally: Start with base image, add complexity gradually
  4. SSH into Container: Direct debugging reveals issues faster than logs
  5. Compare Working vs Broken: What changed between deployments?

Alternative Solutions (Not Recommended)

  1. Use Node Images Without Alpine: Larger but more compatible
  2. Pre-built node_modules: Ship pre-compiled binaries for each platform
  3. Avoid Native Modules: Use pure JavaScript alternatives (performance cost)
  4. Platform-Specific Images: Maintain separate ARM64 and AMD64 images

Bottom Line: When deploying Docker containers from Apple Silicon Macs to AMD64 servers, ALWAYS:

  1. Build with --platform linux/amd64
  2. Install production dependencies fresh in the runner stage
  3. Run npm rebuild after installing dependencies
  4. Never copy node_modules between different architecture stages

This architecture mismatch is a silent killer that only manifests in production. The solution is architectural discipline: build for your target platform, not your development platform.

Architecture Mismatch Resolution: Complete Solution

Date: 2025-08-11

The Final Solution Implementation

After the initial diagnosis and partial fixes, we achieved complete resolution by implementing a systematic approach to Docker architecture management.

Complete Resolution Steps:

  1. Full Docker Cleanup: Used docker system prune -a --volumes to remove all ARM64 artifacts and build cache that could contaminate new builds

  2. Enhanced Build Scripts: Updated both build.sh and docker-build.sh to consistently use docker buildx build --platform linux/amd64 --load

  3. Cross-Platform Build System: Configured Docker buildx properly for Mac M1/M2 โ†’ AMD64 cross-compilation

  4. Verification Pipeline: Added systematic verification at each step:

    # Verify local image architecture
    docker inspect linzoid/sasha-studio:latest | grep Architecture
    # Should show: "Architecture": "amd64"
    
    # Test Docker Hub push/pull
    docker push linzoid/sasha-studio:latest
    docker rmi linzoid/sasha-studio:latest  
    docker pull linzoid/sasha-studio:latest
    
    # Verify pulled image is AMD64
    docker inspect linzoid/sasha-studio:latest | grep Architecture
    
  5. Documentation Updates: Updated CHANGELOG.md to reflect the complete resolution from v1.0.7 (initial work) to v1.0.14 (complete solution)

What Made The Difference

The Missing Piece: While the Dockerfile and build commands were correct, the issue was Docker Hub had cached the old ARM64 image with the latest tag. The solution required:

  1. Building the correct AMD64 image locally
  2. Explicitly pushing the specific version tag (1.0.14)
  3. Re-tagging and pushing latest to overwrite the cached ARM64 version
  4. Verifying the round-trip (remove local โ†’ pull from Hub โ†’ verify architecture)

Technical Pattern for Future Use

# Complete architecture fix workflow
docker system prune -a --volumes            # Clean contaminated cache
./build.sh --no-bump                        # Build AMD64 image
docker push linzoid/sasha-studio:1.0.14    # Push specific version
docker tag linzoid/sasha-studio:1.0.14 linzoid/sasha-studio:latest
docker push linzoid/sasha-studio:latest    # Overwrite cached latest
docker rmi linzoid/sasha-studio:latest     # Remove local latest
docker pull linzoid/sasha-studio:latest    # Test from Docker Hub
docker inspect linzoid/sasha-studio:latest | grep Architecture  # Verify AMD64

Critical Insights From Resolution

  1. Docker Hub Caching: Registry caches can persist wrong architecture images even when builds are correct
  2. Tag Strategy: Always push specific version tags first, then update latest
  3. End-to-End Verification: Must test the complete pull-from-registry workflow, not just local builds
  4. Docker System State: Previous builds can contaminate new builds through shared layers and cache
  5. Build vs Deploy Architecture: The image is built correctly but deploy can still fail due to registry caching

What This Complete Resolution Enables

Immediate Deployment Success:

  • Sliplane deployments now work without "exec format error"
  • All native modules (node-pty, better-sqlite3) function correctly
  • Consistent behavior across development (Mac ARM64) and production (Linux AMD64)

Operational Confidence:

  • Verified build-to-deployment pipeline
  • Clear debugging methodology for future architecture issues
  • Reproducible cross-platform build process

Scaling Benefits:

  • New client deployments will work immediately
  • Team members with different Mac architectures can deploy successfully
  • CI/CD systems can be configured with confidence

Prevention Checklist for Future Projects

  • Always specify --platform linux/amd64 for production builds
  • Test complete workflow: build โ†’ push โ†’ pull โ†’ verify architecture
  • Update latest tag after pushing specific versions
  • Clean Docker system state between architecture changes
  • Document the complete verification process
  • Add architecture reporting to application health endpoints

The Impact

This resolution transformed a complete deployment failure (containers wouldn't start) into a fully functional multi-client deployment system. The architecture fix was the final piece enabling the entire shared Docker image strategy to work successfully in production.

Critical Learning: Architecture mismatches in Docker can manifest at multiple levels (local build, registry cache, deployment platform). Complete resolution requires addressing the entire pipeline, not just fixing the build process. Always verify the full round-trip workflow when dealing with cross-architecture builds.

Key Takeaways

  1. Consistency is Key: Maintaining consistent patterns across mockups improved user experience and development speed
  2. Security by Design: Building security considerations into the UI from the start prevented later complications
  3. Progressive Disclosure: Showing advanced features only when needed kept interfaces clean
  4. Real User Scenarios: Designing for actual use cases (like file system mounting) led to more practical solutions
  5. Documentation as Development: Creating comprehensive guides alongside development improved feature completeness
  6. Container Compatibility: Always test child process spawning in the target container environment
  7. Use execFile in Alpine: For reliable process execution in Alpine Linux, prefer execFile over spawn
  8. Architecture Awareness: Always build Docker images for the target platform architecture, not your development machine

This document will be updated as the project evolves with new insights and learnings.