Lessons Learnt - Sasha Project
This document captures key insights and learnings from the Sasha AI Knowledge Management System development.
Docker Workspace Path Resolution
Date: 2025-08-11
The Challenge
- Specialists and guides weren't loading in Sliplane deployment - UI only showed 2 default specialists instead of 8 (5 system + 3 user)
- HTML documentation wasn't displaying in Knowledge tab iframe - showed blank content
Root Cause Analysis
- Path Mismatch: Code expected files at
/app/docs/and/app/html-static/but they were actually at/app/workspaces/workspace/docs/and/app/workspaces/workspace/html-static/ - Hidden Directory: Private content was in
.private/(hidden) instead ofprivate/directory - Missing Files: Specialists weren't being copied from image to workspace volume during container initialization
Investigation Process
- Backend
readPersonas()worked perfectly locally (returned all 8 specialists) - SSH into container revealed docs were in workspace path, not standard path
.privatedirectory existed but lackedspecialistssubdirectory- Docker entrypoint script wasn't creating/copying specialist files
The Solution
1. Dynamic Path Resolution in content-reader.js
// Check workspace path first in Docker environments
if (process.env.USE_DOCKER_WORKSPACE === 'true' || process.env.RUNNING_IN_DOCKER === 'true') {
const workspaceDocsPath = '/app/workspaces/workspace/docs';
const standardDocsPath = process.env.DOCS_PATH || '/app/docs';
// Use whichever exists
}
2. Handle Hidden Private Directory
// In Docker, check for .private first, fall back to private
privateDir = path.join(docsPath, '.private', 'specialists');
3. Enhanced Docker Entrypoint Script
# Create directory structure
mkdir -p "$WORKSPACES_PATH/workspace/docs/.private/specialists"
# Copy specialists from image to workspace on first run
if [ -d "/app/docs/private/specialists" ]; then
cp -r /app/docs/private/specialists/* "$WORKSPACES_PATH/workspace/docs/.private/specialists/"
fi
4. Dynamic Path Resolution for HTML Static Files in server/index.js
// Determine correct html-static path based on environment
let htmlStaticPath;
if (process.env.USE_DOCKER_WORKSPACE === 'true' || process.env.RUNNING_IN_DOCKER === 'true') {
const workspaceHtmlPath = '/app/workspaces/workspace/html-static';
const standardHtmlPath = path.join(__dirname, '../../html-static');
// Check which path exists
if (fs.existsSync(workspaceHtmlPath)) {
htmlStaticPath = workspaceHtmlPath;
} else {
htmlStaticPath = standardHtmlPath;
}
} else {
htmlStaticPath = path.join(__dirname, '../../html-static');
}
app.use('/api/docs-content', express.static(htmlStaticPath));
Key Learnings
- Always verify actual paths in production - SSH into containers to check real directory structure
- Workspace volumes need initialization - Content must be copied from image to persistent volumes
- Support multiple path configurations - Code should check multiple possible locations
- Hidden directories in Docker - Private content may be intentionally hidden with dot prefix
- Debug with actual environment - Local testing may not reveal Docker-specific path issues
- Apply same path fixes everywhere - If docs are in workspace path, html-static likely is too
- Add comprehensive logging - Path resolution logging helps quickly identify issues in production
Best Practices for Docker Deployments
- Add comprehensive path debugging on startup
- Check multiple possible locations for critical files
- Initialize workspace volumes with required content
- Document the expected vs actual directory structure
- Test with the exact deployment environment (Sliplane, etc.)
Semantic Versioning Implementation
Date: 2025-08-11
The Challenge
Implementing semantic versioning for Docker builds while maintaining simplicity for local development and providing CI/CD compatibility.
What Worked Well
- Single Source of Truth: Using a VERSION file at project root eliminated version drift
- Automatic Synchronization: The version script updates both VERSION and package.json automatically
- Multiple Tag Strategy: Creating multiple Docker tags (1.0.0, 1.0, 1, latest) enables flexible deployment strategies
- Build Metadata: Including git commit, branch, and timestamp in development builds aids debugging
- UI Integration: Version displays in Settings > Version tab by reading from package.json
Key Implementation Details
Version Management Script
# Simple commands for all version operations
./scripts/version.sh patch # 1.0.0 -> 1.0.1
./scripts/version.sh minor # 1.0.0 -> 1.1.0
./scripts/version.sh major # 1.0.0 -> 2.0.0
Docker Build Integration
- Enhanced docker-build.sh automatically creates semantic version tags
- Development builds get unique tags with timestamps:
1.0.0-dev.20240111.abc123 - Production builds create full tag hierarchy: exact, major.minor, major, latest
- Build info saved to
.last-build.jsonfor reference
Package.json Synchronization
// Automatic update in version.sh
sed -i "s/\"version\": \".*\"/\"version\": \"$NEW_VERSION\"/" package.json
Lessons Learned
- Keep It Simple: Local builds should remain simple - complexity belongs in CI/CD
- Automate Sync: Never rely on manual version synchronization between files
- Tag Strategically: Multiple Docker tags allow flexible rollback strategies
- Display Everywhere: Show version in UI, health endpoints, and Docker labels
- Git Integration: Optional git tagging in version script maintains release history
- Hybrid Approach: Support both local and GitHub Actions builds without conflict
Best Practices Discovered
- Always reset version to stable after testing (e.g., back to 1.0.0)
- Include version in health endpoint for runtime verification
- Use
.last-build.jsonto track what was built when - Development builds should indicate "dirty" git state
- Branch-based tags help identify feature builds
Technical Patterns
# Version file as single source
VERSION=$(cat VERSION)
# Multiple tag creation
docker build -t app:$VERSION -t app:latest -t app:$(echo $VERSION | cut -d. -f1-2)
# Build metadata for debugging
BUILD_DATE=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
GIT_COMMIT=$(git rev-parse --short HEAD)
Future Improvements
- Consider semantic-release for fully automated versioning
- Add changelog generation from commit messages
- Implement version constraints for dependencies
- Add pre-commit hooks to verify version consistency
UI/UX Development
Navigation System Implementation
Date: 2025-01-05
What Worked Well
- Reusable Navigation Component: Created a single navigation overlay system that could be easily replicated across all mockups with minimal changes
- Slide-in Animation: The right-side slide-in menu pattern provided smooth, modern interactions
- Active State Management: Clear visual indicators for current page helped users understand their location
- Coming Soon Pattern: Using alerts for unfinished features set clear expectations while maintaining navigation structure
Key Learnings
- CSS Organization: Keeping navigation styles in a dedicated section made it easier to maintain consistency
- Escape Key Support: Adding keyboard navigation (ESC to close) significantly improved usability
- Mobile-First Responsive: Ensuring the navigation menu takes full width on mobile devices prevented layout issues
- Stop Propagation: Using
event.stopPropagation()on the menu container prevented accidental closes when clicking inside
Technical Patterns
// Effective pattern for navigation toggle
function openNavMenu() {
navOverlay.classList.add('active');
document.body.style.overflow = 'hidden'; // Prevent background scrolling
}
Mockup Architecture
Date: 2025-01-05
What Worked Well
- Phosphor Icons: Using emoji-based icons provided consistent, scalable icons without external dependencies
- Status Badges: Visual indicators (New, Soon) helped communicate feature availability
- Gradient Headers: Linear gradients created visual hierarchy and brand consistency
Challenges & Solutions
- String Replacement in Large Files: When editing large HTML files, finding exact strings for replacement was challenging
- Solution: Use more targeted searches and consider breaking large files into components
- Cross-File Consistency: Maintaining consistent navigation across multiple mockup files
- Solution: Create a standard navigation template that can be copied with minimal modifications
File Upload and Conversion System
Date: 2025-01-08
The Challenge
Implementing file upload with automatic document conversion in a React/Express application where:
- Files need to be uploaded via multipart/form-data
- Documents (PDF, Word, Excel) need to be converted to Markdown
- Project paths are encoded with dashes but contain special characters like dots
- File browser and upload must use consistent path resolution
Critical Issue: FormData and Content-Type Headers
Problem: The authenticatedFetch utility was setting Content-Type: 'application/json' for all requests, which broke multipart/form-data uploads.
Why it Failed:
- Multer needs the browser to set
Content-Type: multipart/form-data; boundary=----WebKitFormBoundary... - Our code was forcing
Content-Type: application/json - Result: 400 Bad Request or 413 Payload Too Large errors
Solution:
// In authenticatedFetch
const isFormData = options.body instanceof FormData;
if (!isFormData) {
defaultHeaders['Content-Type'] = 'application/json';
}
// Let browser set Content-Type with boundary for FormData
Path Encoding/Decoding Issues
Problem: Project names are encoded by replacing / with -, but this loses dots:
- Original:
/Users/lindsaysmith/Documents/lambda1.nosync/sasha - Encoded:
-Users-lindsaysmith-Documents-lambda1-nosync-sasha(dot lost!) - Decoded:
/Users/lindsaysmith/Documents/lambda1/nosync/sasha(wrong!)
Solution: Use extractProjectDirectory from Claude's JSONL files to get the actual path:
const { extractProjectDirectory } = await import('../projects.js');
workspacePath = await extractProjectDirectory(projectName);
// This reads the actual cwd from session files, preserving special characters
Middleware Ordering
Problem: Express body parsers were interfering with multer's multipart parsing.
Solution: Mount file upload routes BEFORE body parsers:
app.use('/api', filesRoutes); // File routes with multer
app.use(express.json()); // JSON body parser comes after
Document Conversion API
Learning: The @knowcode/convert-to-markdown package uses:
converter.pdf.toMarkdown()notpdfToMarkdown()converter.word.toMarkdown()for Word documentsconverter.excel.toMarkdown()for Excel files
Key Implementation Patterns
- Always check if body is FormData before setting Content-Type
- Use actual project paths from JSONL, not decoded names
- Mount multer routes before body parsers
- Handle conversion errors gracefully with fallback text
File System Browser Design
Security-First Approach
Date: 2025-01-05
Key Insights
- Visual Security Indicators: Color-coding storage types (local, remote, cloud) immediately communicates security context
- Permission Warnings: Modal confirmations for write access changes prevent accidental security vulnerabilities
- Read-Only by Default: Starting with restrictive permissions and requiring explicit user action for write access
UI Patterns That Worked
- Storage Type Icons: Using distinct icons and colors for different storage types
- Local: Green with hard drive icon
- Remote: Blue with server icon
- Cloud: Purple with cloud icon
- Checkbox Confirmation: Requiring users to check "I understand the risks" before enabling write access
- 15-Second Cooldown: Preventing hasty decisions by adding a delay before confirmation
Local LLM Administration
Dashboard Design Principles
Date: 2025-01-05
Successful Patterns
- Tabbed Interface: Organizing complex admin functions into logical tabs improved discoverability
- Real-Time Status: Live indicators for model health and resource usage
- Action Buttons: Clear, contextual actions (Start, Stop, Update) for each model
- Resource Visualization: Progress bars and charts made resource usage immediately understandable
Technical Implementation
- Model Cards: Displaying each model as a card with status, specs, and actions
- Configuration Sections: Grouping related settings (model configs, resource limits, security)
- Alert System: Combining visual indicators with detailed log information
Development Workflow
Todo List Management
Date: 2025-01-05
Best Practices
- Granular Tasks: Breaking down complex features into specific, actionable items
- Real-Time Updates: Marking tasks as completed immediately after finishing
- Priority Levels: Using high/medium/low priorities to guide work order
- Status Tracking: Clear in_progress markers to show current focus
Panel System Architecture
Unified Panel Management Implementation
Date: 2025-08-05
Critical DOM Manipulation Lessons
The innerHTML Timing Problem
When implementing a unified panel system that restructures existing DOM elements, we encountered a critical timing issue:
// โ PROBLEMATIC PATTERN - Immediate querySelector after innerHTML replacement
panel.element.innerHTML = `<div class="new-structure">${existingContent}</div>`;
const closeBtn = panel.element.querySelector('#closeBtn'); // May return null!
Root Cause: The browser needs time to parse and construct new DOM elements after innerHTML assignment. Immediate querySelector operations may fail because elements aren't fully available yet.
What We Learned
- Event Handler Lifecycle: When you replace
innerHTML, all existing event listeners on child elements are destroyed - DOM Construction Timing: New elements created via
innerHTMLmay not be immediately queryable - Selector Strategy: Relying on a single selector strategy is fragile - multiple fallback approaches are essential
Solution Patterns That Work
Multi-Strategy Close Button Detection
attachCloseHandler(panel, id) {
const strategies = [
// Strategy 1: Specific IDs for known panels
() => panel.element.querySelector(specificSelectors[id]),
// Strategy 2: Generic patterns
() => panel.element.querySelector('[id*="close"], .panel-close'),
// Strategy 3: Event delegation fallback
() => this.addEventDelegation(panel, id)
];
const tryAttachHandler = () => {
for (let strategy of strategies) {
const result = strategy();
if (result) return true;
}
return false;
};
// Try immediately, then retry after DOM update
if (!tryAttachHandler()) {
requestAnimationFrame(() => tryAttachHandler());
}
}
Event Delegation as Ultimate Fallback
// When specific button detection fails, use event delegation
panel.element.addEventListener('click', (e) => {
if (e.target.closest('[id*="close"]')) {
this.closePanel(id);
}
});
Key Insights
- Defensive Programming: Always have multiple strategies for finding DOM elements after restructuring
- DOM Timing: Use
requestAnimationFrame()orsetTimeout()when immediate element access fails - Event Delegation: Provides reliable fallback when specific element detection fails
- Debugging: Console logging successful handler attachment helps diagnose issues
What This Prevented
- Silent Failures: Close buttons appearing functional but not working
- Inconsistent Behavior: Some panels working while others don't
- User Frustration: Broken interactions in an otherwise polished interface
Future Applications
This pattern applies to any system that:
- Dynamically restructures existing DOM elements
- Needs to reattach event handlers after DOM manipulation
- Implements unified behavior across heterogeneous existing components
Bottom Line: When building systems that reshape existing DOM, assume your first attempt to find elements will fail and build accordingly.
JavaScript Error Debugging in Complex HTML Files
Date: 2025-08-05
The Silent Failure Problem
Critical Issue: A single null reference error in early JavaScript code can silently prevent ALL subsequent JavaScript from executing, even in separate logical sections.
Scenario: Implementing a unified panel system, but close buttons weren't working and no console output appeared.
Root Cause Analysis Process
Step 1: No Console Output = Script Not Running
- When NO console logs appear, the issue isn't logic - it's script execution failure
- Don't debug individual features; debug whether JavaScript is running at all
Step 2: Error Location Strategy
// Add basic execution test at script start
console.log('๐ข JavaScript is running - First script tag loaded');
Step 3: The Actual Error
Uncaught TypeError: Cannot read properties of null (reading 'addEventListener')
at chat-interface:2073:23
Root Cause: Code assumed an element existed without checking:
// โ DANGEROUS - Will crash if element doesn't exist
const modelSelector = document.getElementById('modelSelector');
modelSelector.addEventListener('click', () => {...}); // Crashes if null
Solution Patterns
Defensive Element Access
const modelSelector = document.getElementById('modelSelector');
const modelDropdown = document.getElementById('modelDropdown');
if (modelSelector && modelDropdown) {
// Only run if both elements exist
modelSelector.addEventListener('click', () => {...});
} else {
console.log('โ ๏ธ Model selector elements not found - skipping functionality');
}
Variable Declaration Order Matters
// โ WRONG ORDER - Variables used before declaration
const panelManager = new PanelManager();
panelManager.registerPanel('panel', {
onClose: () => chatMessages.scrollTop = 0 // chatMessages not declared yet!
});
const chatMessages = document.getElementById('chatMessages');
// โ
CORRECT ORDER - Variables declared first
const chatMessages = document.getElementById('chatMessages');
const panelManager = new PanelManager();
panelManager.registerPanel('panel', {
onClose: () => chatMessages.scrollTop = 0 // chatMessages exists
});
Debugging Methodology
- Execution Test: Add
console.log()at script start to verify JavaScript runs - Error Location: Use browser console to identify exact line and error type
- Null Checks: Add defensive checks for ALL
getElementByIdcalls - Incremental Testing: Test each major section with console logs
- Variable Order: Ensure all variables are declared before use
Key Insights
- Single Point of Failure: One null reference can break an entire application
- Error Propagation: JavaScript errors don't stay contained to their logical sections
- Console Silence: No logs = script execution failure, not logic problems
- Element Assumptions: Never assume DOM elements exist - always check
- Order Dependencies: Variable declaration order affects runtime behavior
What This Prevented
- Hours of Wrong Debugging: Would have spent time debugging panel logic instead of script execution
- Feature-Specific Fixes: Would have tried to fix panels individually instead of the root cause
- Silent Production Failures: This type of error could cause complete UI failure in production
Prevention Checklist
- Add execution verification logs at script start
- Null check ALL
getElementById()calls - Declare variables before use in callbacks
- Test JavaScript execution before debugging features
- Use browser console to identify exact error locations
Lesson: Always verify JavaScript is executing before debugging application logic. One missing element check can silently break everything.
Icon System Consistency and Missing Definitions
Date: 2025-08-05
The Hidden UI Failure Problem
Critical Issue: Using icon class names in HTML without corresponding CSS definitions creates invisible UI elements that appear to work in development but fail silently in production.
Scenario: Navigation menus and UI elements showing blank spaces instead of icons, making the interface appear broken or incomplete.
Root Cause Analysis
The Icon Definition Gap
<!-- HTML uses the class -->
<span class="phosphor-icon chart-line"></span>
<!-- But CSS definition is missing -->
/* .phosphor-icon.chart-line::before { content: '๐'; } โ NOT DEFINED */
Result: The element exists in the DOM but displays nothing, creating invisible buttons and confusing UX.
What We Discovered
Massive Scale of the Problem:
- account-settings.html: 16 missing icon definitions
- activity-log.html: 13 missing icon definitions
- Total: 29 missing icons across just 2 pages
Common Missing Icons:
- Navigation:
chat-circle,rocket-launch,book-open,folder-open - System:
chart-line,package,cpu,puzzle-piece - UI:
house,moon,device-mobile,gear - Controls:
bars-three(hamburger menu)
Detection Methodology
Step 1: Audit Icon Usage vs Definitions
# Find all phosphor-icon classes used in HTML
grep -o "phosphor-icon [a-z-]+" file.html
# Find all phosphor-icon definitions in CSS
grep "\.phosphor-icon\.[a-z-]+::before" file.html
# Compare lists to find missing definitions
Step 2: Visual Inspection Strategy
- Look for blank spaces where icons should appear
- Check navigation menus for missing visual elements
- Test hover states on buttons that should have icons
Step 3: Systematic Validation
/* Audit pattern - ensure every used class has a definition */
.phosphor-icon.CLASSNAME::before { content: 'EMOJI'; font-size: inherit; }
Solution Patterns
Complete Icon System Audit
/* Navigation icons */
.phosphor-icon.chat-circle::before { content: '๐ฌ'; font-size: inherit; }
.phosphor-icon.rocket-launch::before { content: '๐'; font-size: inherit; }
.phosphor-icon.book-open::before { content: '๐'; font-size: inherit; }
/* System icons */
.phosphor-icon.chart-line::before { content: '๐'; font-size: inherit; }
.phosphor-icon.package::before { content: '๐ฆ'; font-size: inherit; }
.phosphor-icon.cpu::before { content: '๐ฅ๏ธ'; font-size: inherit; }
/* Control icons */
.phosphor-icon.bars-three::before { content: 'โฐ'; font-size: inherit; }
Eliminate Direct Unicode Usage
<!-- โ INCONSISTENT - Direct unicode -->
<button><span>โฐ</span></button>
<!-- โ
CONSISTENT - Phosphor icon system -->
<button><span class="phosphor-icon bars-three"></span></button>
Prevention Strategies
- Icon System Documentation: Maintain a complete list of available icons and their class names
- Development Checklist: Verify all icon classes have corresponding CSS definitions
- Visual Testing: Test all pages to ensure no blank icon spaces exist
- Automated Validation: Create scripts to detect unused classes or missing definitions
- Consistent Implementation: Never mix direct unicode with icon systems
Key Insights
- Silent Failures: Missing icon definitions don't throw errors - they just show nothing
- Scale Impact: Small oversights compound across multiple pages
- User Experience: Blank icons make interfaces appear broken or unprofessional
- Maintenance Debt: Inconsistent icon systems create ongoing maintenance issues
- Design System Integrity: Complete icon coverage is essential for professional UI
What This Prevented
- Professional Appearance Issues: Navigation menus with missing icons
- User Confusion: Buttons that appear non-functional due to missing visual cues
- Inconsistent Branding: Mixed unicode and icon system usage
- Future Scalability Problems: Incomplete icon systems become harder to maintain
Implementation Checklist
- Audit all pages for icon class usage vs CSS definitions
- Create complete icon definition library for the design system
- Replace all direct unicode characters with proper icon classes
- Test visual appearance of all interactive elements
- Document available icons and their proper class names
- Establish icon system usage guidelines for future development
Critical Learning: Icon systems require complete coverage - partial implementations create invisible UI failures that silently degrade user experience. Every icon class used in HTML must have a corresponding CSS definition, and mixing unicode with icon systems creates maintenance nightmares.
Best Practice: Treat icon systems like any other dependency - incomplete implementations are broken implementations.
Documentation Standards
What Works
- No Metadata Headers: Keeping markdown documents clean without status/version headers (except for special cases)
- Image Organization: Storing images in
_images/directories relative to markdown files - Descriptive Alt Tags: Ensuring all images have meaningful alt text for accessibility
- Color Samples: Showing visual samples when hex colors are specified
Implementation Insights
HTML Mockup Best Practices
- Inline Styles First: Starting with inline styles for rapid prototyping, then organizing into structured CSS
- Progressive Enhancement: Building core functionality first, then adding animations and polish
- Consistent Spacing: Using CSS variables for consistent spacing and sizing across components
- Hover States: Adding subtle hover effects to all interactive elements
Cross-Browser Compatibility
- CSS Variables: Using custom properties for theming made dark mode preparation easier
- Flexbox/Grid: Modern layout systems simplified responsive design
- Transition Timing: Consistent timing functions created cohesive animations
Project Management
Communication Patterns
- Clear Status Updates: Regular progress updates with specific accomplishments
- Visual Examples: Including screenshots or detailed descriptions of UI changes
- Incremental Delivery: Completing and demonstrating features incrementally
File Organization
mockups/
โโโ index.html # Central navigation hub
โโโ chat-interface.html # Core user experience
โโโ *-admin.html # Administrative interfaces
โโโ *.html # Feature-specific mockups
๐ฎ Future Considerations
Scalability
- Component Library: Consider creating reusable components for common UI patterns
- Style Guide: Develop a comprehensive style guide for consistent design language
- Template System: Create templates for new mockup pages to ensure consistency
Performance
- Lazy Loading: For production, implement lazy loading for heavy dashboard components
- Code Splitting: Separate navigation code into its own module
- Icon Optimization: Consider using an icon font or SVG sprite for better performance
Accessibility
- ARIA Labels: Add proper ARIA labels to all interactive elements
- Keyboard Navigation: Ensure all features are keyboard accessible
- Screen Reader Testing: Validate mockups work well with screen readers
ReactMarkdown Code Block Styling Issues
Date: 2025-01-09
The Black Border Problem
Critical Issue: Code blocks displayed with harsh black borders in the UI, even after updating component styling, because ReactMarkdown was wrapping the custom code component in a <pre> tag with default browser/Tailwind Typography styling.
Symptoms:
- Code blocks showing black borders despite custom gradient backgrounds
- Browser inspection revealed:
<pre><div class="custom-styled-code">...</div></pre> - Changes to component styling had no effect on the outer border
Root Cause Analysis
The Double-Wrapping Problem:
- ReactMarkdown automatically wraps code blocks in
<pre>tags - Tailwind Typography plugin (
@tailwindcss/typography) applies default styles to.prose pre - Browser default styles for
<pre>tags include borders - Our custom code component was wrapped inside, not replacing, the
<pre>tag
Discovery Process:
<!-- What we expected -->
<div class="bg-gradient-to-br from-slate-50...">
<code>...</code>
</div>
<!-- What we got -->
<pre> <!-- This added unwanted styling! -->
<div class="bg-gradient-to-br from-slate-50...">
<code>...</code>
</div>
</pre>
Solution Implementation
Override the Pre Component in ReactMarkdown:
// In ReactMarkdown components prop
pre: ({children}) => {
// Return just the children (our custom code component)
// This prevents ReactMarkdown from wrapping in <pre>
return <>{children}</>;
}
Add CSS Overrides for Safety:
/* Remove default pre styling from prose */
.prose pre {
background-color: transparent !important;
border: none !important;
padding: 0 !important;
margin: 0 !important;
}
/* Ensure no borders on any pre tags */
pre {
border: none !important;
background: transparent !important;
}
Key Insights
- Component Wrapping: ReactMarkdown components don't replace elements, they wrap them
- Tailwind Typography: The prose class applies opinionated styles that can conflict with custom designs
- Invalid Tailwind Classes: Using non-existent Tailwind classes (like
slate-850) fails silently - Dark Mode Detection: Ensure parent elements have the
darkclass for dark mode styles to apply - Browser Cache: Hard refresh (Cmd+Shift+R) may be needed after CSS changes
Debugging Methodology
- Inspect Actual HTML: Use browser DevTools to see the real DOM structure
- Check Class Names: Verify Tailwind classes actually exist (max is 950, not 850)
- Trace Parent Wrappers: Look for unexpected parent elements adding styles
- Test Component Isolation: Check if the component works outside of ReactMarkdown
- Verify Dark Mode Context: Ensure the
darkclass is ondocument.documentElement
Prevention Strategies
- Always Override Both Pre and Code: When customizing code blocks in ReactMarkdown
- Test with Browser Inspector: Don't just rely on component code
- Use Valid Tailwind Classes: Reference the Tailwind documentation for valid values
- Add Defensive CSS: Include fallback styles to override unwanted defaults
- Document Component Structure: Note when libraries wrap vs. replace elements
What This Prevented
- Poor User Experience: Harsh black borders made the UI feel unpolished
- Inconsistent Theming: Code blocks didn't match the overall design aesthetic
- Light/Dark Mode Issues: Borders were especially jarring in light mode
- Brand Consistency: The harsh styling conflicted with the soft, modern design
Technical Pattern for Future Use
// Complete ReactMarkdown code block customization pattern
<ReactMarkdown
components={{
// Override pre to prevent wrapper
pre: ({children}) => <>{children}</>,
// Custom code component with full styling control
code: ({inline, className, children, ...props}) => {
if (inline) {
return <code className="custom-inline-code">{children}</code>;
}
return (
<div className="custom-code-block">
{/* Your fully controlled code block UI */}
</div>
);
}
}}
>
{content}
</ReactMarkdown>
Lesson: When third-party libraries generate HTML, always check the actual DOM output, not just your component code. Default styles from libraries and browsers can override your carefully crafted designs in unexpected ways.
Chat Message State Management Issues
Date: 2025-08-09
The Disappearing User Messages Bug
Critical Issue: User messages would flash briefly then disappear from the chat history when session messages were loaded.
Symptoms:
- User sends a message โ appears briefly in chat
- Session updates trigger โ message disappears
- Messages lost before being saved to session
Root Cause Analysis
The Problem Flow:
- User message added to
chatMessagesstate - Session messages loaded from API into
sessionMessages useEffectwatchingsessionMessagestriggers- BUG: Completely overwrites
chatMessageswith only converted session messages - New local messages that weren't saved yet are lost
The Faulty Code:
// โ WRONG - Overwrites everything
useEffect(() => {
if (sessionMessages.length > 0) {
setChatMessages(convertedMessages); // Loses local messages!
}
}, [convertedMessages, sessionMessages]);
Solution Implementation
Preserve Local Messages:
// Merge session messages with newer local messages
setChatMessages(prev => {
if (convertedMessages.length > 0) {
const lastSessionTime = new Date(
convertedMessages[convertedMessages.length - 1].timestamp
).getTime();
// Keep messages newer than last session message
const newLocalMessages = prev.filter(msg => {
const msgTime = new Date(msg.timestamp).getTime();
return msgTime > lastSessionTime &&
!convertedMessages.some(cm =>
cm.timestamp === msg.timestamp &&
cm.content === msg.content
);
});
return [...convertedMessages, ...newLocalMessages];
}
return convertedMessages;
});
Key Insights
- State Synchronization: When merging state from multiple sources, always consider what should be preserved
- Timestamp Ordering: Use timestamps to determine which messages are newer
- Duplicate Prevention: Check both timestamp and content to avoid duplicates
- Local-First: Preserve local changes until they're confirmed saved
Inline Code vs Code Block Rendering
Date: 2025-08-09
The Problem
Issue: Inline code (like CONFIG_DIR) was being rendered as full code blocks with borders, headers, and copy buttons instead of simple highlighted text within sentences.
Root Cause
react-markdown v10 Breaking Change: The inline parameter is no longer reliably passed to the code component, making the original detection logic fail:
// โ This check always failed in v10
if (inline) {
return <InlineCode />;
}
Solution: Content-Based Detection
Smart Detection Logic:
code: ({node, inline, className, children, ...props}) => {
// Analyze content to determine if it's inline
const codeString = String(children).replace(/\n$/, '');
const hasNewlines = codeString.includes('\n');
const hasLanguageClass = className?.startsWith('language-');
const isInlineCode = !hasNewlines && !hasLanguageClass;
if (isInlineCode) {
// Simple inline highlighting
return <code className="px-1.5 py-0.5 bg-blue-50 ...">{children}</code>;
}
// Full code block UI
return <CodeBlock>...</CodeBlock>;
}
Detection Rules
Inline Code Characteristics:
- No newlines in content
- No language-* className
- Usually short snippets
Code Block Characteristics:
- Contains newlines (multi-line)
- Has language-* className
- Typically longer code samples
Key Learnings
- Library Version Changes: Always check breaking changes when libraries update
- Fallback Detection: Don't rely on single parameters - use multiple signals
- Content Analysis: Sometimes analyzing the content itself is more reliable than metadata
- User Experience: Different content types need different UI treatments
Critical Learning: When third-party libraries change their API, implement robust detection that doesn't rely on single parameters. Use multiple signals and content analysis for reliable feature detection.
JSX Structure and Build Errors
Date: 2025-08-09
The "Unterminated Regular Expression" JSX Error
Critical Issue: JSX parsing errors can manifest as cryptic "Unterminated regular expression" errors when there are structural issues with React components, particularly with mismatched tags or improper nesting of conditional renders.
Symptoms:
- Build error:
ERROR: Unterminated regular expressionat a closing div tag - Error points to innocent-looking JSX like
</div> - The actual issue is elsewhere in the component structure
Root Cause Analysis
The Nested Conditional Problem:
When implementing expandable tool messages with conditional rendering, improper nesting of JSX elements within conditionals created invalid structures:
// โ PROBLEMATIC - Missing proper nesting
{expandedTools.has(message.toolId) && (
<div className="expanded-content">
{/* Another conditional started without proper closure */}
{message.toolInput && (() => {
Why It Failed:
- Opening a div inside a conditional render
- Immediately starting another conditional without proper JSX structure
- Mismatched opening and closing tags across conditional boundaries
- Parser interpreting malformed JSX as regular expressions
Key Discovery Process
- Error Misleading: "Unterminated regular expression" doesn't mean regex - it means JSX parsing failed
- Count Tags: Systematically counted opening vs closing divs (found 56 opening, 54 closing)
- Trace Conditionals: Each conditional render must have properly balanced JSX
- Check Nesting: Ensure conditionals inside JSX elements are properly wrapped
Solution Patterns
Proper Conditional Nesting:
{expandedTools.has(message.toolId) && (
<div className="expanded-content">
{/* Properly nested content */}
{message.toolInput && (() => {
// Content here
})()}
</div>
)}
Validate Structure Before Complex Changes:
# Count opening and closing tags
sed -n '327,1121p' file.jsx | grep -c '<div'
sed -n '327,1121p' file.jsx | grep -c '</div>'
Debugging Methodology
- Build Error Location: Note the line number but don't trust it - the real issue is often earlier
- Count Tags: Use grep/sed to count opening and closing tags in the affected section
- Trace Ternaries: Map out the complete ternary operator chain structure
- Check Conditionals: Verify each conditional render has balanced JSX
- Revert and Rebuild: When structure is too broken, revert and carefully reapply changes
Prevention Strategies
- Small Incremental Changes: Test build after each structural change
- Comment Complex Structures: Add comments showing where conditionals open/close
- Use Fragments Properly: Use
<>...</>when you need to wrap without adding DOM elements - Validate After Edits: Run build immediately after complex JSX changes
- Keep Backup Points: Commit working versions before major structural changes
What We Learned
- Parser Confusion: Invalid JSX structure confuses the parser into thinking it's parsing JavaScript
- Error Messages Mislead: "Unterminated regular expression" is a symptom, not the cause
- Structure Over Content: Fix structural issues before implementing features
- Indentation Matters: Proper indentation helps spot nesting issues
- Tool Limitations: AI assistants can struggle with complex JSX structure debugging
Implementation Checklist for Complex JSX
- Map out the complete conditional structure before coding
- Test build after each conditional branch addition
- Count opening and closing tags programmatically
- Use proper indentation to visualize nesting
- Add temporary console logs to verify conditional paths
- Keep the previous working version easily accessible
- Document the intended structure in comments
Critical Lesson: When you see "Unterminated regular expression" in a JSX file, immediately check for:
- Mismatched opening/closing tags
- Improper conditional render nesting
- Missing closing parentheses in ternary chains
- Adjacent JSX elements without wrappers
The error message is telling you the parser got confused, not that you have a regex problem.
Multi-Client Deployment Management System
Date: 2025-08-11
The Challenge: Inefficient Client-Specific Docker Images
Initial Problem: Originally building separate Docker images for each client, leading to:
- Redundant builds for identical code
- Storage waste on Docker Hub
- Inconsistent versions across clients
- Complex deployment pipeline
User Insight: "Why do we not use the same docker image for clients - shouldnt each image be exactly the same per version?"
This feedback highlighted a fundamental architecture flaw that needed immediate correction.
Solution: Shared Images with Environment Differentiation
Refactored Architecture:
linzoid/sasha-studio:1.0.2 <- Single shared image
โโโ sasha-main (env: COMPANY_NAME=Knowcode)
โโโ hirebest (env: COMPANY_NAME=HireBest)
โโโ acme-corp (env: COMPANY_NAME=ACME Corp)
Key Implementation Changes:
- Removed client-specific Docker tags - eliminated
tag_suffixfrom configurations - Unified build process - one image serves all clients
- Environment-based differentiation - clients differ only via Sliplane environment variables
- Shared version management - all clients use same VERSION file
Security: Auto-Generated Cryptographic Secrets
Problem: Manual secret generation was error-prone and insecure.
Solution: Automated generation using OpenSSL:
# Each client gets unique 256-bit secrets
SESSION_SECRET=$(openssl rand -base64 32)
JWT_SECRET=$(openssl rand -base64 32)
Security Architecture:
- Session Isolation: Each client has unique session secrets
- JWT Security: Independent token verification per client
- Breach Containment: Compromise of one client doesn't affect others
- Zero Placeholders: Real secrets generated automatically
User Experience: Enhanced Deployment Instructions
Problem: Color escape sequences showing as text (\033[0;34m) instead of actual colors.
Root Cause: Missing -e flag in echo statements prevented interpretation of escape sequences.
Solution:
# โ Wrong - shows escape sequences as text
echo "\\033[0;34mDeployment starting\\033[0m"
# โ
Correct - shows actual colors
echo -e "\\033[0;34mDeployment starting\\033[0m"
Enhanced Output Features:
- Color-coded instructions with proper terminal formatting
- Step-by-step Sliplane setup guide with exact button names
- Copy-paste ready environment variables
- Post-deployment verification checklists
Multi-Client Management CLI
Created comprehensive tooling:
./manage-clients.sh create client-name # Auto-generates secrets
./deploy-client.sh client-name # Shared image deployment
./show-setup.sh client-name # Complete setup guide
Library Functions:
- lib/common.sh: Secret generation, validation utilities
- lib/docker.sh: Shared image operations
- lib/sliplane.sh: Webhook deployment management
Key Technical Insights
- Shared Images Are Superior: Build once, deploy many times with environment differentiation
- Security Through Automation: Auto-generated secrets eliminate human error
- User Experience Matters: Proper terminal formatting significantly improves deployment experience
- Documentation Drives Adoption: Step-by-step instructions reduce deployment friction
What This Architecture Enables
Efficiency Gains:
- 75% reduction in build time: One build instead of per-client builds
- Reduced Docker Hub storage: Single image replicated vs multiple unique images
- Guaranteed consistency: All clients run identical code with different config
Security Improvements:
- Cryptographically unique secrets: 256-bit entropy per client
- Client isolation: Sessions and tokens cannot cross client boundaries
- Audit trail: Clear separation of client data and authentication
Operational Benefits:
- Simple scaling: Add new clients without code changes
- Version management: Single VERSION file controls all deployments
- Troubleshooting: Consistent behavior across all client environments
Implementation Patterns for Future Use
Auto-Secret Generation:
generate_secret() {
local length=${1:-32}
openssl rand -base64 "$length" | tr -d '\n'
}
# Usage in client creation
SESSION_SECRET=$(generate_secret 32)
JWT_SECRET=$(generate_secret 32)
Color-Coded Terminal Output:
# Define colors once, use everywhere
RED='\\033[0;31m'
GREEN='\\033[0;32m'
BLUE='\\033[0;34m'
NC='\\033[0m'
# Always use -e with echo for colors
echo -e "${GREEN}โ
Success${NC}"
echo -e "${RED}โ Error${NC}"
Shared Docker Image Pattern:
# Build once
docker build -t ${REPO}:${VERSION} .
# Deploy many times with different env
# Client 1: COMPANY_NAME=ClientA
# Client 2: COMPANY_NAME=ClientB
# Client 3: COMPANY_NAME=ClientC
Comprehensive Documentation Created
- Client Management README: Complete system overview
- Security Guide: How secrets work at runtime
- Documentation Index: Centralized doc status
What This Prevented
- Operational Inefficiency: Multiple redundant Docker builds
- Security Vulnerabilities: Weak or placeholder secrets in production
- User Frustration: Confusing deployment instructions with formatting issues
- Scaling Problems: Architecture that wouldn't scale to many clients
- Maintenance Overhead: Managing separate codebases per client
Critical Lessons Learned
- Listen to User Feedback: The "why separate images?" question revealed a fundamental flaw
- Security Should Be Automatic: Manual secret generation invites mistakes
- UI/UX Applies to CLI: Terminal formatting significantly impacts developer experience
- Architecture Decisions Compound: Shared images unlock numerous downstream benefits
- Document Everything: Comprehensive docs enable team scaling and knowledge transfer
Future Considerations
- Secret Rotation: Implement automated secret rotation for high-security environments
- Multi-Environment Support: Extend pattern to staging/production environment separation
- Monitoring Integration: Add deployment status monitoring and alerting
- Template System: Create client configuration templates for common scenarios
Bottom Line: The shift from client-specific images to shared images with environment differentiation represents a fundamental architectural improvement that enhances security, efficiency, and user experience while enabling seamless scaling to unlimited clients.
Docker Alpine Linux Child Process Spawning
Date: 2025-08-09
The ENOENT Spawn Error in Alpine Containers
Critical Issue: Claude CLI failed to spawn in Alpine Docker containers with Error: spawn /usr/local/bin/claude ENOENT despite the binary existing and being executable.
Symptoms:
spawncommand failed with ENOENT errors- Binary existed and was executable when checked directly
- Same code worked outside Docker
- Multiple attempts with different paths all failed
Root Cause Analysis
The Alpine Linux Difference:
- musl libc vs glibc: Alpine uses musl libc instead of glibc
- Shell Differences: Alpine's
/bin/shis BusyBox, not bash - Binary Compatibility: Node.js binaries compiled for glibc may not work properly with musl
- Spawn Behavior:
child_process.spawnbehaves differently in Alpine
Discovery Process:
// โ All these approaches failed in Alpine
spawn('claude', args) // ENOENT
spawn('/usr/local/bin/claude', args) // ENOENT
spawn('/usr/local/bin/node', ['/usr/local/bin/claude', ...args]) // ENOENT
spawn('sh', ['-c', 'claude ' + args]) // spawn /bin/sh ENOENT
Solution: Use execFile Instead of Spawn
The Working Solution:
import { spawn, execFile } from 'child_process';
if (process.env.RUNNING_IN_DOCKER === 'true') {
// In Docker Alpine, use execFile which is more reliable than spawn
console.log('๐ณ Using execFile for Docker Alpine environment');
claudeCommand = '/usr/local/bin/node';
finalArgs = ['/usr/local/bin/claude', ...args];
// execFile doesn't require a shell and works reliably in Alpine
claudeProcess = execFile(claudeCommand, finalArgs, spawnOptions);
} else {
// For non-Docker environments, use regular spawn
claudeProcess = spawn('claude', args, spawnOptions);
}
Why execFile Works When spawn Fails
- No Shell Required:
execFiledirectly executes the binary without shell interpretation - Path Resolution:
execFilehandles path resolution differently thanspawn - Alpine Compatibility: Better compatibility with musl libc and BusyBox environment
- Error Handling: More predictable error behavior in minimal environments
Working Directory Path Issues
Secondary Problem: Relative paths like default/workspace caused failures.
Solution: Always use absolute paths in Docker:
let workingDir = cwd || process.cwd();
// If the working directory doesn't start with /, prepend /app/workspaces/
if (!workingDir.startsWith('/')) {
if (process.env.RUNNING_IN_DOCKER === 'true') {
workingDir = `/app/workspaces/${workingDir}`;
} else {
workingDir = path.resolve(workingDir);
}
}
// Ensure the directory exists in Docker
if (process.env.RUNNING_IN_DOCKER === 'true') {
await fs.mkdir(workingDir, { recursive: true });
}
API Key Persistence in Docker
Problem: API keys need to persist across container restarts.
Solution: Load from persistent volume on startup:
// Docker uses /app/config for persistent storage
const isDocker = process.env.RUNNING_IN_DOCKER === 'true';
const configDir = isDocker ? '/app/config' : path.join(__dirname, '..');
// Load .env from persistent volume
if (isDocker && fs.existsSync(path.join(configDir, '.env'))) {
dotenv.config({ path: path.join(configDir, '.env') });
console.log('๐ ANTHROPIC_API_KEY loaded from .env');
}
Testing Methodology
Verification Script:
#!/bin/bash
# Test Claude CLI in Docker container
echo "1. Checking Claude CLI installation:"
docker compose exec -T sasha-studio-test which claude
echo "2. Testing Claude CLI version:"
docker compose exec -T sasha-studio-test /usr/local/bin/node /usr/local/bin/claude --version
echo "3. Testing execFile approach:"
docker compose exec -T sasha-studio-test /usr/local/bin/node -e "
const { execFile } = require('child_process');
execFile('/usr/local/bin/node', ['/usr/local/bin/claude', '--version'], (error, stdout) => {
if (error) {
console.error('Error:', error.message);
} else {
console.log('Success! Output:', stdout);
}
});
"
Key Insights
- Alpine is Different: Never assume Linux behaviors are universal - Alpine's minimal nature creates unique challenges
- execFile > spawn: In containerized environments,
execFileis often more reliable - Absolute Paths: Always use absolute paths in Docker to avoid ambiguity
- Test in Target Environment: Always test Node.js child processes in the actual Docker container
- Persistent Configuration: Design for configuration persistence from the start
Prevention Strategies
- Choose Base Images Carefully: Consider using
node:20instead ofnode:20-alpineif compatibility is more important than size - Test Child Processes Early: Test external binary execution immediately when setting up Docker
- Document Environment Differences: Note Alpine-specific behaviors in documentation
- Use execFile for Reliability: Default to
execFilewhen spawning Node.js scripts in containers - Implement Fallback Strategies: Have multiple approaches ready for process spawning
Alternative Solutions (Not Used)
- Switch from Alpine: Use
node:20base image (larger but more compatible) - Install glibc: Add glibc compatibility layer to Alpine (complex)
- Use Docker exec: Execute commands via Docker API (requires Docker socket)
- HTTP API Wrapper: Wrap Claude CLI in an HTTP service (additional complexity)
What This Prevented
- Production Failures: Claude CLI completely non-functional in Docker
- User Frustration: Core functionality broken in containerized deployment
- Deployment Blockers: Unable to ship Docker version
- Support Burden: Cryptic ENOENT errors difficult to diagnose
Docker Configuration Best Practices
Dockerfile Optimizations:
# Install Claude CLI globally for all users
RUN npm install -g @anthropic-ai/claude-code@latest
# Ensure proper permissions for nodejs user
RUN mkdir -p /home/nodejs/.claude && \
chown -R nodejs:nodejs /home/nodejs/.claude
# Use dumb-init to handle signals properly
ENTRYPOINT ["dumb-init", "--"]
docker-compose.yml Configuration:
volumes:
- sasha-config:/app/config # Persistent API key storage
- sasha-data:/app/data # Persistent database
environment:
- RUNNING_IN_DOCKER=true
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY:-} # Optional env override
Critical Learnings
- ENOENT Doesn't Mean File Not Found: In Alpine, it often means execution failed due to library issues
- Shell Option Doesn't Help: Using
shell: truewith spawn just moves the problem to/bin/sh - Cross-spawn Isn't Universal: Even cross-platform libraries can fail in Alpine
- Debug with Direct Execution: Test binaries directly in container before attempting to spawn
- Environment Variables Matter: Always verify PATH and other env vars in container
Bottom Line: When deploying Node.js applications that spawn child processes to Alpine Docker containers, use execFile instead of spawn, always use absolute paths, and test thoroughly in the actual container environment. The time saved by using Alpine's smaller image size can be quickly lost to debugging compatibility issues.
๐ Docker Architecture Mismatch - ARM64 vs AMD64
Date: 2025-08-11
The Critical Platform Architecture Problem
Critical Issue: Docker images built on Apple Silicon Macs (ARM64/aarch64) fail to run on AMD64 servers with "exec format error", affecting both system binaries and native Node.js modules.
Symptoms:
exec /usr/bin/dumb-init: exec format errorwhen container startsError loading shared library /app/node_modules/node-pty/build/Release/pty.node: Exec format error- Container exits immediately on Sliplane (AMD64 servers)
- Same image works perfectly on Mac (ARM64)
Root Cause Analysis
The Architecture Contamination Chain:
- Mac builds create ARM64 binaries by default
- Multi-stage Docker builds copy node_modules between stages
- Native modules (like node-pty) contain platform-specific compiled code
- Copied ARM64 binaries fail on AMD64 runtime environment
Why It Worked Before, Then Broke:
- Initial deployments may have been built on AMD64 CI/CD systems
- Local Mac builds started being used for deployment
- Native module dependencies were added or updated
- The problem compounds with each native module added
The Failed Attempts
Attempt 1: Just build for AMD64
docker build --platform linux/amd64 ...
Result: Fixed dumb-init but node-pty still failed
Attempt 2: Rebuild native modules
RUN npm rebuild
Result: Rebuild happened in build stage (ARM64), not runtime stage
Attempt 3: Copy node_modules between stages
COPY --from=builder /app/node_modules ./node_modules
Result: Perpetuated architecture mismatch
The Solution: Fresh Dependencies in Runner Stage
Working Dockerfile Pattern:
# Stage 2: Build (can be ARM64 or AMD64)
FROM node:20-alpine AS builder
WORKDIR /app
COPY --from=deps /app/node_modules ./node_modules
COPY claudecodeui/ .
RUN npm run build
# Critical: Remove node_modules to prevent contamination
RUN rm -rf node_modules
# Stage 4: Production (MUST be AMD64)
FROM node:20-alpine AS runner
WORKDIR /app
# Copy built assets but NOT node_modules
COPY --from=builder /app/dist ./dist
COPY claudecodeui/package*.json ./
# Install production dependencies fresh for target architecture
RUN npm ci --production && \
# Rebuild ensures native modules compile for THIS platform
npm rebuild
The Complete Build Command
Correct Multi-Platform Build:
# Use buildx for explicit platform targeting
docker buildx build \
--platform linux/amd64 \
-f claudecodeui/Dockerfile.sliplane \
-t linzoid/sasha-studio:$VERSION \
-t linzoid/sasha-studio:latest \
--push \
.
Key Technical Insights
Native Modules Are Platform-Specific: Modules with C++ bindings (node-pty, bcrypt, better-sqlite3) MUST be compiled for the target architecture
Multi-Stage Builds Can Contaminate: Copying node_modules between stages carries architecture-specific binaries
npm ci vs npm install: Use
npm ci --productionin runner stage for reproducible, production-only dependenciesnpm rebuild Is Essential: Always run after npm ci to ensure native modules match the platform
Docker's Platform Flag:
--platform linux/amd64affects ALL stages, not just the final image
Architecture Detection
Verify Image Architecture:
# Check image architecture
docker image inspect linzoid/sasha-studio:latest | grep Architecture
# Inside container, verify platform
docker run --rm linzoid/sasha-studio:latest uname -m
# Should output: x86_64 (not aarch64)
Health Check Validation:
{
"build": {
"platform": "linux",
"arch": "x64" // Must be x64 for Sliplane
}
}
Prevention Strategies
- CI/CD Builds: Use GitHub Actions or other CI/CD that runs on AMD64
- Explicit Platform Targeting: Always specify
--platform linux/amd64for production builds - Separate Dev/Prod Dockerfiles: Use different approaches for local dev vs production
- Architecture Testing: Add health endpoint that reports architecture
- Build Verification: Test image on AMD64 before deployment
Common Native Modules Affected
- node-pty: Terminal emulation (C++ bindings)
- bcrypt: Password hashing (C++ crypto)
- better-sqlite3: SQLite database (C++ bindings)
- sharp: Image processing (C++ bindings)
- canvas: Canvas rendering (C++ bindings)
What This Prevented
- Complete Deployment Failure: Services unable to start on production servers
- Cryptic Error Messages: "exec format error" doesn't clearly indicate architecture issues
- Time Wasted on Wrong Solutions: Could have spent days on database or authentication debugging
- Platform Lock-in: Would have required Mac-only deployments
Critical Lessons Learned
- "It Works on My Machine" Is Architecture-Dependent: Mac (ARM64) โ Linux servers (AMD64)
- Native Modules Require Special Handling: Can't just copy node_modules around
- Docker Build Context Matters: Building FOR a platform vs ON a platform
- Test on Target Architecture: Always validate on actual deployment platform
- Error Messages Can Mislead: "exec format error" sounds like permissions but is architecture
Debugging Methodology
- Check Container Logs: First error often reveals architecture mismatch
- Inspect Image Architecture: Verify image was built for correct platform
- Test Incrementally: Start with base image, add complexity gradually
- SSH into Container: Direct debugging reveals issues faster than logs
- Compare Working vs Broken: What changed between deployments?
Alternative Solutions (Not Recommended)
- Use Node Images Without Alpine: Larger but more compatible
- Pre-built node_modules: Ship pre-compiled binaries for each platform
- Avoid Native Modules: Use pure JavaScript alternatives (performance cost)
- Platform-Specific Images: Maintain separate ARM64 and AMD64 images
Bottom Line: When deploying Docker containers from Apple Silicon Macs to AMD64 servers, ALWAYS:
- Build with
--platform linux/amd64 - Install production dependencies fresh in the runner stage
- Run
npm rebuildafter installing dependencies - Never copy node_modules between different architecture stages
This architecture mismatch is a silent killer that only manifests in production. The solution is architectural discipline: build for your target platform, not your development platform.
Architecture Mismatch Resolution: Complete Solution
Date: 2025-08-11
The Final Solution Implementation
After the initial diagnosis and partial fixes, we achieved complete resolution by implementing a systematic approach to Docker architecture management.
Complete Resolution Steps:
Full Docker Cleanup: Used
docker system prune -a --volumesto remove all ARM64 artifacts and build cache that could contaminate new buildsEnhanced Build Scripts: Updated both
build.shanddocker-build.shto consistently usedocker buildx build --platform linux/amd64 --loadCross-Platform Build System: Configured Docker buildx properly for Mac M1/M2 โ AMD64 cross-compilation
Verification Pipeline: Added systematic verification at each step:
# Verify local image architecture docker inspect linzoid/sasha-studio:latest | grep Architecture # Should show: "Architecture": "amd64" # Test Docker Hub push/pull docker push linzoid/sasha-studio:latest docker rmi linzoid/sasha-studio:latest docker pull linzoid/sasha-studio:latest # Verify pulled image is AMD64 docker inspect linzoid/sasha-studio:latest | grep ArchitectureDocumentation Updates: Updated CHANGELOG.md to reflect the complete resolution from v1.0.7 (initial work) to v1.0.14 (complete solution)
What Made The Difference
The Missing Piece: While the Dockerfile and build commands were correct, the issue was Docker Hub had cached the old ARM64 image with the latest tag. The solution required:
- Building the correct AMD64 image locally
- Explicitly pushing the specific version tag (
1.0.14) - Re-tagging and pushing
latestto overwrite the cached ARM64 version - Verifying the round-trip (remove local โ pull from Hub โ verify architecture)
Technical Pattern for Future Use
# Complete architecture fix workflow
docker system prune -a --volumes # Clean contaminated cache
./build.sh --no-bump # Build AMD64 image
docker push linzoid/sasha-studio:1.0.14 # Push specific version
docker tag linzoid/sasha-studio:1.0.14 linzoid/sasha-studio:latest
docker push linzoid/sasha-studio:latest # Overwrite cached latest
docker rmi linzoid/sasha-studio:latest # Remove local latest
docker pull linzoid/sasha-studio:latest # Test from Docker Hub
docker inspect linzoid/sasha-studio:latest | grep Architecture # Verify AMD64
Critical Insights From Resolution
- Docker Hub Caching: Registry caches can persist wrong architecture images even when builds are correct
- Tag Strategy: Always push specific version tags first, then update
latest - End-to-End Verification: Must test the complete pull-from-registry workflow, not just local builds
- Docker System State: Previous builds can contaminate new builds through shared layers and cache
- Build vs Deploy Architecture: The image is built correctly but deploy can still fail due to registry caching
What This Complete Resolution Enables
Immediate Deployment Success:
- Sliplane deployments now work without "exec format error"
- All native modules (node-pty, better-sqlite3) function correctly
- Consistent behavior across development (Mac ARM64) and production (Linux AMD64)
Operational Confidence:
- Verified build-to-deployment pipeline
- Clear debugging methodology for future architecture issues
- Reproducible cross-platform build process
Scaling Benefits:
- New client deployments will work immediately
- Team members with different Mac architectures can deploy successfully
- CI/CD systems can be configured with confidence
Prevention Checklist for Future Projects
- Always specify
--platform linux/amd64for production builds - Test complete workflow: build โ push โ pull โ verify architecture
- Update
latesttag after pushing specific versions - Clean Docker system state between architecture changes
- Document the complete verification process
- Add architecture reporting to application health endpoints
The Impact
This resolution transformed a complete deployment failure (containers wouldn't start) into a fully functional multi-client deployment system. The architecture fix was the final piece enabling the entire shared Docker image strategy to work successfully in production.
Critical Learning: Architecture mismatches in Docker can manifest at multiple levels (local build, registry cache, deployment platform). Complete resolution requires addressing the entire pipeline, not just fixing the build process. Always verify the full round-trip workflow when dealing with cross-architecture builds.
Key Takeaways
- Consistency is Key: Maintaining consistent patterns across mockups improved user experience and development speed
- Security by Design: Building security considerations into the UI from the start prevented later complications
- Progressive Disclosure: Showing advanced features only when needed kept interfaces clean
- Real User Scenarios: Designing for actual use cases (like file system mounting) led to more practical solutions
- Documentation as Development: Creating comprehensive guides alongside development improved feature completeness
- Container Compatibility: Always test child process spawning in the target container environment
- Use execFile in Alpine: For reliable process execution in Alpine Linux, prefer execFile over spawn
- Architecture Awareness: Always build Docker images for the target platform architecture, not your development machine
This document will be updated as the project evolves with new insights and learnings.