Lessons Learnt
A collection of important lessons from developing and deploying Sasha Studio.
2025-08-09: Production Deployment and Container Health Checks
1. Container Health Checks Must Be First Priority
Issue: Sliplane deployment failing because health check endpoints were hanging
Root Cause: Health endpoints placed AFTER middleware (CORS, body parsers) in Express routing
Problem: Middleware was blocking/delaying responses, causing health checks to timeout
Solution:
- Move health endpoints to very beginning of middleware stack
- Add simple
/healthendpoint returning immediate 200 OK response - Keep detailed
/api/healthfor monitoring but ensure it's also before auth middleware
Lesson: - Container orchestrators need instant health responses (< 1s)
- Health endpoints should never depend on authentication, database, or complex middleware
- Place health routes BEFORE any
app.use()statements
2. Database Reset in Production Requires Proper Timing
Issue: RESET_DATABASE environment variable not working in Sliplane deployment
Root Cause: Database reset logic executed BEFORE tables were created
Problem: DELETE statements failed because tables didn't exist yet
Solution:
- Move RESET_DATABASE logic AFTER init.sql execution
- Add comprehensive logging to verify reset success
- Delete from ALL related tables, not just users table
- Verify final state (0 users, 0 profiles)
Lesson: - Database initialization order matters critically
- Always verify reset operations with logging and final state checks
- Consider foreign key constraints when deleting data
3. Container Image Caching Can Hide Deployments
Issue: Webhooks triggering successfully but old code still running
Root Cause: Container platform using cached "latest" tag instead of pulling new image
Problem: Multiple deployments appeared successful but no changes were visible
Solution:
- Use timestamp-based tags (20250809-192853) instead of "latest"
- Update deployment scripts to use specific tags
- Manually update image version in platform dashboard when needed
Lesson: - "latest" tag is often cached aggressively by container platforms
- Always use specific version tags for production deployments
- Verify actual container image being used, not just webhook success
4. Production Debugging Requires Container Platform Expertise
Issue: Hard to diagnose why deployments weren't working
Breakthrough: Reading container platform logs showed exact error messages
Solution:
- Check platform-specific logs (Sliplane logs tab)
- Look for health check failure messages
- Monitor both build logs and runtime logs
- Use platform webhooks for automated deployments
Lesson: - Each container platform (Sliplane, Heroku, Cloud Run) has unique behavior
- Platform logs are more reliable than webhook responses
- Understand platform-specific health check requirements
2025-08-09: Docker Deployment and CLI Integration
1. Always Verify CLI Flags Exist
Issue: Added --project workspace flag to Claude CLI for Docker deployments
Problem: The flag doesn't exist in Claude CLI, causing "unknown option '--project'" error
Lesson:
- Always check CLI documentation/help before adding new flags
- Test in the target environment immediately after making changes
- Don't assume flags from other tools or older versions exist
2. Git Commits Can Break Working Code
Issue: Sliplane deployment commit introduced breaking changes to previously working Docker setup
Problem: Multiple changes across files created cascading issues
Lesson:
- Review commits carefully, especially when they touch multiple files
- Test in all environments (Docker, local) after deployment-related changes
- Be cautious of "standardization" commits that change core functionality
3. Complex Workarounds Often Create More Problems
Issue: Added Nginx reverse proxy to work around Docker Desktop vpnkit issues
Problem:
- Broke WebSocket connections
- Added complexity without solving the core issue
- Made debugging harder with logs split across services
Lesson: - Fix root causes, not symptoms
- If a platform has fundamental issues (vpnkit), use a different platform
- Simple solutions are usually better than complex workarounds
4. Docker Desktop for Mac Has Networking Limitations
Issue: File uploads failing with timeouts in Docker Desktop
Root Cause: vpnkit networking layer has bugs with HTTP keep-alive connections
Lesson:
- Be aware of platform-specific limitations
- Docker Desktop != Linux Docker
- Consider alternative container runtimes (Podman) or Linux VMs for development
5. Authentication Flows Need Clear Testing
Issue: User asked if they should be redirected to login when not authenticated
Reality: The feature was already implemented and working
Lesson:
- Document authentication flows clearly
- Provide testing instructions (e.g., "delete localStorage auth-token")
- Make authentication state visible in the UI
2025-08-08: State Management and React
1. React State Updates Can Be Tricky
Issue: User messages disappearing in chat interface
Root Cause: State being overwritten instead of merged
Lesson: Always carefully manage state merging, especially with arrays
2. Markdown Rendering Requires Version-Specific Handling
Issue: Inline code showing as full code blocks
Root Cause: react-markdown v10 changed how it passes parameters
Lesson: Test thoroughly when upgrading markdown libraries
General Development Principles
1. Test in Target Environment Early
Don't assume local development matches production/Docker behavior
2. Document Experiments
When trying workarounds (like Nginx proxy), document why and what was learned
3. Check Your Assumptions
- CLI flags might not exist
- Libraries might have changed behavior
- Platform differences matter (Mac vs Linux)
4. Simple > Complex
Avoid adding layers (proxies, middleware) to fix issues - address root causes
5. Version Control is Your Friend
Use git blame/log to understand how problematic code got introduced
Docker-Specific Lessons
1. Container Environments Need Special Handling
- Paths are different (
/app/workspacesvs local paths) - Permissions matter more
- Network behavior differs from host
2. Environment Variables Drive Behavior
RUNNING_IN_DOCKERchanges code paths- Test with the same env vars as production
3. Volume Mounts Can Cause Issues
- Permission problems between host and container
- Path mapping confusion
- Sync issues with file watchers
Debugging Strategies That Worked
- Check git history first - Understanding when/why code was added helps fix it
- Read the actual CLI help - Don't assume flags exist
- Test incrementally - Remove one thing at a time (like Nginx)
- Check logs at every layer - Container logs, app logs, proxy logs
- Verify environment variables - They control critical code paths