File Upload and Conversion Architecture
Purpose: Technical documentation of the Organization Setup file upload and document conversion system
Created: 2025-08-08
Status: Complete
Overview
The Organization Setup feature includes a sophisticated document upload and conversion pipeline that transforms business documents (PDFs, Word, Excel, etc.) into AI-ready Markdown format. This enables Sasha to understand and learn from an organization's existing documentation.
System Architecture
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β Frontend UI ββββββββΊβ API Endpoint ββββββββΊβ Document β
β (React/Drag) β β (Express/Multer)β β Processor β
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β β
βΌ βΌ
ββββββββββββββββββββ βββββββββββββββββββ
β Database β β Workspace β
β (SQLite) β β Manager β
ββββββββββββββββββββ βββββββββββββββββββ
β
βΌ
βββββββββββββββββββ
β Markdown Files β
β docs/local/* β
βββββββββββββββββββ
Component Details
1. Frontend Upload Interface
Location: src/components/OrganizationSetupScreen.jsx
Features:
- Drag-and-drop file upload zone
- Multiple file selection support
- File type validation (client-side)
- Upload progress indication
- File preview and removal
Supported File Types:
accept: {
'application/pdf': ['.pdf'],
'application/msword': ['.doc', '.docx'],
'application/vnd.ms-excel': ['.xls', '.xlsx'],
'application/vnd.ms-powerpoint': ['.ppt', '.pptx'],
'text/plain': ['.txt'],
'text/markdown': ['.md'],
'image/*': ['.png', '.jpg', '.jpeg', '.gif']
}
2. API Upload Endpoint
Location: server/routes/profile.js
Endpoint: POST /api/profile/organization/complete
Configuration:
// Multer configuration
const upload = multer({
storage: multer.diskStorage({
destination: 'uploads/onboarding/',
filename: (req, file, cb) => {
const uniqueName = `${Date.now()}-${uuidv4()}${path.extname(file.originalname)}`;
cb(null, uniqueName);
}
}),
limits: {
fileSize: 50 * 1024 * 1024, // 50MB max per file
files: 20 // Max 20 files at once
},
fileFilter: validateFileType
});
Request Format:
- Multipart form data
- Fields:
documents[],companyName,website,industry - Authorization: Bearer token required
Response Format:
{
"success": true,
"workspacePath": "/home/user/.claude/projects/default-workspace",
"documentsProcessed": 3,
"researchPrompt": "I'll now help you build a comprehensive knowledge base..."
}
3. Document Processor Service
Location: server/services/document-processor.js
Core Function: processOrganizationDocuments(files, workspacePath)
Conversion Strategy:
// Conversion logic by file type
if (file.mimetype === 'text/plain' || file.mimetype === 'text/markdown') {
// Direct read as UTF-8
markdown = await fs.readFile(file.path, 'utf-8');
} else if (file.mimetype === 'application/pdf') {
// Use PDF converter
const result = await ConvertToMarkdown.pdfToMarkdown(file.path);
markdown = result.markdown;
} else if (file.mimetype.includes('word')) {
// Use Word converter
const result = await ConvertToMarkdown.wordToMarkdown(file.path);
markdown = result.markdown;
} else if (file.mimetype.includes('excel')) {
// Use Excel converter
const result = await ConvertToMarkdown.excelToMarkdown(file.path);
markdown = result.markdown;
}
4. Workspace Manager
Location: server/services/workspace-manager.js
Function: setupOrganizationWorkspace(userId, organizationData)
Directory Structure Created:
~/.claude/projects/default-workspace/
βββ CLAUDE.md # Organization context for Claude
βββ docs/
β βββ local/ # Converted user documents
β β βββ README.md # Index of uploaded documents
β β βββ contracts/ # Legal documents
β β βββ presentations/ # Pitch decks, slides
β β βββ handbooks/ # Employee documentation
β β βββ financial/ # Financial reports
β β βββ general/ # Uncategorized documents
β βββ organization/ # AI-generated knowledge docs
β βββ guides/ # Research guides
β βββ organization-research-guide.md
5. Document Conversion Library
Package: @knowcode/convert-to-markdown v1.3.0
Available Converters:
pdfToMarkdown(filepath)- Extracts text from PDFswordToMarkdown(filepath)- Converts Word docs to MarkdownexcelToMarkdown(filepath)- Converts spreadsheets to Markdown tableswordToHtml(filepath)- Alternative HTML output
Usage Example:
import ConvertToMarkdown from '@knowcode/convert-to-markdown';
// Convert a Word document
const result = await ConvertToMarkdown.wordToMarkdown('document.docx');
const markdown = result.markdown;
// Convert a PDF
const pdfResult = await ConvertToMarkdown.pdfToMarkdown('report.pdf');
const pdfMarkdown = pdfResult.markdown;
File Processing Pipeline
Step 1: File Upload
- User drags files to upload zone
- Frontend validates file types
- Files sent as multipart/form-data
- Multer middleware processes upload
- Files saved to
uploads/onboarding/
Step 2: Document Conversion
- Document processor reads uploaded files
- Determines conversion strategy by MIME type
- Applies appropriate converter
- Generates Markdown output
Step 3: Workspace Storage
- Creates category-based directory structure
- Saves converted Markdown files
- Preserves original filenames (with .md extension)
- Creates document index
Step 4: Knowledge Base Preparation
- Generates CLAUDE.md with organization context
- Creates research guide for Claude
- Builds research prompt with document references
- Returns prompt for auto-launch
Document Categorization
Documents are automatically categorized based on filename patterns:
| Category | Keywords | Directory |
|---|---|---|
| Pitch Deck | pitch, deck, presentation | docs/local/pitch_deck/ |
| Employee Handbook | handbook, onboard, employee | docs/local/employee_handbook/ |
| Product Docs | product, feature, spec | docs/local/product_docs/ |
| Contracts | contract, agreement, proposal | docs/local/contracts/ |
| Financial | financial, budget, revenue | docs/local/financial/ |
| Marketing | marketing, sales, brochure | docs/local/marketing/ |
| General | (others) | docs/local/general/ |
Database Schema
company_profiles Table
CREATE TABLE company_profiles (
id INTEGER PRIMARY KEY,
user_id INTEGER NOT NULL,
company_name TEXT,
company_url TEXT,
industry TEXT,
onboarding_completed BOOLEAN DEFAULT 0,
onboarding_method TEXT DEFAULT 'documents',
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
updated_at DATETIME DEFAULT CURRENT_TIMESTAMP
);
onboarding_documents Table
CREATE TABLE onboarding_documents (
id INTEGER PRIMARY KEY,
user_id INTEGER NOT NULL,
company_id INTEGER,
filename TEXT NOT NULL,
original_name TEXT NOT NULL,
file_type TEXT NOT NULL,
file_size INTEGER,
document_category TEXT,
processed BOOLEAN DEFAULT 0,
upload_date DATETIME DEFAULT CURRENT_TIMESTAMP
);
Security Considerations
File Upload Security
- Size Limits: 50MB per file, 20 files max
- Type Validation: Whitelist of allowed MIME types
- Filename Sanitization: UUID-based naming prevents path traversal
- Authentication: Bearer token required for all uploads
Data Privacy
- Local Storage: All documents stored in user's private workspace
- No External Processing: Conversion happens locally
- Isolated Workspaces: Each user has separate workspace
- No Cloud Upload: Documents never leave the instance
Input Validation
fileFilter: (req, file, cb) => {
const allowedTypes = [
'application/pdf',
'application/msword',
'text/plain',
'text/markdown',
// ... other allowed types
];
if (allowedTypes.includes(file.mimetype)) {
cb(null, true);
} else {
cb(new Error(`File type ${file.mimetype} not supported`), false);
}
}
Error Handling
Upload Errors
- File too large: Returns 413 error
- Invalid file type: Returns 400 error with message
- Upload failure: Returns 500 with error details
Conversion Errors
- Unsupported format: Falls back to text extraction
- Corrupted file: Logs error, continues with other files
- Missing converter: Uses appropriate fallback
Recovery Strategies
try {
// Try specific converter
const result = await ConvertToMarkdown.pdfToMarkdown(file.path);
markdown = result.markdown;
} catch (error) {
console.error(`Conversion failed: ${error.message}`);
// Fallback to text extraction
if (file.mimetype === 'text/plain') {
markdown = await fs.readFile(file.path, 'utf-8');
}
}
Performance Optimization
Concurrent Processing
- Multiple files processed in parallel
- Async/await for non-blocking operations
- Stream processing for large files
Caching Strategy
- Converted documents cached in workspace
- Document index for quick lookups
- Metadata stored in database
Resource Management
// Process files in batches to avoid memory issues
const BATCH_SIZE = 5;
for (let i = 0; i < files.length; i += BATCH_SIZE) {
const batch = files.slice(i, i + BATCH_SIZE);
await Promise.all(batch.map(processFile));
}
Testing
Unit Tests
// Test file conversion
describe('Document Processor', () => {
it('should convert text files to markdown', async () => {
const file = {
path: 'test.txt',
mimetype: 'text/plain'
};
const result = await convertDocumentToMarkdown(file);
expect(result).toBeDefined();
expect(typeof result).toBe('string');
});
});
Integration Tests
# Test upload endpoint
curl -X POST http://localhost:3005/api/profile/organization/complete \
-H "Authorization: Bearer $TOKEN" \
-F "documents=@test-doc.txt" \
-F "companyName=Test Corp"
Manual Testing
- Upload various file types
- Verify conversion output
- Check workspace structure
- Validate research prompt generation
Troubleshooting
Common Issues
| Issue | Cause | Solution |
|---|---|---|
| Files not converting | Incorrect MIME type | Add MIME type to allowed list |
| Upload fails | File too large | Increase size limit or compress files |
| Conversion error | Unsupported format | Check converter compatibility |
| Empty output | Corrupted file | Validate file integrity |
| Permission denied | Workspace permissions | Check directory permissions |
Debug Commands
# Check uploaded files
ls -la uploads/onboarding/
# Verify workspace structure
tree ~/.claude/projects/default-workspace/
# View conversion logs
docker logs sasha-test | grep "convert"
# Test converter directly
node -e "const c = require('@knowcode/convert-to-markdown'); console.log(Object.keys(c));"
API Reference
Upload Organization Documents
POST /api/profile/organization/complete
Authorization: Bearer {token}
Content-Type: multipart/form-data
Parameters:
- documents[] (files): Array of files to upload
- companyName (string): Organization name
- website (string): Company website URL
- industry (string): Industry category
Response:
{
"success": true,
"workspacePath": "/path/to/workspace",
"documentsProcessed": 3,
"researchPrompt": "Research prompt for Claude..."
}
Skip Organization Setup
POST /api/profile/organization/skip
Authorization: Bearer {token}
Response:
{
"success": true,
"skipped": true,
"workspacePath": "/path/to/workspace"
}
Future Enhancements
Planned Features
- OCR support for scanned documents
- PowerPoint conversion support
- Batch processing progress updates
- Document preview before upload
- Automatic duplicate detection
- Smart categorization using AI
- Version control for documents
- Document search and indexing
Performance Improvements
- Worker threads for conversion
- Redis queue for large batches
- CDN integration for static files
- Compression for stored documents
Security Enhancements
- Virus scanning for uploads
- Enhanced MIME type detection
- Rate limiting per user
- Audit logging for uploads
Conclusion
The file upload and conversion system provides a robust, secure, and efficient way to transform an organization's existing documentation into AI-ready format. By converting documents to Markdown and organizing them in a structured workspace, Sasha can quickly understand and learn from an organization's knowledge base, providing immediate value from day one.