Last updated: Aug 12, 2025, 01:09 PM UTC

File Upload and Conversion Architecture

Purpose: Technical documentation of the Organization Setup file upload and document conversion system
Created: 2025-08-08
Status: Complete

Overview

The Organization Setup feature includes a sophisticated document upload and conversion pipeline that transforms business documents (PDFs, Word, Excel, etc.) into AI-ready Markdown format. This enables Sasha to understand and learn from an organization's existing documentation.

System Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Frontend UI   │──────►│   API Endpoint   │──────►│   Document      β”‚
β”‚  (React/Drag)   β”‚       β”‚  (Express/Multer)β”‚       β”‚   Processor     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                    β”‚                         β”‚
                                    β–Ό                         β–Ό
                          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                          β”‚    Database      β”‚       β”‚   Workspace     β”‚
                          β”‚   (SQLite)       β”‚       β”‚   Manager       β”‚
                          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                              β”‚
                                                              β–Ό
                                                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                                    β”‚  Markdown Files β”‚
                                                    β”‚  docs/local/*   β”‚
                                                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Component Details

1. Frontend Upload Interface

Location: src/components/OrganizationSetupScreen.jsx

Features:

  • Drag-and-drop file upload zone
  • Multiple file selection support
  • File type validation (client-side)
  • Upload progress indication
  • File preview and removal

Supported File Types:

accept: {
  'application/pdf': ['.pdf'],
  'application/msword': ['.doc', '.docx'],
  'application/vnd.ms-excel': ['.xls', '.xlsx'],
  'application/vnd.ms-powerpoint': ['.ppt', '.pptx'],
  'text/plain': ['.txt'],
  'text/markdown': ['.md'],
  'image/*': ['.png', '.jpg', '.jpeg', '.gif']
}

2. API Upload Endpoint

Location: server/routes/profile.js

Endpoint: POST /api/profile/organization/complete

Configuration:

// Multer configuration
const upload = multer({
  storage: multer.diskStorage({
    destination: 'uploads/onboarding/',
    filename: (req, file, cb) => {
      const uniqueName = `${Date.now()}-${uuidv4()}${path.extname(file.originalname)}`;
      cb(null, uniqueName);
    }
  }),
  limits: {
    fileSize: 50 * 1024 * 1024,  // 50MB max per file
    files: 20                     // Max 20 files at once
  },
  fileFilter: validateFileType
});

Request Format:

  • Multipart form data
  • Fields: documents[], companyName, website, industry
  • Authorization: Bearer token required

Response Format:

{
  "success": true,
  "workspacePath": "/home/user/.claude/projects/default-workspace",
  "documentsProcessed": 3,
  "researchPrompt": "I'll now help you build a comprehensive knowledge base..."
}

3. Document Processor Service

Location: server/services/document-processor.js

Core Function: processOrganizationDocuments(files, workspacePath)

Conversion Strategy:

// Conversion logic by file type
if (file.mimetype === 'text/plain' || file.mimetype === 'text/markdown') {
  // Direct read as UTF-8
  markdown = await fs.readFile(file.path, 'utf-8');
} else if (file.mimetype === 'application/pdf') {
  // Use PDF converter
  const result = await ConvertToMarkdown.pdfToMarkdown(file.path);
  markdown = result.markdown;
} else if (file.mimetype.includes('word')) {
  // Use Word converter
  const result = await ConvertToMarkdown.wordToMarkdown(file.path);
  markdown = result.markdown;
} else if (file.mimetype.includes('excel')) {
  // Use Excel converter
  const result = await ConvertToMarkdown.excelToMarkdown(file.path);
  markdown = result.markdown;
}

4. Workspace Manager

Location: server/services/workspace-manager.js

Function: setupOrganizationWorkspace(userId, organizationData)

Directory Structure Created:

~/.claude/projects/default-workspace/
β”œβ”€β”€ CLAUDE.md                          # Organization context for Claude
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ local/                        # Converted user documents
β”‚   β”‚   β”œβ”€β”€ README.md                 # Index of uploaded documents
β”‚   β”‚   β”œβ”€β”€ contracts/                # Legal documents
β”‚   β”‚   β”œβ”€β”€ presentations/            # Pitch decks, slides
β”‚   β”‚   β”œβ”€β”€ handbooks/                # Employee documentation
β”‚   β”‚   β”œβ”€β”€ financial/                # Financial reports
β”‚   β”‚   └── general/                  # Uncategorized documents
β”‚   β”œβ”€β”€ organization/                 # AI-generated knowledge docs
β”‚   └── guides/                       # Research guides
β”‚       └── organization-research-guide.md

5. Document Conversion Library

Package: @knowcode/convert-to-markdown v1.3.0

Available Converters:

  • pdfToMarkdown(filepath) - Extracts text from PDFs
  • wordToMarkdown(filepath) - Converts Word docs to Markdown
  • excelToMarkdown(filepath) - Converts spreadsheets to Markdown tables
  • wordToHtml(filepath) - Alternative HTML output

Usage Example:

import ConvertToMarkdown from '@knowcode/convert-to-markdown';

// Convert a Word document
const result = await ConvertToMarkdown.wordToMarkdown('document.docx');
const markdown = result.markdown;

// Convert a PDF
const pdfResult = await ConvertToMarkdown.pdfToMarkdown('report.pdf');
const pdfMarkdown = pdfResult.markdown;

File Processing Pipeline

Step 1: File Upload

  1. User drags files to upload zone
  2. Frontend validates file types
  3. Files sent as multipart/form-data
  4. Multer middleware processes upload
  5. Files saved to uploads/onboarding/

Step 2: Document Conversion

  1. Document processor reads uploaded files
  2. Determines conversion strategy by MIME type
  3. Applies appropriate converter
  4. Generates Markdown output

Step 3: Workspace Storage

  1. Creates category-based directory structure
  2. Saves converted Markdown files
  3. Preserves original filenames (with .md extension)
  4. Creates document index

Step 4: Knowledge Base Preparation

  1. Generates CLAUDE.md with organization context
  2. Creates research guide for Claude
  3. Builds research prompt with document references
  4. Returns prompt for auto-launch

Document Categorization

Documents are automatically categorized based on filename patterns:

Category Keywords Directory
Pitch Deck pitch, deck, presentation docs/local/pitch_deck/
Employee Handbook handbook, onboard, employee docs/local/employee_handbook/
Product Docs product, feature, spec docs/local/product_docs/
Contracts contract, agreement, proposal docs/local/contracts/
Financial financial, budget, revenue docs/local/financial/
Marketing marketing, sales, brochure docs/local/marketing/
General (others) docs/local/general/

Database Schema

company_profiles Table

CREATE TABLE company_profiles (
    id INTEGER PRIMARY KEY,
    user_id INTEGER NOT NULL,
    company_name TEXT,
    company_url TEXT,
    industry TEXT,
    onboarding_completed BOOLEAN DEFAULT 0,
    onboarding_method TEXT DEFAULT 'documents',
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    updated_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

onboarding_documents Table

CREATE TABLE onboarding_documents (
    id INTEGER PRIMARY KEY,
    user_id INTEGER NOT NULL,
    company_id INTEGER,
    filename TEXT NOT NULL,
    original_name TEXT NOT NULL,
    file_type TEXT NOT NULL,
    file_size INTEGER,
    document_category TEXT,
    processed BOOLEAN DEFAULT 0,
    upload_date DATETIME DEFAULT CURRENT_TIMESTAMP
);

Security Considerations

File Upload Security

  • Size Limits: 50MB per file, 20 files max
  • Type Validation: Whitelist of allowed MIME types
  • Filename Sanitization: UUID-based naming prevents path traversal
  • Authentication: Bearer token required for all uploads

Data Privacy

  • Local Storage: All documents stored in user's private workspace
  • No External Processing: Conversion happens locally
  • Isolated Workspaces: Each user has separate workspace
  • No Cloud Upload: Documents never leave the instance

Input Validation

fileFilter: (req, file, cb) => {
  const allowedTypes = [
    'application/pdf',
    'application/msword',
    'text/plain',
    'text/markdown',
    // ... other allowed types
  ];
  
  if (allowedTypes.includes(file.mimetype)) {
    cb(null, true);
  } else {
    cb(new Error(`File type ${file.mimetype} not supported`), false);
  }
}

Error Handling

Upload Errors

  • File too large: Returns 413 error
  • Invalid file type: Returns 400 error with message
  • Upload failure: Returns 500 with error details

Conversion Errors

  • Unsupported format: Falls back to text extraction
  • Corrupted file: Logs error, continues with other files
  • Missing converter: Uses appropriate fallback

Recovery Strategies

try {
  // Try specific converter
  const result = await ConvertToMarkdown.pdfToMarkdown(file.path);
  markdown = result.markdown;
} catch (error) {
  console.error(`Conversion failed: ${error.message}`);
  // Fallback to text extraction
  if (file.mimetype === 'text/plain') {
    markdown = await fs.readFile(file.path, 'utf-8');
  }
}

Performance Optimization

Concurrent Processing

  • Multiple files processed in parallel
  • Async/await for non-blocking operations
  • Stream processing for large files

Caching Strategy

  • Converted documents cached in workspace
  • Document index for quick lookups
  • Metadata stored in database

Resource Management

// Process files in batches to avoid memory issues
const BATCH_SIZE = 5;
for (let i = 0; i < files.length; i += BATCH_SIZE) {
  const batch = files.slice(i, i + BATCH_SIZE);
  await Promise.all(batch.map(processFile));
}

Testing

Unit Tests

// Test file conversion
describe('Document Processor', () => {
  it('should convert text files to markdown', async () => {
    const file = { 
      path: 'test.txt', 
      mimetype: 'text/plain' 
    };
    const result = await convertDocumentToMarkdown(file);
    expect(result).toBeDefined();
    expect(typeof result).toBe('string');
  });
});

Integration Tests

# Test upload endpoint
curl -X POST http://localhost:3005/api/profile/organization/complete \
  -H "Authorization: Bearer $TOKEN" \
  -F "documents=@test-doc.txt" \
  -F "companyName=Test Corp"

Manual Testing

  1. Upload various file types
  2. Verify conversion output
  3. Check workspace structure
  4. Validate research prompt generation

Troubleshooting

Common Issues

Issue Cause Solution
Files not converting Incorrect MIME type Add MIME type to allowed list
Upload fails File too large Increase size limit or compress files
Conversion error Unsupported format Check converter compatibility
Empty output Corrupted file Validate file integrity
Permission denied Workspace permissions Check directory permissions

Debug Commands

# Check uploaded files
ls -la uploads/onboarding/

# Verify workspace structure
tree ~/.claude/projects/default-workspace/

# View conversion logs
docker logs sasha-test | grep "convert"

# Test converter directly
node -e "const c = require('@knowcode/convert-to-markdown'); console.log(Object.keys(c));"

API Reference

Upload Organization Documents

POST /api/profile/organization/complete
Authorization: Bearer {token}
Content-Type: multipart/form-data

Parameters:
- documents[] (files): Array of files to upload
- companyName (string): Organization name
- website (string): Company website URL
- industry (string): Industry category

Response:
{
  "success": true,
  "workspacePath": "/path/to/workspace",
  "documentsProcessed": 3,
  "researchPrompt": "Research prompt for Claude..."
}

Skip Organization Setup

POST /api/profile/organization/skip
Authorization: Bearer {token}

Response:
{
  "success": true,
  "skipped": true,
  "workspacePath": "/path/to/workspace"
}

Future Enhancements

Planned Features

  • OCR support for scanned documents
  • PowerPoint conversion support
  • Batch processing progress updates
  • Document preview before upload
  • Automatic duplicate detection
  • Smart categorization using AI
  • Version control for documents
  • Document search and indexing

Performance Improvements

  • Worker threads for conversion
  • Redis queue for large batches
  • CDN integration for static files
  • Compression for stored documents

Security Enhancements

  • Virus scanning for uploads
  • Enhanced MIME type detection
  • Rate limiting per user
  • Audit logging for uploads

Conclusion

The file upload and conversion system provides a robust, secure, and efficient way to transform an organization's existing documentation into AI-ready format. By converting documents to Markdown and organizing them in a structured workspace, Sasha can quickly understand and learn from an organization's knowledge base, providing immediate value from day one.