File Upload and Conversion Architecture

Purpose: Technical documentation of the Organization Setup file upload and document conversion system
Created: 2025-08-08
Status: Complete

Overview

The Organization Setup feature includes a sophisticated document upload and conversion pipeline that transforms business documents (PDFs, Word, Excel, etc.) into AI-ready Markdown format. This enables Sasha to understand and learn from an organization's existing documentation.

System Architecture

┌─────────────────┐       ┌──────────────────┐       ┌─────────────────┐
│   Frontend UI   │──────►│   API Endpoint   │──────►│   Document      │
│  (React/Drag)   │       │  (Express/Multer)│       │   Processor     │
└─────────────────┘       └──────────────────┘       └─────────────────┘
                                    │                         │
                                    ▼                         ▼
                          ┌──────────────────┐       ┌─────────────────┐
                          │    Database      │       │   Workspace     │
                          │   (SQLite)       │       │   Manager       │
                          └──────────────────┘       └─────────────────┘
                                                              │
                                                              ▼
                                                    ┌─────────────────┐
                                                    │  Markdown Files │
                                                    │  docs/local/*   │
                                                    └─────────────────┘

Component Details

1. Frontend Upload Interface

Location: src/components/OrganizationSetupScreen.jsx

Features:

Drag-and-drop file upload zone
Multiple file selection support
File type validation (client-side)
Upload progress indication
File preview and removal

Supported File Types:

accept: {
  'application/pdf': ['.pdf'],
  'application/msword': ['.doc', '.docx'],
  'application/vnd.ms-excel': ['.xls', '.xlsx'],
  'application/vnd.ms-powerpoint': ['.ppt', '.pptx'],
  'text/plain': ['.txt'],
  'text/markdown': ['.md'],
  'image/*': ['.png', '.jpg', '.jpeg', '.gif']
}

2. API Upload Endpoint

Location: server/routes/profile.js

Endpoint: POST /api/profile/organization/complete

Configuration:

// Multer configuration
const upload = multer({
  storage: multer.diskStorage({
    destination: 'uploads/onboarding/',
    filename: (req, file, cb) => {
      const uniqueName = `${Date.now()}-${uuidv4()}${path.extname(file.originalname)}`;
      cb(null, uniqueName);
    }
  }),
  limits: {
    fileSize: 50 * 1024 * 1024,  // 50MB max per file
    files: 20                     // Max 20 files at once
  },
  fileFilter: validateFileType
});

Request Format:

Multipart form data
Fields: documents[], companyName, website, industry
Authorization: Bearer token required

Response Format:

{
  "success": true,
  "workspacePath": "/home/user/.claude/projects/default-workspace",
  "documentsProcessed": 3,
  "researchPrompt": "I'll now help you build a comprehensive knowledge base..."
}

3. Document Processor Service

Location: server/services/document-processor.js

Core Function: processOrganizationDocuments(files, workspacePath)

Conversion Strategy:

// Conversion logic by file type
if (file.mimetype === 'text/plain' || file.mimetype === 'text/markdown') {
  // Direct read as UTF-8
  markdown = await fs.readFile(file.path, 'utf-8');
} else if (file.mimetype === 'application/pdf') {
  // Use PDF converter
  const result = await ConvertToMarkdown.pdfToMarkdown(file.path);
  markdown = result.markdown;
} else if (file.mimetype.includes('word')) {
  // Use Word converter
  const result = await ConvertToMarkdown.wordToMarkdown(file.path);
  markdown = result.markdown;
} else if (file.mimetype.includes('excel')) {
  // Use Excel converter
  const result = await ConvertToMarkdown.excelToMarkdown(file.path);
  markdown = result.markdown;
}

4. Workspace Manager

Location: server/services/workspace-manager.js

Function: setupOrganizationWorkspace(userId, organizationData)

Directory Structure Created:

~/.claude/projects/default-workspace/
├── CLAUDE.md                          # Organization context for Claude
├── docs/
│   ├── local/                        # Converted user documents
│   │   ├── README.md                 # Index of uploaded documents
│   │   ├── contracts/                # Legal documents
│   │   ├── presentations/            # Pitch decks, slides
│   │   ├── handbooks/                # Employee documentation
│   │   ├── financial/                # Financial reports
│   │   └── general/                  # Uncategorized documents
│   ├── organization/                 # AI-generated knowledge docs
│   └── guides/                       # Research guides
│       └── organization-research-guide.md

5. Document Conversion Library

Package: @knowcode/convert-to-markdown v1.3.0

Available Converters:

pdfToMarkdown(filepath) - Extracts text from PDFs
wordToMarkdown(filepath) - Converts Word docs to Markdown
excelToMarkdown(filepath) - Converts spreadsheets to Markdown tables
wordToHtml(filepath) - Alternative HTML output

Usage Example:

import ConvertToMarkdown from '@knowcode/convert-to-markdown';

// Convert a Word document
const result = await ConvertToMarkdown.wordToMarkdown('document.docx');
const markdown = result.markdown;

// Convert a PDF
const pdfResult = await ConvertToMarkdown.pdfToMarkdown('report.pdf');
const pdfMarkdown = pdfResult.markdown;

File Processing Pipeline

Step 1: File Upload

User drags files to upload zone
Frontend validates file types
Files sent as multipart/form-data
Multer middleware processes upload
Files saved to uploads/onboarding/

Step 2: Document Conversion

Document processor reads uploaded files
Determines conversion strategy by MIME type
Applies appropriate converter
Generates Markdown output

Step 3: Workspace Storage

Creates category-based directory structure
Saves converted Markdown files
Preserves original filenames (with .md extension)
Creates document index

Step 4: Knowledge Base Preparation

Generates CLAUDE.md with organization context
Creates research guide for Claude
Builds research prompt with document references
Returns prompt for auto-launch

Document Categorization

Documents are automatically categorized based on filename patterns:

Category	Keywords	Directory
Pitch Deck	pitch, deck, presentation	`docs/local/pitch_deck/`
Employee Handbook	handbook, onboard, employee	`docs/local/employee_handbook/`
Product Docs	product, feature, spec	`docs/local/product_docs/`
Contracts	contract, agreement, proposal	`docs/local/contracts/`
Financial	financial, budget, revenue	`docs/local/financial/`
Marketing	marketing, sales, brochure	`docs/local/marketing/`
General	(others)	`docs/local/general/`

Database Schema

company_profiles Table

CREATE TABLE company_profiles (
    id INTEGER PRIMARY KEY,
    user_id INTEGER NOT NULL,
    company_name TEXT,
    company_url TEXT,
    industry TEXT,
    onboarding_completed BOOLEAN DEFAULT 0,
    onboarding_method TEXT DEFAULT 'documents',
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    updated_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

onboarding_documents Table

CREATE TABLE onboarding_documents (
    id INTEGER PRIMARY KEY,
    user_id INTEGER NOT NULL,
    company_id INTEGER,
    filename TEXT NOT NULL,
    original_name TEXT NOT NULL,
    file_type TEXT NOT NULL,
    file_size INTEGER,
    document_category TEXT,
    processed BOOLEAN DEFAULT 0,
    upload_date DATETIME DEFAULT CURRENT_TIMESTAMP
);

Security Considerations

File Upload Security

Size Limits: 50MB per file, 20 files max
Type Validation: Whitelist of allowed MIME types
Filename Sanitization: UUID-based naming prevents path traversal
Authentication: Bearer token required for all uploads

Data Privacy

Local Storage: All documents stored in user's private workspace
No External Processing: Conversion happens locally
Isolated Workspaces: Each user has separate workspace
No Cloud Upload: Documents never leave the instance

Input Validation

fileFilter: (req, file, cb) => {
  const allowedTypes = [
    'application/pdf',
    'application/msword',
    'text/plain',
    'text/markdown',
    // ... other allowed types
  ];
  
  if (allowedTypes.includes(file.mimetype)) {
    cb(null, true);
  } else {
    cb(new Error(`File type ${file.mimetype} not supported`), false);
  }
}

Error Handling

Upload Errors

File too large: Returns 413 error
Invalid file type: Returns 400 error with message
Upload failure: Returns 500 with error details

Conversion Errors

Unsupported format: Falls back to text extraction
Corrupted file: Logs error, continues with other files
Missing converter: Uses appropriate fallback

Recovery Strategies

try {
  // Try specific converter
  const result = await ConvertToMarkdown.pdfToMarkdown(file.path);
  markdown = result.markdown;
} catch (error) {
  console.error(`Conversion failed: ${error.message}`);
  // Fallback to text extraction
  if (file.mimetype === 'text/plain') {
    markdown = await fs.readFile(file.path, 'utf-8');
  }
}

Performance Optimization

Concurrent Processing

Multiple files processed in parallel
Async/await for non-blocking operations
Stream processing for large files

Caching Strategy

Converted documents cached in workspace
Document index for quick lookups
Metadata stored in database

Resource Management

// Process files in batches to avoid memory issues
const BATCH_SIZE = 5;
for (let i = 0; i < files.length; i += BATCH_SIZE) {
  const batch = files.slice(i, i + BATCH_SIZE);
  await Promise.all(batch.map(processFile));
}

Testing

Unit Tests

// Test file conversion
describe('Document Processor', () => {
  it('should convert text files to markdown', async () => {
    const file = { 
      path: 'test.txt', 
      mimetype: 'text/plain' 
    };
    const result = await convertDocumentToMarkdown(file);
    expect(result).toBeDefined();
    expect(typeof result).toBe('string');
  });
});

Integration Tests

# Test upload endpoint
curl -X POST http://localhost:3005/api/profile/organization/complete \
  -H "Authorization: Bearer $TOKEN" \
  -F "documents=@test-doc.txt" \
  -F "companyName=Test Corp"

Manual Testing

Upload various file types
Verify conversion output
Check workspace structure
Validate research prompt generation

Troubleshooting

Common Issues

Issue	Cause	Solution
Files not converting	Incorrect MIME type	Add MIME type to allowed list
Upload fails	File too large	Increase size limit or compress files
Conversion error	Unsupported format	Check converter compatibility
Empty output	Corrupted file	Validate file integrity
Permission denied	Workspace permissions	Check directory permissions

Debug Commands

# Check uploaded files
ls -la uploads/onboarding/

# Verify workspace structure
tree ~/.claude/projects/default-workspace/

# View conversion logs
docker logs sasha-test | grep "convert"

# Test converter directly
node -e "const c = require('@knowcode/convert-to-markdown'); console.log(Object.keys(c));"

API Reference

Upload Organization Documents

POST /api/profile/organization/complete
Authorization: Bearer {token}
Content-Type: multipart/form-data

Parameters:
- documents[] (files): Array of files to upload
- companyName (string): Organization name
- website (string): Company website URL
- industry (string): Industry category

Response:
{
  "success": true,
  "workspacePath": "/path/to/workspace",
  "documentsProcessed": 3,
  "researchPrompt": "Research prompt for Claude..."
}

Skip Organization Setup

POST /api/profile/organization/skip
Authorization: Bearer {token}

Response:
{
  "success": true,
  "skipped": true,
  "workspacePath": "/path/to/workspace"
}

Future Enhancements

Planned Features

OCR support for scanned documents
PowerPoint conversion support
Batch processing progress updates
Document preview before upload
Automatic duplicate detection
Smart categorization using AI
Version control for documents
Document search and indexing

Performance Improvements

Worker threads for conversion
Redis queue for large batches
CDN integration for static files
Compression for stored documents

Security Enhancements

Virus scanning for uploads
Enhanced MIME type detection
Rate limiting per user
Audit logging for uploads

Conclusion

The file upload and conversion system provides a robust, secure, and efficient way to transform an organization's existing documentation into AI-ready format. By converting documents to Markdown and organizing them in a structured workspace, Sasha can quickly understand and learn from an organization's knowledge base, providing immediate value from day one.