PII Scanning and Removal Guide

Status: Complete Implementation Guide
Version: 1.0
Purpose: Step-by-step procedures for identifying and removing PII from documents
Applicable To: Any document processing, AI preparation, or compliance project

Overview

This guide provides comprehensive procedures for identifying, scanning, and safely removing Personally Identifiable Information (PII) from documents before AI processing or public sharing. The approach emphasizes accuracy, compliance, and maintaining document utility while protecting privacy.

Related Guides:

Legal Compliance Research Guide - Regulatory requirements

Excel to Markdown Conversion Guide - Document processing patterns

Security Architecture Framework - Data protection principles

Key Benefits

Compliance: Meet GDPR, CCPA, HIPAA, and other privacy regulations
Risk Mitigation: Prevent data breaches and privacy violations
AI Readiness: Safely prepare documents for LLM processing
Trust Building: Demonstrate commitment to privacy protection

PII Identification Patterns

Common PII Types and Detection

// pii-patterns.js - Comprehensive PII detection patterns
const PII_PATTERNS = {
  // Personal Identifiers
  ssn: {
    pattern: /\b\d{3}-\d{2}-\d{4}\b|\b\d{9}\b/g,
    description: 'Social Security Number',
    sensitivity: 'critical',
    falsePositiveCheck: (match, context) => {
      // Avoid matching phone numbers or random 9-digit numbers
      return !context.includes('phone') && !context.includes('fax');
    }
  },
  
  // Financial Information
  creditCard: {
    pattern: /\b(?:\d[ -]*?){13,19}\b/g,
    description: 'Credit Card Number',
    sensitivity: 'critical',
    validate: (match) => {
      const cleaned = match.replace(/[ -]/g, '');
      return luhnCheck(cleaned) && cleaned.length >= 13;
    }
  },
  
  // Contact Information
  email: {
    pattern: /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/g,
    description: 'Email Address',
    sensitivity: 'high',
    contextualCheck: (match, context) => {
      // Exclude generic/system emails
      const genericEmails = ['noreply@', 'info@', 'support@', 'admin@'];
      return !genericEmails.some(generic => match.toLowerCase().includes(generic));
    }
  },
  
  // Health Information
  medicalRecord: {
    pattern: /\b(MRN|Medical Record Number)[:\s]*[\w\d-]+\b/gi,
    description: 'Medical Record Number',
    sensitivity: 'critical',
    requiresContext: true
  },
  
  // Government IDs
  passport: {
    pattern: /\b[A-Z]{1,2}\d{6,9}\b/g,
    description: 'Passport Number',
    sensitivity: 'critical',
    contextualCheck: (match, context) => {
      const keywords = ['passport', 'travel', 'visa'];
      return keywords.some(keyword => 
        context.toLowerCase().includes(keyword)
      );
    }
  },
  
  // Biometric Data
  biometric: {
    keywords: ['fingerprint', 'retina scan', 'facial recognition', 'voiceprint'],
    description: 'Biometric Identifiers',
    sensitivity: 'critical',
    requiresManualReview: true
  }
};

// Luhn algorithm for credit card validation
function luhnCheck(cardNumber) {
  let sum = 0;
  let isEven = false;
  
  for (let i = cardNumber.length - 1; i >= 0; i--) {
    let digit = parseInt(cardNumber[i]);
    
    if (isEven) {
      digit *= 2;
      if (digit > 9) digit -= 9;
    }
    
    sum += digit;
    isEven = !isEven;
  }
  
  return sum % 10 === 0;
}

Context-Aware Detection

// context-detection.js - Smart PII detection with context
class ContextAwarePIIDetector {
  constructor() {
    this.contextWindow = 50; // Characters before/after match
    this.confidenceThreshold = 0.7;
  }
  
  detectWithContext(text, pattern, patternConfig) {
    const matches = [];
    let match;
    
    while ((match = pattern.exec(text)) !== null) {
      const startIdx = Math.max(0, match.index - this.contextWindow);
      const endIdx = Math.min(text.length, match.index + match[0].length + this.contextWindow);
      const context = text.substring(startIdx, endIdx);
      
      const confidence = this.calculateConfidence(match[0], context, patternConfig);
      
      if (confidence >= this.confidenceThreshold) {
        matches.push({
          value: match[0],
          index: match.index,
          length: match[0].length,
          type: patternConfig.description,
          sensitivity: patternConfig.sensitivity,
          confidence: confidence,
          context: context
        });
      }
    }
    
    return matches;
  }
  
  calculateConfidence(match, context, config) {
    let confidence = 0.5; // Base confidence
    
    // Pattern validation
    if (config.validate && config.validate(match)) {
      confidence += 0.3;
    }
    
    // Contextual validation
    if (config.contextualCheck && config.contextualCheck(match, context)) {
      confidence += 0.2;
    }
    
    // False positive check
    if (config.falsePositiveCheck && !config.falsePositiveCheck(match, context)) {
      confidence -= 0.4;
    }
    
    return Math.max(0, Math.min(1, confidence));
  }
}

Scanning Implementation

Multi-Format Document Scanner

// document-scanner.js - Comprehensive document PII scanner
const fs = require('fs');
const path = require('path');
const pdfParse = require('pdf-parse');
const mammoth = require('mammoth');
const xlsx = require('xlsx');

class PIIDocumentScanner {
  constructor(patterns = PII_PATTERNS) {
    this.patterns = patterns;
    this.detector = new ContextAwarePIIDetector();
    this.supportedFormats = ['.txt', '.pdf', '.docx', '.xlsx', '.csv', '.json'];
  }
  
  async scanDocument(filePath) {
    const ext = path.extname(filePath).toLowerCase();
    
    if (!this.supportedFormats.includes(ext)) {
      throw new Error(`Unsupported file format: ${ext}`);
    }
    
    const content = await this.extractContent(filePath, ext);
    const findings = await this.scanContent(content, filePath);
    
    return {
      file: filePath,
      format: ext,
      scanDate: new Date().toISOString(),
      findings: findings,
      summary: this.generateSummary(findings)
    };
  }
  
  async extractContent(filePath, format) {
    switch (format) {
      case '.txt':
        return fs.readFileSync(filePath, 'utf8');
        
      case '.pdf':
        const pdfBuffer = fs.readFileSync(filePath);
        const pdfData = await pdfParse(pdfBuffer);
        return pdfData.text;
        
      case '.docx':
        const docxBuffer = fs.readFileSync(filePath);
        const result = await mammoth.extractRawText({ buffer: docxBuffer });
        return result.value;
        
      case '.xlsx':
      case '.csv':
        const workbook = xlsx.readFile(filePath);
        let content = '';
        workbook.SheetNames.forEach(sheetName => {
          const sheet = workbook.Sheets[sheetName];
          content += xlsx.utils.sheet_to_txt(sheet) + '\n';
        });
        return content;
        
      case '.json':
        const jsonContent = fs.readFileSync(filePath, 'utf8');
        return JSON.stringify(JSON.parse(jsonContent), null, 2);
        
      default:
        throw new Error(`Unsupported format: ${format}`);
    }
  }
  
  async scanContent(content, filePath) {
    const findings = [];
    
    // Scan for each PII pattern
    for (const [patternName, patternConfig] of Object.entries(this.patterns)) {
      if (patternConfig.pattern) {
        const matches = this.detector.detectWithContext(
          content, 
          patternConfig.pattern, 
          patternConfig
        );
        
        findings.push(...matches.map(match => ({
          ...match,
          patternName,
          file: filePath
        })));
      }
      
      // Handle keyword-based detection
      if (patternConfig.keywords) {
        for (const keyword of patternConfig.keywords) {
          const keywordRegex = new RegExp(`\\b${keyword}\\b`, 'gi');
          let match;
          while ((match = keywordRegex.exec(content)) !== null) {
            findings.push({
              value: match[0],
              index: match.index,
              length: match[0].length,
              type: patternConfig.description,
              sensitivity: patternConfig.sensitivity,
              patternName,
              file: filePath,
              requiresManualReview: patternConfig.requiresManualReview
            });
          }
        }
      }
    }
    
    return findings;
  }
  
  generateSummary(findings) {
    const summary = {
      totalFindings: findings.length,
      bySensitivity: {},
      byType: {},
      requiresManualReview: 0
    };
    
    findings.forEach(finding => {
      // Count by sensitivity
      summary.bySensitivity[finding.sensitivity] = 
        (summary.bySensitivity[finding.sensitivity] || 0) + 1;
      
      // Count by type
      summary.byType[finding.type] = 
        (summary.byType[finding.type] || 0) + 1;
      
      // Count manual review requirements
      if (finding.requiresManualReview) {
        summary.requiresManualReview++;
      }
    });
    
    return summary;
  }
}

Batch Processing for Large Datasets

// batch-scanner.js - Efficient batch PII scanning
const { Worker } = require('worker_threads');
const os = require('os');

class BatchPIIScanner {
  constructor(options = {}) {
    this.concurrency = options.concurrency || os.cpus().length;
    this.scanner = new PIIDocumentScanner();
    this.progressCallback = options.onProgress || (() => {});
  }
  
  async scanDirectory(directoryPath, options = {}) {
    const files = await this.collectFiles(directoryPath, options);
    const results = [];
    const errors = [];
    
    // Process files in batches
    const batchSize = Math.ceil(files.length / this.concurrency);
    const batches = [];
    
    for (let i = 0; i < files.length; i += batchSize) {
      batches.push(files.slice(i, i + batchSize));
    }
    
    // Process batches in parallel
    const batchPromises = batches.map((batch, index) => 
      this.processBatch(batch, index, batches.length)
    );
    
    const batchResults = await Promise.all(batchPromises);
    
    // Combine results
    batchResults.forEach(batchResult => {
      results.push(...batchResult.results);
      errors.push(...batchResult.errors);
    });
    
    return {
      scannedFiles: files.length,
      successfulScans: results.length,
      errors: errors,
      results: results,
      summary: this.generateBatchSummary(results)
    };
  }
  
  async processBatch(files, batchIndex, totalBatches) {
    const results = [];
    const errors = [];
    
    for (let i = 0; i < files.length; i++) {
      try {
        const result = await this.scanner.scanDocument(files[i]);
        results.push(result);
        
        // Report progress
        const overallProgress = (
          (batchIndex * files.length + i + 1) / 
          (totalBatches * files.length)
        ) * 100;
        
        this.progressCallback({
          current: files[i],
          progress: overallProgress,
          findings: result.summary.totalFindings
        });
        
      } catch (error) {
        errors.push({
          file: files[i],
          error: error.message
        });
      }
    }
    
    return { results, errors };
  }
  
  async collectFiles(directoryPath, options) {
    const files = [];
    const extensions = options.extensions || ['.txt', '.pdf', '.docx', '.xlsx', '.csv'];
    
    const walkDir = async (dir) => {
      const entries = await fs.promises.readdir(dir, { withFileTypes: true });
      
      for (const entry of entries) {
        const fullPath = path.join(dir, entry.name);
        
        if (entry.isDirectory() && !options.excludeDirs?.includes(entry.name)) {
          await walkDir(fullPath);
        } else if (entry.isFile()) {
          const ext = path.extname(entry.name).toLowerCase();
          if (extensions.includes(ext)) {
            files.push(fullPath);
          }
        }
      }
    };
    
    await walkDir(directoryPath);
    return files;
  }
  
  generateBatchSummary(results) {
    const summary = {
      totalFiles: results.length,
      filesWithPII: 0,
      totalFindings: 0,
      findingsByType: {},
      findingsBySensitivity: {},
      topRiskFiles: []
    };
    
    results.forEach(result => {
      if (result.findings.length > 0) {
        summary.filesWithPII++;
        summary.totalFindings += result.findings.length;
        
        // Track high-risk files
        const criticalFindings = result.findings.filter(
          f => f.sensitivity === 'critical'
        ).length;
        
        if (criticalFindings > 0) {
          summary.topRiskFiles.push({
            file: result.file,
            criticalFindings: criticalFindings,
            totalFindings: result.findings.length
          });
        }
      }
      
      // Aggregate findings
      result.findings.forEach(finding => {
        summary.findingsByType[finding.type] = 
          (summary.findingsByType[finding.type] || 0) + 1;
        summary.findingsBySensitivity[finding.sensitivity] = 
          (summary.findingsBySensitivity[finding.sensitivity] || 0) + 1;
      });
    });
    
    // Sort top risk files
    summary.topRiskFiles.sort((a, b) => 
      b.criticalFindings - a.criticalFindings
    );
    
    return summary;
  }
}

Removal Strategies

Intelligent PII Removal

Important: Always maintain document backups before removal operations

// pii-remover.js - Safe PII removal with audit trails
class PIIRemover {
  constructor(options = {}) {
    this.replacementStrategies = {
      redact: (value, type) => '[REDACTED]',
      mask: (value, type) => this.maskValue(value, type),
      tokenize: (value, type) => this.tokenizeValue(value, type),
      generalize: (value, type) => this.generalizeValue(value, type)
    };
    
    this.strategy = options.strategy || 'redact';
    this.preserveFormat = options.preserveFormat || false;
    this.auditLog = [];
  }
  
  async removeFromDocument(filePath, scanResults, options = {}) {
    // Create backup
    const backupPath = await this.createBackup(filePath);
    
    try {
      const content = await this.readContent(filePath);
      const sanitized = this.removePII(content, scanResults.findings);
      
      // Write sanitized content
      await this.writeContent(filePath, sanitized.content);
      
      // Create audit report
      const auditReport = {
        originalFile: filePath,
        backupFile: backupPath,
        processDate: new Date().toISOString(),
        removalsPerformed: sanitized.removals,
        strategy: this.strategy,
        summary: this.generateRemovalSummary(sanitized.removals)
      };
      
      // Save audit log
      await this.saveAuditLog(filePath, auditReport);
      
      return auditReport;
      
    } catch (error) {
      // Restore from backup on error
      await this.restoreBackup(backupPath, filePath);
      throw error;
    }
  }
  
  removePII(content, findings) {
    // Sort findings by index (reverse order to maintain positions)
    const sortedFindings = [...findings].sort((a, b) => b.index - a.index);
    
    let sanitizedContent = content;
    const removals = [];
    
    sortedFindings.forEach(finding => {
      const replacement = this.replacementStrategies[this.strategy](
        finding.value, 
        finding.type
      );
      
      // Preserve format if requested
      const finalReplacement = this.preserveFormat 
        ? this.preserveOriginalFormat(finding.value, replacement)
        : replacement;
      
      // Replace in content
      sanitizedContent = 
        sanitizedContent.substring(0, finding.index) +
        finalReplacement +
        sanitizedContent.substring(finding.index + finding.length);
      
      // Log removal
      removals.push({
        type: finding.type,
        originalLength: finding.length,
        replacementLength: finalReplacement.length,
        sensitivity: finding.sensitivity,
        strategy: this.strategy
      });
    });
    
    return {
      content: sanitizedContent,
      removals: removals
    };
  }
  
  maskValue(value, type) {
    switch (type) {
      case 'Social Security Number':
        return 'XXX-XX-' + value.slice(-4);
        
      case 'Credit Card Number':
        const cleaned = value.replace(/\D/g, '');
        return '**** **** **** ' + cleaned.slice(-4);
        
      case 'Email Address':
        const [localPart, domain] = value.split('@');
        const maskedLocal = localPart[0] + '*'.repeat(localPart.length - 2) + localPart.slice(-1);
        return maskedLocal + '@' + domain;
        
      case 'Phone Number':
        const digits = value.replace(/\D/g, '');
        return '(***) ***-' + digits.slice(-4);
        
      default:
        return '*'.repeat(value.length);
    }
  }
  
  tokenizeValue(value, type) {
    // Generate consistent token for same value
    const crypto = require('crypto');
    const hash = crypto.createHash('sha256')
      .update(value + type)
      .digest('hex');
    
    return `[TOKEN:${type.toUpperCase().replace(/\s+/g, '_')}:${hash.substring(0, 8)}]`;
  }
  
  generalizeValue(value, type) {
    switch (type) {
      case 'Date of Birth':
        // Replace with year only
        const year = value.match(/\d{4}/);
        return year ? `[YEAR: ${year[0]}]` : '[DATE]';
        
      case 'Address':
        // Keep only city/state
        const cityState = value.match(/,\s*([^,]+),\s*([A-Z]{2})/);
        return cityState ? `[LOCATION: ${cityState[1]}, ${cityState[2]}]` : '[ADDRESS]';
        
      case 'Medical Record Number':
        return '[MEDICAL_ID]';
        
      default:
        return `[${type.toUpperCase().replace(/\s+/g, '_')}]`;
    }
  }
  
  preserveOriginalFormat(original, replacement) {
    // Maintain original spacing and special characters
    if (original.includes('-')) {
      const parts = original.split('-');
      const repParts = replacement.split(/[\s-]/);
      return repParts.join('-');
    }
    
    if (original.includes(' ')) {
      const spaceCount = (original.match(/\s/g) || []).length;
      return replacement + ' '.repeat(Math.max(0, spaceCount));
    }
    
    return replacement;
  }
  
  async createBackup(filePath) {
    const backupDir = path.join(path.dirname(filePath), '.pii-backups');
    await fs.promises.mkdir(backupDir, { recursive: true });
    
    const timestamp = new Date().toISOString().replace(/[:.]/g, '-');
    const backupName = `${path.basename(filePath)}.${timestamp}.backup`;
    const backupPath = path.join(backupDir, backupName);
    
    await fs.promises.copyFile(filePath, backupPath);
    return backupPath;
  }
  
  async saveAuditLog(filePath, auditReport) {
    const auditDir = path.join(path.dirname(filePath), '.pii-audit');
    await fs.promises.mkdir(auditDir, { recursive: true });
    
    const auditFile = path.join(
      auditDir, 
      `${path.basename(filePath)}.audit.json`
    );
    
    // Append to existing audit log
    let existingLog = [];
    try {
      const existing = await fs.promises.readFile(auditFile, 'utf8');
      existingLog = JSON.parse(existing);
    } catch (e) {
      // File doesn't exist yet
    }
    
    existingLog.push(auditReport);
    
    await fs.promises.writeFile(
      auditFile, 
      JSON.stringify(existingLog, null, 2)
    );
  }
}

Validation and Verification

Post-Removal Validation

// validation.js - Verify PII removal effectiveness
class PIIRemovalValidator {
  constructor() {
    this.scanner = new PIIDocumentScanner();
  }
  
  async validateRemoval(sanitizedFile, originalScanResults) {
    // Rescan sanitized file
    const rescanResults = await this.scanner.scanDocument(sanitizedFile);
    
    // Check for remaining PII
    const validationReport = {
      file: sanitizedFile,
      validationDate: new Date().toISOString(),
      originalFindings: originalScanResults.summary.totalFindings,
      remainingFindings: rescanResults.summary.totalFindings,
      removalEffectiveness: 0,
      remainingPII: rescanResults.findings,
      validationStatus: 'PENDING'
    };
    
    // Calculate effectiveness
    if (originalScanResults.summary.totalFindings > 0) {
      validationReport.removalEffectiveness = 
        ((originalScanResults.summary.totalFindings - rescanResults.summary.totalFindings) / 
         originalScanResults.summary.totalFindings) * 100;
    }
    
    // Determine validation status
    if (rescanResults.summary.totalFindings === 0) {
      validationReport.validationStatus = 'PASSED';
    } else if (rescanResults.summary.bySensitivity['critical'] > 0) {
      validationReport.validationStatus = 'FAILED_CRITICAL';
    } else if (validationReport.removalEffectiveness >= 95) {
      validationReport.validationStatus = 'PASSED_WITH_WARNINGS';
    } else {
      validationReport.validationStatus = 'FAILED';
    }
    
    // Generate recommendations
    validationReport.recommendations = this.generateRecommendations(
      rescanResults.findings
    );
    
    return validationReport;
  }
  
  generateRecommendations(remainingFindings) {
    const recommendations = [];
    
    // Group by type
    const findingsByType = {};
    remainingFindings.forEach(finding => {
      findingsByType[finding.type] = findingsByType[finding.type] || [];
      findingsByType[finding.type].push(finding);
    });
    
    // Generate type-specific recommendations
    Object.entries(findingsByType).forEach(([type, findings]) => {
      if (findings.some(f => f.confidence < 0.8)) {
        recommendations.push({
          type: type,
          action: 'MANUAL_REVIEW',
          reason: 'Low confidence detections require human verification'
        });
      }
      
      if (findings.some(f => f.requiresManualReview)) {
        recommendations.push({
          type: type,
          action: 'CONTEXT_REVIEW',
          reason: 'Context-sensitive PII requires manual assessment'
        });
      }
    });
    
    return recommendations;
  }
}

Compliance Reporting

// compliance-reporter.js - Generate compliance documentation
class ComplianceReporter {
  async generateReport(scanResults, removalResults, validationResults) {
    const report = {
      reportId: this.generateReportId(),
      generatedDate: new Date().toISOString(),
      executiveSummary: this.generateExecutiveSummary(
        scanResults, 
        removalResults, 
        validationResults
      ),
      detailedFindings: {
        scan: scanResults,
        removal: removalResults,
        validation: validationResults
      },
      complianceStatus: this.assessComplianceStatus(validationResults),
      recommendations: this.compileRecommendations(
        scanResults, 
        removalResults, 
        validationResults
      )
    };
    
    // Generate different report formats
    const reports = {
      json: report,
      html: this.generateHTMLReport(report),
      pdf: await this.generatePDFReport(report),
      csv: this.generateCSVSummary(report)
    };
    
    return reports;
  }
  
  generateExecutiveSummary(scan, removal, validation) {
    return {
      filesProcessed: scan.scannedFiles,
      piiInstancesFound: scan.summary.totalFindings,
      piiInstancesRemoved: scan.summary.totalFindings - validation.remainingFindings,
      removalEffectiveness: validation.removalEffectiveness + '%',
      complianceStatus: validation.validationStatus,
      criticalIssues: validation.remainingPII.filter(
        f => f.sensitivity === 'critical'
      ).length
    };
  }
  
  assessComplianceStatus(validation) {
    const status = {
      compliant: false,
      regulations: {
        GDPR: false,
        CCPA: false,
        HIPAA: false
      },
      issues: []
    };
    
    // GDPR compliance check
    if (validation.validationStatus === 'PASSED' || 
        validation.validationStatus === 'PASSED_WITH_WARNINGS') {
      status.regulations.GDPR = true;
    } else {
      status.issues.push('GDPR: Personal data not fully removed');
    }
    
    // HIPAA compliance check (if medical data detected)
    const hasMedicalData = validation.remainingPII.some(
      f => f.type.includes('Medical') || f.type.includes('Health')
    );
    
    if (!hasMedicalData) {
      status.regulations.HIPAA = true;
    } else {
      status.issues.push('HIPAA: Protected health information detected');
    }
    
    // Overall compliance
    status.compliant = Object.values(status.regulations).every(v => v === true);
    
    return status;
  }
}

Tool Integration

Open Source Tools Integration

// tool-integration.js - Integrate with popular PII tools
class PIIToolIntegration {
  constructor() {
    this.tools = {
      presidio: new PresidioAnalyzer(),     // Microsoft Presidio
      piilo: new PiiloScanner(),            // Piilo
      privatePanda: new PrivatePanda()      // Private Panda
    };
  }
  
  async enhancedScan(content, options = {}) {
    const results = {
      internal: [],
      external: {}
    };
    
    // Use internal scanner
    const internalScanner = new PIIDocumentScanner();
    results.internal = await internalScanner.scanContent(content);
    
    // Optionally use external tools
    if (options.usePresidio) {
      results.external.presidio = await this.tools.presidio.analyze(content);
    }
    
    if (options.usePiilo) {
      results.external.piilo = await this.tools.piilo.scan(content);
    }
    
    // Merge and deduplicate results
    return this.mergeResults(results);
  }
  
  mergeResults(multiSourceResults) {
    const merged = new Map();
    
    // Process internal results
    multiSourceResults.internal.forEach(finding => {
      const key = `${finding.index}-${finding.value}`;
      merged.set(key, {
        ...finding,
        sources: ['internal']
      });
    });
    
    // Process external results
    Object.entries(multiSourceResults.external).forEach(([tool, findings]) => {
      findings.forEach(finding => {
        const key = `${finding.start}-${finding.text}`;
        
        if (merged.has(key)) {
          merged.get(key).sources.push(tool);
          merged.get(key).confidence = Math.max(
            merged.get(key).confidence,
            finding.score || 0.5
          );
        } else {
          merged.set(key, {
            value: finding.text,
            index: finding.start,
            length: finding.end - finding.start,
            type: finding.entity_type,
            sensitivity: this.mapSensitivity(finding.entity_type),
            confidence: finding.score || 0.5,
            sources: [tool]
          });
        }
      });
    });
    
    return Array.from(merged.values());
  }
}

API Integration Pattern

// api-scanner.js - PII scanning as a service
const express = require('express');
const multer = require('multer');

class PIIScanningAPI {
  constructor() {
    this.app = express();
    this.scanner = new PIIDocumentScanner();
    this.remover = new PIIRemover();
    this.validator = new PIIRemovalValidator();
    
    this.setupRoutes();
  }
  
  setupRoutes() {
    const upload = multer({ dest: 'uploads/' });
    
    // Scan endpoint
    this.app.post('/api/scan', upload.single('document'), async (req, res) => {
      try {
        const scanResults = await this.scanner.scanDocument(req.file.path);
        
        res.json({
          success: true,
          results: scanResults,
          scanId: this.generateScanId()
        });
      } catch (error) {
        res.status(500).json({
          success: false,
          error: error.message
        });
      }
    });
    
    // Remove PII endpoint
    this.app.post('/api/remove', upload.single('document'), async (req, res) => {
      try {
        const { strategy = 'redact' } = req.body;
        
        // Scan first
        const scanResults = await this.scanner.scanDocument(req.file.path);
        
        // Remove PII
        const removalResults = await this.remover.removeFromDocument(
          req.file.path,
          scanResults,
          { strategy }
        );
        
        // Validate removal
        const validationResults = await this.validator.validateRemoval(
          req.file.path,
          scanResults
        );
        
        // Return sanitized file
        res.download(req.file.path, 'sanitized-' + req.file.originalname);
        
      } catch (error) {
        res.status(500).json({
          success: false,
          error: error.message
        });
      }
    });
    
    // Batch processing endpoint
    this.app.post('/api/batch-scan', async (req, res) => {
      const { directory, options } = req.body;
      
      const batchScanner = new BatchPIIScanner({
        onProgress: (progress) => {
          // Could use WebSocket to stream progress
          console.log(`Progress: ${progress.progress}%`);
        }
      });
      
      try {
        const results = await batchScanner.scanDirectory(directory, options);
        
        res.json({
          success: true,
          results: results,
          batchId: this.generateBatchId()
        });
      } catch (error) {
        res.status(500).json({
          success: false,
          error: error.message
        });
      }
    });
  }
}

Implementation Checklist

Pre-Scanning Preparation

Backup Strategy: Ensure all documents are backed up
Access Permissions: Verify read/write permissions
Compliance Requirements: Identify applicable regulations
Sensitivity Classification: Define PII sensitivity levels
Tool Selection: Choose appropriate scanning tools

Scanning Configuration

Pattern Definition: Configure PII detection patterns
False Positive Rules: Set up context validation
Confidence Thresholds: Adjust detection sensitivity
File Format Support: Verify all document types covered
Performance Tuning: Configure batch sizes and concurrency

Removal Verification

Removal Strategy: Select appropriate method (redact/mask/tokenize)
Audit Trail: Enable comprehensive logging
Validation Process: Set up post-removal scanning
Manual Review: Flag items requiring human verification
Rollback Capability: Test backup restoration

Compliance Documentation

Scan Reports: Generate detailed findings reports
Removal Certificates: Document PII removal actions
Validation Records: Maintain effectiveness metrics
Audit Logs: Preserve compliance trail
Regular Reviews: Schedule periodic rescans

Common Pitfalls and Solutions

False Positives

Problem: Detecting non-PII as PII (e.g., product codes as SSNs)
Solution: Implement context-aware validation and industry-specific rules

Performance Issues

Problem: Slow scanning of large document sets
Solution: Use parallel processing and optimize regex patterns

Format Preservation

Problem: Breaking document structure during removal
Solution: Use format-aware replacement strategies

Incomplete Removal

Problem: PII in images, headers, or metadata
Solution: Use comprehensive extraction methods and OCR for images

Additional Resources

Related Guides

Legal Compliance Research Guide - Regulatory requirements
Security Architecture Framework - Security principles
Testing Framework Guide - Testing PII removal
AI Standards Guide - AI data preparation

External Resources

Success Metrics

Detection Accuracy: >95% PII identification rate
False Positive Rate: <5% incorrect detections
Processing Speed: >100 documents/minute
Removal Effectiveness: 100% for critical PII
Compliance Rate: Pass all regulatory audits

This guide provides a comprehensive framework for identifying and removing PII from documents while maintaining compliance and document utility. Regular updates and pattern improvements ensure continued effectiveness against evolving privacy requirements.