BJSS OCR WORKFLOW - DETAILED STEP-BY-STEP PROCESS ===================================================== This document provides a comprehensive overview of how the BJSS (Bank J. Safra Sarasin) OCR system works, from file upload to data extraction and display. OVERVIEW -------- The BJSS OCR system processes investment portfolio documents (POS - Point of Sale statements) to extract bond, equity, and cash position data. It supports both Singapore and Switzerland document formats. ARCHITECTURE COMPONENTS ----------------------- 1. Frontend (app.html) - User interface for file upload and results display 2. Backend Server (server.js) - Express.js server handling file uploads and processing 3. OCR Processor (ocr-processor.js) - Handles PDF/image to text conversion 4. Data Extractor (data-extractor.js) - Routes to appropriate bank-specific extractors 5. BJSS Extractor (bjss-pos-extraction.js) - Specialized BJSS document processing 6. Configuration (bjss-pos-extraction-config.js) - Document format patterns and rules DETAILED WORKFLOW ================= STEP 1: FILE UPLOAD (Frontend) ------------------------------ Location: public/app.html (lines 331-381) ```javascript async function handleUpload() { if (selectedFiles.length === 0) return; const formData = new FormData(); selectedFiles.forEach(file => { formData.append('documents', file); }); // Initialize progress tracking totalFiles = selectedFiles.length; processedFiles = 0; showProgress(); updateProgress(); // Clear previous results allResults = []; const response = await fetch('/api/upload-bulk', { method: 'POST', body: formData }); } ``` What happens: - User selects files via drag-and-drop or file picker - Files are validated (PDF, PNG, JPG, JPEG, max 50MB each, max 20 files) - FormData is created with all selected files - Progress tracking is initialized - POST request sent to /api/upload-bulk endpoint STEP 2: SERVER FILE RECEIPT (Backend) ------------------------------------- Location: server.js (lines 89-222) ```javascript app.post('/api/upload-bulk', upload.array('documents', 20), async (req, res) => { try { if (!req.files || req.files.length === 0) { return res.status(400).json({ success: false, error: 'No files uploaded' }); } const jobs = []; const batchId = Date.now().toString(); for (const file of req.files) { // Save file to disk manually const uniqueSuffix = Date.now() + '-' + Math.round(Math.random() * 1E9); const extension = path.extname(file.originalname); const filename = file.fieldname + '-' + uniqueSuffix + extension; const filePath = path.join(uploadsDir, filename); // Write file to disk fs.writeFileSync(filePath, file.buffer); const jobData = { file_path: filePath, original_name: file.originalname, file_size: file.size, user_id: req.body.user_id || 'anonymous', job_id: Date.now().toString() + Math.random().toString(36).substr(2, 9), batch_id: batchId, timestamp: new Date().toISOString() }; // Process file immediately (synchronous processing) // ... processing logic continues } } }); ``` What happens: - Multer middleware processes uploaded files (stored in memory) - Files are saved to disk in uploads/ directory with unique names - Job data structure is created for each file - Processing begins immediately (synchronous, not queued) STEP 3: OCR PROCESSING ---------------------- Location: ocr-processor.js (lines 19-39) ```javascript async processFile(filePath, originalName) { try { console.log(`πŸ” Starting OCR processing for: ${originalName}`); const fileExtension = path.extname(filePath).toLowerCase(); let ocrText = ''; if (fileExtension === '.pdf') { ocrText = await this.processPDF(filePath, originalName); } else { ocrText = await this.processImage(filePath, originalName); } console.log(`βœ… OCR completed for ${originalName}: ${ocrText.length} characters extracted`); return ocrText; } catch (error) { console.error(`❌ OCR processing failed for ${originalName}:`, error); throw new Error(`OCR processing failed: ${error.message}`); } } ``` PDF Processing (lines 41-64): ```javascript async processPDF(filePath, originalName) { try { console.log(`πŸ“„ Processing PDF: ${originalName}`); // First, try direct text extraction from PDF const pdfBuffer = fs.readFileSync(filePath); const data = await pdfParse(pdfBuffer); let ocrText = data.text.trim(); console.log(`πŸ“Š PDF text extraction: ${ocrText.length} characters`); // If text extraction is limited, it's likely an image-based PDF if (ocrText.length < 100) { console.log(`⚠️ PDF appears to be image-based (${ocrText.length} chars) - converting to images for OCR`); ocrText = await this.processImageBasedPDF(filePath, originalName); } return ocrText; } } ``` Image Processing (lines 119-137): ```javascript async processImage(filePath, originalName) { try { console.log(`πŸ–ΌοΈ Processing image: ${originalName}`); const { data: { text } } = await Tesseract.recognize(filePath, 'eng', { logger: m => { if (m.status === 'recognizing text') { console.log(`πŸ“Š OCR Progress for ${originalName}: ${Math.round(m.progress * 100)}%`); } } }); return text.trim(); } } ``` What happens: - File extension determines processing method (PDF vs Image) - PDFs: First attempt direct text extraction, fallback to image conversion if needed - Images: Direct Tesseract OCR processing - Progress is logged and tracked - Raw text is returned for further processing STEP 4: BANK TYPE DETECTION --------------------------- Location: ocr-processor.js (lines 141-241) ```javascript detectBankType(text, filename = '') { const textLower = text.toLowerCase(); const filenameLower = filename.toLowerCase(); console.log(`πŸ” Detecting bank type for: ${filename}`); console.log(`πŸ“„ Text length: ${text.length}`); // First try to detect from filename (order matters - more specific first) if (filenameLower.includes('bjss') || filenameLower.includes('safra') || filenameLower.includes('sarasin') || filenameLower.includes('investment') || filenameLower.includes('portfolio') || filenameLower.includes('statement') || filenameLower.includes('pos') || filenameLower.includes('positions')) { console.log(`βœ… Detected from filename: BJSS`); return 'bjss'; } // Then try to detect from text content if (textLower.includes('investment report') || textLower.includes('bank j. safra sarasin') || textLower.includes('safra sarasin') || textLower.includes('2. bonds') || textLower.includes('3. equities') || textLower.includes('2. equities') || (textLower.includes('2. bonds') && textLower.includes('3. equities')) || (textLower.includes('total bonds') && textLower.includes('total equities')) || (textLower.includes('reference currency') && textLower.includes('client number'))) { console.log(`βœ… Detected from text: BJSS`); return 'bjss'; } // ... other bank detection logic console.log(`⚠️ No specific bank detected - defaulting to NOT FOUND`); return 'not_found'; } ``` What happens: - Analyzes filename patterns first (more reliable) - Then analyzes text content for bank-specific keywords - BJSS detection looks for: "investment report", "safra sarasin", "2. bonds", "3. equities", etc. - Returns bank type string for routing to appropriate extractor STEP 5: DATA EXTRACTION ROUTING ------------------------------- Location: data-extractor.js (lines 25-104) ```javascript async extractData(text, bankType, filename = '') { try { console.log(`🎯 Starting data extraction for: ${filename}`); console.log(`🏦 Bank type: ${bankType}`); console.log(`πŸ“„ Text length: ${text.length}`); if (!this.extractors[bankType]) { if (bankType === 'not_found') { console.log(`⚠️ Bank type not detected - returning not found result`); return { success: false, bank_type: 'not_found', filename: filename, error: 'Bank type not detected - document may not be supported', text_length: text.length, timestamp: new Date().toISOString() }; } throw new Error(`Unsupported bank type: ${bankType}`); } const ExtractorClass = this.extractors[bankType]; const extractor = new ExtractorClass(); // Use the extractFromText method console.log(`πŸ” About to call extractFromText for ${bankType}`); const result = extractor.extractFromText(text); console.log(`πŸ” extractFromText returned:`, !!result); if (result) { console.log(`βœ… Data extraction successful for ${bankType}`); return { success: true, bank_type: bankType, filename: filename, extracted_data: result, text_length: text.length, timestamp: new Date().toISOString() }; } } } ``` What happens: - Routes to appropriate extractor based on detected bank type - For BJSS: Uses BJSSPOSDocumentExtractor class - Calls extractFromText method on the extractor - Returns structured result with success status and extracted data STEP 6: BJSS DOCUMENT PROCESSING -------------------------------- Location: bjss-pos-extraction.js (lines 172-200) ```javascript extractFromFile(filePath) { try { // Clear previous log file if (fs.existsSync(logFile)) { fs.unlinkSync(logFile); } const content = fs.readFileSync(filePath, 'utf8'); const lines = content.split('\n'); logToFile('πŸ” BJSS POS Extraction Started'); logToFile(`πŸ“„ File path: ${filePath}`); logToFile(`πŸ“Š Total lines: ${lines.length}`); // Detect document format first this.documentFormat = this.detectDocumentFormat(lines); logToFile(`πŸ“‹ Detected document format: ${this.documentFormat}`); // Extract client information this.extractClientInfo(lines); // Initialize ISIN tracking to prevent cross-contamination this.usedISINs = new Set(); // Extract positions using separate functions this.extractedData.cashPositions = this.extractCashPositions(lines); this.extractedData.bondPositions = this.extractBondPositions(lines); this.extractedData.equityPositions = this.extractEquityPositions(lines); this.extractedData.otherAssetPositions = []; // ... rest of processing } } ``` Document Format Detection (lines 70-164): ```javascript detectDocumentFormat(lines) { let singaporeScore = 0; let switzerlandScore = 0; // Check for bank identifiers for (const line of lines) { if (line.includes(this.config.documentFormats.singapore.identifier)) { singaporeScore += 3; // High weight for bank identifier } if (line.includes(this.config.documentFormats.switzerland.identifier)) { switzerlandScore += 3; // High weight for bank identifier } } // Check for format-specific header fields for (const line of lines) { for (const field of this.config.documentFormats.singapore.headerFields) { if (line.includes(field)) { singaporeScore += 1; } } for (const field of this.config.documentFormats.switzerland.headerFields) { if (line.includes(field)) { switzerlandScore += 1; } } } // Analyze bond section structure density // Singapore format: 20-25 lines per bond (verbose) // Switzerland format: 9-12 lines per bond (compact) // Return detected format if (switzerlandScore > singaporeScore) { return 'switzerland'; } else { return 'singapore'; // Default to Singapore if scores are equal } } ``` What happens: - Splits OCR text into lines for processing - Detects document format (Singapore vs Switzerland) using scoring system - Extracts client information (name, number, portfolio, currency, dates) - Processes three main sections: Cash, Bonds, and Equities - Uses configuration-based patterns for each format STEP 7: POSITION DATA EXTRACTION -------------------------------- The BJSS extractor processes three types of positions: A. Cash Positions Extraction: ```javascript extractCashPositions(lines) { // Looks for "1.2 Positions" section // Extracts: Description, Account, Currency, Nominal, Purchase Price, Valuation USD, Accrued Interest, Proportion // Handles both Singapore and Switzerland formats } ``` B. Bond Positions Extraction: ```javascript extractBondPositions(lines) { // Looks for "2.2 Positions" section // Extracts: Currency, Quantity, Name, Country, Purchase Price, Purchase Price FX, Market Price, Market Price FX, Value, ISIN // Uses percentage patterns and company name detection // Prevents duplicate ISINs across sections } ``` C. Equity Positions Extraction: ```javascript extractEquityPositions(lines) { // Looks for "3.2 Positions" section // Extracts: Currency, Quantity, Name, Country, Purchase Price, Purchase Price FX, Market Price, Market Price FX, Value, ISIN, DividendYield // Similar to bonds but includes dividend yield information } ``` STEP 8: RESULT STRUCTURE ------------------------ The extracted data is structured as: ```javascript { success: true, bank_type: 'bjss', filename: 'document.pdf', extracted_data: { documentType: 'BJSS_POS', documentFormat: 'singapore' | 'switzerland', clientInfo: { clientName: 'Client Name', clientNumber: '12345', portfolioNumber: 'P67890', referenceCurrency: 'USD', referenceDate: '2024-01-01', creationDate: '2024-01-02' }, bondPositions: [ { Currency: 'USD', Quantity: '1000', Name: 'Company Bond', Country: 'US', PurchasePrice: '100.50', PurchasePriceFX: '1.00', MarketPrice: '101.25', MarketPriceFX: '1.00', Value: '1012500', ISIN: 'US1234567890' } ], equityPositions: [ { Currency: 'USD', Quantity: '500', Name: 'Company Stock', Country: 'US', PurchasePrice: '50.00', PurchasePriceFX: '1.00', MarketPrice: '52.50', MarketPriceFX: '1.00', Value: '262500', ISIN: 'US9876543210', DividendYield: '2.5%' } ], cashPositions: [ { Description: 'Current account USD', Account: '83.61738.0 4000', Currency: 'USD', Nominal: '100000', PurchasePrice: '1.00', ValuationUSD: '100000', AccruedInterest: '0', Proportion: '5.2%' } ] }, text_length: 15420, timestamp: '2024-01-15T10:30:00.000Z' } ``` STEP 9: FRONTEND DISPLAY ------------------------ Location: public/app.html (lines 497-570) ```javascript function showResults(data) { // Check if this result already exists to avoid duplicates const existingIndex = allResults.findIndex(result => result.job_id === data.job_id); if (existingIndex === -1) { allResults.push(data); } else { allResults[existingIndex] = data; // Update existing result } // Show results section alongside progress section resultsSection.classList.remove('hidden'); // Update summary stats updateSummaryStats(); // Create or update accordion item for this file createOrUpdateAccordionItem(data); } ``` BJSS-Specific Display (lines 572-772): ```javascript function populateBJSSPOSTables(accordionItem, extractedData) { // Add document format indicator const documentFormat = extractedData?.documentFormat || 'unknown'; const formatDisplay = documentFormat === 'singapore' ? 'πŸ‡ΈπŸ‡¬ Singapore Format' : documentFormat === 'switzerland' ? 'πŸ‡¨πŸ‡­ Switzerland Format' : 'πŸ“„ Unknown Format'; // Update table headers for bond, equity, and cash data const headers = [ 'Type', 'Currency', 'Quantity', 'Name', 'Country', 'PurchasePrice', 'PurchasePriceFX', 'MarketPrice', 'MarketPriceFX', 'Value', 'ISIN' ]; // Populate bond and equity positions data const allPositions = []; // Add bond positions if (extractedData.bondPositions && Array.isArray(extractedData.bondPositions)) { extractedData.bondPositions.forEach(bond => { allPositions.push({...bond, Type: 'Bond'}); }); } // Add equity positions if (extractedData.equityPositions && Array.isArray(extractedData.equityPositions)) { extractedData.equityPositions.forEach(equity => { allPositions.push({...equity, Type: 'Equity'}); }); } // Add cash positions (mapped to existing table structure) if (extractedData.cashPositions && Array.isArray(extractedData.cashPositions)) { extractedData.cashPositions.forEach(cash => { allPositions.push({ Type: 'Cash', Currency: cash.Currency || '', Quantity: cash.Nominal || '', Name: cashName, Country: '', PurchasePrice: cash.PurchasePrice || '', PurchasePriceFX: '', MarketPrice: '', MarketPriceFX: '', Value: cash.ValuationUSD || cash.Nominal || '', ISIN: '' }); }); } } ``` What happens: - Results are displayed in accordion-style interface - BJSS documents show format indicator (Singapore/Switzerland) - Combined table shows bonds, equities, and cash positions - Client information displayed in separate section - Raw OCR text available for review - Edit, approve, and database save functionality STEP 10: DATABASE INTEGRATION ----------------------------- Location: public/app.html (lines 1485-1628) ```javascript async function addToDatabase(jobId) { // Get the data for this job const result = allResults.find(r => r.job_id === jobId); if (!result) { showNotification('Data not found for this job', 'error'); return; } // Initialize API client with external API configuration const apiClient = new APIClient(); // Detect product type and configure API accordingly const documentType = dataToUse?.documentType || result.result?.extracted_data?.documentType || 'Unknown'; const bankType = result.bank_type || 'Unknown'; // Determine if this is a Structure or Accumulator product const isStructureProduct = ['swissquote', 'barclays', 'efg'].includes(bankType.toLowerCase()) || documentType === 'STRUCTURE'; // Configure API client apiClient.configure({ baseURL: window.location.origin, // Use local server apiKey: 'Xj1Wboo9Bd0Ru6qsRhYndwsb', // Your actual API key endpoints: { saveData: isStructureProduct ? '/api/proxy/structure' : '/api/proxy/accumulator' }, headers: { 'Content-Type': 'application/json' }, timeout: 30000 // 30 seconds timeout }); // Prepare data for database const dataToSave = { job_id: jobId, filename: result.filename, document_type: dataToUse?.documentType || result.result?.extracted_data?.documentType || 'Unknown', bank_type: result.bank_type || 'Unknown', parent_data: dataToUse?.parent || result.result?.extracted_data?.parent || {}, child_data: dataToUse?.child || result.result?.extracted_data?.child || [], metadata: { processing_time: result.processing_time, extraction_timestamp: new Date().toISOString(), user_id: 'anonymous' } }; // Make API call to save data const response = isStructureProduct ? await apiClient.saveStructureToDatabase(dataToSave) : await apiClient.saveToDatabase(dataToSave); } ``` What happens: - User can save extracted data to external database - BJSS data is treated as accumulator product (not structure) - Data is sent via proxy endpoints to handle CORS - Success/failure feedback provided to user - Duplicate detection and validation handled ERROR HANDLING ============== 1. File Upload Errors: - Invalid file types rejected - File size limits enforced - Maximum file count limits 2. OCR Processing Errors: - PDF parsing failures - Image conversion errors - Tesseract processing failures 3. Bank Detection Errors: - Unknown bank types return 'not_found' - Fallback to filename pattern matching 4. Data Extraction Errors: - Missing sections handled gracefully - Invalid data patterns logged - Partial extraction results returned 5. Display Errors: - Missing data fields shown as "Not Found" - Error states clearly indicated - User feedback for all operations PERFORMANCE CONSIDERATIONS ========================== 1. File Processing: - Synchronous processing for immediate feedback - File cleanup after processing - Memory management for large files 2. OCR Optimization: - Direct PDF text extraction preferred - Image conversion only when necessary - Progress tracking for long operations 3. Data Extraction: - Configuration-based pattern matching - Efficient line-by-line processing - ISIN deduplication to prevent cross-contamination 4. Frontend Performance: - Real-time progress updates via Socket.IO - Accordion interface for multiple files - Efficient DOM updates CONFIGURATION ============= The BJSS extractor uses configuration files to define: - Document format patterns (Singapore vs Switzerland) - Section headers and boundaries - Data extraction patterns - Exclusion patterns - Field mappings This allows for easy updates without code changes. SECURITY CONSIDERATIONS ======================= 1. File Upload Security: - File type validation - Size limits - Temporary file cleanup 2. API Security: - CORS handling via proxy - API key management - Input validation 3. Data Privacy: - Temporary file storage - Automatic cleanup - No persistent storage of sensitive data This comprehensive workflow ensures reliable extraction of BJSS portfolio data while maintaining security and performance standards.