BJSS OCR WORKFLOW - DETAILED STEP-BY-STEP PROCESS
=====================================================

This document provides a comprehensive overview of how the BJSS (Bank J. Safra Sarasin) OCR system works, from file upload to data extraction and display.

OVERVIEW
--------
The BJSS OCR system processes investment portfolio documents (POS - Point of Sale statements) to extract bond, equity, and cash position data. It supports both Singapore and Switzerland document formats.

ARCHITECTURE COMPONENTS
-----------------------
1. Frontend (app.html) - User interface for file upload and results display
2. Backend Server (server.js) - Express.js server handling file uploads and processing
3. OCR Processor (ocr-processor.js) - Handles PDF/image to text conversion
4. Data Extractor (data-extractor.js) - Routes to appropriate bank-specific extractors
5. BJSS Extractor (bjss-pos-extraction.js) - Specialized BJSS document processing
6. Configuration (bjss-pos-extraction-config.js) - Document format patterns and rules

DETAILED WORKFLOW
=================

STEP 1: FILE UPLOAD (Frontend)
------------------------------
Location: public/app.html (lines 331-381)

```javascript
async function handleUpload() {
    if (selectedFiles.length === 0) return;

    const formData = new FormData();
    selectedFiles.forEach(file => {
        formData.append('documents', file);
    });

    // Initialize progress tracking
    totalFiles = selectedFiles.length;
    processedFiles = 0;
    showProgress();
    updateProgress();
    
    // Clear previous results
    allResults = [];
    
    const response = await fetch('/api/upload-bulk', {
        method: 'POST',
        body: formData
    });
}
```

What happens:
- User selects files via drag-and-drop or file picker
- Files are validated (PDF, PNG, JPG, JPEG, max 50MB each, max 20 files)
- FormData is created with all selected files
- Progress tracking is initialized
- POST request sent to /api/upload-bulk endpoint

STEP 2: SERVER FILE RECEIPT (Backend)
-------------------------------------
Location: server.js (lines 89-222)

```javascript
app.post('/api/upload-bulk', upload.array('documents', 20), async (req, res) => {
    try {
        if (!req.files || req.files.length === 0) {
            return res.status(400).json({ 
                success: false, 
                error: 'No files uploaded' 
            });
        }

        const jobs = [];
        const batchId = Date.now().toString();
        
        for (const file of req.files) {
            // Save file to disk manually
            const uniqueSuffix = Date.now() + '-' + Math.round(Math.random() * 1E9);
            const extension = path.extname(file.originalname);
            const filename = file.fieldname + '-' + uniqueSuffix + extension;
            const filePath = path.join(uploadsDir, filename);
            
            // Write file to disk
            fs.writeFileSync(filePath, file.buffer);
            
            const jobData = {
                file_path: filePath,
                original_name: file.originalname,
                file_size: file.size,
                user_id: req.body.user_id || 'anonymous',
                job_id: Date.now().toString() + Math.random().toString(36).substr(2, 9),
                batch_id: batchId,
                timestamp: new Date().toISOString()
            };
            
            // Process file immediately (synchronous processing)
            // ... processing logic continues
        }
    }
});
```

What happens:
- Multer middleware processes uploaded files (stored in memory)
- Files are saved to disk in uploads/ directory with unique names
- Job data structure is created for each file
- Processing begins immediately (synchronous, not queued)

STEP 3: OCR PROCESSING
----------------------
Location: ocr-processor.js (lines 19-39)

```javascript
async processFile(filePath, originalName) {
    try {
        console.log(`🔍 Starting OCR processing for: ${originalName}`);
        
        const fileExtension = path.extname(filePath).toLowerCase();
        let ocrText = '';

        if (fileExtension === '.pdf') {
            ocrText = await this.processPDF(filePath, originalName);
        } else {
            ocrText = await this.processImage(filePath, originalName);
        }

        console.log(`✅ OCR completed for ${originalName}: ${ocrText.length} characters extracted`);
        return ocrText;

    } catch (error) {
        console.error(`❌ OCR processing failed for ${originalName}:`, error);
        throw new Error(`OCR processing failed: ${error.message}`);
    }
}
```

PDF Processing (lines 41-64):
```javascript
async processPDF(filePath, originalName) {
    try {
        console.log(`📄 Processing PDF: ${originalName}`);
        
        // First, try direct text extraction from PDF
        const pdfBuffer = fs.readFileSync(filePath);
        const data = await pdfParse(pdfBuffer);
        let ocrText = data.text.trim();

        console.log(`📊 PDF text extraction: ${ocrText.length} characters`);

        // If text extraction is limited, it's likely an image-based PDF
        if (ocrText.length < 100) {
            console.log(`⚠️ PDF appears to be image-based (${ocrText.length} chars) - converting to images for OCR`);
            ocrText = await this.processImageBasedPDF(filePath, originalName);
        }

        return ocrText;
    }
}
```

Image Processing (lines 119-137):
```javascript
async processImage(filePath, originalName) {
    try {
        console.log(`🖼️ Processing image: ${originalName}`);
        
        const { data: { text } } = await Tesseract.recognize(filePath, 'eng', {
            logger: m => {
                if (m.status === 'recognizing text') {
                    console.log(`📊 OCR Progress for ${originalName}: ${Math.round(m.progress * 100)}%`);
                }
            }
        });

        return text.trim();
    }
}
```

What happens:
- File extension determines processing method (PDF vs Image)
- PDFs: First attempt direct text extraction, fallback to image conversion if needed
- Images: Direct Tesseract OCR processing
- Progress is logged and tracked
- Raw text is returned for further processing

STEP 4: BANK TYPE DETECTION
---------------------------
Location: ocr-processor.js (lines 141-241)

```javascript
detectBankType(text, filename = '') {
    const textLower = text.toLowerCase();
    const filenameLower = filename.toLowerCase();
    
    console.log(`🔍 Detecting bank type for: ${filename}`);
    console.log(`📄 Text length: ${text.length}`);
    
    // First try to detect from filename (order matters - more specific first)
    if (filenameLower.includes('bjss') || filenameLower.includes('safra') || filenameLower.includes('sarasin') ||
       filenameLower.includes('investment') || filenameLower.includes('portfolio') || filenameLower.includes('statement') ||
       filenameLower.includes('pos') || filenameLower.includes('positions')) {
        console.log(`✅ Detected from filename: BJSS`);
        return 'bjss';
    }
    
    // Then try to detect from text content
    if (textLower.includes('investment report') || 
        textLower.includes('bank j. safra sarasin') || 
        textLower.includes('safra sarasin') ||
        textLower.includes('2. bonds') || textLower.includes('3. equities') || textLower.includes('2. equities') ||
        (textLower.includes('2. bonds') && textLower.includes('3. equities')) ||
        (textLower.includes('total bonds') && textLower.includes('total equities')) ||
        (textLower.includes('reference currency') && textLower.includes('client number'))) {
        console.log(`✅ Detected from text: BJSS`);
        return 'bjss';
    }
    
    // ... other bank detection logic
    
    console.log(`⚠️ No specific bank detected - defaulting to NOT FOUND`);
    return 'not_found';
}
```

What happens:
- Analyzes filename patterns first (more reliable)
- Then analyzes text content for bank-specific keywords
- BJSS detection looks for: "investment report", "safra sarasin", "2. bonds", "3. equities", etc.
- Returns bank type string for routing to appropriate extractor

STEP 5: DATA EXTRACTION ROUTING
-------------------------------
Location: data-extractor.js (lines 25-104)

```javascript
async extractData(text, bankType, filename = '') {
    try {
        console.log(`🎯 Starting data extraction for: ${filename}`);
        console.log(`🏦 Bank type: ${bankType}`);
        console.log(`📄 Text length: ${text.length}`);

        if (!this.extractors[bankType]) {
            if (bankType === 'not_found') {
                console.log(`⚠️ Bank type not detected - returning not found result`);
                return {
                    success: false,
                    bank_type: 'not_found',
                    filename: filename,
                    error: 'Bank type not detected - document may not be supported',
                    text_length: text.length,
                    timestamp: new Date().toISOString()
                };
            }
            throw new Error(`Unsupported bank type: ${bankType}`);
        }

        const ExtractorClass = this.extractors[bankType];
        const extractor = new ExtractorClass();
        
        // Use the extractFromText method
        console.log(`🔍 About to call extractFromText for ${bankType}`);
        const result = extractor.extractFromText(text);
        console.log(`🔍 extractFromText returned:`, !!result);
        
        if (result) {
            console.log(`✅ Data extraction successful for ${bankType}`);
            return {
                success: true,
                bank_type: bankType,
                filename: filename,
                extracted_data: result,
                text_length: text.length,
                timestamp: new Date().toISOString()
            };
        }
    }
}
```

What happens:
- Routes to appropriate extractor based on detected bank type
- For BJSS: Uses BJSSPOSDocumentExtractor class
- Calls extractFromText method on the extractor
- Returns structured result with success status and extracted data

STEP 6: BJSS DOCUMENT PROCESSING
--------------------------------
Location: bjss-pos-extraction.js (lines 172-200)

```javascript
extractFromFile(filePath) {
    try {
        // Clear previous log file
        if (fs.existsSync(logFile)) {
            fs.unlinkSync(logFile);
        }
        
        const content = fs.readFileSync(filePath, 'utf8');
        const lines = content.split('\n');
        
        logToFile('🔍 BJSS POS Extraction Started');
        logToFile(`📄 File path: ${filePath}`);
        logToFile(`📊 Total lines: ${lines.length}`);
        
        // Detect document format first
        this.documentFormat = this.detectDocumentFormat(lines);
        logToFile(`📋 Detected document format: ${this.documentFormat}`);
        
        // Extract client information
        this.extractClientInfo(lines);
        
        // Initialize ISIN tracking to prevent cross-contamination
        this.usedISINs = new Set();
        
        // Extract positions using separate functions
        this.extractedData.cashPositions = this.extractCashPositions(lines);
        this.extractedData.bondPositions = this.extractBondPositions(lines);
        this.extractedData.equityPositions = this.extractEquityPositions(lines);
        this.extractedData.otherAssetPositions = [];
        
        // ... rest of processing
    }
}
```

Document Format Detection (lines 70-164):
```javascript
detectDocumentFormat(lines) {
    let singaporeScore = 0;
    let switzerlandScore = 0;
    
    // Check for bank identifiers
    for (const line of lines) {
        if (line.includes(this.config.documentFormats.singapore.identifier)) {
            singaporeScore += 3; // High weight for bank identifier
        }
        if (line.includes(this.config.documentFormats.switzerland.identifier)) {
            switzerlandScore += 3; // High weight for bank identifier
        }
    }
    
    // Check for format-specific header fields
    for (const line of lines) {
        for (const field of this.config.documentFormats.singapore.headerFields) {
            if (line.includes(field)) {
                singaporeScore += 1;
            }
        }
        for (const field of this.config.documentFormats.switzerland.headerFields) {
            if (line.includes(field)) {
                switzerlandScore += 1;
            }
        }
    }
    
    // Analyze bond section structure density
    // Singapore format: 20-25 lines per bond (verbose)
    // Switzerland format: 9-12 lines per bond (compact)
    
    // Return detected format
    if (switzerlandScore > singaporeScore) {
        return 'switzerland';
    } else {
        return 'singapore'; // Default to Singapore if scores are equal
    }
}
```

What happens:
- Splits OCR text into lines for processing
- Detects document format (Singapore vs Switzerland) using scoring system
- Extracts client information (name, number, portfolio, currency, dates)
- Processes three main sections: Cash, Bonds, and Equities
- Uses configuration-based patterns for each format

STEP 7: POSITION DATA EXTRACTION
--------------------------------
The BJSS extractor processes three types of positions:

A. Cash Positions Extraction:
```javascript
extractCashPositions(lines) {
    // Looks for "1.2 Positions" section
    // Extracts: Description, Account, Currency, Nominal, Purchase Price, Valuation USD, Accrued Interest, Proportion
    // Handles both Singapore and Switzerland formats
}
```

B. Bond Positions Extraction:
```javascript
extractBondPositions(lines) {
    // Looks for "2.2 Positions" section
    // Extracts: Currency, Quantity, Name, Country, Purchase Price, Purchase Price FX, Market Price, Market Price FX, Value, ISIN
    // Uses percentage patterns and company name detection
    // Prevents duplicate ISINs across sections
}
```

C. Equity Positions Extraction:
```javascript
extractEquityPositions(lines) {
    // Looks for "3.2 Positions" section
    // Extracts: Currency, Quantity, Name, Country, Purchase Price, Purchase Price FX, Market Price, Market Price FX, Value, ISIN, DividendYield
    // Similar to bonds but includes dividend yield information
}
```

STEP 8: RESULT STRUCTURE
------------------------
The extracted data is structured as:

```javascript
{
    success: true,
    bank_type: 'bjss',
    filename: 'document.pdf',
    extracted_data: {
        documentType: 'BJSS_POS',
        documentFormat: 'singapore' | 'switzerland',
        clientInfo: {
            clientName: 'Client Name',
            clientNumber: '12345',
            portfolioNumber: 'P67890',
            referenceCurrency: 'USD',
            referenceDate: '2024-01-01',
            creationDate: '2024-01-02'
        },
        bondPositions: [
            {
                Currency: 'USD',
                Quantity: '1000',
                Name: 'Company Bond',
                Country: 'US',
                PurchasePrice: '100.50',
                PurchasePriceFX: '1.00',
                MarketPrice: '101.25',
                MarketPriceFX: '1.00',
                Value: '1012500',
                ISIN: 'US1234567890'
            }
        ],
        equityPositions: [
            {
                Currency: 'USD',
                Quantity: '500',
                Name: 'Company Stock',
                Country: 'US',
                PurchasePrice: '50.00',
                PurchasePriceFX: '1.00',
                MarketPrice: '52.50',
                MarketPriceFX: '1.00',
                Value: '262500',
                ISIN: 'US9876543210',
                DividendYield: '2.5%'
            }
        ],
        cashPositions: [
            {
                Description: 'Current account USD',
                Account: '83.61738.0 4000',
                Currency: 'USD',
                Nominal: '100000',
                PurchasePrice: '1.00',
                ValuationUSD: '100000',
                AccruedInterest: '0',
                Proportion: '5.2%'
            }
        ]
    },
    text_length: 15420,
    timestamp: '2024-01-15T10:30:00.000Z'
}
```

STEP 9: FRONTEND DISPLAY
------------------------
Location: public/app.html (lines 497-570)

```javascript
function showResults(data) {
    // Check if this result already exists to avoid duplicates
    const existingIndex = allResults.findIndex(result => result.job_id === data.job_id);
    if (existingIndex === -1) {
        allResults.push(data);
    } else {
        allResults[existingIndex] = data; // Update existing result
    }
    
    // Show results section alongside progress section
    resultsSection.classList.remove('hidden');
    
    // Update summary stats
    updateSummaryStats();
    
    // Create or update accordion item for this file
    createOrUpdateAccordionItem(data);
}
```

BJSS-Specific Display (lines 572-772):
```javascript
function populateBJSSPOSTables(accordionItem, extractedData) {
    // Add document format indicator
    const documentFormat = extractedData?.documentFormat || 'unknown';
    const formatDisplay = documentFormat === 'singapore' ? '🇸🇬 Singapore Format' : 
                       documentFormat === 'switzerland' ? '🇨🇭 Switzerland Format' : 
                       '📄 Unknown Format';
    
    // Update table headers for bond, equity, and cash data
    const headers = [
        'Type', 'Currency', 'Quantity', 'Name', 'Country', 'PurchasePrice', 
        'PurchasePriceFX', 'MarketPrice', 'MarketPriceFX', 'Value', 'ISIN'
    ];
    
    // Populate bond and equity positions data
    const allPositions = [];
    
    // Add bond positions
    if (extractedData.bondPositions && Array.isArray(extractedData.bondPositions)) {
        extractedData.bondPositions.forEach(bond => {
            allPositions.push({...bond, Type: 'Bond'});
        });
    }
    
    // Add equity positions
    if (extractedData.equityPositions && Array.isArray(extractedData.equityPositions)) {
        extractedData.equityPositions.forEach(equity => {
            allPositions.push({...equity, Type: 'Equity'});
        });
    }
    
    // Add cash positions (mapped to existing table structure)
    if (extractedData.cashPositions && Array.isArray(extractedData.cashPositions)) {
        extractedData.cashPositions.forEach(cash => {
            allPositions.push({
                Type: 'Cash',
                Currency: cash.Currency || '',
                Quantity: cash.Nominal || '',
                Name: cashName,
                Country: '',
                PurchasePrice: cash.PurchasePrice || '',
                PurchasePriceFX: '',
                MarketPrice: '',
                MarketPriceFX: '',
                Value: cash.ValuationUSD || cash.Nominal || '',
                ISIN: ''
            });
        });
    }
}
```

What happens:
- Results are displayed in accordion-style interface
- BJSS documents show format indicator (Singapore/Switzerland)
- Combined table shows bonds, equities, and cash positions
- Client information displayed in separate section
- Raw OCR text available for review
- Edit, approve, and database save functionality

STEP 10: DATABASE INTEGRATION
-----------------------------
Location: public/app.html (lines 1485-1628)

```javascript
async function addToDatabase(jobId) {
    // Get the data for this job
    const result = allResults.find(r => r.job_id === jobId);
    if (!result) {
        showNotification('Data not found for this job', 'error');
        return;
    }

    // Initialize API client with external API configuration
    const apiClient = new APIClient();
    
    // Detect product type and configure API accordingly
    const documentType = dataToUse?.documentType || result.result?.extracted_data?.documentType || 'Unknown';
    const bankType = result.bank_type || 'Unknown';
    
    // Determine if this is a Structure or Accumulator product
    const isStructureProduct = ['swissquote', 'barclays', 'efg'].includes(bankType.toLowerCase()) || 
                             documentType === 'STRUCTURE';
    
    // Configure API client
    apiClient.configure({
        baseURL: window.location.origin, // Use local server
        apiKey: 'Xj1Wboo9Bd0Ru6qsRhYndwsb', // Your actual API key
        endpoints: {
            saveData: isStructureProduct ? '/api/proxy/structure' : '/api/proxy/accumulator'
        },
        headers: {
            'Content-Type': 'application/json'
        },
        timeout: 30000 // 30 seconds timeout
    });
    
    // Prepare data for database
    const dataToSave = {
        job_id: jobId,
        filename: result.filename,
        document_type: dataToUse?.documentType || result.result?.extracted_data?.documentType || 'Unknown',
        bank_type: result.bank_type || 'Unknown',
        parent_data: dataToUse?.parent || result.result?.extracted_data?.parent || {},
        child_data: dataToUse?.child || result.result?.extracted_data?.child || [],
        metadata: {
            processing_time: result.processing_time,
            extraction_timestamp: new Date().toISOString(),
            user_id: 'anonymous'
        }
    };

    // Make API call to save data
    const response = isStructureProduct ? 
        await apiClient.saveStructureToDatabase(dataToSave) : 
        await apiClient.saveToDatabase(dataToSave);
}
```

What happens:
- User can save extracted data to external database
- BJSS data is treated as accumulator product (not structure)
- Data is sent via proxy endpoints to handle CORS
- Success/failure feedback provided to user
- Duplicate detection and validation handled

ERROR HANDLING
==============

1. File Upload Errors:
   - Invalid file types rejected
   - File size limits enforced
   - Maximum file count limits

2. OCR Processing Errors:
   - PDF parsing failures
   - Image conversion errors
   - Tesseract processing failures

3. Bank Detection Errors:
   - Unknown bank types return 'not_found'
   - Fallback to filename pattern matching

4. Data Extraction Errors:
   - Missing sections handled gracefully
   - Invalid data patterns logged
   - Partial extraction results returned

5. Display Errors:
   - Missing data fields shown as "Not Found"
   - Error states clearly indicated
   - User feedback for all operations

PERFORMANCE CONSIDERATIONS
==========================

1. File Processing:
   - Synchronous processing for immediate feedback
   - File cleanup after processing
   - Memory management for large files

2. OCR Optimization:
   - Direct PDF text extraction preferred
   - Image conversion only when necessary
   - Progress tracking for long operations

3. Data Extraction:
   - Configuration-based pattern matching
   - Efficient line-by-line processing
   - ISIN deduplication to prevent cross-contamination

4. Frontend Performance:
   - Real-time progress updates via Socket.IO
   - Accordion interface for multiple files
   - Efficient DOM updates

CONFIGURATION
=============

The BJSS extractor uses configuration files to define:
- Document format patterns (Singapore vs Switzerland)
- Section headers and boundaries
- Data extraction patterns
- Exclusion patterns
- Field mappings

This allows for easy updates without code changes.

SECURITY CONSIDERATIONS
=======================

1. File Upload Security:
   - File type validation
   - Size limits
   - Temporary file cleanup

2. API Security:
   - CORS handling via proxy
   - API key management
   - Input validation

3. Data Privacy:
   - Temporary file storage
   - Automatic cleanup
   - No persistent storage of sensitive data

This comprehensive workflow ensures reliable extraction of BJSS portfolio data while maintaining security and performance standards.