System Design: Building a Payment Reconciliation Engine

Payments look simple from the product surface. A customer click-triggers a pay button, the user interface displays a success confirmation, and the order moves forward. Under the hood, however, the movement of money and the transition of system states do not occur in a single atomic database transaction.

A payment service provider (PSP) authorizes a transaction immediately, but the actual settlement of funds to your merchant bank account occurs days later. Webhooks confirming payouts arrive out of order, refunds partially fail, and chargebacks are initiated by cardholders weeks after the transaction. If your internal ledger states that you captured 100,000 dollars while your payment gateway reports deposits of only 98,000 dollars, you face a major financial discrepancy.

That gap is where payment reconciliation lives. Without a robust reconciliation engine, finance teams lose operational visibility, support agents cannot explain missing customer funds, settlement mismatches accumulate, and month-end closes degrade into emergency war rooms. If reconciliation is strong, companies can identify money leakages, trace unpaid gateway fees, verify refund processing correctness, and prove transaction audit compliance.

This guide designs a production payment reconciliation engine capable of processing 50,000,000 transaction events daily.

Requirements and System Goals

A reconciliation engine acts as a ledger-level audit and correctness system. It must ingest disparate, asynchronous feeds of data, normalize them into a single schema, and execute deterministic matching algorithms.

Functional Requirements

Multi-Source Ingestion: Ingest records from internal database ledgers, payment gateways (Stripe, Adyen, PayPal), bank statement files (BAI2, MT940), and dispute registries. The ingestion layer must support both push-based mechanisms (such as listening to Kafka event streams and webhook subscriptions) and pull-based mechanisms (such as scheduled cron jobs fetching files from remote SFTP directories or querying gateway REST endpoints).
Canonical Normalization: Standardize data types, timestamps, currencies, fee directions, and transaction references across all sources.
Priority-Ladder Matching: Match transactions using a multi-key ladder (e.g., Transaction ID, Payment reference, Order hash, and Amount-window).
Flexible Cardinality Matching: Support one-to-one (single card capture), one-to-many (split orders), and many-to-one (aggregated bank deposits containing thousands of individual transactions) matching.
Mismatches Classification: Automatically tag unmatched records with reasons (e.g., timing_delay, fee_mismatch, missing_internal_ledger).
Case Management Workflow: Generate cases for operational teams to manually investigate and resolve mismatches.
Durable Audit Trail: Record all manual and automated reconciliation actions.

Non-Functional Requirements

Data Lineage & Traceability: Maintain the link from normalized record back to the raw payload hash and source file ID.
Idempotency: Ensure that reprocessing a file or event never creates duplicate transactions or false cases.
Scalability: Process over 50,000,000 transaction events daily.
Correctness Over Latency: Prioritize accuracy and transaction safety over sub-second latency.
Multi-Tenant Separation: Keep configuration rules, tolerance boundaries, and data access isolated between different business divisions or merchant accounts.

API Interfaces and Service Contracts

We expose REST APIs for uploading batch files and querying cases. Internal ledger databases sync events via gRPC.

Upload Source Settlement File

Endpoint: POST /v1/reconciliation/files
Headers:
- Content-Type: multipart/form-data
Request Payload:
- file: (binary settlement CSV)
- sourceSystem: "stripe_settlement_us"
- fileDate: "2026-06-06"
Response Payload (HTTP 202 Accepted):

{
  "fileId": "file_8871-c002-4b21",
  "status": "PROCESSING",
  "rowCount": 12500,
  "sha256Hash": "f8a91b92cd40a993e1104e76839bb2f8a91b92cd40a993e1104e76839bb2a1"
}

Propose Manual Case Resolution

Endpoint: POST /v1/reconciliation/cases/case_9981-b21a-4c22/resolve
Request Payload:

{
  "resolutionType": "WRITE_OFF_FEE_DIFFERENCE",
  "notes": "Stripe fee rounding difference of $0.02 written off.",
  "ledgerJournalId": "jr_0192a-881c"
}

Response Payload (HTTP 200 OK):

{
  "caseId": "case_9981-b21a-4c22",
  "status": "RESOLVED",
  "resolvedBy": "usr_finance_mgr_12",
  "resolvedAt": "2026-06-07T10:25:00Z"
}

Internal gRPC Ingestion Contract

For real-time event streaming from ledger services:

syntax = "proto3";

package codesprintpro.reconciliation.v1;

service ReconciliationIngestService {
  rpc IngestLedgerEvent (LedgerEventRequest) returns (IngestResponse);
}

message LedgerEventRequest {
  string tenant_id = 1;
  string entry_id = 2;
  string transaction_ref = 3;
  string account_code = 4;
  string direction = 5;       // "DEBIT" or "CREDIT"
  int64 amount_cents = 6;
  string currency = 7;
  int64 posted_at_unix = 8;
}

message IngestResponse {
  string record_id = 1;
  bool is_duplicate = 2;
}

High-Level Design and Visualizations

Our reconciliation architecture uses a 5-stage pipeline, separating ingestion from matching and audit logging.

The engine divides operations into distinct logical boundaries. Raw files and real-time events enter the Ingestion Service, where they are cryptographically hashed and deduplicated. These payloads are stored in the Raw Event Store.

The Normalization Engine parses these raw events, transforming them into a standard canonical schema. The Deterministic Matcher Engine then pulls unmatched records from the canonical store, applying priority-ladder rules defined in the Rules Registry. Successfully matched items are marked reconciled, while anomalies flow directly to the Case Management Store for manual review.

Five-Stage Reconciliation Processing Pipeline

This pipeline maps the flow of data from ingestion to raw storage, normalization, matching, and case resolution.

flowchart TD
    subgraph DataSources [Sources]
        Ledger[Ledger Events Kafka]
        PSPFile[Stripe CSV via SFTP]
        BankFile[Bank BAI2 File via API]
    end

    subgraph Stage1_2 [1. Ingestion & 2. Raw Store]
        Ingester[Ingestion Service] -->|Check Hash & Deduplicate| Deduplicator[Deduplication Engine]
        Deduplicator -->|Save Raw Payload| RawDB[(Raw Event Store - PostgreSQL)]
    end
    
    DataSources --> Ingester

    subgraph Stage3_4 [3. Normalize & 4. Matcher]
        Parser[Normalization Engine] -->|Read Raw Payloads| RawDB
        Parser -->|Normalize to Common Schema| CanonDB[(Canonical Records Store)]
        
        Matcher[Deterministic Matcher Engine] -->|Scan Unmatched| CanonDB
        Matcher -->|Apply Priority-Ladder Rules| RuleEngine[Rules Registry]
    end

    subgraph Stage5 [5. Case Management & Reports]
        Matcher -->|Create Mismatches| CaseDB[(Cases PostgreSQL)]
        CaseDB -->|Alert Finance| Slack[Finance Slack / PagerDuty]
        CaseDB -->|Expose UI| Dashboard[Ops Dashboard / case management]
    end

Aggregated Bank Settlement Matching Bucket Flow

This diagram illustrates how the engine matches aggregated bank deposits containing thousands of individual transactions.

flowchart TD
    Bank[Bank BAI2 File] -->|Ingest net credit: $99,200.00| BankDeposit[Bank Deposit Bucket]
    
    subgraph BatchAggregation [Reconciliation Group Engine]
        PSP[Stripe Settlement File] -->|Inbound records| PSPEvents[10,000 Transactions]
        PSPEvents -->|Sum Gross: $100,000.00| SumGross[Sum Gross]
        PSPEvents -->|Sum Fees: -$750.00| SumFees[Sum Fees]
        PSPEvents -->|Sum Refunds: -$50.00| SumRefunds[Sum Refunds]
        
        SumGross & SumFees & SumRefunds --> CalcNet[Calculate Net: $99,200.00]
    end
    
    CalcNet --> Compare{Do Net Amounts Match?}
    BankDeposit --> Compare
    
    Compare -->|Yes| Match[Mark Batch and all 10,000 records MATCHED]
    Compare -->|No| Case[Open Aggregate Case: SETTLEMENT_AMOUNT_MISMATCH]

Low-Level Design and Schema Strategies

We use a PostgreSQL database to manage reconciliation records. We maintain separate tables for normalized records, raw events, cases, and files.

PostgreSQL Table DDLs

-- Normalized representation of all source records
CREATE TABLE reconciliation_records (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id VARCHAR(64) NOT NULL,
    source_system VARCHAR(64) NOT NULL,          -- 'internal_payment', 'ledger', 'stripe', 'bank_statement'
    source_type VARCHAR(32) NOT NULL,            -- 'EVENT', 'FILE_ROW', 'WEBHOOK'
    source_record_id VARCHAR(256) NOT NULL,      -- Primary key from source system
    source_file_id VARCHAR(256),                 -- Reference to source file registry
    record_type VARCHAR(64) NOT NULL,            -- 'PAYMENT', 'REFUND', 'FEE', 'DISPUTE'
    transaction_ref VARCHAR(256),                -- Internal payment ID or order reference
    external_ref VARCHAR(256),                   -- Gateway transaction reference
    currency CHAR(3) NOT NULL,
    gross_amount NUMERIC(18, 4) NOT NULL,
    fee_amount NUMERIC(18, 4) NOT NULL DEFAULT 0,
    net_amount NUMERIC(18, 4) NOT NULL,
    event_time TIMESTAMPTZ NOT NULL,
    settlement_date DATE,
    status VARCHAR(64),
    attributes JSONB NOT NULL DEFAULT '{}'::jsonb,
    ingestion_run_id UUID NOT NULL,
    normalized_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
    CONSTRAINT uk_tenant_source_record UNIQUE (tenant_id, source_system, source_record_id)
);

CREATE INDEX idx_recon_records_refs 
ON reconciliation_records (tenant_id, transaction_ref, external_ref);

-- Store raw payloads to support re-parsing and audit lineage
CREATE TABLE reconciliation_raw_events (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id VARCHAR(64) NOT NULL,
    source_system VARCHAR(64) NOT NULL,
    source_record_id VARCHAR(256) NOT NULL,
    source_file_id VARCHAR(256),
    payload JSONB NOT NULL,
    payload_hash CHAR(64) NOT NULL,              -- SHA-256 hash to detect duplicates
    received_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
    ingestion_run_id UUID NOT NULL,
    CONSTRAINT uk_tenant_raw_event UNIQUE (tenant_id, source_system, source_record_id)
);

-- Manage case status and assignments for manual reviews
CREATE TABLE reconciliation_cases (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id VARCHAR(64) NOT NULL,
    case_type VARCHAR(64) NOT NULL,              -- 'AMOUNT_MISMATCH', 'MISSING_LEDGER', 'GATEWAY_MISSING'
    status VARCHAR(32) NOT NULL,                 -- 'OPEN', 'UNDER_INVESTIGATION', 'RESOLVED', 'CLOSED'
    severity VARCHAR(32) NOT NULL,                -- 'HIGH', 'MEDIUM', 'LOW'
    reconciliation_key VARCHAR(256) NOT NULL,    -- Common ID mapping to the records (e.g., payment ID)
    summary VARCHAR(512) NOT NULL,
    details JSONB NOT NULL DEFAULT '{}'::jsonb,
    assignee VARCHAR(128),
    created_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
    resolved_at TIMESTAMPTZ,
    resolution_type VARCHAR(64),
    resolution_notes TEXT
);

CREATE INDEX idx_recon_cases_lookup 
ON reconciliation_cases (tenant_id, status, severity);

-- File ingestion registry to prevent duplicate uploads
CREATE TABLE source_files (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id VARCHAR(64) NOT NULL,
    source_system VARCHAR(64) NOT NULL,
    file_name VARCHAR(256) NOT NULL,
    file_hash CHAR(64) NOT NULL,
    file_date DATE,
    status VARCHAR(32) NOT NULL,                 -- 'PENDING', 'PROCESSING', 'COMPLETED', 'FAILED'
    row_count INT,
    received_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
    processed_at TIMESTAMPTZ,
    CONSTRAINT uk_tenant_file_hash UNIQUE (tenant_id, source_system, file_hash)
);

Database Design and Index Optimization

idx_recon_records_refs: This composite index covers tenant_id, transaction_ref, and external_ref. Because the matcher continuously scans unmatched records looking for matching keys, this index ensures that the query planner performs rapid index lookups instead of slow, full-table sequential scans.
uk_tenant_source_record: A unique constraint that acts as the primary barrier against duplicate ingestion. It prevents the same transaction from being normalized twice, enforcing idempotency at the database engine level.
reconciliation_raw_events table: Storing raw data in JSONB ensures that if our parsing rules evolve, we can re-evaluate the original payload without losing data. The payload_hash column contains the SHA-256 hash of the incoming content, which is checked on ingestion to drop duplicates at the front gate.

Scaling and Operational Challenges

To design a reconciliation engine that scales to handle millions of transactions, we must evaluate write bandwidth and sharding.

Back-of-the-Envelope Capacity Estimations

Let us evaluate the system for an enterprise processing 50,000,000 transactions/day:

Sources per Transaction: Each payment generates 4 records (Internal Event, Ledger debit, Gateway webhook, Bank credit).
Daily Ingestion Volume: $$\text{Daily Ingestion} = 50,000,000 \times 4 = 200,000,000 \text{ records/day}$$
Network Write Ingestion Bandwidth: If records arrive evenly over a 24-hour period: $$\text{Average write throughput} = \frac{200,000,000 \text{ records}}{86,400 \text{ seconds}} \approx 2,315 \text{ records/second}$$ At peak traffic, this rate can increase to 10,000 records/second.
Daily Storage Footprint: If each canonical record payload is approximately 300 bytes: $$\text{Storage volume} = 200,000,000 \times 300\text{ bytes} = 60\text{ GB/day}$$ Including indexes and raw payloads (averaging 1 KB/record), the daily storage footprint reaches 260 GB/day.
Optimizing Database Queries: To prevent query performance from degrading as the database grows:
1. We partition the reconciliation_records and reconciliation_raw_events tables by date. Old partitions (older than 90 days) are archived to Amazon S3 in compressed Parquet format, leaving only the active reconciliation window in hot storage.
2. We index matching keys (such as transaction references and external references). This reduces index scan sizes, keeping lookup latency under 5ms.

Trade-offs and Architectural Alternatives

Designing a large-scale real-time reconciliation engine requires making important trade-offs regarding data ingestion latency and matching strictness.

Ingestion Timing: Real-Time Streaming vs. End-of-Day Batch Processing

Feature	Real-Time Streaming	End-of-Day Batch Processing
Outage Detection Speed	High; catches payment and webhook errors within minutes.	Low; errors are detected only after daily settlement files land.
System Complexity	High; requires handling out-of-order events and partial matches.	Low; files contain a complete, ordered record set.
Database Load	Evenly distributed over the day.	High spike during batch processing windows.

We choose a hybrid approach: We stream internal events and webhooks in real time to create provisional matches and flag immediate errors. The final settlement reconciliation runs as a daily batch job after bank and PSP settlement files land.

Matching Rules: Strict vs. Tolerant Matching

Strict Matching:
- Pros: Zero false positives; ensures absolute accuracy.
- Cons: High false-alarm rates due to timing delays, foreign exchange (FX) fluctuations, or rounding differences.
Tolerant Matching:
- Pros: Reduces operational alerts by accommodating rounding differences, fees, and timezone offsets.
- Cons: Risks matching unrelated transactions of the same amount.

Failure Scenarios and Resilience

Reconciliation engines exist to catch failures in other systems, so they must be resilient to external chaos.

1. Duplicate Ingestion Storms

If a payment gateway experiences network retries, it might send the same webhook dozens of times.

Mitigation: We calculate a SHA-256 hash of every incoming payload and check it against reconciliation_raw_events(payload_hash). If a match is found, we immediately discard the message at the ingestion gate.

2. Silent File Format Changes

A payment provider might change a CSV column layout (e.g., swapping "Transaction Date" to "Event Timestamp") without notice.

Mitigation: We execute schema-validation checks on every file before processing. The parser verifies column headers, data types, and row structures against a registered schema version. If a mismatch is detected, the file is quarantined, and a pager alert is triggered.

3. Late-Arriving Settlement Files

Bank settlement files can be delayed, causing matched internal records to appear as mismatches.

Mitigation: We implement a grace period. Unmatched internal records are not flagged as immediate cases; instead, they are held in a PENDING_SOURCE_DATA state for 48 hours to allow late files to arrive.

4. Foreign Exchange (FX) Fluctuation Differences

When an item is priced in USD, settled by the gateway in EUR, and deposited in GBP, conversion rates introduce minor differences due to timing. These variances can compound across millions of transactions, leading to significant imbalances in balance sheets.

Mitigation: The engine applies a multi-currency tolerance rule. It evaluates the conversion rate using historical rate feeds for that transaction day. If the mismatch is less than the calculated FX variance limit (e.g., less than 1% of the value), the system matches the transaction and routes the difference to an FX gain/loss write-off account. For differences exceeding this tolerance, the transaction is flagged as a high-severity mismatch case, triggering manual intervention by treasury analysts to check for currency arbitrage or gateway routing misconfigurations.

Staff Engineer Perspective

Applying reconciliation at scale reveals operational challenges that standard tutorials ignore.

[!WARNING] Handling Multi-Currency Rounding Errors Rounding errors are common in international payments. For example, converting EUR to USD can create fractional differences (e.g., $100.005 vs $100.01). The matching engine must support configurable tolerance ranges. If the difference is within the threshold (e.g. less than 0.05), the record is matched, and the difference is written off:
function withinTolerance(amountA: number, amountB: number, tolerance: number): boolean {
  return Math.abs(amountA - amountB) <= tolerance;
}

Verbal Script

Interviewer: "How would you design a payment reconciliation system to handle a scenario where a bank settles 10,000 transactions in a single aggregated deposit?"

Candidate: "I would use a batch aggregation matching model.

Instead of matching individual transactions to the bank deposit directly, the engine groups the corresponding PSP records (Stripe captures, fees, and refunds) into a settlement bucket.

The engine calculates the net total of the PSP settlement bucket: gross captures minus fees and refunds.

Finally, we match the net total of the PSP settlement bucket to the bank deposit amount. If they match, the engine marks the batch and all 10,000 underlying transactions as MATCHED. If they do not, we open an aggregate case for investigation."

Interviewer: "What happens if a bank file is uploaded twice by accident?"

Candidate: "We use cryptographic hash validation to prevent duplicate file processing.

When a file is uploaded, the ingestion gateway calculates its SHA-256 hash.

Before processing, we check the hash against the source_files registry. If the hash exists, the file is rejected as a duplicate, preventing duplicate transactions from creating reconciliation errors."

Interviewer: "How would you handle chargebacks and partial refunds that occur weeks after the initial transaction has already been reconciled?"

Candidate: "This is handled through Reconciliation Versioning and Revision Ledgers.

When a transaction is initially matched and verified, we seal its state with a version identifier. If a chargeback or a partial refund occurs later, it is treated as a new financial event, not a mutation of the original record. The new event is ingested as a distinct record linked to the original transaction reference.

The matcher runs a delta reconciliation job, creating a revision ledger entry. If a partial refund changes the final net total, the engine validates that the sum of all payouts and refunds matches the final bank settlements. If a mismatch arises (e.g., the PSP charged an extra dispute fee), the engine opens a case linked to the original case ID, ensuring a complete, audit-compliant historical timeline."