PDF Security Blog

Building a Document Fraud Detection Workflow for Fintech: Beyond KYC

HTPBE Team·10.06.2026·14 min read

This article is a snapshot — content was accurate as of June 2026 (code examples tested against the API as of April 2026). The product evolves actively; specific counts, examples, and detection rules may have changed since publication — see the changelog for the current state.

Your KYC platform confirmed the applicant is a real person. The liveness check passed. The ID matched a government database. The document they uploaded — a three-month bank statement — shows exactly the income you need to approve the loan.

What your KYC platform did not tell you is whether that bank statement was edited after the bank generated it.

This is the gap that structural forensics fills. KYC platforms like Persona, Onfido, Alloy, and Jumio are built to answer “is this a real person?” They are not built to examine the internal structure of the PDF files that person uploads. The two problems require different tools.

The Three-Layer Document Fraud Stack

Most fintech lending, BNPL, and neobank onboarding workflows have two of the three layers they need:

Layer 1: Identity — KYC platform (Persona, Onfido, Alloy). Checks the person is real, the ID is genuine, the face matches. Catches synthetic identity fraud and ID document forgeries. Does not examine the PDFs the checked person uploads.

Layer 2: Transaction data — Open banking / Plaid / TrueLayer. Connects directly to the applicant’s bank account and pulls real transactions. Catches income inflation and account mismatch. Requires the applicant to consent to a bank connection and does not cover applicants who cannot or will not use open banking.

Layer 3: Document integrity — Structural PDF forensics. Examines the internal file structure of uploaded documents and determines whether they were modified after the issuing institution generated them. Catches PDF tampering that passes identity checks and catches cases where open banking is not available.

Each layer catches a different attack. A fraudster using a real identity with stolen credentials (passes Layer 1), who uploads a bank statement from a different account than their Plaid-connected account (caught by Layer 2 if you compare them), may have also modified that statement — Layer 3 examines the file structure for evidence of that modification regardless of what Layer 2 sees.

The layers are complementary, not redundant. Layer 3 is the one most fintech operations teams are still missing — but it is a structural-evidence layer, not a fraud oracle. It raises the cost of clean fraud, gives reviewers a defensible reason to escalate, and feeds a clean signal into scoring models. Treating its output as proof-of-fraud rather than as a high-confidence anomaly is how it goes wrong in production.

What Document Fraud Looks Like at the PDF Layer

The most common bank statement and pay stub fraud patterns each leave a different structural trace.

Re-save via Excel or Word. An applicant downloads a genuine bank statement PDF, opens it in Microsoft Excel or Word (which decomposes it into editable content), adjusts transaction amounts or balances, and exports back to PDF. This overwrites the producer field — the metadata tag recording which software last saved the file. A bank-generated statement carries a producer string from the bank’s document platform. After re-export, it carries “Microsoft Excel” or “Microsoft: Print To PDF.” The creator field, which recorded the bank’s original software, remains. The mismatch is detectable without comparing against any original document.

Incremental Acrobat edit. A fraudster opens the PDF in Adobe Acrobat, edits specific fields (a balance, a salary figure, an employer name), and saves. The PDF incremental update architecture appends the changes to the end of the file and adds a new cross-reference table. The original content remains in the file alongside the edit. The xref count goes from one to two. Combined with modification timestamps, this is consistent with one or more post-creation edit sessions — though some legitimate workflows (archival re-export, PDF/A normalisation, e-signature stamping) also produce incremental updates, so the count is a strong correlate of editing rather than a deterministic edit log.

Generator-tool fabrication. Some fraudsters skip the editing step and build documents from scratch using template tools. These produce structurally consistent files that may return inconclusive rather than modified — because there is no original structure to compare against. The verdict carries its own signal, described below.

Integration Architecture

There are three places in a lending or onboarding pipeline where PDF forensics fits naturally. Which you implement depends on whether you want synchronous blocking, async flagging, or model signal generation.

At Intake: Synchronous Block

Call the HTPBE? API immediately when an applicant uploads a document. If the verdict is modified, block the submission before it enters the review queue. The applicant sees a generic rejection message; your team sees the forensic report.

This is the highest-value integration point. Documents that are flagged at intake never reach underwriters, never consume manual review time, and never generate a loan file that has to be unwound later.

import https from 'https';

const HTPBE_API_KEY = process.env.HTPBE_API_KEY!;
const BASE_URL = 'https://api.htpbe.tech/v1';

interface ForensicResult {
  id: string;
  status: 'intact' | 'modified' | 'inconclusive';
  modification_confidence: 'certain' | 'high' | 'none';
  modification_markers: string[];
  producer: string;
  creator: string;
  xref_count: number;
}

async function checkDocumentIntegrity(pdfUrl: string): Promise<ForensicResult> {
  const headers = {
    Authorization: `Bearer ${HTPBE_API_KEY}`,
    'Content-Type': 'application/json',
  };

  // Step 1: Submit for analysis
  const submitRes = await fetch(`${BASE_URL}/analyze`, {
    method: 'POST',
    headers,
    body: JSON.stringify({ url: pdfUrl }),
  });

  if (!submitRes.ok) {
    throw new Error(`Analysis submission failed: ${submitRes.status}`);
  }

  const { id } = await submitRes.json();

  // Step 2: Retrieve result
  const resultRes = await fetch(`${BASE_URL}/result/${id}`, { headers });

  if (!resultRes.ok) {
    throw new Error(`Result retrieval failed: ${resultRes.status}`);
  }

  return resultRes.json();
}

// Usage at document intake
async function handleDocumentUpload(
  pdfUrl: string,
  documentType: 'bank_statement' | 'pay_stub' | 'utility_bill'
): Promise<{ action: 'proceed' | 'block' | 'escalate'; checkId: string; reason?: string }> {
  const result = await checkDocumentIntegrity(pdfUrl);

  if (result.status === 'modified') {
    return {
      action: 'block',
      checkId: result.id,
      reason: `Modification detected: ${result.modification_markers.join(', ')}`,
    };
  }

  // inconclusive on institutional document types warrants escalation
  if (result.status === 'inconclusive' && documentType === 'bank_statement') {
    return {
      action: 'escalate',
      checkId: result.id,
      reason: `Document origin (${result.producer}) inconsistent with expected bank-issued PDF`,
    };
  }

  return { action: 'proceed', checkId: result.id };
}

The checkId is stored against the application record. If an approved loan later defaults and the document is revisited, the forensic report is retrievable via GET /api/v1/result/{checkId} — a permanent audit trail.

At Underwriting: Async Batch Processing

If synchronous blocking at intake is not feasible, run forensic checks in a background job before the application reaches a human underwriter. Flag documents in the review queue so underwriters see the forensic verdict alongside the application. A modified flag means the underwriter reviews the document with full knowledge of what was detected; they are not making a credit decision blind.

At Model Training: Verdicts as Features

The HTPBE? response includes named markers (HTPBE_EDITING_TOOL_FINGERPRINT, HTPBE_MULTIPLE_REVISION_LAYERS, HTPBE_DATES_DISAGREE) and raw metadata fields (xref_count, creator, producer). These are structured features that can feed directly into fraud-scoring models. Applications where the producer field carries a consumer editing tool are statistically different from applications where it carries an institutional generator — independently of whether the verdict is modified.

Verdict Routing Table

Verdict	Context	Recommended Action
`intact`	Any document type	Proceed to underwriting
`modified`	Any document type	Route to manual review queue; do not approve automatically
`inconclusive`	Bank statement, pay stub (institutional doc expected)	Escalate; origin is inconsistent with issuing institution
`inconclusive`	Utility bill, government letter (consumer-produced OK)	Use alongside other signals; not a direct fraud indicator

inconclusive means the document was produced using consumer-grade software and HTPBE? cannot confirm post-creation integrity. It is not a failure — it is a meaningful signal about document origin. A bank statement returned inconclusive is a different risk signal than a utility bill returned inconclusive. The routing logic should reflect the expected origin of each document type.

Where the cost case sits

Per-document forensic cost is a fraction of a percent of the typical underwritten loan amount. The real question for a lending or onboarding operation is not the per-check spend — it is whether the document-integrity layer pays back its calibration cost (the few weeks of false-positive tuning) over the lifetime of the loan book. A single avoided fraudulent origination on a five-figure loan typically dwarfs months of API spend at any sensible plan tier.

Before committing to an integration, run a sample batch from your last default cohort against the API. If the file-structure signal correlates with your post-hoc fraud labels, the math is straightforward. If it does not — or if the signal collapses into the calibration noise your team already accepts on KYC — you have learned that without committing further engineering time. The bank statement fraud in lending breakdown walks through the same sampling exercise against a specific document class.

False Positives and Calibration

Structural signals are noisy in messy enterprise environments. A producer of Microsoft Excel on a bank statement is a strong fraud correlate; an extra incremental update on a corporate tax form is not. Five common sources of legitimate metadata contamination:

DMS and ECM rewrites. Document management platforms (SharePoint, M-Files, OpenText) regularly re-serialise PDFs on ingest for indexing, OCR, or compliance tagging — overwriting producer and bumping xref_count.
Email gateway processing. Anti-malware scanners, link-rewriting, and PDF sanitisers (especially in regulated industries) re-emit files with new metadata. The applicant did nothing wrong; the gateway did.
Mobile capture and share pipelines. A statement screenshot-shared from a banking app through iOS Files or Android share-sheet is often re-encoded by the intermediate app — a different producer, with no malicious intent.
Archival and PDF/A normalisation. Corporate accountants and bookkeepers running their own document retention regularly re-export to PDF/A, which always rewrites the file structure.
E-signature platforms. DocuSign, Adobe Sign, and similar tools add their own structural layers (signatures, audit trails, certificate dictionaries) that legitimately produce xref_count > 1 and modified timestamps.

Calibration is not optional. A forensic layer that throws every modified verdict into the rejection bucket will burn through underwriter trust within weeks — high false-positive rates kill detection systems faster than missed fraud does. The routing logic in the table above is a starting template, not the finished product. In production:

Score on the combination of status, modification_confidence, and the specific marker ids — not on modified alone. A modified verdict with modification_confidence: high and a single benign marker is a different signal from a modified verdict with modification_confidence: certain and three corroborating markers.
Hold the verdict against the document type expectation. Bank statements from a top-10 retail bank have a known producer baseline; freelance invoices do not.
Default ambiguous cases to human review, not automatic rejection, until you have weeks of ground-truth labels from the actual reviewer queue. The cost of one wrongly-rejected good applicant (chargeback, churn, complaint) commonly exceeds the cost of letting one fraud attempt reach a human reviewer who then catches it.

Treat the first six weeks of integration as calibration, not enforcement.

What This Does Not Catch

Structural forensics detects modifications to existing PDFs. Several scenarios sit outside that scope — some are pre-existing limitations, others are the active edge of the arms race.

Documents fabricated in the same software the issuing institution uses. If a fraudster generates a bank statement using cloned bank templates in the bank’s own PDF infrastructure (rare but possible for sophisticated actors), the file structure will be consistent with a legitimate document. The verdict will be intact. This attack requires institutional access and is not common in volume fraud.

Genuine consumer-originated documents. Some applicants legitimately produce documents using consumer tools — a freelancer who invoices from Word, a small landlord who writes a tenancy letter in Google Docs. These return inconclusive. The routing logic should account for document type and applicant context, not apply a blanket escalation rule to every inconclusive result.

Browser-native and headless renders. A statement rebuilt as HTML in a templating engine and rendered to PDF through Chromium’s built-in print (or a headless puppeteer pipeline) produces a structurally clean, single-revision PDF with a generic Chromium or Skia/PDF producer. The verdict is typically inconclusive rather than modified, because there is no modification to detect — the file was created from scratch. The right operational response is to treat inconclusive on institutional document types as a strong signal in its own right, not to expect a modified verdict.

Synthetic statement generators. Online "fake bank statement" services and template kits produce files that mimic institutional output. Higher-end generators replicate not just the visual layout but also plausible producer strings — though the producer name almost never matches the actual document platform of the bank or payroll provider being impersonated, which is what gives them away. Lower-end generators leave consumer-tool fingerprints; higher-end ones require corroboration from Layer 1 (identity) and Layer 2 (account match) to catch reliably.

AI-assisted document recreation. LLM-generated layouts rendered into PDF through standard browser or design tooling produce structurally consistent files. Forensic markers may show nothing actionable; the lies live in the content (impossible totals, broken running balances, mismatched issuer details), which is a different layer altogether — closer to content validation, issuer lookups, and transaction reconciliation than to structural analysis.

Structural forensics is one layer in a stack, not a replacement for the other layers. Used correctly, it catches what KYC platforms and open banking connections miss — and surfaces a measured anomaly rate that calibrated routing logic, not a blanket rejection rule, decides how to act on.

Stack Recommendation

Persona or Onfido for identity. Plaid or TrueLayer for transaction data. HTPBE? for document integrity. That combination covers three independent fraud vectors — is the person real, does the account data match, and is the file structure of the document consistent with the issuing institution’s output. Each layer addresses what the others structurally cannot see; none of them is the answer on its own.

The API reference lists the full request/response contract and error codes. For where this layer sits relative to KYC platforms more broadly, see KYC vs. document forensics.

Frequently Asked Questions

Why isn’t KYC enough to catch document fraud?

KYC platforms (Persona, Onfido, Alloy, Jumio) confirm that the person submitting an application is real and that their ID is genuine. They are not built to examine the internal structure of the PDFs that person uploads. A verified real person with a real ID can upload an edited bank statement and pass every identity check on the way through. The structural integrity of the file is a different question that needs its own layer.

How does PDF integrity verification work without the original file?

The check is comparative against the structure of the file itself, not against any external original. Every PDF carries a producer field (the software that last saved it), a cross-reference (xref) table that records each save session, and CreationDate / ModDate timestamps. A statement emitted once by an institutional portal has one xref entry, a single institutional producer, and matching dates. A re-saved file carries multiple xref entries, a consumer-tool producer, and a date gap. No external reference document is needed.

Where in the onboarding pipeline should the check run?

The highest-leverage point is intake — synchronously, before the document enters the review queue. A blocked submission costs the applicant a friendly rejection message; a released loan on a forged statement costs the principal. If synchronous blocking is not feasible, an async check that flags documents in the review queue before an underwriter opens them is the next-best shape. Running the check after approval is the least useful version — by then the credit decision has already been made on unverified inputs.

Will every modified verdict mean fraud?

No, and treating it that way will burn through underwriter trust quickly. Legitimate modified causes include e-signature stamping, DMS re-export, mailbox sanitisers, mobile-share re-encoding, and applicants compressing the file before upload. The right operational posture is to start every modified and every institutional-document inconclusive in human review, log the reviewer’s ground-truth outcome, and tighten auto-routing rules only after several weeks of labelled data. The verdict is a signal into review, not a hiring or underwriting decision on its own.