OCR extracts the income figure from a fraudulent bank statement perfectly — it cannot detect that the file was fabricated in Word
OCR-based document analysis reads what is printed on the page. A fabricated bank statement created in Microsoft Word reads correctly — the numbers are right, the layout looks real, the extracted data passes cash-flow thresholds. What OCR cannot read is the file's structural layer: the producer field showing Word instead of Chase Online Banking, the single xref table with no incremental history, the modification timestamp gap. htpbe? reads that layer. Pair it with your existing extraction stack.
htpbe? analyzes the structural layer of the PDF file only — producer, xref, metadata, image streams, signature chain, balance arithmetic. We don't extract data, we don't read text content with OCR, we don't classify transactions or build cash-flow profiles. Ocrolus has those layers and a customer base; htpbe? is positioned for teams that want structural fraud detection as a focused primitive, separate from extraction and analytics.
One REST call, one deterministic verdict
Upload the PDF. The API returns INTACT, MODIFIED, or INCONCLUSIVE with named markers — in about three seconds.
How structural fraud survives OCR-based document analysis
Three real fraud mechanics we catch at the structural PDF layer.
Bank statement fabricated in Word — OCR reads it correctly
Applicant creates a bank statement layout in Microsoft Word using the bank's logo from the web, types in three months of fictitious deposits totalling $6,400/month, exports to PDF. OCR extracts the amounts correctly. Cash-flow analytics shows regular income. The producer field showing Microsoft Word — not the bank's issuance system — is invisible to every layer that reads the document as text.
Pay stub edited to raise gross pay — OCR reports the inflated number
Applicant downloads a real Gusto pay stub showing $3,200/month, opens it in an editor, raises the gross pay to $5,800, saves. OCR extracts $5,800 and cash-flow analysis accepts it. The xref chain shows a second cross-reference table appended after the original Gusto export — structural evidence of the edit that OCR has no mechanism to detect.
W-2 with wages changed after IRS e-file — OCR cannot see the timestamp gap
A real W-2 from TurboTax shows $42,000 in Box 1. Applicant opens it months later, changes to $68,000, saves. OCR extracts the new figure. The modification date is 4 months after the creation date on a document that should be a single-session export — a structural signal only forensic analysis reads.
How htpbe? is positioned
Why OCR-based document analysis has a structural blind spot
Every OCR platform in the market reads what is in the document. None of them read whether the document itself is real.
The structural layer — producer signature, xref chain, modification history — exists in the binary file, not in the text.
OCR platforms (AWS Textract, Google Document AI, Azure Form Recognizer, Ocrolus) extract text and numbers from the document as rendered. They have no mechanism to inspect the binary file structure: the producer field that names the software that created the PDF, the xref chain that records every edit session, the modification timestamp that shows when the file was last saved. A bank statement fabricated in Word and a genuine Chase online export produce identical OCR output — but the structural layer is completely different. htpbe? reads that layer as a standalone API call that sits alongside whatever extraction stack you already use.
Five forensic layers, one deterministic verdict
Every PDF we receive passes through the same structural pipeline — no model training, no thresholds to tune.
Metadata analysis
Creation and modification timestamps, producer and creator fields, XMP metadata — the first layer exposes basic tampering.
File structure
Xref tables, trailer chain, incremental updates. Any edit after export leaves a structural fingerprint here.
Digital signatures
Signature chain integrity and post-signature modifications produce deterministic markers. Certainty-level signal.
Content integrity
Fonts, objects, embedded content, page assembly. Multi-session edits and inserted objects are visible at this layer.
Verdict with markers
Deterministic output: INTACT / MODIFIED / INCONCLUSIVE, with named markers for every finding — suitable for audit trail.
PDFs we analyze structurally for lenders and mortgage ops
Every type listed below is analyzed at the structural file layer — not the rendered image.
Detection capabilities
Deterministic structural signals. No probabilistic scores, no model training.
Producer signature analysis
Authentic bank statements come from banking systems, pay stubs from payroll engines, tax forms from accounting/tax software. When the producer field shows a desktop tool (Microsoft Word, Excel, LibreOffice) or a generator-tool fingerprint (Chrome Headless, wkhtmltopdf), htpbe? flags accordingly — no OCR needed to make this call.
Incremental update detection (xref chain)
Every edit to a PDF leaves a structural trace in the xref chain. htpbe? counts cross-reference tables and flags incremental updates — the structural fingerprint of post-issuance editing, invisible to OCR-based extraction.
Balance arithmetic verification
Running balance is verified row-by-row across bank statements (previous balance + transaction = new balance). Edited transactions break the chain unless every dependent balance was also adjusted. htpbe? reads the structural data without OCR.
Digital signature chain validation
Tax forms, employer letters, and many institutional PDFs carry digital signature chains. htpbe? validates the signature chain and flags invalidated or removed signatures — orthogonal to whatever OCR sees in the text.
Image-stream artefact detection
Lifted-and-pasted logos, signatures, and headers leave compression artefacts that differ from authentic embedded content. htpbe? reads the image-stream metadata directly — exposing paste operations OCR cannot see.
Cross-document fingerprint analysis
When multiple "different" employer letters or bank statements share font subset prefixes, image hashes, or producer signatures across an applicant pool, htpbe? surfaces the shared fingerprints — useful for catching synthetic-identity rings.
An Ocrolus alternative for structural PDF fraud — no OCR layer
Buyers can skip this section — developers, the integration is two HTTP calls.
Step 1 — submit the PDF
curl -X POST https://api.htpbe.tech/v1/analyze \
-H "Authorization: Bearer $HTPBE_API_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://your-storage/applicant-bank-statement.pdf"}'Step 2 — read the verdict (no extracted data, just integrity)
{
"id": "o1c2r3o4-5l6u-7s8a-9z0l-a1b2c3d4e5f6",
"status": "modified",
"modification_confidence": "high",
"modification_markers": [
"Two cross-reference tables — incremental update",
"Modification date 7 days after creation date",
"PDF editor producer detected"
],
"producer": "Adobe Acrobat Pro",
"creator": "Chase Online Banking",
"creation_date": 1707091200,
"modification_date": 1707696000,
"has_digital_signature": false,
"xref_count": 2,
"has_incremental_updates": true
}Original came from Chase Online Banking — institutional source. 7 days later it was opened in Adobe Acrobat Pro and re-saved, adding a second xref. Verdict: modified at high confidence — without ever running OCR on the text. Pair this verdict with whatever extraction layer you already use.
Customer Stories
Teams that stopped document fraud
Compliance, finance, and risk teams use htpbe? to catch manipulated PDFs before they become costly mistakes.
Caught an invoice where the total had been changed by less than a thousand dollars. Without this I would have approved it without a second look.
Sarah M.
AP Manager
United States
We had three applicants in the same week with bank statements that looked completely fine. Two of them were flagged as modified. You simply cannot see this by reading the document — it is in the file structure.
Lars V.
Risk Analyst, Online Lending
Netherlands
Salary slips were coming with altered figures. We identified two problematic files before the placement was finalised.
Priya K.
HR Operations Lead
India
Since we started checking documents this way, we stopped two applications early in the process that would have been very difficult to reverse later.
Julien R.
Fraud Analyst, Fintech
France
Some applicants were sending PDFs that looked authentic but had been edited in ways not visible to the eye. We now ask for verified originals when something is flagged. Already saved us from a few bad decisions.
Marta S.
Compliance Coordinator
Spain
One invoice was caught because there was a mismatch between the document dates and structure. That particular case would have cost us significantly.
Tariq A.
Finance Manager
United Arab Emirates
Frequently asked questions
Related solutions and guides
Mortgage
Mortgage operations vertical — document fraud detection for loan-origination workflows.
Fintech & Lending
Lender vertical positioning — fraud-ops angle for risk teams.
Bank Statement Fraud Detection
The primary document type in lending fraud — structural forensics deep dive.
Secure your workflow
Create your account — API key on signup, free test environment on every plan.
From $15/mo. No sales call. Cancel any time.