PDF Fraud Detection in Loan Origination

This article is a snapshot — content was accurate as of May 2026 (code examples tested against the API as of April 2026). The product evolves actively; specific counts, examples, and detection rules may have changed since publication — see the changelog for the current state.
A borrower submits a loan application through your LOS. Bank statements, a W-2, two pay stubs, a tax return. Everything looks right — the account numbers, the employer name, the income figure that places the borrower within your debt-to-income threshold.
Three weeks later, the loan closes. Eighteen months after that, it defaults. When you pull the origination file in a repurchase review, someone finally opens the PDF in the right tool and sees it: the producer field on the bank statement shows “iLovePDF”. The modification date is three days after the creation date. The balance figures were edited after the bank generated the file.
The fraud was in the PDF the entire time. Your LOS never checked.
PDF fraud detection in loan origination closes this gap. A structural forensics check at document intake examines the file itself — not just what it says, but whether its internal history is consistent with the system that claims to have generated it.
The document fraud surface in loan origination
Every loan application that requires income or asset fraud detection is a document fraud surface. The documents involved follow a predictable pattern:
Bank statements — downloaded from the borrower’s online banking portal as a PDF. Real statements are generated by institutional document systems: Chase, Wells Fargo, Bank of America, HSBC. Their producer signatures are consistent and identifiable. An edited statement carries the producer of whichever tool modified it last.
W-2s — issued by employers and generated either by payroll software (ADP, Paychex, Gusto) or filed and downloaded through tax preparation platforms (TurboTax, H&R Block, IRS e-file). A W-2 whose claimed issuer is a national employer but whose producer is a consumer PDF editor should not exist in a normal workflow.
Pay stubs — generated by payroll platforms (ADP Workforce Now, Paychex Flex, Gusto, Rippling). Each has a distinct producer signature. A pay stub that claims ADP origin but carries a different producer was modified between generation and submission.
Tax returns — either IRS-issued transcripts or preparer-generated documents from TurboTax, H&R Block, or a CPA’s practice management software. The producer field narrows the expected origin significantly.
Asset letters and employment fraud detection letters — generated on institutional letterhead, typically by the issuer’s document management system or a staffing platform. These documents should show producer strings matching enterprise software, not consumer tools.
The modification pattern is the same across all of these: the borrower takes a legitimate PDF, opens it in an editor, changes a number — income, balance, employment date, contribution amount — and submits the modified file. The edit takes five minutes. Without structural forensics, detection is nearly impossible.
Why your LOS does not catch this
Encompass, Blend, Byte, and SimpleNexus are routing and workflow platforms. They intake documents, attach them to loan files, route them to the appropriate stage, and hold them for underwriter review. That is what they were built to do.
Document fraud detection in a mortgage LOS is not part of that design. These platforms do not inspect whether the PDF is structurally consistent with the system that allegedly generated it — that was never in their scope.
The OCR extraction layer that sits in front of many LOS platforms — Ocrolus, FormFree, Finicity — has a different role. These tools read the document’s content: the income figures, the account balances, the employment dates. They extract numbers from the page and compare them against stated income or bank transaction data.
Content extraction and structural forensics are different checks. OCR reads what the document says. Structural forensics reads whether the document’s internal history is consistent with the system that claims to have generated it. A skillfully edited bank statement can pass OCR extraction cleanly — the numbers on the page are internally consistent, the transactions add up, the balance matches the running total. The forgery is invisible at the content layer. It is only visible in the file structure.
Neither your LOS nor your OCR vendor closes this gap. Both process the document as presented. Neither examines whether the document was modified before it arrived.
What the structural signals actually look like
A real bank statement from Wells Fargo, Chase, or Bank of America carries a producer string generated by the bank’s document management system. The creation date reflects when the statement was generated. The modification date is absent or matches the creation date within seconds. The xref table has one entry — the original generation, nothing else.
An edited version of that statement shows a different picture. Here is what HTPBE? returns on a typical altered bank statement submitted in a mortgage application:
{
"id": "ck_4e2a1b9f-7c3d-4f8e-b2a1-5d0c9e3a7f2b",
"status": "modified",
"modification_confidence": "high",
"modification_markers": ["PRODUCER_MISMATCH", "INCREMENTAL_UPDATES"],
"creator": "Wells Fargo Document Services",
"producer": "Smallpdf",
"xref_count": 3,
"has_digital_signature": false,
"creation_date": 1743465600,
"modification_date": 1743724800
}creator: "Wells Fargo Document Services" alongside producer: "Smallpdf" is not a combination that occurs in any legitimate document workflow. Wells Fargo generates the statement; nothing in a normal mortgage process re-saves it in Smallpdf. The creation-to-modification gap of three days confirms the edit window. Three xref entries — the original generation and two subsequent edit sessions — tell you how many times it was touched.
For a W-2 that should have originated from ADP Workforce Now and was instead edited in Adobe Acrobat Reader:
{
"id": "ck_8f1b3d2e-5a4c-4e7f-c3b2-9e1d4a6b8c0f",
"status": "modified",
"modification_confidence": "high",
"modification_markers": ["PRODUCER_MISMATCH", "DIFFERENT_DATES"],
"creator": "ADP Workforce Now",
"producer": "Adobe Acrobat DC",
"xref_count": 2,
"has_digital_signature": false,
"creation_date": 1735689600,
"modification_date": 1743120000
}DIFFERENT_DATES with a 3-month gap between creation and modification on a W-2 is not a timing artifact. W-2s are generated in January and submitted to lenders in January or February of the same year. A modification date in late March on a document with a January creation date means someone opened and re-saved it well after initial generation.
What inconclusive means in a mortgage context
inconclusive is the verdict returned when a document was created with consumer software — Microsoft Word, Google Docs, LibreOffice, Canva — that does not leave the structural markers present in institutionally-generated documents.
For mortgage documents, inconclusive on a document that claims institutional origin is itself a red flag.
A bank statement that returns inconclusive with producer: "Microsoft Word" was not generated by a bank. Banks do not produce statements in Microsoft Word. The document was built from scratch in consumer software, which is not the same as editing an existing statement but is equally disqualifying for underwriting purposes.
The routing rule for mortgage document intake:
intact— structural signals consistent with claimed origin, proceed to underwriting queuemodified— post-creation edit detected, route to pre-underwriting review with named markersinconclusiveon a bank statement, W-2, pay stub, or tax return — document origin inconsistent with institutional claim, treat asmodifiedand hold for review
inconclusive on a borrower-authored document — a personal letter of explanation, a gift letter — is expected and acceptable. The signal only matters when the document claims to come from an institutional source.
Integration at document intake
The check belongs at the moment the PDF arrives in your intake pipeline — before it is attached to the loan file, before it is queued for the underwriter, before OCR extraction runs. At that point, the document URL is available, the file is in storage, and the check takes under three seconds.
import httpx
import os
HTPBE_API_KEY = os.environ["HTPBE_API_KEY"]
BASE_URL = "https://api.htpbe.tech/v1"
HEADERS = {"Authorization": f"Bearer {HTPBE_API_KEY}"}
# Document types that claim institutional origin in a mortgage application
INSTITUTIONAL_DOC_TYPES = {
"bank_statement",
"w2",
"pay_stub",
"tax_return",
"asset_letter",
"employment_letter",
}
def check_loan_document(pdf_url: str, doc_type: str) -> dict:
"""
Run structural forensics on a loan application document.
Returns a routing decision and audit metadata.
doc_type: "bank_statement" | "w2" | "pay_stub" | "tax_return" |
"asset_letter" | "employment_letter" | "personal_letter"
"""
submit = httpx.post(
f"{BASE_URL}/analyze",
headers=HEADERS,
json={"url": pdf_url},
timeout=30,
)
submit.raise_for_status()
check_id = submit.json()["id"]
result = httpx.get(
f"{BASE_URL}/result/{check_id}",
headers=HEADERS,
timeout=30,
)
result.raise_for_status()
data = result.json()
status = data["status"]
markers = data.get("modification_markers", [])
producer = data.get("producer", "")
creator = data.get("creator", "")
# Modified — post-creation edit confirmed
if status == "modified":
return {
"action": "hold",
"queue": "pre_underwriting_review",
"check_id": check_id,
"reason": f"Structural modification detected: {', '.join(markers)}",
"creator": creator,
"producer": producer,
}
# Inconclusive on a document claiming institutional origin
if status == "inconclusive" and doc_type in INSTITUTIONAL_DOC_TYPES:
return {
"action": "hold",
"queue": "pre_underwriting_review",
"check_id": check_id,
"reason": (
f"{doc_type} produced by consumer software ({producer}), "
"inconsistent with institutional origin"
),
"producer": producer,
}
# Intact, or inconclusive on a personal document
return {
"action": "proceed",
"queue": "underwriting",
"check_id": check_id,
}The check_id is stored against the document record in the loan file. If the loan is selected for a QC audit, repurchase review, or regulatory examination, the forensic report is retrievable at any point via GET /api/v1/result/{check_id}. The report includes the verdict, the named markers, the producer and creator strings, and the timestamp of the check — a complete audit trail attached to the document without requiring the original file.
For LOS platforms that process documents via webhook (Blend’s document event hooks, Encompass’s pipeline triggers), the check runs asynchronously on document receipt. The routing decision is applied before the document advances to the next workflow stage.
What the underwriter sees
The structural verdict surfaces alongside the document in the loan file. It is not a score or a probability — it is a named set of signals that the underwriter can read and act on.
For a held document, the underwriter sees:
- The verdict:
modified - The named markers:
PRODUCER_MISMATCH,INCREMENTAL_UPDATES - The creator:
Chase Document Management - The producer:
iLovePDF - The check ID linking to the full forensic report
This is not a black box. The underwriter can explain the hold, escalate it for borrower clarification, or reject the document based on a documented structural finding. That documented finding matters for the compliance angle.
Compliance: TRID, HMDA, and adverse action documentation
TRID requires that lenders maintain a documented basis for decisions in the loan origination process. HMDA requires that adverse action be supported by identifiable reasons. When a lender rejects a document — or takes adverse action on an application that included fraudulent documents — the regulatory expectation is that the basis for that decision can be stated.
“The bank statement appeared altered” is a subjective finding. “The bank statement returned a modified verdict with markers PRODUCER_MISMATCH and INCREMENTAL_UPDATES — the document’s creator field shows Chase Document Management and the producer field shows iLovePDF, with a modification date three days after creation” is a documented, machine-generated structural finding.
Named structural markers from HTPBE? translate directly into adverse action documentation. The check ID links to a permanent, retrievable record of the finding. The audit trail exists from the moment the document was checked — without any manual step from the underwriter or compliance team.
What this does not catch
Structural forensics has a defined scope. Two patterns fall outside it:
Documents fabricated from scratch in the correct software. If a fraudster creates a bank statement using the same document system a real bank uses — or registers a business with a payroll provider and generates a real ADP pay stub with inflated figures — the structural signals will be consistent with a legitimate document. The content is false; the structure is clean. Forensic PDF analysis cannot detect this pattern. Income fraud detection against source data (Plaid, The Work Number, IRS income source-of-truth check) is the appropriate control for fabricated-from-source fraud.
Encrypted or password-protected PDFs. A PDF with strong encryption cannot be analyzed for structural signals. The check returns inconclusive by necessity. For loan document intake, receiving an encrypted document from a borrower without prior arrangement is itself unusual and worth flagging.
For the fraud pattern that accounts for the majority of LOS document fraud — taking a legitimate PDF and editing it with available tools — structural forensics catches it consistently. That is because the tools available for editing PDFs (Adobe Acrobat, Smallpdf, iLovePDF, PDF24, Microsoft Word’s PDF export) all leave recoverable traces in the file structure.
Where to go from here
The check runs against any document URL your LOS has access to — a presigned S3 URL, a Cloudflare R2 link, a Blob URL. There is no file upload to a third-party service in the critical path.
Teams integrating into Encompass, Blend, or a custom LOS pipeline can start with the free web tool at htpbe.tech to check a sample of application documents from recent closed loans before committing to an API build. The results on a closed-loan sample frequently surface modifications that were missed at origination.
For teams ready to build, API access starts at $15/month with test keys available on all plans for integration testing before live documents are involved.
For the full mortgage use case — including pay stub fraud detection patterns, asset letter signals, and employment letter checks — see mortgage document fraud detection and /solutions/lending. For pay stub specifics, see fake pay stub detection.
Register for API access and run the first check in under ten minutes.