PDF Forensics Without the Original File: One-Sided Verification

Code examples verified against the API as of April 2026. If the API has changed since then, check the changelog.
The most common question we get from teams evaluating PDF verification tools: "Do we need to keep a copy of the original?"
The answer is no — and understanding why reveals something important about how PDF files work and why tampering is harder to hide than most people assume.
The comparison trap
The obvious approach to detecting whether a document was modified is comparison: keep the original, compare it to what you received, flag the differences. This is how tools like Draftable, Adobe Compare Documents, and diff utilities work.
The comparison approach has a structural problem: it requires you to have the original. In most fraud scenarios, you do not. You received an invoice, a diploma, a bank statement, or a contract from a counterparty. You have the document they sent you. You do not have what the document looked like before they sent it.
If an attacker modified the bank account number on an invoice before sending it to you, you have no original to compare against. You have one file. Comparison-based tools cannot help you.
Forensic analysis works differently. It does not compare the document against anything external. It reads what the document preserved about its own history.
What a PDF knows about itself
The PDF specification was designed around a model of incremental updates. When a PDF is edited and saved, the editing software does not rewrite the entire file — it appends the changes to the end and adds a new cross-reference table pointing to the updated objects. The original content remains in the file.
This design decision, intended for efficiency, has a forensic consequence: a modified PDF carries a structural record of that modification.
When a document is analyzed without the original, the analysis reads:
The cross-reference (xref) chain. How many edit sessions does the file contain? Each incremental update adds at least one xref section. A document created in one session has one xref. A document created and then edited has at least two. The chain length tells you how many times the file changed hands between creation and analysis.
The metadata record. The PDF Info dictionary stores CreationDate and ModDate as part of the file’s standard metadata. These fields record when the document was created and when it was last modified. When a document submitted as a fresh bank statement shows a modification date six months after the creation date, the document was modified — even without a copy of the bank statement from six months ago to compare against.
The authoring trail. The Creator field identifies the software used to create the document. The Producer field identifies the software that last processed it. When a document claims to be a corporate tax filing but was last processed by a free consumer PDF editor, the authoring trail contradicts the claimed origin.
The signature binding. Digital signatures in PDFs cryptographically bind a checksum to a specific byte range of the file. When content is appended after a signature, the signature remains valid for the content it covers — but the document now contains unsigned content. This pattern is detectable without the original: the signature’s byte range does not extend to the end of the file, and xref entries exist outside the signed range.
The limits of one-sided analysis
Honesty about what this approach cannot detect:
Content changes to unsigned documents. If an attacker edits a text field in an unsigned PDF and the editing software rewrites the file cleanly — removing evidence of prior xref structure, preserving or backdating timestamps — the structural record may not reveal the edit. This requires sophisticated tooling and deliberate counter-forensic effort. It is possible, but it is not what most fraud uses. Most fraud uses off-the-shelf PDF editors that leave clear traces.
Fabricated documents created from scratch. If an attacker builds a document in the same software a legitimate issuer uses, sets plausible timestamps, and produces a structurally consistent file — one-sided analysis cannot distinguish it from a genuine document. Structural forensics alone cannot verify that the content is truthful, only that the file is structurally consistent with its claimed origin.
Encrypted documents. Strong encryption prevents reading the structural content of a PDF. Analysis returns inconclusive when the file cannot be read.
For the fraud patterns that account for the vast majority of real-world document fraud — editing existing documents with consumer software, modifying bank details on legitimate invoices, altering figures on real financial statements — one-sided forensic analysis catches them reliably.
Why most fraud does not evade one-sided analysis
Real-world invoice fraud and document tampering is not performed by people with forensic knowledge of the PDF specification. It is performed by attackers who:
- Download a legitimate PDF
- Open it in iLovePDF, PDF24, Adobe Acrobat Reader (free), or a similar tool
- Change the relevant text
- Save and send
Every step of this process leaves marks:
The consumer editor rewrites the Producer field to its own name. It creates a new xref entry. It updates the modification date. It may not preserve the original structure cleanly.
The resulting file shows all the signals that one-sided analysis reads: multiple xref tables, a Producer that contradicts the document’s claimed origin, a modification date inconsistent with the claimed issuance date.
A practical example: bank statement fraud
A lender requests three months of bank statements from a loan applicant. The applicant downloads genuine statements from their bank’s online portal — PDF files generated by the bank’s document system. They then edit the statements to inflate the account balances, using a free online PDF editor.
The lender has no original statements. They cannot compare the submitted documents against bank records. What they have is one PDF per month.
What those PDFs preserve:
Creator: HSBC Document Service(the bank’s original authoring software)Producer: Smallpdf(the free editor used to modify the file)CreationDate: 2025-12-01(when the bank generated the statement)ModDate: 2026-03-15(when the applicant modified it before submission)xref_count: 3(original creation + two editing sessions)
The forensic verdict: MODIFIED with markers PRODUCER_MISMATCH and INCREMENTAL_UPDATES.
The lender did not need the original bank statements. The modified file told them everything.
The API call
curl -X POST https://api.htpbe.tech/v1/analyze \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://your-storage.example.com/statements/statement-dec.pdf"}'
The response includes both the verdict and the specific signals:
{
"status": "modified",
"modification_confidence": "high",
"modification_markers": ["PRODUCER_MISMATCH", "INCREMENTAL_UPDATES"],
"creator": "HSBC Document Service",
"producer": "Smallpdf",
"creation_date": 1764547200,
"modification_date": 1742688000,
"xref_count": 3
}
No original file required. No comparison. One HTTP call.
When to use one-sided analysis vs. comparison
Use one-sided forensic analysis when:
- You receive documents from third parties and do not have originals
- You process documents at scale and need automation
- You need to audit a backlog of documents you received in the past
- You are building this into an API workflow
Use comparison when:
- You have both the original and the received version
- You want to see exactly what changed (field by field, pixel by pixel)
- You are doing manual forensic review of a specific document with known original
These approaches are complementary, not competing. Forensic analysis tells you whether a document was modified. Comparison shows you what changed. In most fraud scenarios, you only have the modified version — comparison is not available to you.
Integrating into document intake
The pattern that works for any document intake workflow:
def process_received_document(doc_url: str, doc_type: str) -> dict:
"""
Run one-sided forensic analysis on a received document.
doc_type: "bank_statement" | "invoice" | "contract" | "certificate"
"""
result = verify_pdf(doc_url) # see developer guide for full client code
if result["status"] == "intact":
return {"action": "accept", "check_id": result["id"]}
if result["status"] == "modified":
return {
"action": "reject",
"check_id": result["id"],
"markers": result["modification_markers"],
"note": "Document shows forensic signs of post-creation modification",
}
# inconclusive — consumer software
# For documents claiming institutional origin, treat as suspicious
institutional_doc_types = {"bank_statement", "tax_certificate", "official_contract"}
if doc_type in institutional_doc_types:
return {
"action": "review",
"check_id": result["id"],
"note": f"Document origin ({result['producer']}) inconsistent with {doc_type}",
}
return {"action": "accept", "check_id": result["id"]}
The check_id is stored alongside the document record. If a document is later disputed, the forensic report is retrievable via GET /api/v1/result/{check_id} — a permanent audit trail that does not require storing the original document.
What this means for compliance teams
For compliance workflows that receive third-party documents — KYC onboarding, loan applications, insurance claims, contract execution — one-sided verification changes the math on document fraud detection. The same principle applies in government procurement and benefits programs, where agencies receive permits, contractor credentials, and benefit application support documents from external parties without holding originals to compare against.
Previously: catch fraud only when the original is available for comparison, or rely on manual review that misses structural signals.
With forensic analysis: every received document is automatically checked against its own preserved history. The document does not need to be caught against an original. It just needs to have been modified.
This is not foolproof — sophisticated forgery from scratch evades it. But it reliably catches the category of fraud that accounts for the majority of real-world document manipulation: taking a legitimate document and editing it with available tools.