PDF Security Blog

AI-Generated Fake Documents: What PDF Metadata Can Detect

HTPBE Team·10.03.2026·10 min read

This article is a snapshot — content was accurate as of March 2026 (code examples tested against the API as of March 2026). The product evolves actively; specific counts, examples, and detection rules may have changed since publication — see the changelog for the current state.

The cost of creating a visually convincing fake document collapsed in 2024 and 2025. What once required a skilled graphic designer, a few hundred dollars, and several hours now takes a browser, a free prompt, and about thirty seconds. Deloitte projects that AI-driven fraud will generate $40 billion in losses in the United States alone by 2027. Inscribe, a document intelligence company, reported that AI-based document fraud grew fivefold between April and December 2025. The visual inspection methods that once caught most forgeries have quietly stopped working. What comes next is not one single solution but a layered strategy — and understanding exactly what each layer can and cannot do is the starting point.

Why AI Made Visual Inspection Obsolete

Traditional document fraud required physical skill or expensive software. A forged bank statement meant editing a scan in Photoshop, managing fonts, matching backgrounds, and hoping the reviewer would not notice the compression artifacts. It was imperfect work, and careful reviewers caught a meaningful percentage of attempts.

Generative AI changed the economics completely. Modern models can produce a plausible-looking payslip, bank statement, or employment letter from a text description. Some tools are trained specifically on financial documents and can match typography, layout conventions, and even the subtle formatting quirks of specific institutions. The output is not a scan-and-paste job; it is a freshly generated document that has never been photographed or compressed. Compression artifacts, misaligned text, and inconsistent font rendering — the traditional visual tells — are absent.

The FBI’s Internet Crime Complaint Center (ic3.gov) has reported a consistent rise in business email compromise cases that now incorporate AI-generated supporting documents. Fraudsters submit fake bank statements to prove income, fabricated employment letters to support loan applications, and altered invoices to redirect payments. Compliance teams that built their entire detection process around visual review are exposed.

What PDF Metadata Still Reveals

A PDF is not just a visual representation of a document. It is a structured file with a history, and that history is written in metadata. Even when the visual layer of a document has been generated or manipulated with AI, the file structure itself often tells a different story.

Timestamp Inconsistencies

Every PDF contains at minimum a creation date (CreationDate) and, if the file was modified after initial creation, a modification date (ModDate). These timestamps are set by the software that created or last edited the document, not by the person who printed or saved it.

A common pattern in AI-generated fakes is a document that claims to be from 2022 or 2023 — a bank statement, a tax return, a university transcript — but whose CreationDate metadata was written by a tool that did not exist until 2025. The content claims one thing; the file structure proves another.

Even when fraudsters are careful about the visible date on the document, they rarely think to alter the embedded timestamp. When they do attempt to alter it, the result is often a different inconsistency: a ModDate that predates the CreationDate, or a timestamp that falls in an implausible range for the claimed document.

Consumer Software Origin

Institutional documents — payslips, bank statements, mortgage letters, academic transcripts — are generated by institutional software. A payslip produced by a large employer comes from an HR platform, an ERP system, or purpose-built payroll software. A bank statement comes from core banking infrastructure. These documents carry producer strings in their metadata that identify the generation software precisely.

When a “bank statement” shows a producer of iLovePDF, Smallpdf, Canva, or Microsoft Word, that is a meaningful signal. Those tools are consumer-grade editing applications. Legitimate banks do not use them to generate customer statements. The presence of consumer software in the producer field does not prove fraud — a user might have legitimately converted or annotated a document — but it means the document’s authenticity cannot be established from file structure alone. This is the inconclusive verdict: not modified, but created by software anyone can access.

The same logic applies to AI document generators. Many output PDFs whose producer strings identify them directly as AI-generation tools or as generic web-to-PDF converters. Others produce files with completely empty metadata fields. Both patterns are detectable.

Stripped or Empty Metadata

AI document generators frequently produce PDF files with no metadata at all: blank Author, blank Creator, blank Producer, no timestamps. Legitimate institutional documents generated by real systems almost never have empty metadata. A bank’s core banking platform writes its own identifier into every PDF it produces. An HR system embeds the software name and version.

Empty metadata is not proof of fraud, but it is a strong signal that the document did not originate from the claimed institutional source. Combined with content claims — “this is an official statement from a major bank” — an entirely blank metadata profile warrants escalation.

XRef Table Structure and Incremental Updates

PDFs use a cross-reference table (xref) to track the location of objects within the file. When a PDF is modified after creation, the modification is typically appended to the file as an incremental update, adding a new xref section. The presence of multiple xref sections is a reliable structural indicator that a document was edited after it was initially created.

Template-based AI generation — where an attacker starts from a real document and overlays or replaces content — leaves exactly this trace. The original document’s xref structure is preserved, and the modifications appear as incremental additions. Even when the visual result looks clean, the file structure records that edits occurred.

The Honest Limits of Metadata Analysis

Metadata analysis is a powerful and fast first layer. It is not a complete solution, and being transparent about its limits is more useful than overstating its capabilities.

What it cannot catch: A sophisticated attacker who generates a document from scratch using appropriate software can produce a file whose metadata is entirely consistent. If someone uses a real institutional PDF generation library, sets plausible timestamps, and produces a document with no incremental updates, the file structure will look legitimate. The intact verdict means the document’s structure is consistent — it does not certify the content is truthful.

What inconclusive means: When a document receives an inconclusive verdict, it means the file was not detectably modified, but it was created by consumer-grade software that anyone can access. This is not a failure of the analysis. It is accurate information: the document’s authenticity cannot be confirmed from its file structure. That finding should trigger additional fraud detection, not a pass-through.

The verdict is structural, not content-based: A modified verdict indicates post-creation changes consistent with tampering patterns. It does not mean the document is definitely fraudulent. Conversely, an intact verdict does not certify the content is truthful. The analysis reflects structural signals; human judgment and additional fraud detection determine the final decision.

Metadata analysis is most valuable as a fast, cheap first filter that separates documents worth closer scrutiny from those that are structurally consistent. It is not a replacement for the full fraud detection chain.

A Practical Multi-Layer Defense

The practical implication of AI-enabled fraud is that no single detection method suffices. The question is how to structure a fraud-detection workflow that catches most fraud at acceptable cost.

Layer 1: PDF Metadata Check

A metadata check costs fractions of a dollar per document and returns a result in seconds. It catches:

Documents modified after creation (the most common tampering pattern)
Documents created by consumer tools when institutional software is claimed
Empty or inconsistent metadata profiles
Structural evidence of template-based generation

On the Growth plan, this layer is cheap enough to apply to every document in a workflow. The goal is not to pass or fail documents definitively; it is to triage. Documents with intact verdicts from plausible producer software move forward. Documents flagged as modified or inconclusive escalate to the next layer.

Layer 2: Full KYC or Manual Review

Full document fraud detection through a KYC provider — identity document liveness checks, database cross-referencing, institutional confirmation — costs $0.50 to $5.00 per document depending on the provider and depth of check. Applied to every incoming document, this is prohibitively expensive for many workflows. Applied only to documents that fail or receive inconclusive verdicts on metadata analysis, the cost becomes manageable.

The economics are straightforward. If 80% of submitted documents pass metadata analysis and 20% are flagged, full KYC applies only to the 20%. At $2 per full fraud detection, running the metadata layer first and sending only the flagged 20% to full KYC costs far less than full KYC on all 1,000 documents. The metadata layer pays for itself many times over.

Layer 3: Direct Confirmation

For high-value decisions — large loans, significant contract signings, regulatory submissions — direct confirmation with the issuing institution remains the gold standard. This means contacting the bank, employer, or university through their publicly listed contact channels (not contact details from the submitted document itself) to confirm the document’s authenticity.

No automated system replaces this step for the highest-stakes decisions. Automated metadata and KYC checks reduce the volume of documents requiring this treatment; they do not eliminate the need for it.

What This Means for Your Workflow

For compliance officers and fraud teams, the practical action is to treat metadata analysis as a baseline requirement — the minimum viable first filter that should already be in place, not a sophisticated enhancement.

For HR teams processing employment fraud detection documents, tax returns, and identity submissions, the inconclusive verdict deserves as much attention as modified. A resume attachment created in Canva is not inherently fraudulent, but a bank statement created in Canva warrants a direct call to the claimed bank. In healthcare credentialing, the stakes escalate further: a medical license that was edited in a consumer PDF tool before submission carries the same structural signals, but the consequences of missing it include patient safety exposure and regulatory sanctions.

For finance teams processing invoices and payment documentation, the producer field in PDF metadata is a faster and more reliable check than visual inspection for detecting modifications. An invoice from an accounting system that suddenly arrives as a file created in a generic PDF editor is a reason to stop and check, regardless of how professional the invoice looks.

The AI fraud wave is not slowing down. The tools that enable it are getting cheaper and more accessible. Building a fraud-detection workflow that starts with fast, cheap structural checks and escalates intelligently is the difference between catching fraud before payment and discovering it afterward.

Metadata analysis will not catch every AI-generated fake. Nothing will. But it catches a meaningful percentage of them at a cost that makes it unreasonable not to use — and it does so before a fraudulent document has caused any harm.