Binary PDF Modification Detection: How It Works and Where It Fails

Iurii Rogulia·February 24, 2026·11 min read

The question sounds simple: has this PDF been edited after it was originally created? The answer, it turns out, requires understanding how PDF files store their history at the byte level.

This article explains the technical approach behind binary PDF modification detection — what signals we look for, why they are reliable, and critically, where the method breaks down. If you want a quick answer without the technical detail, HTPBE.tech is an online tool that tells you whether a PDF has been edited (yes/no) in seconds. If you want to understand how that answer is produced, read on.

How a PDF File Is Structured

Before discussing detection, you need to understand the PDF file format at a structural level.

A PDF file consists of four components:

Header — declares the PDF version (%PDF-1.7)
Body — a collection of numbered objects (pages, fonts, images, metadata)
Cross-reference table (xref) — an index mapping object numbers to byte offsets
Trailer — points to the xref table and the document root object

A minimal trailer looks like this:

trailer
<<
  /Size 42
  /Root 1 0 R
  /Info 2 0 R
>>
startxref
9876
%%EOF

The /Info entry points to the document information dictionary — this is where metadata lives.

The Metadata Dictionary

The PDF Information Dictionary (/Info) contains fields that describe the document’s origin and history:

Field	Meaning
`/CreationDate`	When the document was originally created
`/ModDate`	When the document was last modified
`/Creator`	Application that created the source document (e.g., Microsoft Word)
`/Producer`	PDF library or printer that generated the PDF (e.g., macOS Quartz PDFContext)
`/Author`	Document author
`/Title`	Document title

Dates are encoded in a specific format:

D:20231015143022+03'00'

This means: 2023-10-15 at 14:30:22, UTC+3. The timezone offset is significant — we will come back to it.

The Creator/Producer distinction matters enormously. When someone writes a document in Microsoft Word and exports it to PDF:

Creator = Microsoft Word
Producer = Microsoft Word (or a Word-internal PDF engine)

When someone then opens that PDF in Adobe Acrobat and edits it:

Creator stays as Microsoft Word (preserved from original)
Producer changes to Adobe Acrobat 23.0 (the tool that wrote the new version)

This mismatch is a strong signal of modification.

The Incremental Save Mechanism

This is the core of PDF modification detection, and it is what makes PDFs fundamentally different from most other file formats.

When you edit a PDF and save it, most PDF editors do not rewrite the entire file. Instead, they append changes to the end:

The original file body is left intact
New or modified objects are appended after %%EOF
A new cross-reference table is appended, containing entries for the changed objects
A new trailer is appended with a /Prev field pointing to the byte offset of the previous xref table

[original body]
%%EOF
[new objects]
xref
[new xref entries]
trailer
<<
  /Size 55
  /Root 1 0 R
  /Prev 9876          ← points to original xref
>>
startxref
14523
%%EOF

This incremental update mechanism exists for good reasons: it enables fast partial saves, supports undo history, and is required for certain digital signature workflows. But it also leaves a clear paper trail.

A PDF with multiple %%EOF markers, multiple xref sections, or a trailer with /Prev has been saved more than once. This does not automatically mean fraud — legitimate workflows involve multiple saves. But it is a definitive indicator that the file was modified after initial creation.

What Automated Detection Actually Examines

A detection tool like HTPBE examines several independent signals and combines them into a confidence score.

1. Timestamp Consistency

The simplest check: is ModDate later than CreationDate? If yes, the file was modified. But there are subtleties:

ModDate == CreationDate does not mean the file was never edited. Some editors reset ModDate to match CreationDate to hide modifications.
ModDate earlier than CreationDate is impossible under normal conditions and indicates tampering with metadata.
Timezone changes between CreationDate and ModDate (e.g., creation in +03'00' but modification in Z) suggest the file was modified on a different machine or with a different tool.

2. Cross-Reference Table Analysis

Counting xref sections reveals the number of save operations:

1 xref section: the file was saved exactly once — typical for freshly exported PDFs
2+ xref sections: the file was modified at least once after initial creation
3+ xref sections: multiple rounds of editing

The content of the xref updates also matters. If the updated objects include the /Info dictionary (metadata), it suggests metadata was modified after the fact.

3. Creator/Producer Mismatch Analysis

We maintain a knowledge base of known PDF creation tool combinations. Some mismatches are routine:

Creator: Word + Producer: Adobe PDF Printer — normal, Word printed to PDF
Creator: Google Docs + Producer: Skia/PDF — normal Google Docs export

Some mismatches are suspicious:

Creator: Microsoft Excel + Producer: iLovePDF — the file was processed by an online editor
Creator: LibreOffice + Producer: Adobe Acrobat — the file was opened and re-saved in Acrobat
Creator: (empty) + Producer: FPDF — programmatically generated, may be a reconstructed document

The tool also looks for editors known to be used for document fraud: certain online PDF editors, specific versions of commercial tools with known history-stripping features.

4. Digital Signature Integrity

PDF supports cryptographic digital signatures (/Sig objects). When a signed PDF is modified:

The signature covers specific byte ranges of the file
If content outside those byte ranges is modified, the signature becomes invalid
If the signature object itself is removed and the document re-saved, the absence of a previously-present signature is detectable via the revision history

5. Object Stream Anomalies

PDF 1.5+ supports compressed object streams. Legitimate PDF generators produce consistent, well-formed object streams. Anomalies we check for:

Objects with inconsistent generation numbers
Orphaned objects not reachable from the document root
Objects that exist in the xref table but whose byte offsets point to different object types (indicates partial overwrite attempts)

Putting It Together: The Confidence Score

No single signal is conclusive. A document with ModDate > CreationDate might be completely legitimate. A document with matching timestamps might have been forged with a tool that resets them.

Detection tools combine multiple signals into a weighted confidence score:

Signal	Weight	Rationale
Multiple xref sections	High	Hard to fake without specialized knowledge
Creator/Producer mismatch with known editor	Medium-High	Common in legitimate workflows too
Timestamp anomaly (impossible date)	High	Never legitimate
Timestamp anomaly (reset to match)	Medium	Detectable but requires suspicion
Known fraud-associated producer	Medium	Context-dependent
Digital signature removed	High	Strong indicator of post-signing modification

A score above a threshold produces a “Modified” verdict. Below the threshold: “Original”. The threshold is calibrated to minimize false positives on legitimate documents.

False Positives: When Unmodified PDFs Look Modified

This is where the method has real limitations.

Legitimate workflows that trigger modification signals:

PDF/A conversion — archival format conversion modifies the file structure. ModDate will differ from CreationDate even if content is unchanged.
Digital signing — adding a digital signature is a modification. A signed-after-creation document will show multiple xref sections even if the content was never altered.
Optimization — tools like Ghostscript, qpdf, or Adobe Acrobat’s “Reduce File Size” rewrite the entire file, resetting the xref structure but potentially preserving old timestamps.
Printer workflows — some enterprise print systems re-process PDFs (adding watermarks, headers, or compliance stamps) as a routine step.
Email clients — some email clients modify PDF attachments (adding metadata, stripping features) during send or receive.
Merging or splitting — combining PDFs with tools like PDFtk or iLovePDF creates a new document with a new Producer and potentially inconsistent metadata from the source files.

In all these cases, the document content may be perfectly authentic, but the structural signals indicate modification. This is why results should always be interpreted in context.

False Negatives: When Modified PDFs Look Original

More dangerous is the opposite case: a forged document that passes detection.

Techniques that evade detection:

Full reconstruction — printing to PDF (or using print → save as PDF) creates a completely new file with a fresh xref, consistent timestamps, and a single clean structure. All modification traces are lost. The resulting PDF looks freshly created, even if it was forged.
Metadata scrubbing — tools like exiftool or custom scripts can overwrite all metadata fields, including timestamps and creator information, before the file is distributed. A sophisticated forger resets CreationDate = ModDate and sets Producer to match the expected original tool.
Re-export from source format — if the forger has access to the source document (a Word file, for example), they can modify the source and re-export to PDF. The resulting PDF has no modification traces because it was never “modified” — it was created fresh from a modified source.
PDF reconstruction tools — specialized tools can parse a modified PDF and reconstruct it as a clean single-revision file, stripping all incremental save history.

The fundamental limitation: binary detection works on file structure, not content. We can detect that a file’s structure shows signs of modification, but we cannot verify whether the visible text or numbers are correct. A forger skilled enough to reconstruct the PDF properly can evade structural detection entirely.

What This Means in Practice

Binary PDF modification detection is a fast, automated first-line check. It catches:

Naive modifications made with standard PDF editors (the vast majority of document fraud cases)
Metadata inconsistencies that reveal post-creation editing
Structural anomalies that indicate multiple save operations

It does not catch:

Sophisticated forgeries where the PDF was fully reconstructed
Source-level modifications (edit the Word doc, re-export)
Cases where the file was “legitimately” processed by a PDF workflow tool

For high-stakes verification (legal documents, financial instruments, regulated industries), automated detection should be combined with:

Human review of the document content for semantic consistency
Verification of the document’s origin through the issuing party
Cryptographic verification if the original was digitally signed
Chain-of-custody controls that prevent the question from arising

Technical Implementation Notes

For developers building similar tools, the key libraries:

pdf-lib (JavaScript/Node.js) — parse PDF structure, read xref tables, access object streams
PyMuPDF / fitz (Python) — comprehensive PDF analysis, supports most PDF versions
pdfminer.six (Python) — lower-level access to PDF internals
iTextSharp (.NET) — commercial-grade PDF analysis

The core algorithm:

Parse the xref table(s) — count revisions
Extract /Info dictionary — compare timestamps, analyze Creator/Producer
Walk the xref chain via /Prev pointers — identify which objects changed between revisions
Check for /Sig objects — verify digital signature presence and coverage
Score each signal — weight by reliability and combine

The implementation challenge is handling malformed PDFs. Real-world documents frequently deviate from the PDF specification. A robust tool must handle:

Missing or corrupted xref tables (PDF readers use recovery mode)
Linearized PDFs (a different structure used for web-optimized files)
Encrypted PDFs (metadata may be accessible even if content is not)
PDF 2.0 documents with new structural features

Conclusion

Binary PDF modification detection is a reliable heuristic, not a cryptographic proof. It answers the question “does this file’s structure suggest modification?” with high accuracy for everyday document fraud. Against a skilled adversary who understands PDF internals, structural detection alone is insufficient.

The honest answer to “has this PDF been edited?” is: we can tell you what the file’s structure reveals. For most fraud cases — edited invoices, tampered bank statements, modified contracts — that is enough.

HTPBE.tech implements the detection approach described in this article. The web tool is free, requires no registration, and processes PDFs in seconds. The API is available for integration into document verification workflows.