Binary PDF Modification Detection: How It Works and Where It Fails
The question sounds simple: has this PDF been edited after it was originally created? The answer, it turns out, requires understanding how PDF files store their history at the byte level.
This article explains the technical approach behind binary PDF modification detection — what signals we look for, why they are reliable, and critically, where the method breaks down. If you want a quick answer without the technical detail, HTPBE.tech is an online tool that tells you whether a PDF has been edited (yes/no) in seconds. If you want to understand how that answer is produced, read on.
How a PDF File Is Structured
Before discussing detection, you need to understand the PDF file format at a structural level.
A PDF file consists of four components:
- Header — declares the PDF version (
%PDF-1.7) - Body — a collection of numbered objects (pages, fonts, images, metadata)
- Cross-reference table (xref) — an index mapping object numbers to byte offsets
- Trailer — points to the xref table and the document root object
A minimal trailer looks like this:
trailer
<<
/Size 42
/Root 1 0 R
/Info 2 0 R
>>
startxref
9876
%%EOF
The /Info entry points to the document information dictionary — this is where metadata lives.
The Metadata Dictionary
The PDF Information Dictionary (/Info) contains fields that describe the document’s origin and history:
| Field | Meaning |
|---|---|
/CreationDate | When the document was originally created |
/ModDate | When the document was last modified |
/Creator | Application that created the source document (e.g., Microsoft Word) |
/Producer | PDF library or printer that generated the PDF (e.g., macOS Quartz PDFContext) |
/Author | Document author |
/Title | Document title |
Dates are encoded in a specific format:
D:20231015143022+03'00'
This means: 2023-10-15 at 14:30:22, UTC+3. The timezone offset is significant — we will come back to it.
The Creator/Producer distinction matters enormously. When someone writes a document in Microsoft Word and exports it to PDF:
Creator = Microsoft WordProducer = Microsoft Word(or a Word-internal PDF engine)
When someone then opens that PDF in Adobe Acrobat and edits it:
Creatorstays asMicrosoft Word(preserved from original)Producerchanges toAdobe Acrobat 23.0(the tool that wrote the new version)
This mismatch is a strong signal of modification.
The Incremental Save Mechanism
This is the core of PDF modification detection, and it is what makes PDFs fundamentally different from most other file formats.
When you edit a PDF and save it, most PDF editors do not rewrite the entire file. Instead, they append changes to the end:
- The original file body is left intact
- New or modified objects are appended after
%%EOF - A new cross-reference table is appended, containing entries for the changed objects
- A new trailer is appended with a
/Prevfield pointing to the byte offset of the previous xref table
[original body]
%%EOF
[new objects]
xref
[new xref entries]
trailer
<<
/Size 55
/Root 1 0 R
/Prev 9876 ← points to original xref
>>
startxref
14523
%%EOF
This incremental update mechanism exists for good reasons: it enables fast partial saves, supports undo history, and is required for certain digital signature workflows. But it also leaves a clear paper trail.
A PDF with multiple %%EOF markers, multiple xref sections, or a trailer with /Prev has been saved more than once. This does not automatically mean fraud — legitimate workflows involve multiple saves. But it is a definitive indicator that the file was modified after initial creation.
What Automated Detection Actually Examines
A detection tool like HTPBE examines several independent signals and combines them into a confidence score.
1. Timestamp Consistency
The simplest check: is ModDate later than CreationDate? If yes, the file was modified. But there are subtleties:
ModDate == CreationDatedoes not mean the file was never edited. Some editors resetModDateto matchCreationDateto hide modifications.ModDateearlier thanCreationDateis impossible under normal conditions and indicates tampering with metadata.- Timezone changes between
CreationDateandModDate(e.g., creation in+03'00'but modification inZ) suggest the file was modified on a different machine or with a different tool.
2. Cross-Reference Table Analysis
Counting xref sections reveals the number of save operations:
- 1 xref section: the file was saved exactly once — typical for freshly exported PDFs
- 2+ xref sections: the file was modified at least once after initial creation
- 3+ xref sections: multiple rounds of editing
The content of the xref updates also matters. If the updated objects include the /Info dictionary (metadata), it suggests metadata was modified after the fact.
3. Creator/Producer Mismatch Analysis
We maintain a knowledge base of known PDF creation tool combinations. Some mismatches are routine:
Creator: Word+Producer: Adobe PDF Printer— normal, Word printed to PDFCreator: Google Docs+Producer: Skia/PDF— normal Google Docs export
Some mismatches are suspicious:
Creator: Microsoft Excel+Producer: iLovePDF— the file was processed by an online editorCreator: LibreOffice+Producer: Adobe Acrobat— the file was opened and re-saved in AcrobatCreator:(empty) +Producer: FPDF— programmatically generated, may be a reconstructed document
The tool also looks for editors known to be used for document fraud: certain online PDF editors, specific versions of commercial tools with known history-stripping features.
4. Digital Signature Integrity
PDF supports cryptographic digital signatures (/Sig objects). When a signed PDF is modified:
- The signature covers specific byte ranges of the file
- If content outside those byte ranges is modified, the signature becomes invalid
- If the signature object itself is removed and the document re-saved, the absence of a previously-present signature is detectable via the revision history
5. Object Stream Anomalies
PDF 1.5+ supports compressed object streams. Legitimate PDF generators produce consistent, well-formed object streams. Anomalies we check for:
- Objects with inconsistent generation numbers
- Orphaned objects not reachable from the document root
- Objects that exist in the xref table but whose byte offsets point to different object types (indicates partial overwrite attempts)
Putting It Together: The Confidence Score
No single signal is conclusive. A document with ModDate > CreationDate might be completely legitimate. A document with matching timestamps might have been forged with a tool that resets them.
Detection tools combine multiple signals into a weighted confidence score:
| Signal | Weight | Rationale |
|---|---|---|
| Multiple xref sections | High | Hard to fake without specialized knowledge |
| Creator/Producer mismatch with known editor | Medium-High | Common in legitimate workflows too |
| Timestamp anomaly (impossible date) | High | Never legitimate |
| Timestamp anomaly (reset to match) | Medium | Detectable but requires suspicion |
| Known fraud-associated producer | Medium | Context-dependent |
| Digital signature removed | High | Strong indicator of post-signing modification |
A score above a threshold produces a “Modified” verdict. Below the threshold: “Original”. The threshold is calibrated to minimize false positives on legitimate documents.
False Positives: When Unmodified PDFs Look Modified
This is where the method has real limitations.
Legitimate workflows that trigger modification signals:
-
PDF/A conversion — archival format conversion modifies the file structure.
ModDatewill differ fromCreationDateeven if content is unchanged. -
Digital signing — adding a digital signature is a modification. A signed-after-creation document will show multiple xref sections even if the content was never altered.
-
Optimization — tools like Ghostscript, qpdf, or Adobe Acrobat’s “Reduce File Size” rewrite the entire file, resetting the xref structure but potentially preserving old timestamps.
-
Printer workflows — some enterprise print systems re-process PDFs (adding watermarks, headers, or compliance stamps) as a routine step.
-
Email clients — some email clients modify PDF attachments (adding metadata, stripping features) during send or receive.
-
Merging or splitting — combining PDFs with tools like PDFtk or iLovePDF creates a new document with a new Producer and potentially inconsistent metadata from the source files.
In all these cases, the document content may be perfectly authentic, but the structural signals indicate modification. This is why results should always be interpreted in context.
False Negatives: When Modified PDFs Look Original
More dangerous is the opposite case: a forged document that passes detection.
Techniques that evade detection:
-
Full reconstruction — printing to PDF (or using
print → save as PDF) creates a completely new file with a fresh xref, consistent timestamps, and a single clean structure. All modification traces are lost. The resulting PDF looks freshly created, even if it was forged. -
Metadata scrubbing — tools like
exiftoolor custom scripts can overwrite all metadata fields, including timestamps and creator information, before the file is distributed. A sophisticated forger resetsCreationDate = ModDateand setsProducerto match the expected original tool. -
Re-export from source format — if the forger has access to the source document (a Word file, for example), they can modify the source and re-export to PDF. The resulting PDF has no modification traces because it was never “modified” — it was created fresh from a modified source.
-
PDF reconstruction tools — specialized tools can parse a modified PDF and reconstruct it as a clean single-revision file, stripping all incremental save history.
The fundamental limitation: binary detection works on file structure, not content. We can detect that a file’s structure shows signs of modification, but we cannot verify whether the visible text or numbers are correct. A forger skilled enough to reconstruct the PDF properly can evade structural detection entirely.
What This Means in Practice
Binary PDF modification detection is a fast, automated first-line check. It catches:
- Naive modifications made with standard PDF editors (the vast majority of document fraud cases)
- Metadata inconsistencies that reveal post-creation editing
- Structural anomalies that indicate multiple save operations
It does not catch:
- Sophisticated forgeries where the PDF was fully reconstructed
- Source-level modifications (edit the Word doc, re-export)
- Cases where the file was “legitimately” processed by a PDF workflow tool
For high-stakes verification (legal documents, financial instruments, regulated industries), automated detection should be combined with:
- Human review of the document content for semantic consistency
- Verification of the document’s origin through the issuing party
- Cryptographic verification if the original was digitally signed
- Chain-of-custody controls that prevent the question from arising
Technical Implementation Notes
For developers building similar tools, the key libraries:
- pdf-lib (JavaScript/Node.js) — parse PDF structure, read xref tables, access object streams
- PyMuPDF / fitz (Python) — comprehensive PDF analysis, supports most PDF versions
- pdfminer.six (Python) — lower-level access to PDF internals
- iTextSharp (.NET) — commercial-grade PDF analysis
The core algorithm:
- Parse the xref table(s) — count revisions
- Extract
/Infodictionary — compare timestamps, analyze Creator/Producer - Walk the xref chain via
/Prevpointers — identify which objects changed between revisions - Check for
/Sigobjects — verify digital signature presence and coverage - Score each signal — weight by reliability and combine
The implementation challenge is handling malformed PDFs. Real-world documents frequently deviate from the PDF specification. A robust tool must handle:
- Missing or corrupted xref tables (PDF readers use recovery mode)
- Linearized PDFs (a different structure used for web-optimized files)
- Encrypted PDFs (metadata may be accessible even if content is not)
- PDF 2.0 documents with new structural features
Conclusion
Binary PDF modification detection is a reliable heuristic, not a cryptographic proof. It answers the question “does this file’s structure suggest modification?” with high accuracy for everyday document fraud. Against a skilled adversary who understands PDF internals, structural detection alone is insufficient.
The honest answer to “has this PDF been edited?” is: we can tell you what the file’s structure reveals. For most fraud cases — edited invoices, tampered bank statements, modified contracts — that is enough.
HTPBE.tech implements the detection approach described in this article. The web tool is free, requires no registration, and processes PDFs in seconds. The API is available for integration into document verification workflows.