PDF Metadata Forensics: A Complete Field-by-Field Reference

Code examples verified against the API as of March 2026. If the API has changed since then, check the changelog.
Every PDF file carries two layers of information. The first is the visible content — the text, images, and layout a reader sees. The second is metadata: structured data describing the document itself. This second layer records when the document was created, which application produced it, whether it has been modified, and by what tools.
Forensic analysis of these fields can reconstruct a document’s history without examining its visible content at all. For document verification professionals, understanding each field — what it stores, what it reveals, and what makes a value suspicious — is the foundation of PDF authenticity assessment.
This reference covers every major metadata field used in PDF forensics, the structural element that cannot be cleared (the cross-reference table), and how these signals combine into an overall authenticity verdict.
Two Metadata Systems in One File
Modern PDF files contain two distinct metadata systems that coexist and sometimes conflict.
The DocInfo dictionary is the original system, introduced with PDF 1.0. It stores metadata as plain text key-value pairs inside the PDF’s trailer structure. Fields like CreationDate, ModDate, Creator, and Producer all live here. It is simple, universally supported, and very easy to edit with basic tools.
XMP metadata (Extensible Metadata Platform) is XML-based and was introduced in PDF 1.4. It stores a richer, more structured version of the same information, plus additional fields that the DocInfo dictionary cannot represent. XMP is embedded as a stream object within the PDF.
When both systems are present, they should agree. A document where the DocInfo dictionary says the creation date is 2020 but the XMP stream records 2024 has an internal inconsistency — a strong indicator that one of the two was modified manually after the original export.
HTPBE reads both systems and flags contradictions between them as a distinct finding.
Field: CreationDate
What it stores: The timestamp recorded when the PDF was first written to disk. In the DocInfo dictionary, this is stored in the format D:YYYYMMDDHHmmSSOHH'mm' — for example, D:20220315143022+02'00' represents 14:30:22 on 15 March 2022 in the UTC+2 timezone.
What it reveals: The earliest possible moment the document could have existed. This is the anchor point for all other temporal analysis.
Suspicious patterns:
CreationDateis today’s date on a document that is supposed to be years old. A bank statement from 2021 withCreationDate: 2026-03-27was either re-exported from source data very recently or was fabricated.CreationDateis in the future. This happens when the PDF-generating machine has its clock set incorrectly, but it is also a byproduct of certain editing tools that do not normalize timestamps.CreationDateis significantly earlier than the software that allegedly created it. IfCreatorreports an application released in 2023 butCreationDateis 2019, one of those values is wrong.
Critical limitation: CreationDate is a plain-text field. Any text editor or command-line tool can change it to any value. It is not cryptographically protected. It should be treated as a claim, not a fact, and verified against other signals.
Field: ModDate
What it stores: The timestamp of the last modification to the PDF. Like CreationDate, it uses the same D: format. ModDate is updated by any operation that saves changes back into the file: editing content, adding annotations, applying a digital signature, running an optimizer, or converting with a new tool.
What it reveals: When the document was last processed by any application. A significant gap between CreationDate and ModDate on a document that was never supposed to be touched after creation is a core forensic signal.
The normal case — digital signatures: Applying a digital signature counts as a modification and updates ModDate to the signing timestamp. A payslip issued on the first of the month and signed by the employee for submission on the fifteenth will show CreationDate two weeks before ModDate. This is completely expected. HTPBE accounts for this by examining whether the modification pattern is consistent with legitimate signing behavior rather than content editing.
Suspicious patterns:
ModDateis years afterCreationDateon a document that is supposed to be a static official record (a diploma, a property title, a tax certificate). The gap itself is not proof of fraud, but it warrants investigation into what tools produced the modification.ModDatepredatesCreationDate. This is logically impossible and indicates deliberate tampering with one or both timestamps.ModDatematches the timestamp of a known online editing service, whileCreationDatematches the original institution’s software.
What HTPBE checks: The ModDate minus CreationDate gap is calculated and evaluated against document type expectations. The producer field at the time of modification is examined to determine whether the modifying tool is consistent with legitimate workflows (signing software, document management systems) or associated with document editing (online PDF editors, desktop PDF manipulation tools).
Field: Creator (also called CreatorTool)
What it stores: The name of the application that originally authored the document and exported it to PDF. This is distinct from whatever tool later processed or printed the file. For a Word document saved as PDF, Creator would typically be Microsoft Word. For a web-based report exported by an enterprise platform, it might be Oracle BI Publisher 12.2.1.4.0 or Temenos T24 Banking System.
ISO 32000 standard semantics: The ISO 32000 specification (maintained and published by the PDF Association) defines Creator as the name of the application that created the original document, and Producer as the name of the application that converted or last processed the file. Creator represents origin, not editing history. Producer is the field that should be updated to reflect processing tools.
Why Creator matters forensically: Institutional documents — bank statements, payslips, university transcripts, government certificates — are generated by specialized software with characteristic Creator strings. Knowing what software a given institution uses, and seeing a different application in the Creator field, is one of the highest-confidence fraud signals available.
Legitimate institutional creators:
PeopleSoft 9.2— HR and payroll systems at large enterprises and universitiesOracle BI Publisher— financial reporting at banks and insurance companiesTemenos T24— core banking software used by hundreds of banks worldwideSAP NetWeaver— enterprise resource planning, common in large corporate payrollRegistrar System v4.2— academic record management
High-risk creators for documents claiming institutional origin:
Canva— a graphic design tool, not a document management systemMicrosoft PowerPoint— presentation softwareAdobe PhotoshoporGIMP— image editorsLibreOffice Draw— vector drawing applicationInkscape— SVG editor
A bank statement with Creator: Canva is not ambiguous. No bank uses Canva to generate account statements. This single field value is sufficient to classify the document as fabricated regardless of what any other field says.
Field: Producer
What it stores: The name of the application that most recently processed or converted the PDF. Where Creator represents origin, Producer represents the last tool in the processing chain. If a Word document is opened in Adobe Acrobat Pro and re-saved after adding a comment, Creator remains Microsoft Word and Producer becomes Adobe Acrobat Pro DC 2024.002.20991.
ISO 32000 standard semantics: Unlike Creator, which should remain stable, the Producer field should be updated by each tool that processes the file. This means Producer gives a running record of the last tool to touch the document.
Legitimate producer/creator combinations:
Some combinations are entirely normal and should not raise suspicion:
Creator: wkhtmltopdf+Producer: Qt 4.8.7— a common open-source web-to-PDF stack used by many fintech companies and SaaS platformsCreator: Microsoft Word+Producer: Microsoft® Word for Microsoft 365— a Word document saved directlyCreator: Chromium(or a full User-Agent string) +Producer: Skia/PDF m121— Chrome printing to PDF, expected for web-generated documentsCreator: PrimoPDF+Producer: Nitro PDF PrimoPDF— legitimate PDF printer driver
Applying a naive rule of “any producer/creator mismatch is suspicious” produces a false positive rate that makes the tool useless in practice. HTPBE’s whitelist of known-legitimate pairs was built from analysis of real institutional document samples and reduces this to a manageable signal.
High-risk producers — online PDF editing services:
These applications are legitimate tools for personal use, but their presence as Producer on a document claiming to be an official record is a significant red flag:
iLovePDF— online PDF editor and converterSmallpdf— browser-based PDF toolsPDF24— web and desktop PDF editorPDFescape— online PDF form filler and editorSejda— online and desktop PDF editor
A payslip with Producer: iLovePDF was processed by an online PDF editing service. This is not how payroll software works. The document was either converted using an unofficial tool (which introduces questions about why) or was edited online and the editing service left its signature in the metadata.
Why whitelist beats blacklist: The set of legitimate producer values is large but bounded. The set of suspicious values is also bounded. The grey zone — tools that are sometimes legitimate and sometimes used for fraud — requires contextual evaluation. HTPBE uses a weighted scoring approach rather than a binary classification.
XRef Tables (Cross-Reference Tables)
What they are: The cross-reference table is the PDF’s internal index. It maps every object in the file (pages, images, fonts, metadata streams, signatures) to its byte offset within the file. Without the xref table, a PDF reader cannot locate any object.
When a PDF is first created and saved, a single xref table is written. When the PDF is subsequently modified and saved again, PDF’s default behavior is incremental saving: rather than rewriting the entire file, a new section is appended to the end of the file, and a new xref table is appended that describes only the changed or added objects. The previous xref table remains intact.
This means xref_count — the number of xref sections in a file — directly corresponds to the number of times the document was saved after its initial creation.
xref_count: 1 — The document was written once. It has never been incrementally updated. For a document that should be a static export (a bank statement, a certificate), this is the expected value.
xref_count: 2 — The document was saved a second time. This is normal and expected if the document was digitally signed after creation. Signing software appends a signature object and a new xref table.
xref_count: 4 — The document has been through four save operations. Combined with a producer like iLovePDF, this pattern is highly suspicious.
Why xref count is forensically special: Every other metadata field — CreationDate, ModDate, Creator, Producer — is a plain text value that can be overwritten with any content. The xref structure is different. To clear the incremental save history and make a multiply-edited document appear as if it was saved only once, you would need to completely rewrite the file from scratch, reserializing all objects. This is a fundamentally different operation from metadata editing, and it changes the file in ways that introduce other detectable artifacts. Most document forgers do not do this. They edit the visible content, save the file, and move on — leaving the xref count intact.
Nuances:
- Adobe Acrobat uses incremental saves for minor operations like adding bookmarks, annotations, or form field values.
xref_count: 2or evenxref_count: 3is not automatically suspicious. - The combination of
xref_countandProducerat each save point is what matters. Two saves where both producers are consistent with a sign-and-certify workflow is normal. Four saves where the final producer is an online editing tool is not. - Some PDF generators (particularly those that optimize for file size) may perform a full linearization pass, which rewrites the file and resets xref count. This is a legitimate operation but means
xref_count: 1after optimization does not prove a clean history — it only proves the file was linearized.
HTPBE reports xref_count directly in the API response, paired with the producer values found at each incremental update point.
Empty or Stripped Metadata
Absent metadata is itself a forensic signal, but it is ambiguous.
When empty metadata is suspicious: A corporate bank statement, payslip, or official certificate generated by enterprise software will always contain metadata. The software that generates these documents is configured to populate standard fields. If a document claiming to be from a major bank has no Creator, no Producer, and no CreationDate, someone deliberately removed that information. Tools like ExifTool (exiftool -all= document.pdf) and QPDF (qpdf --linearize) can strip all DocInfo dictionary entries. This is a non-trivial operation that someone would perform specifically because the original metadata was incriminating.
When empty metadata is normal: Developer-generated PDFs built with libraries like PDFKit (Node.js), ReportLab (Python), Puppeteer (headless Chrome), or FPDF often contain minimal or entirely absent metadata. These libraries do not populate metadata fields by default. A developer-generated invoice or automated report with zero metadata is not suspicious — it is the expected output of code that was not written to include metadata.
How HTPBE handles it: Rather than treating empty metadata as a pass or a fail, HTPBE takes into account signal availability. When metadata is sparse, the analysis has fewer signals to work with. A document with empty metadata and xref_count: 1 might receive a verdict of inconclusive rather than a definitive classification. This is more honest than a false positive.
Digital Signatures and Their Interaction with Metadata
A digital signature in a PDF protects the content of the document at the moment of signing. Any subsequent modification to the signed content will invalidate the signature. This sounds comprehensive, but the implementation details matter.
What signing does to metadata: When a signing application (Adobe Acrobat, DocuSign, a custom signing service) applies a signature, it appends a signature object to the PDF and writes a new xref section. ModDate is updated to the signing timestamp, and Producer is updated to the signing software. This is correct and expected behavior. A signed document should show this pattern.
The Incremental Saving Attack: Research published at pdf-insecurity.org documented a class of vulnerabilities in PDF signature validation. In some older or misconfigured PDF viewers, it is possible to append additional content to a signed PDF in a new incremental update. The signature validation reports the signature as valid — because the signed content has not changed — but the document as displayed to the user includes the appended content, which was never signed.
In practical terms: a signed document can have additional pages appended, additional form field values inserted, or visual overlays added in a new xref section, and a naive signature validation tool will still report the signature as “valid.”
What HTPBE detects: By examining the xref structure in relation to the position of signature objects, HTPBE can identify whether modifications were appended after the signature was applied. If the signature object appears in xref section 2 and additional content modifications appear in xref sections 3 and 4, the document has been altered after signing regardless of what a signature validation tool reports about the signature’s cryptographic validity.
Practical Decision Reference
The following table summarizes the most common field combinations and their forensic interpretation.
CreationDate | ModDate Gap | Creator | Producer | xref_count | Interpretation |
|---|---|---|---|---|---|
| Matches document era | None or days | Institutional software | Matching or signing tool | 1–2 | Normal — low risk |
| Today | None | Institutional software | Institutional software | 1 | Recently regenerated — investigate why |
| Today | N/A | Canva, Photoshop, PowerPoint | Any | Any | Fabricated — high risk |
| Years ago | Years | Institutional software | iLovePDF, Smallpdf, Sejda | >2 | Modified — high risk |
| Years ago | Weeks | Institutional software | Signing tool (DocuSign, Acrobat) | 2 | Signed by recipient — normal |
| Inconsistent with XMP | Any | Any | Any | Any | Internal contradiction — moderate risk |
| Empty | Empty | Empty | Empty | 1 | Stripped or minimal generator — low confidence |
| Future date | N/A | Any | Any | Any | Clock error or fabrication — investigate |
Why No Single Field Is Conclusive
Each field in this reference has a legitimate explanation for values that look suspicious in isolation. A future CreationDate might be a server clock configuration error. A Producer from an online editor might mean the recipient printed and re-uploaded the document for their own filing. A large xref count might be from legitimate document management workflows.
The forensic value comes from the combination. CreationDate from today, Creator set to Canva, and xref_count: 1 is not three independent coincidences — it is a coherent pattern of fabrication. ModDate three years after CreationDate, Producer set to iLovePDF, and xref_count: 4 is a coherent pattern of post-creation editing.
HTPBE combines all of these signals into a verdict (status) supported by the modification_markers array, which lists each specific signal that contributed to the result in plain language, so the analyst can evaluate the evidence rather than accepting a black-box verdict.
Conclusion
PDF metadata forensics is not about finding a single smoking gun. It is about reading a document’s history from the fields that institutions, editing tools, and the PDF format itself leave behind. CreationDate and ModDate establish the temporal frame. Creator and Producer identify who made and who touched the document. The xref table records how many times it was saved. Each field is a claim; the combination of claims either tells a coherent story or reveals contradictions.
Understanding these fields individually is the prerequisite for interpreting them collectively. When the story they tell matches the document’s claimed origin, that is evidence of authenticity. When they contradict each other — or contradict what the document claims to be — that is evidence worth investigating.
See how HTPBE uses all of these fields together — run a free analysis on any PDF at htpbe.tech.