PDF Security Blog

PDF Metadata Forensics: A Complete Field-by-Field Reference

HTPBE Team·27.03.2026·16 min read

This article is a snapshot — content was accurate as of March 2026 (code examples tested against the API as of March 2026). The product evolves actively; specific counts, examples, and detection rules may have changed since publication — see the changelog for the current state.

Every PDF file carries two layers of information. The first is the visible content — the text, images, and layout a reader sees. The second is metadata: structured data describing the document itself. This second layer records when the document was created, which application produced it, whether it has been modified, and by what tools.

Forensic analysis of these fields can reconstruct a document’s history without examining its visible content at all. For document fraud detection professionals, understanding each field — what it stores, what it reveals, and what makes a value suspicious — is the foundation of PDF authenticity assessment.

This reference covers every major metadata field used in PDF forensics, the structural element that cannot be cleared (the cross-reference table), and how these signals combine into an overall authenticity verdict.

Two Metadata Systems in One File

Modern PDF files contain two distinct metadata systems that coexist and sometimes conflict.

The DocInfo dictionary is the original system, introduced with PDF 1.0. It stores metadata as plain text key-value pairs inside the PDF’s trailer structure. Fields like CreationDate, ModDate, Creator, and Producer all live here. It is simple, universally supported, and very easy to edit with basic tools.

XMP metadata (Extensible Metadata Platform) is XML-based and was introduced in PDF 1.4. It stores a richer, more structured version of the same information, plus additional fields that the DocInfo dictionary cannot represent. XMP is embedded as a stream object within the PDF.

When both systems are present, they should agree. A document where the DocInfo dictionary says the creation date is 2020 but the XMP stream records 2024 has an internal inconsistency — a strong indicator that one of the two was modified manually after the original export.

HTPBE? reads both systems and flags contradictions between them as a distinct finding.

To see both systems for one of your own files, open it in the free PDF metadata viewer — it dumps every DocInfo key and the full XMP packet side by side, so you can spot a mismatch like the one above for yourself.

Field: `CreationDate`

What it stores: The timestamp recorded when the PDF was first written to disk. In the DocInfo dictionary, this is stored in the format D:YYYYMMDDHHmmSSOHH'mm' — for example, D:20220315143022+02'00' represents 14:30:22 on 15 March 2022 in the UTC+2 timezone.

What it reveals: The earliest possible moment the document could have existed. This is the anchor point for all other temporal analysis.

Suspicious patterns:

CreationDate is today’s date on a document that is supposed to be years old. A bank statement from 2021 with CreationDate: 2026-03-27 was either re-exported from source data very recently or was fabricated.
CreationDate is in the future. This happens when the PDF-generating machine has its clock set incorrectly, but it is also a byproduct of certain editing tools that do not normalize timestamps.
CreationDate is significantly earlier than the software that allegedly created it. If Creator reports an application released in 2023 but CreationDate is 2019, one of those values is wrong.

Critical limitation: CreationDate is a plain-text field. Any text editor or command-line tool can change it to any value. It is not cryptographically protected. It should be treated as a claim, not a fact, and checked against other signals.

Field: `ModDate`

What it stores: The timestamp of the last modification to the PDF. Like CreationDate, it uses the same D: format. ModDate is updated by any operation that saves changes back into the file: editing content, adding annotations, applying a digital signature, running an optimizer, or converting with a new tool.

What it reveals: When the document was last processed by any application. A significant gap between CreationDate and ModDate on a document that was never supposed to be touched after creation is a core forensic signal.

The normal case — digital signatures: Applying a digital signature counts as a modification and updates ModDate to the signing timestamp. A payslip issued on the first of the month and signed by the employee for submission on the fifteenth will show CreationDate two weeks before ModDate. This is completely expected. HTPBE? accounts for this by examining whether the modification pattern is consistent with legitimate signing behavior rather than content editing.

Suspicious patterns:

ModDate is years after CreationDate on a document that is supposed to be a static official record (a diploma, a property title, a tax certificate). The gap itself is not proof of fraud, but it warrants investigation into what tools produced the modification.
ModDate predates CreationDate. This is logically impossible and indicates deliberate tampering with one or both timestamps.
ModDate matches the timestamp of a known online editing service, while CreationDate matches the original institution’s software.

What HTPBE? checks: The ModDate minus CreationDate gap is calculated and evaluated against document type expectations. The producer field at the time of modification is examined to determine whether the modifying tool is consistent with legitimate workflows (signing software, document management systems) or associated with document editing (online PDF editors, desktop PDF manipulation tools).

Field: `Creator` (also called `CreatorTool`)

What it stores: The name of the application that originally authored the document and exported it to PDF. This is distinct from whatever tool later processed or printed the file. For a Word document saved as PDF, Creator would typically be Microsoft Word. For a web-based report exported by an enterprise platform, it might be Oracle BI Publisher 12.2.1.4.0 or Temenos T24 Banking System.

ISO 32000 standard semantics: The ISO 32000 specification (maintained and published by the PDF Association) defines Creator as the name of the application that created the original document, and Producer as the name of the application that converted or last processed the file. Creator represents origin, not editing history. Producer is the field that should be updated to reflect processing tools.

Why Creator matters forensically: Institutional documents — bank statements, payslips, university transcripts, government certificates — are generated by specialized software with characteristic Creator strings. Knowing what software a given institution uses, and seeing a different application in the Creator field, is one of the highest-confidence fraud signals available.

Legitimate institutional creators:

PeopleSoft 9.2 — HR and payroll systems at large enterprises and universities
Oracle BI Publisher — financial reporting at banks and insurance companies
Temenos T24 — core banking software used by hundreds of banks worldwide
SAP NetWeaver — enterprise resource planning, common in large corporate payroll
Registrar System v4.2 — academic record management

High-risk creators for documents claiming institutional origin:

Canva — a graphic design tool, not a document management system
Microsoft PowerPoint — presentation software
Adobe Photoshop or GIMP — image editors
LibreOffice Draw — vector drawing application
Inkscape — SVG editor

A bank statement with Creator: Canva is not ambiguous. No bank uses Canva to generate account statements. This single field value is sufficient to classify the document as fabricated regardless of what any other field says.

Field: `Producer`

What it stores: The name of the application that most recently processed or converted the PDF. Where Creator represents origin, Producer represents the last tool in the processing chain. If a Word document is opened in Adobe Acrobat Pro and re-saved after adding a comment, Creator remains Microsoft Word and Producer becomes Adobe Acrobat Pro DC 2024.002.20991.

ISO 32000 standard semantics: Unlike Creator, which should remain stable, the Producer field should be updated by each tool that processes the file. This means Producer gives a running record of the last tool to touch the document.

Legitimate producer/creator combinations:

Some combinations are entirely normal and should not raise suspicion:

Creator: wkhtmltopdf + Producer: Qt 4.8.7 — a common open-source web-to-PDF stack used by many fintech companies and SaaS platforms
Creator: Microsoft Word + Producer: Microsoft® Word for Microsoft 365 — a Word document saved directly
Creator: Chromium (or a full User-Agent string) + Producer: Skia/PDF m121 — Chrome printing to PDF, expected for web-generated documents
Creator: PrimoPDF + Producer: Nitro PDF PrimoPDF — legitimate PDF printer driver

Applying a naive rule of “any producer/creator mismatch is suspicious” produces a false positive rate that makes the tool useless in practice. HTPBE?’s whitelist of known-legitimate pairs was built from analysis of real institutional document samples and reduces this to a manageable signal.

High-risk producers — online PDF editing services:

These applications are legitimate tools for personal use, but their presence as Producer on a document claiming to be an official record is a significant red flag:

iLovePDF — online PDF editor and converter
Smallpdf — browser-based PDF tools
PDF24 — web and desktop PDF editor
PDFescape — online PDF form filler and editor
Sejda — online and desktop PDF editor

A payslip with Producer: iLovePDF was processed by an online PDF editing service. This is not how payroll software works. The document was either converted using an unofficial tool (which introduces questions about why) or was edited online and the editing service left its signature in the metadata.

Why whitelist beats blacklist: The set of legitimate producer values is large but bounded. The set of suspicious values is also bounded. The grey zone — tools that are sometimes legitimate and sometimes used for fraud — requires contextual evaluation. HTPBE? uses a weighted scoring approach rather than a binary classification.

XRef Tables (Cross-Reference Tables)

What they are: The cross-reference table is the PDF’s internal index. It maps every object in the file (pages, images, fonts, metadata streams, signatures) to its byte offset within the file. Without the xref table, a PDF reader cannot locate any object.

When a PDF is first created and saved, a single xref table is written. When the PDF is subsequently modified and saved again, PDF’s default behavior is incremental saving: rather than rewriting the entire file, a new section is appended to the end of the file, and a new xref table is appended that describes only the changed or added objects. The previous xref table remains intact.

This means xref_count — the number of xref sections in a file — directly corresponds to the number of times the document was saved after its initial creation.

xref_count: 1 — The document was written once. It has never been incrementally updated. For a document that should be a static export (a bank statement, a certificate), this is the expected value.

xref_count: 2 — The document was saved a second time. This is normal and expected if the document was digitally signed after creation. Signing software appends a signature object and a new xref table.

xref_count: 4 — The document has been through four save operations. Combined with a producer like iLovePDF, this pattern is highly suspicious.

Why xref count is forensically special: Every other metadata field — CreationDate, ModDate, Creator, Producer — is a plain text value that can be overwritten with any content. The xref structure is different. To clear the incremental save history and make a multiply-edited document appear as if it was saved only once, you would need to completely rewrite the file from scratch, reserializing all objects. This is a fundamentally different operation from metadata editing, and it changes the file in ways that introduce other detectable artifacts. Most document forgers do not do this. They edit the visible content, save the file, and move on — leaving the xref count intact.

Nuances:

Adobe Acrobat uses incremental saves for minor operations like adding bookmarks, annotations, or form field values. xref_count: 2 or even xref_count: 3 is not automatically suspicious.
The combination of xref_count and Producer at each save point is what matters. Two saves where both producers are consistent with a sign-and-certify workflow is normal. Four saves where the final producer is an online editing tool is not.
Some PDF generators (particularly those that optimize for file size) may perform a full linearization pass, which rewrites the file and resets xref count. This is a legitimate operation but means xref_count: 1 after optimization does not prove a clean history — it only proves the file was linearized.

HTPBE? reports xref_count directly in the API response, paired with the producer values found at each incremental update point.

Empty or Stripped Metadata

Absent metadata is itself a forensic signal, but it is ambiguous.

When empty metadata is suspicious: A corporate bank statement, payslip, or official certificate generated by enterprise software will always contain metadata. The software that generates these documents is configured to populate standard fields. If a document claiming to be from a major bank has no Creator, no Producer, and no CreationDate, someone deliberately removed that information. Tools like ExifTool (exiftool -all= document.pdf) and QPDF (qpdf --linearize) can strip all DocInfo dictionary entries. This is a non-trivial operation that someone would perform specifically because the original metadata was incriminating.

When empty metadata is normal: Developer-generated PDFs built with libraries like PDFKit (Node.js), ReportLab (Python), Puppeteer (headless Chrome), or FPDF often contain minimal or entirely absent metadata. These libraries do not populate metadata fields by default. A developer-generated invoice or automated report with zero metadata is not suspicious — it is the expected output of code that was not written to include metadata.

How HTPBE? handles it: Rather than treating empty metadata as a pass or a fail, HTPBE? takes into account signal availability. When metadata is sparse, the analysis has fewer signals to work with. A document with empty metadata and xref_count: 1 might receive a verdict of inconclusive rather than a definitive classification. This is more honest than a false positive.

Digital Signatures and Their Interaction with Metadata

A digital signature in a PDF protects the content of the document at the moment of signing. Any subsequent modification to the signed content will invalidate the signature. This sounds comprehensive, but the implementation details matter.

What signing does to metadata: When a signing application (Adobe Acrobat, DocuSign, a custom signing service) applies a signature, it appends a signature object to the PDF and writes a new xref section. ModDate is updated to the signing timestamp, and Producer is updated to the signing software. This is correct and expected behavior. A signed document should show this pattern.

The Incremental Saving Attack: Research published at pdf-insecurity.org documented a class of vulnerabilities in PDF signature validation. In some older or misconfigured PDF viewers, it is possible to append additional content to a signed PDF in a new incremental update. The signature validation reports the signature as valid — because the signed content has not changed — but the document as displayed to the user includes the appended content, which was never signed.

In practical terms: a signed document can have additional pages appended, additional form field values inserted, or visual overlays added in a new xref section, and a naive signature validation tool will still report the signature as “valid.”

What HTPBE? detects: By examining the xref structure in relation to the position of signature objects, HTPBE? can identify whether modifications were appended after the signature was applied. If the signature object appears in xref section 2 and additional content modifications appear in xref sections 3 and 4, the document has been altered after signing regardless of what a signature validation tool reports about the signature’s cryptographic validity.

Practical Decision Reference

The following table summarizes the most common field combinations and their forensic interpretation.

`CreationDate`	`ModDate` Gap	`Creator`	`Producer`	`xref_count`	Interpretation
Matches document era	None or days	Institutional software	Matching or signing tool	1–2	Normal — low risk
Today	None	Institutional software	Institutional software	1	Recently regenerated — investigate why
Today	N/A	Canva, Photoshop, PowerPoint	Any	Any	Fabricated — high risk
Years ago	Years	Institutional software	iLovePDF, Smallpdf, Sejda	>2	Modified — high risk
Years ago	Weeks	Institutional software	Signing tool (DocuSign, Acrobat)	2	Signed by recipient — normal
Inconsistent with XMP	Any	Any	Any	Any	Internal contradiction — moderate risk
Empty	Empty	Empty	Empty	1	Stripped or minimal generator — low confidence
Future date	N/A	Any	Any	Any	Clock error or fabrication — investigate

Why No Single Field Is Conclusive

Each field in this reference has a legitimate explanation for values that look suspicious in isolation. A future CreationDate might be a server clock configuration error. A Producer from an online editor might mean the recipient printed and re-uploaded the document for their own filing. A large xref count might be from legitimate document management workflows.

The forensic value comes from the combination. CreationDate from today, Creator set to Canva, and xref_count: 1 is not three independent coincidences — it is a coherent pattern of fabrication. ModDate three years after CreationDate, Producer set to iLovePDF, and xref_count: 4 is a coherent pattern of post-creation editing.

HTPBE? combines all of these signals into a verdict (status) supported by the modification_markers array, which lists each specific signal that contributed to the result in plain language, so the analyst can evaluate the evidence rather than accepting a black-box verdict. The same field-by-field reasoning is available programmatically through the PDF metadata forensics API, which returns every field discussed here alongside the verdict.

Conclusion

PDF metadata forensics is not about finding a single smoking gun. It is about reading a document’s history from the fields that institutions, editing tools, and the PDF format itself leave behind. CreationDate and ModDate establish the temporal frame. Creator and Producer identify who made and who touched the document. The xref table records how many times it was saved. Each field is a claim; the combination of claims either tells a coherent story or reveals contradictions.

Understanding these fields individually is the prerequisite for interpreting them collectively. When the story they tell matches the document’s claimed origin, that is evidence of authenticity. When they contradict each other — or contradict what the document claims to be — that is evidence worth investigating.

PDF Metadata Forensics: A Complete Field-by-Field Reference

Two Metadata Systems in One File

Field: `CreationDate`

Field: `ModDate`

Field: `Creator` (also called `CreatorTool`)

Field: `Producer`

XRef Tables (Cross-Reference Tables)

Empty or Stripped Metadata

Digital Signatures and Their Interaction with Metadata

Practical Decision Reference

Why No Single Field Is Conclusive

Conclusion

Share This Article

Secure your workflow

Two Metadata Systems in One File

Field: CreationDate

Field: ModDate

Field: Creator (also called CreatorTool)

Field: Producer

XRef Tables (Cross-Reference Tables)

Empty or Stripped Metadata

Digital Signatures and Their Interaction with Metadata

Practical Decision Reference

Why No Single Field Is Conclusive

Conclusion

Share This Article

Secure your workflow

Field: `CreationDate`

Field: `ModDate`

Field: `Creator` (also called `CreatorTool`)

Field: `Producer`