PDF Security Blog

PDF Integrity Report: May 2026

HTPBE Team·01.06.2026·9 min read

This article is a snapshot — content was accurate as of June 2026. The product evolves actively; specific counts, examples, and detection rules may have changed since publication — see the changelog for the current state.

Every month we look at aggregate, anonymized data from checks processed by HTPBE? and write up what the structural signals tell us about the state of PDF tampering. No file contents, no personally identifiable information — only the structural and metadata patterns the algorithm uses to classify documents.

This report is about proportions and movement, not raw counts. What share of documents came back flagged, which signals fired more or less often than the month before, which origins shifted, and what the recurring tampering shapes looked like. Those are the numbers that mean something; an absolute file count for a single month is noise by comparison.

The Shape of the Verdicts

The flagged share climbed again — from just under half in March to roughly seven in ten in May. The bigger story is within the flagged set. For the first time, "certain" verdicts — where multiple unambiguous structural signals converge — overtook "high-confidence" ones as the single largest bucket.

Verdict	Direction vs. prior months
Certain modification	▲ now the largest single bucket
High-confidence modification	▲ slightly, but overtaken by "certain"
Not flagged	▼ shrinking share

Two forces are behind that. First, the traffic mix tilted hard toward the API this month, and API callers skew toward documents that are already modified — developers testing an integration with known-bad files, and forger-bridge traffic uploading a fake to see whether it gets caught. That population produces stacked, unambiguous evidence, which lands in the "certain" tier. Second, detection coverage widened (see the version count below), so files that once scraped a "high" now cross into "certain" because a newer check adds the converging second signal.

Read the flagged share as a statement about who is submitting documents, not as a population-wide fraud rate.

Signals That Moved

Among flagged documents, the evidence mix shifted in a consistent direction: the classical first-order signals held their lead, while the newer second-order signals kept gaining share.

Up month over month:

Generator-fingerprint contradictions — the declared producer says one thing, the binary structure says another. Growing fastest, because it catches files where the forger spoofed the producer string but left the structural fingerprint of the tool they actually used.
Multi-source page-template assembly — pages stamped from one source spliced onto pages from another. A new dedicated check this month moved this from "occasionally caught" to "routinely caught."
Producer-identity spoofing on re-distilled files — also a May-shipped check; immediately started firing on files laundered through a re-distill step.
Missing creation date — roughly a quarter of all files now arrive with no creation timestamp at all, a share that has crept up every month. A missing creation date strips out one of the cleaner forensic anchors, and its rise is itself a signal: someone is scrubbing it.

Flat or down:

Date-field inconsistencies — still the most common single finding by share, but no longer growing; the easy timestamp tells are increasingly being cleaned by forgers before submission.
Post-signature modification — down in share, mostly because signed documents were a thin slice of the month.

The trend we have flagged all year held: a forger who learned to scrub creation dates and avoid an incremental-update trail does not necessarily know to reconcile the structural fingerprint of the tool they rebuilt the file with against the producer string they spoofed. The second-order signals are where those cases get caught.

Incremental Updates: Almost Without Exception

The cleanest signal we track, and it got cleaner. Files carrying incremental updates were flagged in virtually every case this month — the highest rate in the four months we have published, continuing a monotonic climb. The average revision chain on those files sat around three appends.

The mechanism is unchanged: incremental updates let content be appended after the original write. Legitimate workflows produce them — signature application, annotation, form-fill — but on the population reaching the tool, those clean cases have shrunk to a rounding error. When an incremental update shows up on a document submitted for fraud detection, it is now almost synonymous with post-creation editing.

Representative Cases

These are composite, anonymized illustrations of the recurring shapes the engine resolved this month — not specific files. Each maps to the structural markers that actually drove the verdict.

The spreadsheet rebuild (verdict: certain). A "bank statement" arrives looking clean to the eye. Structurally, the producer field names a spreadsheet-export pipeline rather than the bank's core system, and the modification timestamp trails the creation timestamp by weeks. Two converging signals — producer mismatch plus a date gap — and the file is flagged with high certainty. This is the single most common shape in the flagged set, month after month.

The signature that held while the bytes moved (verdict: certain). A signed contract shows a green "signed by" badge in the viewer. Structurally, an incremental update was appended after the signature's byte range — a figure changed on page three, saved as a new revision the signature never covered. The signature stays technically valid; the document is not what was signed. This is the case digital-signature validation alone cannot catch, and structural analysis does.

The two-source invoice (verdict: certain). An invoice looks like one coherent document. Structurally, page one carries the font subsets and object fingerprint of an institutional generator, while page two was spliced in from a different source — a different font-subset prefix, a stamp-coverage discontinuity at the page boundary. Multi-source page-template assembly: the body is genuine, one page was swapped.

The borrowed identity (verdict: high → certain). A file declares "Adobe Acrobat" in its producer string but is missing the XMP toolkit marker and document-instance identifiers a genuine Acrobat save always writes, and its structural fingerprint matches a re-distill pipeline. Producer-identity spoofing — the May-shipped check that turns "claims to be Adobe" into a flag when the structure says otherwise.

Document Origin

The origin mix shifted: scanned documents rose sharply, to nearly a quarter of submissions — overtaking consumer-software exports — while institutional documents remained the plurality.

Origin classification	Direction
Institutional (server-side / enterprise generators)	plurality, steady
Scanned ("Cannot Verify")	▲ sharp rise, now ~a quarter
Consumer software ("Cannot Verify")	▼ slipped below scanned
Online editor / unknown / other	small shares

Scans and consumer-software exports fall into a "Cannot Verify" bucket where the structural layer deliberately returns a conservative inconclusive verdict rather than an intact-or-modified call — forcing a binary verdict on those formats would generate false positives in both directions. The rise in scanned share is worth watching: re-scanning a tampered printout is a known way to launder edits out of the structural record, which is exactly why a scan can never earn an "intact" verdict here.

Digital Signatures

Signed documents remained a thin slice of the month, too small to quote a meaningful rate. The pattern that did appear is the one we keep reporting: a signature that is valid in the viewer does not guarantee the bytes were not altered, because incremental updates appended after signing fall outside the signed scope. Checking integrity at the structural layer, not the signature-validation layer, is what catches that — see "the signature that held while the bytes moved" above.

Algorithm Development

May was, by version count, the busiest month since launch — twenty-nine versions shipped, up from April's eighteen. The work split three ways, as always:

New detection categories — multi-source page-template assembly, producer-identity spoofing on re-distilled files, and refinements to drawing-operator and content-stream consistency checks. The first two show up directly in the "signals that moved" section above.
False-positive reductions — roughly half the releases narrowed heuristics misfiring on legitimate document classes: professional export pipelines, multi-tool re-export chains, certain office-suite and print-to-PDF outputs, signed-document workflows.
Clearer inconclusive verdicts — scans, consumer-software exports and HTML-to-PDF output continued to be routed into an explicit inconclusive verdict rather than forced into intact-or-modified.

Wider coverage is itself part of why the flagged share rose: a share of the documents now flagged would have come back intact under the early-May algorithm.

The Software Ecosystem

The recurring fingerprints held. Online manipulation services as intermediate steps — a service in the producer field with a different application in the creator field, the signature of a compress / merge / page-extract step between creation and submission. Design-tool origin — vector- and consumer-design applications appearing where a system-generated producer belongs, on documents that purport to be business records. Programmatic manipulation libraries — where the signal is no longer the spoofable producer string but the structural fingerprint the library leaves at the binary level. May's producer-identity-spoofing check was built for exactly that last category.

PDF Version Landscape

Concentration tightened: PDF 1.7 alone accounted for over half the sample, with 1.6, 1.4, 1.5 and 1.3 splitting most of the rest. PDF 2.0, despite nearly a decade of availability, stayed a rounding-error share.

Summary

May 2026, in relative terms:

The flagged share climbed past seven in ten — but read it as a traffic-mix effect (API-dominated, skewed toward already-modified files), not a population fraud rate.
"Certain" verdicts overtook "high-confidence" ones for the first time — converging, unambiguous evidence is becoming the norm in the flagged set.
Incremental-update files were flagged almost without exception — the cleanest single signal we track, climbing every month.
Second-order signals — generator-fingerprint contradictions, multi-source assembly, producer-identity spoofing — kept gaining share on the classical date and incremental-update tells.
Scanned documents rose sharply, to nearly a quarter of submissions; missing creation dates continued their slow climb.
Twenty-nine algorithm versions shipped, the most in any month so far.

Every pattern here comes from the same forensic engine teams run on their own intake stream through the PDF tamper detection API. If you want to run a single document through the same analysis by hand, the free checker does it in the browser.

This report covers checks processed by HTPBE? in May 2026. We analyze only file structure, never document content; web uploads may be retained in anonymized form to improve detection. All figures are aggregate and anonymized.