PDF Security Blog

PDF Integrity Report: June 2026

HTPBE Team·01.07.2026·11 min read

This article is a snapshot — content was accurate as of July 2026. The product evolves actively; specific counts, examples, and detection rules may have changed since publication — see the changelog for the current state.

Every month we look at aggregate, anonymized data from checks processed by HTPBE? and write up what the structural signals tell us about the state of PDF tampering. No file contents, no personally identifiable information — only the structural and metadata patterns the algorithm uses to classify documents.

This report is about proportions and movement, not raw counts. What share of documents came back flagged, which signals fired more or less often than the month before, which origins shifted, and what the recurring tampering shapes looked like. Those are the numbers that mean something; an absolute file count for a single month is noise by comparison.

The Shape of the Verdicts

Last month the flagged share climbed past seven in ten and, for the first time, "certain" verdicts overtook "high-confidence" ones. In June both of those moves reversed. The flagged share fell back to just under half — near where it sat in March — and "high-confidence" reclaimed the lead inside the flagged set.

Verdict	Direction vs. May
Not flagged	▲ back to just over half
High-confidence modification	▲ retook the largest flagged bucket
Certain modification	▼ slipped back below "high"

The reversal is almost entirely a traffic-mix effect, and it is the cleanest illustration we have yet published of why the flagged share is not a fraud rate. In May the traffic leaned hard toward the API, where callers skew toward files that are already modified — developers testing an integration with known-bad documents, and forger-bridge uploads probing whether a fake gets caught. That population stacks converging evidence, which lands in the "certain" tier and pushes the flagged share up.

In June the mix flipped: roughly four in five submissions came through the browser-based free checker rather than the API. Web traffic is a broader, messier population — curious first-time users, ordinary documents, genuine intake. A larger slice of it comes back clean, and the files that do flag tend to trip one strong signal rather than a stack of them, so they land in "high-confidence" rather than "certain."

Same engine, same thresholds, a very different headline number — driven by who showed up, not by any change in how documents are made. Read the flagged share every month as a statement about the submitting population, never as a population-wide fraud rate.

Signals That Moved

Under the shifting headline number, the evidence mix among flagged documents kept moving in the direction we have tracked all year: the classical first-order tells stayed common, while newer second-order checks kept widening what gets caught.

Up, or newly firing:

Generator-identity forgery — a document whose stated origin has been rewritten to disguise where it actually came from. A dedicated check shipped early in June and a late-June broadening extended it to files rebuilt after issue but still dressed up as untouched originals. It fires whenever the declared producer and the structural fingerprint tell two different stories.
Overlay and cover-and-replace edits — substitute values layered on top of, or concealing, otherwise-untouched original content to change what a page reads. New detection classes this month moved a long-standing hard case from "occasionally caught" toward routinely caught.
Edited-in-an-editor — a document reopened in an interactive PDF editor after creation, a value changed, and saved back. New coverage this month recognises that round-trip independently of which editing application did it.

Flat or down in share:

Date-field inconsistencies — still the most common single finding, but no longer growing; the easy timestamp tells are increasingly being cleaned before submission.
Missing creation date — eased back to roughly a fifth of all files, down from the near-quarter it reached in May. Still elevated, still worth watching, but the monotonic climb paused.
Post-signature modification — down in share, mostly because signed documents were again a thin slice of the month.

The year-long pattern held: a forger who has learned to scrub creation dates and avoid an incremental-update trail does not necessarily know to reconcile the structural fingerprint of the tool they rebuilt the file with against the origin they claim. The second-order checks are where those cases surface.

Incremental Updates: Still Almost Every Time

The cleanest signal we track stayed the cleanest. Files carrying incremental updates were flagged in the vast majority of cases — roughly seven in eight — easing only slightly from May's near-total rate. The average revision chain on those files sat around three appends.

The mechanism is unchanged: incremental updates let content be appended after the original write. Legitimate workflows produce them — signature application, annotation, form-fill — but on the population reaching the tool, those clean cases remain a small minority. When an incremental update shows up on a document submitted for tamper detection, it is still very close to synonymous with post-creation editing.

Representative Cases

These are composite, anonymized illustrations of the recurring shapes the engine resolved this month — not specific files. Each maps to the structural markers that actually drove the verdict.

The editor round-trip (verdict: certain). A "bank statement" looks clean to the eye. Structurally it was opened in an interactive PDF editor after its original creation, a figure was changed, and it was saved back — leaving the fingerprint of an editing pass over what claims to be an untouched issuer original. Edited-in-an-editor: the newer coverage this month recognises that round-trip no matter which editor did it.

The overlay patch (verdict: high → certain). A payslip reads correctly, cell by cell. Structurally, substitute values were layered on top of the original page content — the underlying figures are still there, quietly covered by the numbers the forger wanted shown. Overlay and cover-and-replace detection targets exactly this: the page you see is not the page underneath.

The borrowed generator identity (verdict: certain). A document declares an institutional producer in its origin fields, but the binary structure carries the fingerprint of a consumer tool that rebuilt it. The stated origin has been rewritten to disguise where the file actually came from. Generator-identity forgery — the check that treats "claims one origin, structurally is another" as a flag in its own right.

The render dressed as a scan (verdict: modified). A file arrives looking like a camera or scanner capture — one full-page image, no born-digital text. Structurally it was digitally rendered into that single image and then presented as a scan, a way a fabricated file is made to look like an innocent photograph of a paper original. A new June check separates that synthetic render from a genuine scan, including ordinary phone captures, which continue to pass into the not-certifiable ceiling rather than being flagged.

Document Origin

The origin mix partly unwound May's shift. Scanned documents fell back to roughly an eighth of submissions, slipping below consumer-software exports again, while institutional documents remained the plurality at a little over four in ten.

Origin classification	Direction
Institutional (server-side / enterprise generators)	plurality, ~four in ten
Consumer software ("Cannot Verify")	▲ back above scanned
Scanned ("Cannot Verify")	▼ eased to ~an eighth
Online editor / unknown / other	small shares

Scans and consumer-software exports fall into a "Cannot Verify" bucket where the structural layer deliberately returns a conservative inconclusive verdict rather than an intact-or-modified call — forcing a binary verdict on those formats would generate false positives in both directions. Several of June's releases sharpened that boundary in both directions: extending scanned-document recognition to more multifunction copier and companion-app output that had been misread as born-digital, while sparing genuine machine-issued bills that carry lightweight postal-mailing marks from being mistaken for a full-page scan. A scan can still never earn an "intact" verdict here — re-scanning a tampered printout is a known way to launder edits out of the structural record.

Digital Signatures

Signed documents were again a thin slice of the month — too small a base to quote a meaningful rate, so we keep it qualitative. The pattern that did appear is the one we report every month: a signature valid in the viewer does not guarantee the bytes were not altered, because incremental updates appended after signing fall outside the signed scope. Checking integrity at the structural layer, not the signature-validation layer, is what catches that.

Algorithm Development

June shipped sixteen versions — a steadier month than May's twenty-nine, and weighted toward broadening detection rather than the release-a-day pace of the previous month. The work split the usual three ways, with two firsts worth calling out.

New detection categories — generator-identity forgery, overlay and cover-and-replace edits, edited-in-an-editor round-trips, a synthetic-render-dressed-as-a-scan check, files rebuilt inside a graphics-design tool from a source held on the operator's own machine, and dangling internal references left behind by a rebuild. Several of these closed cases that had previously slipped through certified as originals.
The first deliberately-undisclosed signal. For the first time the catalogue carries a proprietary integrity check whose mechanism we do not describe — we acknowledge it exists and that it is strong, corroborated evidence when it fires, but we hold back how it works so a forger cannot read the description and engineer around it. Every other check stays described in plain outcome terms.
A retirement. We removed a standalone document-identifier consistency check: reviewed against a large corpus of genuine files, those identifier records were found to differ legitimately on a single clean render across many established generators, so on their own they produced false positives while adding nothing the structural checks did not already cover. Every modification it could ever evidence remains caught independently.
False-positive reductions — roughly half the releases narrowed heuristics misfiring on legitimate document classes: genuine scanner hardware output, single-pass institutional renders, print-rendered layouts, table-layout documents, machine-composed financial and retirement statements, and government forms with long author-to-issue gaps.

Wider coverage cuts against the falling flagged share, not with it: a share of the documents flagged in June would have passed under the early-June algorithm. The headline number fell anyway — which is exactly why the traffic-mix framing above matters.

The Software Ecosystem

The recurring fingerprints held. Online manipulation services as intermediate steps — a service in the producer field with a different application in the creator field, the signature of a compress / merge / page-extract step between creation and submission. Design-tool origin — vector- and consumer-design applications appearing where a system-generated producer belongs, on documents that purport to be business records; June added a specific check for files rebuilt inside a graphics tool from a locally-held source, a construction pattern no institution uses to issue its own statements. Programmatic manipulation libraries — where the signal is no longer the spoofable producer string but the structural fingerprint the library leaves at the binary level, which is where the generator-identity-forgery work is aimed.

PDF Version Landscape

Concentration loosened slightly. PDF 1.7 slipped just under half the sample, down from over half in May, with 1.4 taking a larger second share and 1.3, 1.6 and 1.5 splitting most of the rest. PDF 2.0, despite nearly a decade of availability, stayed a rounding-error share.

Summary

June 2026, in relative terms:

The flagged share fell back to just under half — and the reversal is the clearest example yet that this number tracks who submitted documents (a web-dominated month), not a population fraud rate.
"High-confidence" verdicts reclaimed the lead from "certain" as the traffic mix flipped from API-heavy to browser-heavy.
Incremental-update files were still flagged in the vast majority of cases — roughly seven in eight, the cleanest single signal we track.
Newer second-order checks — generator-identity forgery, overlay and cover-and-replace edits, edited-in-an-editor — kept widening coverage against the classical date and incremental-update tells.
Scanned share eased back to roughly an eighth; missing creation dates paused their climb at about a fifth.
Sixteen algorithm versions shipped, including the first deliberately-undisclosed proprietary signal and the retirement of an unreliable identifier check.

Every pattern here comes from the same forensic engine teams run on their own intake stream through the PDF tamper detection API. If you want to run a single document through the same analysis by hand, the free checker does it in the browser.

This report covers checks processed by HTPBE? in June 2026. File contents are not stored or analyzed; only structural metadata signals are retained. All figures are aggregate and anonymized.