PDF Security Blog

PDF Integrity Report: April 2026

HTPBE Team·May 1, 2026·11 min read

This article is a snapshot — content was accurate as of May 2026. The product evolves actively; specific counts, examples, and detection rules may have changed since publication — see the changelog for the current state.

Every month we look at aggregate, anonymized data from checks processed through the HTPBE web interface and write up what the structural signals tell us about the state of PDF tampering. No file contents, no personally identifiable information — only the structural and metadata patterns our algorithm uses to classify documents.

A note before the numbers. April was an unusual month for the dataset. Alongside the organic stream of public submissions, we ran a large internal adversarial-testing batch — well over ten thousand synthetic and modified PDFs, generated and edited through every technique we know about, used to harden the algorithm against new tampering classes. Those test files are mixed into the structural counts below. So the absolute volume for April is dominated by our own training pipeline and is not a meaningful number to compare to March. What is meaningful is the shape of what we saw — the proportions, the shifts in which signals fire, and the categories of tampering that became detectable for the first time. Those are what this report covers.

For that reason we are stepping away from the daily-volume chart and the headline file-count figure that anchored the February and March reports. Both were useful when the sample was a clean walk-up of public submissions. They are misleading once a structured testing batch is mixed in.

The Headline Pattern

The modification rate kept climbing. February was around two-fifths of submissions flagged. March was just under half. April pushed past the halfway line: more than one in two documents that reached the algorithm — across organic submissions and adversarial test material combined — came back flagged.

Within those, the share carrying definitive structural evidence (the verdict we label "certain", reserved for cases where multiple unambiguous signals converge) held roughly stable as a proportion of all flagged files. The "high-confidence" share — strong but non-converging evidence — grew slightly. The qualitative direction is the same one we noted in March: the population reaching the tool is increasingly weighted toward documents where something is structurally wrong, not toward clean files being routinely checked.

Verdict	Share of all submissions
Not flagged	~47%
High-confidence modification	~24%
Certain modification	~29%

Eighteen Algorithm Versions in Thirty Days

April was the most active month for detection development since launch. Eighteen versions shipped between April 1 and April 30. Most of the work fell into three buckets:

New detection categories. Several whole classes of tampering became identifiable for the first time. We are deliberately vague about exactly what each new check looks for — the changelog notes describe behavior in general terms for the same reason. The new categories include impossible-timestamp detection, character-encoding manipulation in document text, residual structural traces left by document rebuilds, and a class of tampering that targets the consistency between different metadata layers within a file. There is also a new category covering documents marked as redacted where the underlying content remains structurally accessible.

False-positive reductions. Roughly half of the April releases were focused on cleaning up false positives on legitimate document classes — enterprise reporting frameworks, certain office-suite outputs, web-optimized files, multi-party signed documents, scanner-firmware variants, and HTML-to-PDF rendering pipelines. These are categories where the underlying tools produce structural quirks that look superficially like tampering signals. Each release narrows the heuristics so that legitimate output stops triggering them.

Clearer inconclusive verdicts. Documents that the structural layer cannot reason about with confidence — pure scans, consumer-software exports, browser-rendered HTML-to-PDF output — are now more consistently routed into an explicit "inconclusive" verdict instead of being forced into intact-or-modified. The April releases added a dedicated status reason for HTML-rendered documents and broadened the scanner-origin classification. The practical effect is that fewer files are returned with a misleading "intact" label simply because the format made structural reasoning impossible.

The broader point: detection coverage in April is materially wider than in March. A non-trivial share of the documents now classified as modified would have come back as intact under the March algorithm. That alone accounts for some of the rise in flagged share.

Modification Signals: The Shape of the Evidence

Among flagged documents, no single signal carried the entire result. The classical signals — date-field disagreements, the presence of incremental updates, missing mandatory metadata fields — remained the most common findings by share. The newer signals appeared less often individually but covered tampering classes the classical signals miss.

The proportional pattern across flagged files in April:

Signal category	Share of flagged
Date-field inconsistencies	leading single category
Incremental update structure	second tier
Document-identity inconsistencies	second tier
Design-tool assembly patterns	second tier
Generator-fingerprint contradictions	second tier
Mandatory metadata removal	third tier
Cross-metadata-stream disagreement	third tier
Editor-tool content-stream traces	third tier
Sub-1% incremental modifications	smaller share
Character-encoding manipulation	smaller share
Multi-session page assembly	smaller share
Page-property inconsistencies	smaller share
Post-signature modifications	smaller share
Residual prior-template traces	smaller share
Scan-replace patterns	smaller share

Files routinely carry more than one signal at the same time, which is why this list is best read as a portrait of the evidence mix rather than a partition.

The trend worth flagging: the newer categories — encoding manipulation, residual templates, generator-fingerprint contradictions, multi-session assembly — are becoming a larger share of the flagged set month over month. A forger who has learned to scrub creation dates and avoid leaving an incremental update trail does not necessarily know to clean up the residual template hierarchy of the document they rebuilt from, or to avoid leaving a structural fingerprint that contradicts the producer string they spoofed. These are the second-order signals, and they are catching cases the first-order signals miss.

Incremental Updates: Almost Always Tampered

Files carrying incremental updates were overwhelmingly flagged in April — the modification share among them was higher than in March, which was itself higher than February. The trend has been monotonic across three months.

The mechanism has not changed: PDF incremental updates allow appending content after the original write. Legitimate workflows do produce them — signature application, annotation, form-fill. But on the population reaching the tool, those legitimate cases are increasingly the minority. When an incremental update appears on a document submitted for fraud detection, it is now far more often than not a sign of post-creation editing rather than a clean signature chain.

The average revision-chain length on flagged files ticked up slightly compared to March.

Document Origin

The origin classifier's output split as follows in April:

Origin classification	Share
Institutional (server-side generators, enterprise systems)	majority
Consumer software ("Cannot Verify")	second-largest
Scanned ("Cannot Verify")	meaningful share
Unknown	small share
Online editor ("Cannot Verify")	small share

Documents falling into the "Cannot Verify" buckets — scans, consumer-software exports, online HTML-to-PDF tools — receive a deliberately conservative inconclusive verdict rather than an intact-or-modified call. The rationale is unchanged: those formats produce structural patterns that overlap with legitimate editing artifacts, and forcing a binary verdict on them would generate too many false positives in either direction.

The April releases also reaffirmed and broadened that conservative stance. HTML-rendered documents now return inconclusive with an explicit machine-readable reason; the classifier treats more scanner-firmware variants as scans; and certain office-suite quirks that previously triggered modification signals are now correctly recognized as legitimate library behavior.

Digital Signatures: Still Not a Guarantee

Among submitted files carrying embedded digital signatures, a meaningful share showed evidence of post-signature modification, and a smaller share had had a signature removed entirely. The post-signature-modification share fell relative to March, but the underlying observation has not changed: digital signatures are not, on their own, an integrity guarantee.

The mechanism is the structural one we have written about before. A PDF signature covers exactly the bytes it covered when it was applied. Incremental updates appended after signing fall outside the signed scope, and the signature can remain technically valid in the viewer chrome — green checkmark, "signed by", date — even though new content has been spliced in. Checking integrity at the structural layer, rather than at the signature-validation layer, is what catches these.

What the April data adds: the share of altered-after-signing files among signed submissions is consistently in the same order of magnitude across months. It is not a tail-risk class of forgery — it is a routine one.

The Software Ecosystem

The producer and creator distributions in April shifted toward more server-side and enterprise generators than the previous months. Several patterns recurring across the ecosystem are worth noting in proportional terms.

Online manipulation services as intermediate steps. A larger share of flagged documents in April than in March showed an online PDF manipulation service in the producer field while a different application appeared in the creator field. That combination indicates the document went through a compress / merge / page-extract / edit step somewhere between original creation and submission. Whether that step was benign or fraudulent is what the structural analysis resolves.

Design-tool origin. Vector-design and consumer-design applications continued to appear in the creator field for documents that purport to be business records — invoices, certificates, contracts. Design tools are powerful enough to reproduce convincing layouts and trivially easy to edit; their presence in the creator field is one of the cleaner indicators that a document was hand-built rather than generated by the system that ought to have produced it.

Programmatic manipulation libraries. Paid PDF-manipulation libraries (the kind used legitimately for merging, watermarking, and signing, and also used in volume for page-import and template-assembly forgeries) continued to account for a notable share of flagged files. The detection signal here is no longer the producer string itself — those are easy to spoof — but the structural fingerprint the library leaves at the binary level. The contradiction between declared producer and structural fingerprint was one of the more frequently firing signals in April.

PDF Version Landscape

The version distribution was effectively unchanged from March: PDF 1.7 and 1.4 together accounted for roughly two-thirds of the sample, with 1.5, 1.6, and 1.3 splitting most of the remainder. PDF 2.0, despite nearly a decade of availability, remained a rounding-error share.

Embedded Content

A small but non-trivial share of files contained embedded JavaScript, and a similar share contained embedded file attachments — binary payloads carried inside the PDF wrapper. Both shares were higher than in March in proportional terms. Neither is in itself proof of tampering, but both are recognized compliance and threat-vector concerns in business-document contexts (contracts, invoices, statements). For workflows that touch documents from external counterparties, both warrant scrutiny independent of modification verdict.

Document Profile

Average size and average page count were broadly in line with the prior months. Metadata completeness remained around three-quarters of fields populated on the average file. The share of documents with creation dates removed entirely ticked up slightly compared to March — a continuation of a trend we have been tracking, since a missing creation date removes one of the cleaner forensic anchors in the file.

Summary

April 2026 in proportional terms:

The flagged-document share rose past one in two — continuing the climb from prior months.
Eighteen algorithm versions shipped, broadening detection coverage materially. Some part of the rise in flagged share is attributable to that wider coverage, not just to a change in the population.
New tampering categories — character-encoding manipulation, residual prior-template traces, generator-fingerprint contradictions, incomplete redaction, impossible timestamps — are becoming a meaningful share of flagged findings in their own right.
Files carrying incremental updates were overwhelmingly tampered — the rate has gone up every month for three months running.
Digital signatures continued to be modified after signing in a non-trivial share of signed submissions.
The same online manipulation services and design tools that featured in the March writeup continued to appear, in proportional terms, as recurring fingerprints in the flagged set.

For absolute volume readers: April included a large internal adversarial-testing batch used to train the algorithm, so the raw counts for the month are not comparable to February or March. The proportions and trend directions above are what the data supports.

May will be a cleaner month for like-for-like comparison.

This report covers checks submitted to HTPBE in April 2026. File contents are not stored or analyzed; only structural metadata signals are retained. All figures are aggregate and anonymized.