AI-Generated Document Detection — Catch Generator-Tool PDFs
AI-generated receipts, payslips, and bank statements are passing visual review — your existing tools only check the text, not how the file was made. Fraud-ops, AP, claims, and HR teams started seeing AI-generated PDFs at scale in 2024. The documents look right. The text passes OCR. Content classifiers are inconsistent. What's missing is a check on the file's structural layer: where did this PDF actually come from? AI tools render through standard toolchains (Chrome Headless, Puppeteer, wkhtmltopdf, ReportLab) that leave recognisable producer fingerprints — fingerprints that institutional billing systems and payroll engines never produce. We don't classify AI-written text — we read the rendering toolchain fingerprint the institutional source would have left. Read the honest scope below.
htpbe? analyzes the structural layer of the PDF file — the producer/creator metadata, the xref chain, the digital signature state, font subsets, image streams. We do NOT run an AI content classifier on the text inside the PDF. We do NOT decide whether words were 'written by AI'. What we DO is detect when a PDF lacks the institutional-issuer fingerprint real documents carry — which catches the high-volume, technically unsophisticated AI-rendered PDFs that pass visual review today.
We will NOT catch: an AI-generated PDF that has been printed, scanned, and re-saved through a real institutional workflow (the AI fingerprints are gone); a sophisticated AI tool that successfully spoofs a real institutional producer string in metadata; AI-generated text that a human pasted into Word and exported (we cannot tell that text is AI-written from file structure alone). For those scenarios, defence-in-depth means pairing htpbe? with content classifiers and manual review.
One REST call, one deterministic verdict
Upload the PDF. The API returns INTACT, MODIFIED, or INCONCLUSIVE with named markers — in about three seconds.
How AI-rendered PDFs typically look at the file layer
Three real fraud mechanics we catch at the structural PDF layer.
AI tool renders a 'receipt' or 'invoice' through a headless browser
A user prompts an AI tool to 'generate a hotel folio for Marriott Times Square'. The tool outputs an HTML render and exports through Chrome Headless or Puppeteer. The producer field is the headless browser; there is no Marriott PMS metadata in the file. Real Marriott folios always carry the PMS producer signature.
AI tool exports through a PDF library
AI assistants use libraries like wkhtmltopdf, ReportLab, jsPDF, or PDFKit to produce PDFs. These leave recognisable producer strings — distinct from any payroll, banking, or government issuer. Single-session, no incremental update, no institutional metadata.
AI-generated text pasted into Word and exported
Honest answer: we typically cannot distinguish this from a human-typed Word document. Both produce a Microsoft Word producer signature, single-session export. The verdict is INCONCLUSIVE — same as any Word-authored document. Whether INCONCLUSIVE is a fraud signal depends on document context (a Word 'W-2' is suspicious; a Word reference letter from a small employer might be legitimate).
The scale
Why your existing checks miss this
Content classifiers see the text. They don't see how the PDF was rendered.
And content classifiers tuned for AI text don't transfer cleanly to PDFs.
OCR and rule-based document platforms extract data — they cannot tell whether the underlying PDF was issued by a real merchant or rendered by an AI tool. AI text classifiers (GPTZero and similar) are inconsistent on PDF documents because the structural layer carries different signals than free-text. htpbe? inspects the file structure — producer, metadata, xref, image streams — and reports what it sees. Pair us with a content classifier for full coverage: classifiers handle the language layer, we handle the file layer.
Five forensic layers, one deterministic verdict
Every PDF we receive passes through the same structural pipeline — no model training, no thresholds to tune.
Metadata analysis
Creation and modification timestamps, producer and creator fields, XMP metadata — the first layer exposes basic tampering.
File structure
Xref tables, trailer chain, incremental updates. Any edit after export leaves a structural fingerprint here.
Digital signatures
Signature chain integrity and post-signature modifications produce deterministic markers. Certainty-level signal.
Content integrity
Fonts, objects, embedded content, page assembly. Multi-session edits and inserted objects are visible at this layer.
Verdict with markers
Deterministic output: INTACT / MODIFIED / INCONCLUSIVE, with named markers for every finding — suitable for audit trail.
AI-rendered PDFs we typically flag (via producer/metadata)
Every type listed below is analyzed at the structural file layer — not the rendered image.
Detection capabilities
Deterministic structural signals. No probabilistic scores, no model training.
Producer signature reveals the rendering toolchain
AI-generated PDFs typically render through a headless browser (Chrome Headless, Puppeteer, Playwright) or a PDF library (wkhtmltopdf, ReportLab, PDFKit, jsPDF). These leave producer strings that are distinct from authentic issuer producers (payroll engines, EHR billing, banking portals, government systems). We surface the producer field; you interpret it against the document type.
INCONCLUSIVE verdict is the typical signal
AI-rendered PDFs almost never trigger MODIFIED — there is no edit trail to confirm. They trigger INCONCLUSIVE: 'this PDF does not have institutional-issuer fingerprints'. In context (a 'receipt' that should come from a real POS, a 'bank statement' that should come from a banking system), INCONCLUSIVE is a strong fraud-positive signal.
Single-session creation pattern
AI-generated PDFs are produced in one shot — CreationDate equals ModDate, single xref table, no incremental update history. Real institutional production systems often carry richer history with incremental updates.
AI-edited regions in real PDFs trigger MODIFIED
When an AI-generated region is pasted into a real PDF, the file shows an incremental update trail and the verdict becomes MODIFIED. The detection works regardless of whether the inserted content was AI-made or human-typed — we detect the EDIT, not the AI origin.
Image-stream artefacts in pasted AI logos and headers
AI tools that paste merchant logos or letterhead images leave compression artefacts that differ from authentic embedded headers. Image-stream metadata exposes the difference where the AI tool reused stock images.
Font subset divergence across pages
Multi-page AI-rendered documents often show font subset prefix shifts between pages — a fingerprint of multi-call generation rather than single-session institutional export.
Two HTTP calls — and you read the producer field yourself
Buyers can skip this section — developers, the integration is two HTTP calls.
Step 1 — submit the PDF
curl -X POST https://api.htpbe.tech/v1/analyze \
-H "Authorization: Bearer $HTPBE_API_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://your-storage/suspicious-receipt.pdf"}'Step 2 — read the verdict and the producer field
{
"id": "a1i2g3e4-5n6e7r-8a9t-0e0d-z1z2z3z4z5z6",
"status": "inconclusive",
"modification_confidence": "none",
"modification_markers": [
"Headless-browser producer detected (Puppeteer)",
"Single-session creation — no institutional metadata",
"No incremental update trail"
],
"producer": "Puppeteer (Chrome 124.0)",
"creator": "Puppeteer (Chrome 124.0)",
"creation_date": 1707350400,
"modification_date": 1707350400,
"has_digital_signature": false,
"xref_count": 1,
"has_incremental_updates": false
}htpbe? returns inconclusive — there is no edit trail (so MODIFIED isn't justified), and the file lacks institutional metadata (so INTACT isn't either). The producer field shows Puppeteer — a headless-browser rendering toolchain commonly used by AI tools. For a receipt that should have come from a hotel PMS or POS system, this combination is a strong fraud signal. Your application reads the producer field and applies the rule: 'receipts from this issuer must have producer X; everything else is suspect'. We don't make that judgement for you — we surface the data.
Customer Stories
Teams that stopped document fraud
Compliance, finance, and risk teams use htpbe? to catch manipulated PDFs before they become costly mistakes.
Caught an invoice where the total had been changed by less than a thousand dollars. Without this I would have approved it without a second look.
Sarah M.
AP Manager
United States
We had three applicants in the same week with bank statements that looked completely fine. Two of them were flagged as modified. You simply cannot see this by reading the document — it is in the file structure.
Lars V.
Risk Analyst, Online Lending
Netherlands
Salary slips were coming with altered figures. We identified two problematic files before the placement was finalised.
Priya K.
HR Operations Lead
India
Since we started checking documents this way, we stopped two applications early in the process that would have been very difficult to reverse later.
Julien R.
Fraud Analyst, Fintech
France
Some applicants were sending PDFs that looked authentic but had been edited in ways not visible to the eye. We now ask for verified originals when something is flagged. Already saved us from a few bad decisions.
Marta S.
Compliance Coordinator
Spain
One invoice was caught because there was a mismatch between the document dates and structure. That particular case would have cost us significantly.
Tariq A.
Finance Manager
United Arab Emirates
Frequently asked questions
Related solutions and guides
Fake Receipt Detection
Receipts are the highest-volume AI-rendered PDF category — focused treatment of receipts specifically.
Medical Bill Tamper Detection
Same claims cluster — forensics for tampered medical bill PDFs in insurance and expense workflows.
Invoice Fraud
AI-rendered vendor invoices entering AP pipelines — fraud-ops angle.
Insurance Claims
AI-rendered supporting documents in property and travel claims — claims-ops angle.
Secure your workflow
Create your account — API key on signup, free test environment on every plan.
From $15/mo. No sales call. Cancel any time.