PDF Security Blog

Detect AI-Generated PDFs: What Works and What Does Not

HTPBE Team··10 min read
Detect AI-Generated PDFs: What Works and What Does Not

This article is a snapshot — content was accurate as of May 2026 (code examples tested against the API as of April 2026). The product evolves actively; specific counts, examples, and detection rules may have changed since publication — see the changelog for the current state.

Accounts payable teams are receiving receipts generated by ChatGPT plugins. HR platforms are seeing payslips rendered by Python scripts. Insurance claims contain repair estimates that no shop ever issued. The documents look correct. The logos match. The numbers are plausible.

The question is: what can actually be detected, and what cannot?

The honest answer requires separating two things that are often confused under the phrase “AI-generated document detection.”

Two distinct problems called "AI-generated document detection"

When people ask how to detect an AI-generated document, they usually mean one of two distinct things:

Content classification asks: was the text in this document written by an AI language model? This is what tools like GPTZero and Turnitin’s AI detector do. They analyze writing style, token probability distributions, and linguistic patterns to estimate whether a human or a model produced the text.

Structural forensics asks: was this PDF file generated by a real institutional system, or did it come from a headless browser, a PDF library, or a consumer tool? This is what HTPBE? does. It reads the binary structure of the file — producer metadata, xref patterns, font embedding, object numbering — and checks whether those patterns match how legitimate institutional software generates documents.

These are not the same problem. A document can contain AI-written text and still come from a real corporate system. A document can contain entirely human-written text and still have been rendered by Puppeteer an hour ago. The structural check and the content check answer different questions.

HTPBE? does structural forensics. It does not classify text. This article explains what that distinction means in practice, what the structural approach reliably catches, and where its limits are.

What structural forensics detects

When an AI tool generates a PDF, it must render that PDF using some software. The rendering layer almost always leaves a producer fingerprint.

The most common rendering paths for AI-generated documents in fraud scenarios:

Headless browsers (Chrome Headless, Puppeteer, Playwright) are used when a fraudster builds an HTML template — often copied from a legitimate document they scanned or photographed — and renders it to PDF using a browser. Chrome Headless has a characteristic producer string: Chromium, Chrome, or a Puppeteer-generated variant that typically includes the Chrome version. These strings are recognizable and are cross-referenced against known institutional producers.

Python and Node.js PDF libraries (ReportLab, PDFKit, jsPDF, fpdf2, WeasyPrint) are used when someone generates a document programmatically — either directly or as part of an AI tool’s export pipeline. ReportLab’s producer string is ReportLab PDF Library. PDFKit’s is PDFKit. jsPDF writes jsPDF. None of these strings appear in documents genuinely issued by banks, payroll processors, or insurance carriers.

wkhtmltopdf is an older HTML-to-PDF tool that remains common in automated document generation pipelines. Its producer string is wkhtmltopdf.

Online “AI document generators” that export to PDF typically use one of the tools above internally. The producer field reflects the underlying renderer, not the AI layer on top.

When HTPBE? analyzes a submitted PDF, it compares the Producer field against a database of known institutional generators — the software that real banks, payroll platforms, accounting systems, and government agencies use to produce documents. A mismatch between the claimed document type and the actual rendering software is a modification marker.

A payslip generated by ReportLab does not look like a payslip generated by Sage Payroll or ADP Workforce Now at the structural level. Both may look identical visually. The binary layer tells a different story.

What an AI-generated PDF looks like in a forensic response

Below is a real API response for a payslip submitted to a lending platform. The file was generated by a Puppeteer-based AI document tool and submitted as proof of income.

{
  "status": "inconclusive",
  "modification_confidence": "none",
  "modification_markers": [],
  "creator": null,
  "producer": "Chromium (Chrome 124.0)",
  "origin_type": "consumer",
  "creation_date": null,
  "modification_date": null,
  "xref_count": 1
}

The verdict is inconclusive, not modified. There is no evidence this file was edited after creation — because it was never edited. It was created in its current form, in a single render pass, by a headless browser.

The producer field is Chromium (Chrome 124.0). A payslip from a real employer does not come from a headless Chrome instance. The origin_type is consumer. creation_date is null because Puppeteer does not set it by default.

This is the correct interpretation of inconclusive in an AI fraud context: the document shows no modification markers because it was never a real document that was later modified. It was fabricated from nothing. The absence of institutional metadata is itself the signal.

What INCONCLUSIVE means when you expected an institutional document

inconclusive from HTPBE? means: this document was created by consumer or non-institutional software, and we cannot determine whether it was modified after creation because there is no institutional baseline to compare against.

For user-generated documents — cover letters, personal statements, forms the applicant completed themselves — inconclusive is expected and is not a fraud signal. A person who writes their cover letter in Google Docs and exports it to PDF will produce an inconclusive result. That is correct behavior.

For documents that claim institutional origin, inconclusive is a strong fraud signal. The reasoning:

  • A bank statement from HSBC is generated by HSBC’s document management system. It does not come from Puppeteer.
  • A payslip from a company using ADP, Xero, or Gusto is generated by that platform’s PDF renderer. It does not come from jsPDF.
  • A tax certificate from a government agency is generated by that agency’s systems. It does not come from wkhtmltopdf.

If your workflow receives documents that claim to be bank statements, payslips, or official certificates, and those documents return inconclusive with a consumer or headless-browser producer, do not accept them. The document’s own metadata contradicts its claimed origin.

import os
import httpx

API_KEY = os.environ["HTPBE_API_KEY"]
BASE_URL = "https://api.htpbe.tech/v1"
HEADERS = {"Authorization": f"Bearer {API_KEY}"}

INSTITUTIONAL_DOC_TYPES = {"bank_statement", "payslip", "tax_certificate", "insurance_policy"}

CONSUMER_PRODUCERS = {
    "chromium", "chrome", "puppeteer", "playwright",
    "reportlab", "pdfkit", "jspdf", "fpdf", "wkhtmltopdf",
    "weasyprint",
}


def verify_document(pdf_url: str, doc_type: str) -> dict:
    # Step 1: submit
    r = httpx.post(f"{BASE_URL}/analyze", headers=HEADERS, json={"url": pdf_url}, timeout=30)
    r.raise_for_status()
    check_id = r.json()["id"]

    # Step 2: retrieve
    r2 = httpx.get(f"{BASE_URL}/result/{check_id}", headers=HEADERS, timeout=30)
    r2.raise_for_status()
    result = r2.json()

    # Route based on verdict + document type
    if result["status"] == "modified":
        return {"action": "reject", "reason": "post_creation_modification", "check_id": check_id}

    if result["status"] == "inconclusive" and doc_type in INSTITUTIONAL_DOC_TYPES:
        producer = (result.get("producer") or "").lower()
        is_consumer_origin = any(tool in producer for tool in CONSUMER_PRODUCERS)
        reason = "ai_or_consumer_origin" if is_consumer_origin else "missing_institutional_metadata"
        return {"action": "reject", "reason": reason, "check_id": check_id}

    return {"action": "accept", "check_id": check_id}

What HTPBE? cannot detect

Being clear about the limits of this approach matters. Overstating what structural forensics catches creates false confidence.

Printed and re-scanned AI documents. If someone generates a PDF with an AI tool, prints it, and scans it back to PDF, the structural fingerprints are gone. The scanner produces a new PDF — with its own producer and its own structure — containing image pages. The analysis will return inconclusive (scanned origin), which is technically correct but loses the AI-rendering signal. This is a known limitation and requires a different layer: image quality analysis, font rendering artifact detection, or manual review.

Sophisticated producer spoofing. The Producer field is a plain text string. A determined attacker who knows the detection approach can hardcode a string like Adobe PDF Library 15.0 or Oracle PDF Renderer into their fake document generator. This would defeat producer-based detection. Countering it requires checking multiple structural signals together — object numbering patterns, font embedding methods, XMP metadata consistency — rather than relying on the producer string alone. HTPBE? runs multiple analysis layers, but a sophisticated attacker who specifically targets the detection system can evade individual signals.

AI text pasted into Word then exported to PDF. If someone uses an AI to write text, pastes it into Microsoft Word, and exports to PDF, the resulting file looks like any Word-to-PDF export. The origin is consumer (Word), which is inconclusive but not alarming on its own for documents expected to come from Word. This case requires content-layer analysis.

Documents generated by the same software as legitimate issuers. If a fraudster gains access to Sage Payroll, generates a payslip for a fake employee, and exports it, the structural signals will look legitimate. The file came from the right software. Detecting this requires checking the content with the issuer — structural forensics alone cannot distinguish a real Sage payslip from a fraudulent one generated on a compromised Sage account.

The recommended stack for AI document fraud

No single layer catches everything. The approach that covers the most ground combines:

Structural forensics (HTPBE?) handles the file layer: modified documents, consumer-origin documents submitted as institutional, headless-browser renders, and PDF-library-generated fakes. This runs first — it is fast, cost-effective, and catches the majority of operational fraud. See the AI-generated document detection page for a complete breakdown of what the file layer covers.

Content classification (GPTZero, Originality.ai, or a fine-tuned classifier for your document type) handles the text layer: detecting AI-written prose in documents where the writing itself is the fraud signal — reference letters, employment checks, academic submissions.

Issuer fraud detection handles the ground-truth layer: contacting the bank, payroll provider, or issuing authority to confirm the document was actually issued. This is costly at scale but appropriate for high-value decisions.

The practical sequence for a lending or HR platform processing document submissions:

  1. Run structural forensics first. Immediately reject modified results and inconclusive results with a consumer producer for institutional document types. This eliminates the majority of fraudulent submissions without manual effort.
  2. For documents that pass the structural check, run a content classifier if the document type warrants it (reference letters, personal statements, professional certifications with written attestations).
  3. For high-value decisions that clear both automated checks, spot-check with issuer fraud detection.

Who needs this

Accounts payable teams processing invoice and receipt submissions from vendors or employees: the primary AI fraud vector is fabricated receipts and invoices generated by AI tools. Structural fraud detection catches headless-browser and PDF-library renders before they enter the approval queue.

HR platforms and background check providers: AI-generated reference letters, diploma supplements, and employment fraud detection documents are increasingly common. producer field analysis alone is not sufficient here (the text also matters), but it catches the lowest-effort fabrications — documents rendered by the wrong software for their claimed origin.

Insurance claims operations: repair estimates, medical bills, and supporting documentation submitted by claimants are a high-fraud category. AI tools reduce the effort required to fabricate a plausible-looking estimate. Structural forensics identifies documents that did not come from the claimed issuer’s systems.

Lending and fintech compliance teams: bank statements and payslips are the most-targeted document types. The structural check is a necessary first layer before any income or asset fraud-detection workflow. See the PDF authenticity API documentation and pricing.

Share This Article

Found this article helpful? Share it with others to spread knowledge about PDF security and fraud detection.

https://htpbe.tech/blog/detect-ai-generated-pdf

Secure your workflow

Create your account — API key on signup, free test environment on every plan.
From $15/mo. No sales call. Cancel any time.