PDF Security Blog

Detect PDF Tampering Programmatically: Developer Guide

HTPBE Team·07.04.2026·10 min read

This article is a snapshot — content was accurate as of April 2026 (code examples tested against the API as of April 2026). The product evolves actively; specific counts, examples, and detection rules may have changed since publication — see the changelog for the current state.

PDF fraud is a backend problem. It happens before your business logic runs — at the moment your application accepts a document it has no reason to distrust. By the time a reviewer opens a bank statement, invoice, or diploma, the data from that document may have already influenced an automated decision.

This guide is about catching that before it reaches your business layer. We will look at what signals inside a PDF reveal tampering, and how to integrate forensic PDF tamper detection into a backend application using the HTPBE? API. For language-specific walkthroughs with complete production-ready code, see the Node.js integration guide or the Python integration guide. If you would rather check a single file by hand first — no code — the free PDF tamper detection checker runs the same analysis in the browser.

What makes PDF tampering detectable

PDF is not a flat format. When a PDF is created or edited, it leaves a structural record of what happened and when. Most editors do not erase these records — they append to the file. A forensic analysis does not need the original document to detect changes; it reads what the file itself preserved.

The four main signal categories:

Metadata timestamps. Every PDF has a creation date and a modification date in its Info dictionary and optionally in its XMP stream. When these dates are inconsistent — modification date before creation date, creation date in the future, or the two timestamps differing by months — it indicates the metadata was modified after the fact. A 15-second tolerance separates legitimate timestamp imprecision from deliberate manipulation.

Incremental update structure. PDF supports appending changes to a file without rewriting it. Each append creates a new cross-reference table (xref) entry. A document with five xref tables has been modified four times after initial creation. The xref count alone does not prove tampering, but in combination with other signals it is a strong indicator.

Digital signature integrity. Many institutional documents — contracts, financial statements, certified reports — carry digital signatures. A signed PDF can still be modified by appending content after the signature. The signature remains cryptographically valid for the content it covers, but the document now contains unsigned content the signer never approved. This pattern — modifications after an existing signature — is one of the highest-confidence tampering indicators.

Producer and creator inconsistency. The Producer field identifies the software that generated the file. When a document claims to come from a document management system but the Producer field names a consumer PDF editor, something changed hands between creation and submission. Known-tool databases allow distinguishing institutional generators (Adobe Acrobat Server, Microsoft Office, DocuSign) from editing tools that are rarely used to create original documents.

None of these signals is individually conclusive in isolation. A forensic analysis aggregates them into a verdict: INTACT, MODIFIED, or INCONCLUSIVE.

The API flow

The PDF tamper detection API uses a two-step flow. First, you submit a PDF URL and receive a check ID. Second, you retrieve the result using that ID. Both steps use standard HTTP with Bearer token authentication.

The document must be publicly accessible — either a direct URL to the file, or a presigned URL from your storage provider (S3, Google Cloud Storage, Cloudflare R2, Vercel Blob). HTPBE? downloads the PDF server-side, so the URL only needs to be temporarily accessible.

cURL

# Step 1 — submit the PDF URL for analysis
curl -X POST https://api.htpbe.tech/v1/analyze \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://your-bucket.s3.amazonaws.com/documents/contract.pdf"}'

Response:

{"id": "ck_9f4a2e1b-3d7c-4a8e-b1f2-9e0d3c5a7b8f"}

# Step 2 — retrieve the verdict
curl https://api.htpbe.tech/v1/result/ck_9f4a2e1b-3d7c-4a8e-b1f2-9e0d3c5a7b8f \
  -H "Authorization: Bearer YOUR_API_KEY"

Response (abbreviated):

{
  "id": "ck_9f4a2e1b-3d7c-4a8e-b1f2-9e0d3c5a7b8f",
  "status": "modified",
  "modification_confidence": "high",
  "modification_markers": ["HTPBE_MULTIPLE_REVISION_LAYERS", "HTPBE_EDITING_TOOL_FINGERPRINT"],
  "xref_count": 4,
  "has_digital_signature": false,
  "creator": "Microsoft Word",
  "producer": "iLovePDF",
  "creation_date": 1704067200,
  "modification_date": 1709251200
}

The modification_markers array tells you exactly which signals triggered the verdict — not just that the document is suspect, but why.

Python

import os
import httpx  # pip install httpx

API_KEY = os.environ["HTPBE_API_KEY"]
BASE_URL = "https://api.htpbe.tech/v1"
HEADERS = {"Authorization": f"Bearer {API_KEY}"}


def verify_pdf(pdf_url: str) -> dict:
    """Submit a PDF URL and return the full analysis result."""
    # Step 1: submit
    submit_response = httpx.post(
        f"{BASE_URL}/analyze",
        headers=HEADERS,
        json={"url": pdf_url},
        timeout=30,
    )
    submit_response.raise_for_status()
    check_id = submit_response.json()["id"]

    # Step 2: retrieve result
    result_response = httpx.get(
        f"{BASE_URL}/result/{check_id}",
        headers=HEADERS,
        timeout=30,
    )
    result_response.raise_for_status()
    return result_response.json()


def route_document(pdf_url: str) -> str:
    """Return an action based on the PDF forensic verdict."""
    result = verify_pdf(pdf_url)
    status = result["status"]

    if status == "intact":
        return "accept"
    elif status == "modified":
        markers = result.get("modification_markers", [])
        print(f"Tampering detected: {', '.join(markers)}")
        return "reject"
    else:  # inconclusive
        # Consumer software origin — route to manual review
        return "manual_review"

The httpx library is used here because it has a cleaner API than requests for JSON workflows, but requests works identically — replace httpx.post with requests.post and httpx.get with requests.get.

Handling errors in Python

import httpx

def verify_pdf_safe(pdf_url: str) -> dict | None:
    try:
        return verify_pdf(pdf_url)
    except httpx.HTTPStatusError as e:
        status = e.response.status_code
        if status == 401:
            raise RuntimeError("Invalid HTPBE API key") from e
        if status == 402:
            raise RuntimeError("HTPBE subscription required") from e
        if status == 422:
            # URL did not return a valid PDF
            return None
        raise
    except httpx.TimeoutException:
        # Handle timeout — retry or queue for later
        return None

Node.js / TypeScript

const API_KEY = process.env.HTPBE_API_KEY!;
const BASE_URL = 'https://api.htpbe.tech/v1';

interface HTPBEResult {
  id: string;
  status: 'intact' | 'modified' | 'inconclusive';
  modification_confidence: 'certain' | 'high' | 'none' | null;
  modification_markers: string[];
  xref_count: number;
  has_digital_signature: boolean;
  modifications_after_signature: boolean;
  signature_removed: boolean;
  creator: string | null;
  producer: string | null;
  creation_date: number | null;
  modification_date: number | null;
}

async function verifyPdf(pdfUrl: string): Promise<HTPBEResult> {
  // Step 1: submit
  const submitRes = await fetch(`${BASE_URL}/analyze`, {
    method: 'POST',
    headers: {
      Authorization: `Bearer ${API_KEY}`,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({ url: pdfUrl }),
  });

  if (!submitRes.ok) {
    const body = await submitRes.json().catch(() => ({}));
    throw new Error(`HTPBE submit failed ${submitRes.status}: ${JSON.stringify(body)}`);
  }

  const { id } = await submitRes.json() as { id: string };

  // Step 2: retrieve result
  const resultRes = await fetch(`${BASE_URL}/result/${id}`, {
    headers: { Authorization: `Bearer ${API_KEY}` },
  });

  if (!resultRes.ok) {
    throw new Error(`HTPBE result fetch failed ${resultRes.status}`);
  }

  return resultRes.json() as Promise<HTPBEResult>;
}

// Production usage example
async function handleDocumentSubmission(pdfUrl: string): Promise<'accept' | 'reject' | 'review'> {
  const result = await verifyPdf(pdfUrl);

  switch (result.status) {
    case 'intact':
      return 'accept';

    case 'modified':
      console.log('Tampering markers:', result.modification_markers);
      if (result.modifications_after_signature) {
        console.log('Document was modified after digital signing');
      }
      return 'reject';

    case 'inconclusive':
      // Consumer software origin — may be legitimate, route to human
      return 'review';
  }
}

Reading the result: what each field means

The fields that matter most for routing decisions:

Field	Type	What it tells you
`status`	`intact` / `modified` / `inconclusive`	The primary verdict
`modification_confidence`	`certain` / `high` / `none`	How confident the verdict is
`modification_markers`	`string[]`	Which specific signals triggered the verdict
`modifications_after_signature`	`boolean`	Content added after a valid digital signature
`signature_removed`	`boolean`	A digital signature was stripped from the document
`xref_count`	`number`	Number of edit sessions in the file
`creator`	`string`	Software that created the document
`producer`	`string`	Software that last processed the document

The three certain confidence markers — meaning the confidence level is "certain" rather than "high":

HTPBE_POST_SIGNATURE_EDIT — cryptographically verifiable
HTPBE_SIGNATURE_REMOVED — the signature slot exists but the signature is gone
HTPBE_DATES_DISAGREE — creation and modification dates differ by more than 15 seconds in an impossible sequence

Everything else produces "high" confidence, not "certain". For workflows where false positives are costly (legal proceedings, for example), certain markers warrant automatic rejection while high markers might warrant manual review.

Routing the `inconclusive` verdict

inconclusive does not mean the document is untrustworthy. It means the document was created with consumer software — Microsoft Word, Google Docs, LibreOffice, Canva — and lacks the structural patterns of institutionally-generated documents.

The right routing depends on what you are processing:

Documents that claim institutional origin (bank statements, tax certificates, court filings, insurance policies): inconclusive should be treated the same as modified. A bank statement claiming to come from a financial institution should not be produced by Google Docs.

User-generated documents (forms, applications, letters, CVs): inconclusive is expected and acceptable. Route to normal processing.

function shouldReject(result: HTPBEResult, claimsInstitutionalOrigin: boolean): boolean {
  if (result.status === 'modified') return true;
  if (result.status === 'inconclusive' && claimsInstitutionalOrigin) return true;
  return false;
}

Testing without real documents

All HTPBE? plans include a test API key. Test keys accept mock URLs that return predictable responses — similar to Stripe test cards. Use these in your test suite to cover every verdict branch without consuming production quota.

# Clean document — returns status: intact
https://api.htpbe.tech/v1/test/clean.pdf

# Tampered document — returns status: modified
https://api.htpbe.tech/v1/test/modified-high.pdf

# Consumer software origin — returns status: inconclusive
https://api.htpbe.tech/v1/test/inconclusive.pdf

# Signature removed — returns status: modified, signature_removed: true
https://api.htpbe.tech/v1/test/signature-removed.pdf

# Modified after signing — returns modifications_after_signature: true
https://api.htpbe.tech/v1/test/modified-medium.pdf

Keep your test key in a .env.test file and never let it touch production flows.

What this does not detect

Two scenarios where forensic metadata analysis has limits:

Documents created fraudulently from scratch. If someone fabricates a bank statement using the same software a real bank uses, generates plausible timestamps, and produces a structurally consistent PDF — the file may pass analysis. Forensic analysis catches editing of existing documents and lazy fabrication. Sophisticated forgery from scratch, using professional tools, may require additional signals (visual content analysis, issuer fraud detection).

Encrypted documents. Strongly encrypted PDFs cannot be analyzed for structural signals. The analysis will flag this as inconclusive by necessity.

For most operational workflows — invoice processing, loan applications, recruitment document checks — forensic metadata analysis catches the overwhelming majority of attempted fraud, which uses off-the-shelf PDF editors rather than sophisticated fabrication tools.

Frequently Asked Questions

Do I need the original PDF to detect tampering through the API?

No. The analysis reads the structural record the file carries with it — timestamps, cross-reference history, signature state, producer and creator fingerprints. You submit only the document under review; there is no “known good” baseline to upload. That is what makes it practical to run on every inbound document at submission time.

How do I detect PDF tampering in Python?

Submit the PDF URL to POST /v1/analyze, take the returned check ID, then read the verdict from GET /v1/result/{id}. The verify_pdf() and route_document() helpers in the Python section above are the complete pattern — call them from your ingestion handler and branch on result["status"]. For a fuller production walkthrough see the Python integration guide.

What is the difference between the modified and inconclusive verdicts?

modified means the file carries positive evidence of post-creation change — a triggered marker, listed in modification_markers. inconclusive means the file lacks the structural fingerprints of an institutional generator (typically consumer software output) and there is nothing to confirm or deny. Route them differently depending on whether the document claims institutional origin — the shouldReject() helper in the routing section above encodes exactly that rule.

Can the API confirm a document was changed after it was digitally signed?

Yes, and it is one of the highest-confidence outcomes: modifications_after_signature: true with modification_confidence: "certain". Because that signal is cryptographically grounded rather than heuristic, it is a safe candidate for automatic rejection in most workflows.

How do I test the integration without real documents?

Use the test key and mock URLs (the testing section above). Each mock URL returns a fixed verdict, so your test suite can cover the intact / modified / inconclusive branches deterministically without spending production quota — the same way you would use Stripe test cards. Start integrating against the API with a test key before you provision a live one.

Detect PDF Tampering Programmatically: Developer Guide

What makes PDF tampering detectable

The API flow

cURL

Python

Handling errors in Python

Node.js / TypeScript

Reading the result: what each field means

Routing the `inconclusive` verdict

Testing without real documents

What this does not detect

Frequently Asked Questions

Share This Article

Secure your workflow

What makes PDF tampering detectable

The API flow

cURL

Python

Handling errors in Python

Node.js / TypeScript

Reading the result: what each field means

Routing the inconclusive verdict

Testing without real documents

What this does not detect

Frequently Asked Questions

Share This Article

Secure your workflow

Routing the `inconclusive` verdict