Detect PDF Tampering Programmatically: Developer Guide

Code examples verified against the API as of April 2026. If the API has changed since then, check the changelog.
PDF fraud is a backend problem. It happens before your business logic runs — at the moment your application accepts a document it has no reason to distrust. By the time a reviewer opens a bank statement, invoice, or diploma, the data from that document may have already influenced an automated decision.
This guide is about catching that before it reaches your business layer. We will look at what signals inside a PDF reveal tampering, and how to integrate forensic PDF verification into a backend application using the HTPBE API. For language-specific walkthroughs with complete production-ready code, see the Node.js integration guide or the Python integration guide.
What makes PDF tampering detectable
PDF is not a flat format. When a PDF is created or edited, it leaves a structural record of what happened and when. Most editors do not erase these records — they append to the file. A forensic analysis does not need the original document to detect changes; it reads what the file itself preserved.
The four main signal categories:
Metadata timestamps. Every PDF has a creation date and a modification date in its Info dictionary and optionally in its XMP stream. When these dates are inconsistent — modification date before creation date, creation date in the future, or the two timestamps differing by months — it indicates the metadata was modified after the fact. A 15-second tolerance separates legitimate timestamp imprecision from deliberate manipulation.
Incremental update structure. PDF supports appending changes to a file without rewriting it. Each append creates a new cross-reference table (xref) entry. A document with five xref tables has been modified four times after initial creation. The xref count alone does not prove tampering, but in combination with other signals it is a strong indicator.
Digital signature integrity. Many institutional documents — contracts, financial statements, certified reports — carry digital signatures. A signed PDF can still be modified by appending content after the signature. The signature remains cryptographically valid for the content it covers, but the document now contains unsigned content the signer never approved. This pattern — modifications after an existing signature — is one of the highest-confidence tampering indicators.
Producer and creator inconsistency. The Producer field identifies the software that generated the file. When a document claims to come from a document management system but the Producer field names a consumer PDF editor, something changed hands between creation and submission. Known-tool databases allow distinguishing institutional generators (Adobe Acrobat Server, Microsoft Office, DocuSign) from editing tools that are rarely used to create original documents.
None of these signals is individually conclusive in isolation. A forensic analysis aggregates them into a verdict: INTACT, MODIFIED, or INCONCLUSIVE.
The API flow
HTPBE uses a two-step flow. First, you submit a PDF URL and receive a check ID. Second, you retrieve the result using that ID. Both steps use standard HTTP with Bearer token authentication.
The document must be publicly accessible — either a direct URL to the file, or a presigned URL from your storage provider (S3, Google Cloud Storage, Cloudflare R2, Vercel Blob). HTPBE downloads the PDF server-side, so the URL only needs to be temporarily accessible.
cURL
# Step 1 — submit the PDF URL for analysis
curl -X POST https://api.htpbe.tech/v1/analyze \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://your-bucket.s3.amazonaws.com/documents/contract.pdf"}'
Response:
{"id": "ck_9f4a2e1b-3d7c-4a8e-b1f2-9e0d3c5a7b8f"}
# Step 2 — retrieve the verdict
curl https://api.htpbe.tech/v1/result/ck_9f4a2e1b-3d7c-4a8e-b1f2-9e0d3c5a7b8f \
-H "Authorization: Bearer YOUR_API_KEY"
Response (abbreviated):
{
"id": "ck_9f4a2e1b-3d7c-4a8e-b1f2-9e0d3c5a7b8f",
"status": "modified",
"modification_confidence": "high",
"modification_markers": ["INCREMENTAL_UPDATES", "PRODUCER_MISMATCH"],
"xref_count": 4,
"has_digital_signature": false,
"creator": "Microsoft Word",
"producer": "iLovePDF",
"creation_date": 1704067200,
"modification_date": 1709251200
}
The modification_markers array tells you exactly which signals triggered the verdict — not just that the document is suspect, but why.
Python
import os
import httpx # pip install httpx
API_KEY = os.environ["HTPBE_API_KEY"]
BASE_URL = "https://api.htpbe.tech/v1"
HEADERS = {"Authorization": f"Bearer {API_KEY}"}
def verify_pdf(pdf_url: str) -> dict:
"""Submit a PDF URL and return the full analysis result."""
# Step 1: submit
submit_response = httpx.post(
f"{BASE_URL}/analyze",
headers=HEADERS,
json={"url": pdf_url},
timeout=30,
)
submit_response.raise_for_status()
check_id = submit_response.json()["id"]
# Step 2: retrieve result
result_response = httpx.get(
f"{BASE_URL}/result/{check_id}",
headers=HEADERS,
timeout=30,
)
result_response.raise_for_status()
return result_response.json()
def route_document(pdf_url: str) -> str:
"""Return an action based on the PDF forensic verdict."""
result = verify_pdf(pdf_url)
status = result["status"]
if status == "intact":
return "accept"
elif status == "modified":
markers = result.get("modification_markers", [])
print(f"Tampering detected: {', '.join(markers)}")
return "reject"
else: # inconclusive
# Consumer software origin — route to manual review
return "manual_review"
The httpx library is used here because it has a cleaner API than requests for JSON workflows, but requests works identically — replace httpx.post with requests.post and httpx.get with requests.get.
Handling errors in Python
import httpx
def verify_pdf_safe(pdf_url: str) -> dict | None:
try:
return verify_pdf(pdf_url)
except httpx.HTTPStatusError as e:
status = e.response.status_code
if status == 401:
raise RuntimeError("Invalid HTPBE API key") from e
if status == 402:
raise RuntimeError("HTPBE subscription required") from e
if status == 422:
# URL did not return a valid PDF
return None
raise
except httpx.TimeoutException:
# Handle timeout — retry or queue for later
return None
Node.js / TypeScript
const API_KEY = process.env.HTPBE_API_KEY!;
const BASE_URL = 'https://api.htpbe.tech/v1';
interface HTPBEResult {
id: string;
status: 'intact' | 'modified' | 'inconclusive';
modification_confidence: 'certain' | 'high' | 'none' | null;
modification_markers: string[];
xref_count: number;
has_digital_signature: boolean;
modifications_after_signature: boolean;
signature_removed: boolean;
creator: string | null;
producer: string | null;
creation_date: number | null;
modification_date: number | null;
}
async function verifyPdf(pdfUrl: string): Promise<HTPBEResult> {
// Step 1: submit
const submitRes = await fetch(`${BASE_URL}/analyze`, {
method: 'POST',
headers: {
Authorization: `Bearer ${API_KEY}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({ url: pdfUrl }),
});
if (!submitRes.ok) {
const body = await submitRes.json().catch(() => ({}));
throw new Error(`HTPBE submit failed ${submitRes.status}: ${JSON.stringify(body)}`);
}
const { id } = await submitRes.json() as { id: string };
// Step 2: retrieve result
const resultRes = await fetch(`${BASE_URL}/result/${id}`, {
headers: { Authorization: `Bearer ${API_KEY}` },
});
if (!resultRes.ok) {
throw new Error(`HTPBE result fetch failed ${resultRes.status}`);
}
return resultRes.json() as Promise<HTPBEResult>;
}
// Production usage example
async function handleDocumentSubmission(pdfUrl: string): Promise<'accept' | 'reject' | 'review'> {
const result = await verifyPdf(pdfUrl);
switch (result.status) {
case 'intact':
return 'accept';
case 'modified':
console.log('Tampering markers:', result.modification_markers);
if (result.modifications_after_signature) {
console.log('Document was modified after digital signing');
}
return 'reject';
case 'inconclusive':
// Consumer software origin — may be legitimate, route to human
return 'review';
}
}
Reading the result: what each field means
The fields that matter most for routing decisions:
| Field | Type | What it tells you |
|---|---|---|
status | intact / modified / inconclusive | The primary verdict |
modification_confidence | certain / high / none | How confident the verdict is |
modification_markers | string[] | Which specific signals triggered the verdict |
modifications_after_signature | boolean | Content added after a valid digital signature |
signature_removed | boolean | A digital signature was stripped from the document |
xref_count | number | Number of edit sessions in the file |
creator | string | Software that created the document |
producer | string | Software that last processed the document |
The three certain confidence markers — meaning the confidence level is "certain" rather than "high":
MODIFICATIONS_AFTER_SIGNATURE— cryptographically verifiableSIGNATURE_REMOVED— the signature slot exists but the signature is goneDIFFERENT_DATES— creation and modification dates differ by more than 15 seconds in an impossible sequence
Everything else produces "high" confidence, not "certain". For workflows where false positives are costly (legal proceedings, for example), certain markers warrant automatic rejection while high markers might warrant manual review.
Routing the inconclusive verdict
inconclusive does not mean the document is untrustworthy. It means the document was created with consumer software — Microsoft Word, Google Docs, LibreOffice, Canva — and lacks the structural patterns of institutionally-generated documents.
The right routing depends on what you are processing:
Documents that claim institutional origin (bank statements, tax certificates, court filings, insurance policies): inconclusive should be treated the same as modified. A bank statement claiming to come from a financial institution should not be produced by Google Docs.
User-generated documents (forms, applications, letters, CVs): inconclusive is expected and acceptable. Route to normal processing.
function shouldReject(result: HTPBEResult, claimsInstitutionalOrigin: boolean): boolean {
if (result.status === 'modified') return true;
if (result.status === 'inconclusive' && claimsInstitutionalOrigin) return true;
return false;
}
Testing without real documents
All HTPBE plans include a test API key. Test keys accept mock URLs that return predictable responses — similar to Stripe test cards. Use these in your test suite to cover every verdict branch without consuming production quota.
# Clean document — returns status: intact
https://api.htpbe.tech/v1/test/clean.pdf
# Tampered document — returns status: modified
https://api.htpbe.tech/v1/test/modified-high.pdf
# Consumer software origin — returns status: inconclusive
https://api.htpbe.tech/v1/test/inconclusive.pdf
# Signature removed — returns status: modified, signature_removed: true
https://api.htpbe.tech/v1/test/signature-removed.pdf
# Modified after signing — returns modifications_after_signature: true
https://api.htpbe.tech/v1/test/modified-medium.pdf
Keep your test key in a .env.test file and never let it touch production flows.
What this does not detect
Two scenarios where forensic metadata analysis has limits:
Documents created fraudulently from scratch. If someone fabricates a bank statement using the same software a real bank uses, generates plausible timestamps, and produces a structurally consistent PDF — the file may pass analysis. Forensic analysis catches editing of existing documents and lazy fabrication. Sophisticated forgery from scratch, using professional tools, may require additional signals (visual content analysis, issuer verification).
Encrypted documents. Strongly encrypted PDFs cannot be analyzed for structural signals. The analysis will flag this as inconclusive by necessity.
For most operational workflows — invoice processing, loan applications, recruitment document checks — forensic metadata analysis catches the overwhelming majority of attempted fraud, which uses off-the-shelf PDF editors rather than sophisticated fabrication tools.