PDF Verification API for Python: Complete Integration Guide

Code examples verified against the API as of April 2026. If the API has changed since then, check the changelog.
Every PDF your application accepts is a trust decision. An invoice triggers a payment. A bank statement determines a credit limit. A diploma qualifies a candidate. If the document was modified after creation, the decision it drives is based on false data — and your application made that decision automatically. For the business case, see why PDF verification matters.
This guide walks through integrating the HTPBE PDF verification API into a Python application: from the first requests call to a production-ready client class with retry logic, typed responses, Django/Flask webhook patterns, and batch processing over cloud storage. All code is complete and working. For Node.js, see the Node.js integration guide. For a language-agnostic overview of what the API detects, see How to Detect PDF Tampering Programmatically.
Prerequisites
- Python 3.10+
requestslibrary (pip install requests)- An HTPBE API key (sign up → Dashboard → copy key)
Quick Start: Two API Calls
The HTPBE API uses a two-step flow. First, POST /analyze submits a publicly accessible PDF URL and returns a check ID. Then, GET /result/{id} retrieves the full verdict. Both steps use Bearer token authentication. (For a visual walkthrough of the 5-layer analysis that runs behind these endpoints, see the How It Works page.)
import requests
API_KEY = "YOUR_API_KEY"
BASE_URL = "https://api.htpbe.tech/v1"
HEADERS = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json",
}
# Step 1: Submit the PDF URL for analysis
response = requests.post(
f"{BASE_URL}/analyze",
headers=HEADERS,
json={"url": "https://api.htpbe.tech/v1/test/clean.pdf"},
)
response.raise_for_status()
check_id = response.json()["id"]
# Step 2: Retrieve the full verdict
result = requests.get(
f"{BASE_URL}/result/{check_id}",
headers=HEADERS,
).json()
print(result["status"]) # "intact", "modified", or "inconclusive"
print(result["modification_markers"]) # e.g. ["INCREMENTAL_UPDATES", "PRODUCER_MISMATCH"]
print(result["producer"]) # e.g. "Adobe PDF Library 15.0"
print(result["creator"]) # e.g. "Microsoft Word"
The URL https://api.htpbe.tech/v1/test/clean.pdf is a test mock — it returns a predictable intact response without consuming quota. We cover all test URLs in the testing section below.
Production Client Class
A production integration needs retry logic for transient failures, timeouts to prevent stalled connections, and typed responses so your IDE and type checker can assist you. Here is a complete client using dataclasses and requests.Session:
from dataclasses import dataclass
from typing import Optional
import time
import requests
@dataclass
class HTPBEResult:
"""Typed representation of the GET /result/{id} response."""
id: str
filename: str
status: str # "intact" | "modified" | "inconclusive"
modification_confidence: Optional[str] # "certain" | "high" | "none"
modification_markers: list[str]
creator: Optional[str]
producer: Optional[str]
creation_date: Optional[int] # Unix timestamp
modification_date: Optional[int] # Unix timestamp
xref_count: int
has_incremental_updates: bool
update_chain_length: int
has_digital_signature: bool
signature_removed: bool
modifications_after_signature: bool
file_size: int
page_count: int
pdf_version: Optional[str]
@classmethod
def from_dict(cls, data: dict) -> "HTPBEResult":
return cls(
id=data["id"],
filename=data["filename"],
status=data["status"],
modification_confidence=data.get("modification_confidence"),
modification_markers=data.get("modification_markers", []),
creator=data.get("creator"),
producer=data.get("producer"),
creation_date=data.get("creation_date"),
modification_date=data.get("modification_date"),
xref_count=data.get("xref_count", 0),
has_incremental_updates=data.get("has_incremental_updates", False),
update_chain_length=data.get("update_chain_length", 0),
has_digital_signature=data.get("has_digital_signature", False),
signature_removed=data.get("signature_removed", False),
modifications_after_signature=data.get(
"modifications_after_signature", False
),
file_size=data.get("file_size", 0),
page_count=data.get("page_count", 0),
pdf_version=data.get("pdf_version"),
)
class HTPBEError(Exception):
"""Raised when the HTPBE API returns a non-2xx response."""
def __init__(self, message: str, status_code: int, code: str):
super().__init__(message)
self.status_code = status_code
self.code = code
class HTPBEClient:
"""Production-ready HTPBE API client with retry and timeout."""
def __init__(
self,
api_key: str,
base_url: str = "https://api.htpbe.tech/v1",
timeout: float = 30.0,
):
if not api_key:
raise ValueError("HTPBE API key is required")
self._session = requests.Session()
self._session.headers.update(
{
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json",
}
)
self._base_url = base_url
self._timeout = timeout
def verify(self, pdf_url: str, retries: int = 3) -> HTPBEResult:
"""Submit a PDF URL for analysis and return the typed result.
Retries only on 5xx errors with exponential backoff.
Client errors (4xx) raise immediately.
"""
last_error: Optional[HTPBEError] = None
for attempt in range(retries):
if attempt > 0:
time.sleep(2 ** (attempt - 1)) # 1s, 2s, 4s
try:
check_id = self._submit(pdf_url)
return self._get_result(check_id)
except HTPBEError as exc:
if exc.status_code < 500:
raise # Client error — do not retry
last_error = exc
raise last_error # type: ignore[misc]
def _submit(self, pdf_url: str) -> str:
resp = self._session.post(
f"{self._base_url}/analyze",
json={"url": pdf_url},
timeout=self._timeout,
)
self._raise_for_error(resp)
return resp.json()["id"]
def _get_result(self, check_id: str) -> HTPBEResult:
resp = self._session.get(
f"{self._base_url}/result/{check_id}",
timeout=self._timeout,
)
self._raise_for_error(resp)
return HTPBEResult.from_dict(resp.json())
@staticmethod
def _raise_for_error(resp: requests.Response) -> None:
if resp.ok:
return
try:
body = resp.json()
except ValueError:
body = {"error": resp.text, "code": "UNKNOWN"}
error_map = {
400: ("BAD_REQUEST", f"Bad request: {body.get('error', '')}"),
401: ("UNAUTHORIZED", "Invalid API key. Check your HTPBE_API_KEY."),
402: ("SUBSCRIPTION_REQUIRED", "Subscription required. See htpbe.tech/pricing."),
403: ("FORBIDDEN", "Access forbidden."),
404: ("NOT_FOUND", "Check not found."),
413: ("FILE_TOO_LARGE", "PDF exceeds the 10 MB size limit."),
422: ("INVALID_PDF", "The URL did not return a valid PDF file."),
}
code, message = error_map.get(
resp.status_code,
(body.get("code", "UNKNOWN"), body.get("error", f"HTTP {resp.status_code}")),
)
raise HTPBEError(message, resp.status_code, code)
The retry logic only applies to 5xx responses. A 401 means your key is wrong — retrying will not help. A 422 means the URL did not return a valid PDF. Only server-side transient failures warrant a retry.
Handling All Three Verdicts
The API returns exactly one of three statuses. Each maps to a different action in your application:
def handle_document(client: HTPBEClient, pdf_url: str) -> str:
result = client.verify(pdf_url)
if result.status == "intact":
# Document has not been modified since creation.
# Safe to process automatically.
return "accept"
if result.status == "modified":
# Post-creation changes detected.
# modification_markers tells you exactly what was found:
# e.g. ["INCREMENTAL_UPDATES", "PRODUCER_MISMATCH", "DIFFERENT_DATES"]
print(f"Rejected: {result.modification_markers}")
print(f"Confidence: {result.modification_confidence}")
return "reject"
# status == "inconclusive"
# The document was created with consumer software (Word, Excel,
# LibreOffice, Canva) rather than institutional document management
# software. The API cannot verify its integrity because consumer
# tools produce metadata patterns that overlap with editing tools.
# This is NOT a failure — it is a specific, actionable finding.
print(f"Consumer origin detected: {result.producer}")
return "manual_review"
The inconclusive verdict requires special attention. It does not mean the check failed or that something went wrong. It means the document was created with consumer software — Microsoft Word, Excel, LibreOffice, Canva — and those tools produce metadata patterns that are indistinguishable from documents that have been opened and re-saved. For documents that claim institutional origin (bank statements, diplomas, contracts from enterprise systems), inconclusive is itself a warning signal: why does a document from a bank look like it was created in Word? Route it to human review. For a deeper explanation, see what “inconclusive” really means.
Django Integration: Document Upload Webhook
A common pattern in Django applications: a user uploads a PDF, your view stores it in S3 or another cloud provider, then verifies it asynchronously before processing.
import os
from django.http import JsonResponse
from django.views.decorators.http import require_POST
from django.views.decorators.csrf import csrf_exempt
# Initialize once at module level — reuses the TCP connection
htpbe = HTPBEClient(api_key=os.environ["HTPBE_API_KEY"])
@csrf_exempt
@require_POST
def upload_invoice(request):
"""Handle invoice upload, verify PDF, route by verdict."""
uploaded_file = request.FILES.get("invoice")
if not uploaded_file:
return JsonResponse({"error": "No file provided"}, status=400)
# 1. Upload to your storage and get a public URL
# (replace with your actual storage logic)
pdf_url = upload_to_s3(uploaded_file)
# 2. Verify with HTPBE
try:
result = htpbe.verify(pdf_url)
except HTPBEError as exc:
if exc.code == "INVALID_PDF":
return JsonResponse({"error": "Not a valid PDF"}, status=422)
if exc.code == "FILE_TOO_LARGE":
return JsonResponse({"error": "PDF must be under 10 MB"}, status=413)
# Log unexpected errors, return a generic message
import logging
logging.exception("HTPBE verification failed")
return JsonResponse(
{"error": "Verification temporarily unavailable"}, status=503
)
# 3. Route based on verdict
if result.status == "modified":
flag_for_fraud_review(
pdf_url=pdf_url,
markers=result.modification_markers,
confidence=result.modification_confidence,
)
return JsonResponse(
{"status": "flagged", "reason": result.modification_markers},
status=422,
)
if result.status == "inconclusive":
queue_for_manual_review(pdf_url=pdf_url, result=result)
return JsonResponse({"status": "pending_review"}, status=202)
# intact — proceed with normal processing
process_invoice(pdf_url=pdf_url, metadata=result)
return JsonResponse({"status": "accepted", "check_id": result.id})
The same pattern works in Flask — replace the Django decorators with Flask route decorators and request.files instead of request.FILES.
Flask Equivalent
import os
from flask import Flask, request, jsonify
app = Flask(__name__)
htpbe = HTPBEClient(api_key=os.environ["HTPBE_API_KEY"])
@app.route("/api/documents", methods=["POST"])
def upload_document():
uploaded_file = request.files.get("document")
if not uploaded_file:
return jsonify(error="No file provided"), 400
pdf_url = upload_to_s3(uploaded_file)
try:
result = htpbe.verify(pdf_url)
except HTPBEError as exc:
if exc.status_code < 500:
return jsonify(error=str(exc)), exc.status_code
return jsonify(error="Verification unavailable"), 503
if result.status == "modified":
return jsonify(
status="flagged",
markers=result.modification_markers,
), 422
if result.status == "inconclusive":
return jsonify(status="pending_review"), 202
return jsonify(status="accepted", check_id=result.id)
Batch Processing: S3 or GCS Bucket
For backfill scenarios — verifying a bucket of existing documents — iterate over the objects and collect results. This example uses boto3 for AWS S3, but the pattern applies to any storage provider that gives you presigned URLs:
import boto3
import csv
from botocore.config import Config
s3 = boto3.client("s3", config=Config(signature_version="s3v4"))
htpbe = HTPBEClient(api_key=os.environ["HTPBE_API_KEY"])
def batch_verify_bucket(bucket: str, prefix: str = "") -> list[dict]:
"""Verify all PDFs in an S3 bucket prefix. Returns a summary list."""
paginator = s3.get_paginator("list_objects_v2")
results = []
for page in paginator.paginate(Bucket=bucket, Prefix=prefix):
for obj in page.get("Contents", []):
key = obj["Key"]
if not key.lower().endswith(".pdf"):
continue
# Generate a presigned URL (valid for 5 minutes)
presigned_url = s3.generate_presigned_url(
"get_object",
Params={"Bucket": bucket, "Key": key},
ExpiresIn=300,
)
try:
result = htpbe.verify(presigned_url)
results.append({
"key": key,
"status": result.status,
"confidence": result.modification_confidence,
"markers": ", ".join(result.modification_markers),
"producer": result.producer,
"creator": result.creator,
})
except HTPBEError as exc:
results.append({
"key": key,
"status": "error",
"confidence": None,
"markers": f"{exc.code}: {exc}",
"producer": None,
"creator": None,
})
return results
def export_to_csv(results: list[dict], output_path: str) -> None:
"""Write batch results to a CSV file for review."""
if not results:
return
with open(output_path, "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=results[0].keys())
writer.writeheader()
writer.writerows(results)
# Usage:
# results = batch_verify_bucket("my-documents-bucket", prefix="invoices/2026/")
# export_to_csv(results, "verification_report.csv")
For large buckets, add concurrent.futures.ThreadPoolExecutor to parallelize requests. The API handles concurrent calls — just stay within your plan’s rate limits.
Test Mode: Develop Without Live Documents
All HTPBE plans — including the free tier — include a test API key. Test keys use mock URLs that return deterministic responses without consuming quota. Use them for integration testing and CI pipelines.
Test URL format: https://api.htpbe.tech/v1/test/{filename}.pdf
| URL | status | What it simulates |
|---|---|---|
clean.pdf | intact | Clean document, institutional origin |
clean-no-dates.pdf | intact | No date metadata, unknown origin |
modified-low.pdf | modified | Single incremental update |
modified-medium.pdf | modified | Consumer software origin, multiple updates |
modified-high.pdf | modified | Different producer, 5-link update chain |
modified-critical.pdf | modified | Signature removed, JavaScript, Excel origin |
signature-removed.pdf | modified | Digital signature stripped |
signature-valid.pdf | intact | Valid digital signature present |
inconclusive.pdf | inconclusive | Excel origin, no structural modifications |
dates-mismatch.pdf | modified | ModDate 14 days after CreationDate |
Use these in your test suite with pytest:
import os
import pytest
TEST_BASE = "https://api.htpbe.tech/v1/test"
client = HTPBEClient(api_key=os.environ["HTPBE_TEST_API_KEY"])
def test_intact_document():
result = client.verify(f"{TEST_BASE}/clean.pdf")
assert result.status == "intact"
assert result.modification_markers == []
def test_modified_document():
result = client.verify(f"{TEST_BASE}/modified-high.pdf")
assert result.status == "modified"
assert result.modification_confidence in ("certain", "high")
assert len(result.modification_markers) > 0
def test_inconclusive_consumer_origin():
result = client.verify(f"{TEST_BASE}/inconclusive.pdf")
assert result.status == "inconclusive"
# inconclusive is NOT an error — it means consumer software origin
def test_signature_removed():
result = client.verify(f"{TEST_BASE}/signature-removed.pdf")
assert result.status == "modified"
assert result.signature_removed is True
Keep your test API key in a .env.test file and your production key in .env. Never commit either to version control.
What the API Does Not Detect
Understanding the limits is important when building a system that makes automated decisions:
- Content-level changes within a single save. If someone creates a PDF in Word, types a fraudulent number, and exports once, the document has never been modified post-creation. It is structurally
intact. The fraud happened at authorship, not at the file level. - Documents recreated from scratch. If an attacker recreates a document from scratch in the same software the original used and matches the metadata fields, the structural signals will not distinguish it from the original. This attack requires significant effort and is rare, but it is possible.
- Encrypted or password-protected PDFs. The API cannot analyze a PDF it cannot parse. If the file requires a password to open, the analysis will return an error.
- Scanned documents with no digital metadata. A photo of a document converted to PDF contains minimal structural metadata. These typically return
inconclusivebecause the originating software is a scanner driver or image converter, not an institutional system. For more on what metadata fields the API reads and why, see the PDF metadata fields reference.
These limits are why HTPBE works best as one layer in a verification workflow — not the only layer. Combine structural PDF analysis with domain-specific checks (amount validation, vendor database lookups, sender authentication) for a layered defense. See PDF Fraud Prevention Best Practices.
Choosing a Plan
Plans start at $15/month for 30 checks (Starter). The right tier depends on your document volume:
- Starter ($15/mo, 30 checks): Low-volume workflows — a handful of documents per week. Good for initial production deployment.
- Growth ($149/mo, 350 checks): The recommended starting point for applications with real users. Handles roughly 12 documents per day.
- Pro ($499/mo, 1,500 checks): For platforms processing 40–50 documents per day. Per-check cost drops to $0.33.
- Enterprise (custom): Volume pricing for 1,500+ checks per month.
Requests beyond your monthly quota succeed automatically — there is no hard cutoff. Overage is billed at the end of the billing cycle.
Decisions Before You Ship
The integration surface is small: one POST, one GET, three verdicts. The decisions that matter are on your side:
inconclusiverouting — for documents that claim institutional origin (bank statements, diplomas), treatinconclusivethe same asmodifiedand route to human review. For user-generated documents (cover letters, personal forms), it may be acceptable to proceed.- Where in your pipeline — verify at intake, before any data extraction or business logic runs. Verifying after processing does not prevent harm.
- Test key separation — keep test and production keys in separate environment files. Test keys return synthetic data and must not be used in production flows.
The full API reference covers all response fields, error codes, and rate limits. To get started, sign up for a free account, copy your test key from the dashboard, and run the quick-start example above.