logo
Back to Blog

PDF Verification API for Python: Complete Integration Guide

HTPBE Team··12 min read
PDF Verification API for Python: Complete Integration Guide

Code examples verified against the API as of April 2026. If the API has changed since then, check the changelog.

Every PDF your application accepts is a trust decision. An invoice triggers a payment. A bank statement determines a credit limit. A diploma qualifies a candidate. If the document was modified after creation, the decision it drives is based on false data — and your application made that decision automatically. For the business case, see why PDF verification matters.

This guide walks through integrating the HTPBE PDF verification API into a Python application: from the first requests call to a production-ready client class with retry logic, typed responses, Django/Flask webhook patterns, and batch processing over cloud storage. All code is complete and working. For Node.js, see the Node.js integration guide. For a language-agnostic overview of what the API detects, see How to Detect PDF Tampering Programmatically.

Prerequisites

  • Python 3.10+
  • requests library (pip install requests)
  • An HTPBE API key (sign up → Dashboard → copy key)

Quick Start: Two API Calls

The HTPBE API uses a two-step flow. First, POST /analyze submits a publicly accessible PDF URL and returns a check ID. Then, GET /result/{id} retrieves the full verdict. Both steps use Bearer token authentication. (For a visual walkthrough of the 5-layer analysis that runs behind these endpoints, see the How It Works page.)

import requests

API_KEY = "YOUR_API_KEY"
BASE_URL = "https://api.htpbe.tech/v1"
HEADERS = {
    "Authorization": f"Bearer {API_KEY}",
    "Content-Type": "application/json",
}

# Step 1: Submit the PDF URL for analysis
response = requests.post(
    f"{BASE_URL}/analyze",
    headers=HEADERS,
    json={"url": "https://api.htpbe.tech/v1/test/clean.pdf"},
)
response.raise_for_status()
check_id = response.json()["id"]

# Step 2: Retrieve the full verdict
result = requests.get(
    f"{BASE_URL}/result/{check_id}",
    headers=HEADERS,
).json()

print(result["status"])              # "intact", "modified", or "inconclusive"
print(result["modification_markers"])  # e.g. ["INCREMENTAL_UPDATES", "PRODUCER_MISMATCH"]
print(result["producer"])             # e.g. "Adobe PDF Library 15.0"
print(result["creator"])              # e.g. "Microsoft Word"

The URL https://api.htpbe.tech/v1/test/clean.pdf is a test mock — it returns a predictable intact response without consuming quota. We cover all test URLs in the testing section below.

Production Client Class

A production integration needs retry logic for transient failures, timeouts to prevent stalled connections, and typed responses so your IDE and type checker can assist you. Here is a complete client using dataclasses and requests.Session:

from dataclasses import dataclass
from typing import Optional
import time
import requests


@dataclass
class HTPBEResult:
    """Typed representation of the GET /result/{id} response."""

    id: str
    filename: str
    status: str  # "intact" | "modified" | "inconclusive"
    modification_confidence: Optional[str]  # "certain" | "high" | "none"
    modification_markers: list[str]

    creator: Optional[str]
    producer: Optional[str]
    creation_date: Optional[int]  # Unix timestamp
    modification_date: Optional[int]  # Unix timestamp

    xref_count: int
    has_incremental_updates: bool
    update_chain_length: int
    has_digital_signature: bool
    signature_removed: bool
    modifications_after_signature: bool

    file_size: int
    page_count: int
    pdf_version: Optional[str]

    @classmethod
    def from_dict(cls, data: dict) -> "HTPBEResult":
        return cls(
            id=data["id"],
            filename=data["filename"],
            status=data["status"],
            modification_confidence=data.get("modification_confidence"),
            modification_markers=data.get("modification_markers", []),
            creator=data.get("creator"),
            producer=data.get("producer"),
            creation_date=data.get("creation_date"),
            modification_date=data.get("modification_date"),
            xref_count=data.get("xref_count", 0),
            has_incremental_updates=data.get("has_incremental_updates", False),
            update_chain_length=data.get("update_chain_length", 0),
            has_digital_signature=data.get("has_digital_signature", False),
            signature_removed=data.get("signature_removed", False),
            modifications_after_signature=data.get(
                "modifications_after_signature", False
            ),
            file_size=data.get("file_size", 0),
            page_count=data.get("page_count", 0),
            pdf_version=data.get("pdf_version"),
        )


class HTPBEError(Exception):
    """Raised when the HTPBE API returns a non-2xx response."""

    def __init__(self, message: str, status_code: int, code: str):
        super().__init__(message)
        self.status_code = status_code
        self.code = code


class HTPBEClient:
    """Production-ready HTPBE API client with retry and timeout."""

    def __init__(
        self,
        api_key: str,
        base_url: str = "https://api.htpbe.tech/v1",
        timeout: float = 30.0,
    ):
        if not api_key:
            raise ValueError("HTPBE API key is required")

        self._session = requests.Session()
        self._session.headers.update(
            {
                "Authorization": f"Bearer {api_key}",
                "Content-Type": "application/json",
            }
        )
        self._base_url = base_url
        self._timeout = timeout

    def verify(self, pdf_url: str, retries: int = 3) -> HTPBEResult:
        """Submit a PDF URL for analysis and return the typed result.

        Retries only on 5xx errors with exponential backoff.
        Client errors (4xx) raise immediately.
        """
        last_error: Optional[HTPBEError] = None

        for attempt in range(retries):
            if attempt > 0:
                time.sleep(2 ** (attempt - 1))  # 1s, 2s, 4s

            try:
                check_id = self._submit(pdf_url)
                return self._get_result(check_id)
            except HTPBEError as exc:
                if exc.status_code < 500:
                    raise  # Client error — do not retry
                last_error = exc

        raise last_error  # type: ignore[misc]

    def _submit(self, pdf_url: str) -> str:
        resp = self._session.post(
            f"{self._base_url}/analyze",
            json={"url": pdf_url},
            timeout=self._timeout,
        )
        self._raise_for_error(resp)
        return resp.json()["id"]

    def _get_result(self, check_id: str) -> HTPBEResult:
        resp = self._session.get(
            f"{self._base_url}/result/{check_id}",
            timeout=self._timeout,
        )
        self._raise_for_error(resp)
        return HTPBEResult.from_dict(resp.json())

    @staticmethod
    def _raise_for_error(resp: requests.Response) -> None:
        if resp.ok:
            return

        try:
            body = resp.json()
        except ValueError:
            body = {"error": resp.text, "code": "UNKNOWN"}

        error_map = {
            400: ("BAD_REQUEST", f"Bad request: {body.get('error', '')}"),
            401: ("UNAUTHORIZED", "Invalid API key. Check your HTPBE_API_KEY."),
            402: ("SUBSCRIPTION_REQUIRED", "Subscription required. See htpbe.tech/pricing."),
            403: ("FORBIDDEN", "Access forbidden."),
            404: ("NOT_FOUND", "Check not found."),
            413: ("FILE_TOO_LARGE", "PDF exceeds the 10 MB size limit."),
            422: ("INVALID_PDF", "The URL did not return a valid PDF file."),
        }

        code, message = error_map.get(
            resp.status_code,
            (body.get("code", "UNKNOWN"), body.get("error", f"HTTP {resp.status_code}")),
        )
        raise HTPBEError(message, resp.status_code, code)

The retry logic only applies to 5xx responses. A 401 means your key is wrong — retrying will not help. A 422 means the URL did not return a valid PDF. Only server-side transient failures warrant a retry.

Handling All Three Verdicts

The API returns exactly one of three statuses. Each maps to a different action in your application:

def handle_document(client: HTPBEClient, pdf_url: str) -> str:
    result = client.verify(pdf_url)

    if result.status == "intact":
        # Document has not been modified since creation.
        # Safe to process automatically.
        return "accept"

    if result.status == "modified":
        # Post-creation changes detected.
        # modification_markers tells you exactly what was found:
        # e.g. ["INCREMENTAL_UPDATES", "PRODUCER_MISMATCH", "DIFFERENT_DATES"]
        print(f"Rejected: {result.modification_markers}")
        print(f"Confidence: {result.modification_confidence}")
        return "reject"

    # status == "inconclusive"
    # The document was created with consumer software (Word, Excel,
    # LibreOffice, Canva) rather than institutional document management
    # software. The API cannot verify its integrity because consumer
    # tools produce metadata patterns that overlap with editing tools.
    # This is NOT a failure — it is a specific, actionable finding.
    print(f"Consumer origin detected: {result.producer}")
    return "manual_review"

The inconclusive verdict requires special attention. It does not mean the check failed or that something went wrong. It means the document was created with consumer software — Microsoft Word, Excel, LibreOffice, Canva — and those tools produce metadata patterns that are indistinguishable from documents that have been opened and re-saved. For documents that claim institutional origin (bank statements, diplomas, contracts from enterprise systems), inconclusive is itself a warning signal: why does a document from a bank look like it was created in Word? Route it to human review. For a deeper explanation, see what “inconclusive” really means.

Django Integration: Document Upload Webhook

A common pattern in Django applications: a user uploads a PDF, your view stores it in S3 or another cloud provider, then verifies it asynchronously before processing.

import os
from django.http import JsonResponse
from django.views.decorators.http import require_POST
from django.views.decorators.csrf import csrf_exempt

# Initialize once at module level — reuses the TCP connection
htpbe = HTPBEClient(api_key=os.environ["HTPBE_API_KEY"])


@csrf_exempt
@require_POST
def upload_invoice(request):
    """Handle invoice upload, verify PDF, route by verdict."""
    uploaded_file = request.FILES.get("invoice")
    if not uploaded_file:
        return JsonResponse({"error": "No file provided"}, status=400)

    # 1. Upload to your storage and get a public URL
    #    (replace with your actual storage logic)
    pdf_url = upload_to_s3(uploaded_file)

    # 2. Verify with HTPBE
    try:
        result = htpbe.verify(pdf_url)
    except HTPBEError as exc:
        if exc.code == "INVALID_PDF":
            return JsonResponse({"error": "Not a valid PDF"}, status=422)
        if exc.code == "FILE_TOO_LARGE":
            return JsonResponse({"error": "PDF must be under 10 MB"}, status=413)
        # Log unexpected errors, return a generic message
        import logging
        logging.exception("HTPBE verification failed")
        return JsonResponse(
            {"error": "Verification temporarily unavailable"}, status=503
        )

    # 3. Route based on verdict
    if result.status == "modified":
        flag_for_fraud_review(
            pdf_url=pdf_url,
            markers=result.modification_markers,
            confidence=result.modification_confidence,
        )
        return JsonResponse(
            {"status": "flagged", "reason": result.modification_markers},
            status=422,
        )

    if result.status == "inconclusive":
        queue_for_manual_review(pdf_url=pdf_url, result=result)
        return JsonResponse({"status": "pending_review"}, status=202)

    # intact — proceed with normal processing
    process_invoice(pdf_url=pdf_url, metadata=result)
    return JsonResponse({"status": "accepted", "check_id": result.id})

The same pattern works in Flask — replace the Django decorators with Flask route decorators and request.files instead of request.FILES.

Flask Equivalent

import os
from flask import Flask, request, jsonify

app = Flask(__name__)
htpbe = HTPBEClient(api_key=os.environ["HTPBE_API_KEY"])


@app.route("/api/documents", methods=["POST"])
def upload_document():
    uploaded_file = request.files.get("document")
    if not uploaded_file:
        return jsonify(error="No file provided"), 400

    pdf_url = upload_to_s3(uploaded_file)

    try:
        result = htpbe.verify(pdf_url)
    except HTPBEError as exc:
        if exc.status_code < 500:
            return jsonify(error=str(exc)), exc.status_code
        return jsonify(error="Verification unavailable"), 503

    if result.status == "modified":
        return jsonify(
            status="flagged",
            markers=result.modification_markers,
        ), 422

    if result.status == "inconclusive":
        return jsonify(status="pending_review"), 202

    return jsonify(status="accepted", check_id=result.id)

Batch Processing: S3 or GCS Bucket

For backfill scenarios — verifying a bucket of existing documents — iterate over the objects and collect results. This example uses boto3 for AWS S3, but the pattern applies to any storage provider that gives you presigned URLs:

import boto3
import csv
from botocore.config import Config

s3 = boto3.client("s3", config=Config(signature_version="s3v4"))
htpbe = HTPBEClient(api_key=os.environ["HTPBE_API_KEY"])


def batch_verify_bucket(bucket: str, prefix: str = "") -> list[dict]:
    """Verify all PDFs in an S3 bucket prefix. Returns a summary list."""
    paginator = s3.get_paginator("list_objects_v2")
    results = []

    for page in paginator.paginate(Bucket=bucket, Prefix=prefix):
        for obj in page.get("Contents", []):
            key = obj["Key"]
            if not key.lower().endswith(".pdf"):
                continue

            # Generate a presigned URL (valid for 5 minutes)
            presigned_url = s3.generate_presigned_url(
                "get_object",
                Params={"Bucket": bucket, "Key": key},
                ExpiresIn=300,
            )

            try:
                result = htpbe.verify(presigned_url)
                results.append({
                    "key": key,
                    "status": result.status,
                    "confidence": result.modification_confidence,
                    "markers": ", ".join(result.modification_markers),
                    "producer": result.producer,
                    "creator": result.creator,
                })
            except HTPBEError as exc:
                results.append({
                    "key": key,
                    "status": "error",
                    "confidence": None,
                    "markers": f"{exc.code}: {exc}",
                    "producer": None,
                    "creator": None,
                })

    return results


def export_to_csv(results: list[dict], output_path: str) -> None:
    """Write batch results to a CSV file for review."""
    if not results:
        return
    with open(output_path, "w", newline="") as f:
        writer = csv.DictWriter(f, fieldnames=results[0].keys())
        writer.writeheader()
        writer.writerows(results)


# Usage:
# results = batch_verify_bucket("my-documents-bucket", prefix="invoices/2026/")
# export_to_csv(results, "verification_report.csv")

For large buckets, add concurrent.futures.ThreadPoolExecutor to parallelize requests. The API handles concurrent calls — just stay within your plan’s rate limits.

Test Mode: Develop Without Live Documents

All HTPBE plans — including the free tier — include a test API key. Test keys use mock URLs that return deterministic responses without consuming quota. Use them for integration testing and CI pipelines.

Test URL format: https://api.htpbe.tech/v1/test/{filename}.pdf

URLstatusWhat it simulates
clean.pdfintactClean document, institutional origin
clean-no-dates.pdfintactNo date metadata, unknown origin
modified-low.pdfmodifiedSingle incremental update
modified-medium.pdfmodifiedConsumer software origin, multiple updates
modified-high.pdfmodifiedDifferent producer, 5-link update chain
modified-critical.pdfmodifiedSignature removed, JavaScript, Excel origin
signature-removed.pdfmodifiedDigital signature stripped
signature-valid.pdfintactValid digital signature present
inconclusive.pdfinconclusiveExcel origin, no structural modifications
dates-mismatch.pdfmodifiedModDate 14 days after CreationDate

Use these in your test suite with pytest:

import os
import pytest

TEST_BASE = "https://api.htpbe.tech/v1/test"
client = HTPBEClient(api_key=os.environ["HTPBE_TEST_API_KEY"])


def test_intact_document():
    result = client.verify(f"{TEST_BASE}/clean.pdf")
    assert result.status == "intact"
    assert result.modification_markers == []


def test_modified_document():
    result = client.verify(f"{TEST_BASE}/modified-high.pdf")
    assert result.status == "modified"
    assert result.modification_confidence in ("certain", "high")
    assert len(result.modification_markers) > 0


def test_inconclusive_consumer_origin():
    result = client.verify(f"{TEST_BASE}/inconclusive.pdf")
    assert result.status == "inconclusive"
    # inconclusive is NOT an error — it means consumer software origin


def test_signature_removed():
    result = client.verify(f"{TEST_BASE}/signature-removed.pdf")
    assert result.status == "modified"
    assert result.signature_removed is True

Keep your test API key in a .env.test file and your production key in .env. Never commit either to version control.

What the API Does Not Detect

Understanding the limits is important when building a system that makes automated decisions:

  • Content-level changes within a single save. If someone creates a PDF in Word, types a fraudulent number, and exports once, the document has never been modified post-creation. It is structurally intact. The fraud happened at authorship, not at the file level.
  • Documents recreated from scratch. If an attacker recreates a document from scratch in the same software the original used and matches the metadata fields, the structural signals will not distinguish it from the original. This attack requires significant effort and is rare, but it is possible.
  • Encrypted or password-protected PDFs. The API cannot analyze a PDF it cannot parse. If the file requires a password to open, the analysis will return an error.
  • Scanned documents with no digital metadata. A photo of a document converted to PDF contains minimal structural metadata. These typically return inconclusive because the originating software is a scanner driver or image converter, not an institutional system. For more on what metadata fields the API reads and why, see the PDF metadata fields reference.

These limits are why HTPBE works best as one layer in a verification workflow — not the only layer. Combine structural PDF analysis with domain-specific checks (amount validation, vendor database lookups, sender authentication) for a layered defense. See PDF Fraud Prevention Best Practices.

Choosing a Plan

Plans start at $15/month for 30 checks (Starter). The right tier depends on your document volume:

  • Starter ($15/mo, 30 checks): Low-volume workflows — a handful of documents per week. Good for initial production deployment.
  • Growth ($149/mo, 350 checks): The recommended starting point for applications with real users. Handles roughly 12 documents per day.
  • Pro ($499/mo, 1,500 checks): For platforms processing 40–50 documents per day. Per-check cost drops to $0.33.
  • Enterprise (custom): Volume pricing for 1,500+ checks per month.

Requests beyond your monthly quota succeed automatically — there is no hard cutoff. Overage is billed at the end of the billing cycle.

Decisions Before You Ship

The integration surface is small: one POST, one GET, three verdicts. The decisions that matter are on your side:

  • inconclusive routing — for documents that claim institutional origin (bank statements, diplomas), treat inconclusive the same as modified and route to human review. For user-generated documents (cover letters, personal forms), it may be acceptable to proceed.
  • Where in your pipeline — verify at intake, before any data extraction or business logic runs. Verifying after processing does not prevent harm.
  • Test key separation — keep test and production keys in separate environment files. Test keys return synthetic data and must not be used in production flows.

The full API reference covers all response fields, error codes, and rate limits. To get started, sign up for a free account, copy your test key from the dashboard, and run the quick-start example above.

Share This Article

Found this article helpful? Share it with others to spread knowledge about PDF security and verification.

https://htpbe.tech/blog/pdf-verification-api-python-integration-guide

Automate PDF Verification in Your Workflow

REST API with transparent pricing from $15/mo. Self-serve — no sales call required.
Free web tool available for manual checks. Test keys on all plans.

View API Docs