Large PDFs in Node.js Without Unbounded Buffering

This article is a snapshot — content was accurate as of June 2026 (code examples tested against the API as of May 2026). The product evolves actively; specific counts, examples, and detection rules may have changed since publication — see the changelog for the current state.
The first time you run a PDF verification pipeline at production volume on a single Node.js worker, you discover the same three failures in roughly the same order: V8 heap exhaustion under concurrent load, a Vercel 413 on anything above 4.5 MB, and a worker that gets slower and slower as the event loop fills with paused Buffer.concat chains. None of these are bugs in your code — they are the predictable cost of the “FormData → full buffer → analyze” pattern that almost every PDF tutorial starts with.
This guide is for backend engineers integrating PDF tamper detection into a Node service that has to handle real upload volume. The honest framing up front: you cannot make pdf-lib itself stream — the PDF format needs the xref table at the end of the file, and the parser needs the whole thing resident. What you can do is control the three things that actually cause production failures: how many bytes you ever pull into the worker, where the bytes live during the slow upload phase, and how many parses run in parallel. We will walk through why the naive pattern dies, the two-step upload pattern HTPBE? uses internally to decouple upload throughput from analysis throughput, a clean Node 20+ implementation with backpressure and a hard byte cap, and the concurrency story across Vercel, Fly.io, Coolify, and a self-hosted Fastify worker. (For the conceptual API tour first, see the Node.js integration guide; for the queue/fan-out angle, see batch PDF verification with queues.)
Why “buffer the whole file” dies at scale
The pattern most quick-start tutorials show looks like this:
app.post('/upload', upload.single('pdf'), async (req, res) => {
const buffer = req.file.buffer; // entire file in RAM
const result = await analyzePdf(buffer);
res.json(result);
});This works on your laptop with one user. It fails in production for three independent reasons.
V8 heap pressure compounds with concurrency. A 9 MB PDF is not 9 MB resident in Node — Multer holds the raw buffer, your downstream parser holds a parallel Uint8Array, and any intermediate transform (re-encoding, base64, hashing) typically allocates another copy. Thirty concurrent uploads on a 512 MB worker can plausibly cross the old-space ceiling once you count the per-request V8 overhead and the GC headroom Node tries to keep free. In our planning we assume roughly 3–5x the raw PDF size in transient allocation per in-flight request; you should measure on your own hardware before locking in a worker size.
Vercel serverless function bodies are capped at 4.5 MB. That cap is the request body limit on the function entry, not a Node limit. Once a customer uploads a 6 MB bank statement, you get a 413 you cannot retry around. The standard fix is to keep the file out of the function body entirely — upload it to object storage from the browser, then send the URL to your function. The same pattern, wired through Next.js App Router, is in PDF tamper detection in Next.js API routes.
Buffer.concat on large stream chunks pauses the event loop. When you read a 9 MB stream into a list of chunks and concatenate them at the end, the concat is a single synchronous allocation. On a busy worker that allocation competes with every other request’s GC pass. The visible symptom is tail-latency drift — p50 stays fine, p99 climbs as load increases.
The takeaway is not “avoid PDFs” — it is “separate the bytes from the analysis.” The upload path and the parse path want different resource budgets, and forcing them through one function is what creates the failure mode.
The two-step pattern: client → R2 → analyzer
HTPBE?’s own service uses a pattern worth replicating in any Node app that has to accept user-uploaded PDFs at scale, on Vercel or elsewhere:
- The browser asks the server for a presigned PUT URL to object storage (S3, R2, GCS).
- The browser PUTs the file directly to storage. The Node server never sees the bytes during upload.
- The browser sends the resulting public URL to the analyze endpoint.
- The analyze worker pulls the bytes from storage with backpressure and a hard cap, parses, and makes the verdict available for polling.
This decouples two very different workloads. Upload throughput is bound by the user’s network and the object store’s ingress — both of which scale far better than a single Node process. Analysis throughput is bound by the parser’s CPU and memory profile — which you can size independently and run on a different machine if you want. The Node worker is no longer in the path of the bytes during the slow phase.
For the HTPBE? API specifically, this is what the POST /api/v1/analyze endpoint expects: a public URL to the PDF, not the file itself. The submission returns a check_id synchronously; your service then polls GET /api/v1/result/{id} for the verdict once analysis completes. Your role is to host the bytes somewhere reachable (your own R2 bucket, S3, a signed CDN URL) and hand the URL over.
Implementing the analyze worker in Node 20+
Here is a clean implementation of the analyze side — the part of your service that takes a URL, fetches the bytes with backpressure, enforces a byte cap, and submits the file to HTPBE? for a verdict. We use only Node 20 built-ins here; the pdf-lib point matters because most local PDF parsers (including the one HTPBE? uses server-side) still require a full resident buffer, which is what shapes the cap-and-bound design below.
// src/lib/fetchPdfWithCap.ts
const MAX_PDF_BYTES = 10 * 1024 * 1024; // 10 MB hard cap
const FETCH_TIMEOUT_MS = 30_000;
export class PdfTooLargeError extends Error {}
export class PdfFetchTimeoutError extends Error {}
/**
* Pulls bytes from a URL with backpressure and a hard byte cap.
* Returns a Uint8Array no larger than MAX_PDF_BYTES, or throws.
*/
export async function fetchPdfWithCap(url: string): Promise<Uint8Array> {
const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), FETCH_TIMEOUT_MS);
try {
const res = await fetch(url, { signal: controller.signal });
if (!res.ok) {
throw new Error(`fetch failed: ${res.status} ${res.statusText}`);
}
if (!res.body) {
throw new Error('response has no body');
}
// Optimistic short-circuit: if Content-Length is present and over cap,
// abort before reading any bytes.
const declared = Number(res.headers.get('content-length'));
if (Number.isFinite(declared) && declared > MAX_PDF_BYTES) {
controller.abort();
throw new PdfTooLargeError(`content-length ${declared} exceeds cap ${MAX_PDF_BYTES}`);
}
const chunks: Uint8Array[] = [];
let total = 0;
// res.body is a web ReadableStream; Node 20+ accepts it for await directly.
for await (const chunk of res.body as unknown as AsyncIterable<Uint8Array>) {
total += chunk.byteLength;
if (total > MAX_PDF_BYTES) {
controller.abort();
throw new PdfTooLargeError(`streamed ${total} bytes exceeded cap ${MAX_PDF_BYTES}`);
}
chunks.push(chunk);
}
// One concat at the end, on a bounded total. This is the unavoidable
// copy required by pdf-lib's full-buffer API; the cap above keeps it safe.
const out = new Uint8Array(total);
let offset = 0;
for (const c of chunks) {
out.set(c, offset);
offset += c.byteLength;
}
return out;
} catch (err) {
if ((err as Error).name === 'AbortError') {
throw new PdfFetchTimeoutError(`fetch aborted after ${FETCH_TIMEOUT_MS}ms`);
}
throw err;
} finally {
clearTimeout(timeout);
}
}A few things worth highlighting in that code.
The Content-Length short-circuit. If the server is honest about size, we never start the body read at all. Hostile or misconfigured servers can lie, which is why the in-loop byte counter is the actual enforcement.
The in-loop cap is the real defence. A 5 GB hostile URL never gets resident in your worker — we abort the underlying socket the moment the running total crosses the threshold. The AbortController propagates the cancel through fetch down to the TCP layer.
The final concat is bounded. We do still need a single Uint8Array at the end. pdf-lib’s PDFDocument.load() requires a fully resident buffer — the PDF format’s xref table is at the end of the file and references arbitrary offsets, so there is no honest way to parse it as a true stream. The streaming win here is on the network path: backpressured download and a hard upper bound, rather than a parser that consumes constant memory. That is still a meaningful improvement over await res.arrayBuffer() with no cap.
The timeout is on the whole fetch, not per-chunk. A slowloris-style server that drips one byte every five seconds will hit the 30-second wall before it can fill your buffer.
Handing the URL to HTPBE?
Once you have the bytes safely in memory and want HTPBE? to verify them, you can either skip the local pull entirely (cheapest path — pass the URL straight through) or do your own checks first and then forward. The forwarding call is straightforward:
// src/lib/analyzeWithHtpbe.ts
const HTPBE_ENDPOINT = 'https://api.htpbe.tech/v1/analyze';
export interface HtpbeAnalyzeResponse {
id: string;
}
export async function submitForAnalysis(
pdfUrl: string,
apiKey: string
): Promise<HtpbeAnalyzeResponse> {
const res = await fetch(HTPBE_ENDPOINT, {
method: 'POST',
headers: {
Authorization: `Bearer ${apiKey}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({ url: pdfUrl }),
});
if (!res.ok) {
const text = await res.text();
throw new Error(`HTPBE submit failed: ${res.status} ${text}`);
}
return res.json() as Promise<HtpbeAnalyzeResponse>;
}You then poll GET /api/v1/result/{id} for the verdict. The response carries a verdict field (intact, modified, or inconclusive) and a modification_markers array of public IDs from our forensic checks catalog — for example, HTPBE_POST_SIGNATURE_EDIT or HTPBE_DATES_DISAGREE. There is no risk score; the verdict plus the markers is the contract.
Express and Fastify wiring
The Express version of the analyze handler is the boring case — it works, but Express’s built-in body parsing does not help you here because we are sending JSON, not a multipart body.
// Express
import express from 'express';
import { fetchPdfWithCap, PdfTooLargeError } from './lib/fetchPdfWithCap';
import { submitForAnalysis } from './lib/analyzeWithHtpbe';
const app = express();
app.use(express.json({ limit: '1mb' })); // we accept a URL, not a file
app.post('/verify', async (req, res, next) => {
try {
const { url } = req.body as { url?: string };
if (!url) return res.status(400).json({ error: 'url required' });
// Optional: pre-fetch with cap to fail fast on huge files
await fetchPdfWithCap(url);
const verdict = await submitForAnalysis(url, process.env.HTPBE_API_KEY!);
res.json(verdict);
} catch (err) {
if (err instanceof PdfTooLargeError) {
return res.status(413).json({ error: 'pdf exceeds 10MB cap' });
}
next(err);
}
});Fastify is structurally the same but cheaper at high RPS because of its schema-based serialiser.
// Fastify
import Fastify from 'fastify';
import { fetchPdfWithCap, PdfTooLargeError } from './lib/fetchPdfWithCap';
import { submitForAnalysis } from './lib/analyzeWithHtpbe';
const app = Fastify({ logger: true, bodyLimit: 1 * 1024 * 1024 });
app.post<{ Body: { url: string } }>(
'/verify',
{
schema: {
body: {
type: 'object',
required: ['url'],
properties: { url: { type: 'string', format: 'uri' } },
},
},
},
async (req, reply) => {
try {
await fetchPdfWithCap(req.body.url);
return submitForAnalysis(req.body.url, process.env.HTPBE_API_KEY!);
} catch (err) {
if (err instanceof PdfTooLargeError) {
return reply.code(413).send({ error: 'pdf exceeds 10MB cap' });
}
throw err;
}
}
);
app.listen({ port: 3000 });In both cases, the byte cap and the AbortController logic live inside fetchPdfWithCap, not in the framework. That is deliberate — the safety properties travel with the function, not with the route handler.
Concurrency control — the part that actually matters
Backpressure on a single request is not enough. The thing that crashes Node workers is N requests in flight at once each holding 10 MB. You need an explicit concurrency cap, and it has to live above the parser, not inside it.
A 20-line semaphore is usually sufficient.
// src/lib/limit.ts
export function createLimiter(max: number) {
let active = 0;
const queue: Array<() => void> = [];
return async function limit<T>(fn: () => Promise<T>): Promise<T> {
if (active >= max) {
await new Promise<void>((resolve) => queue.push(resolve));
}
active++;
try {
return await fn();
} finally {
active--;
const next = queue.shift();
if (next) next();
}
};
}
// Usage at module scope:
const analyzeLimit = createLimiter(4); // tune per worker
// In the handler:
await analyzeLimit(() => fetchPdfWithCap(url));p-limit does the same thing if you prefer a dependency. The number you put there matters more than the implementation. Some rough guidance for picking it:
- Vercel Functions (1024 MB tier): 2–4 concurrent analyses per function instance, depending on average file size. Vercel autoscales horizontally, so per-instance concurrency should be conservative.
- Fly.io / Render / Railway (shared-CPU 1 GB instances): 3–6 concurrent, with
--max-old-space-size=768to keep V8 from optimistically growing past your container limit. - Coolify / self-hosted Docker on a 2 GB instance: 6–10 concurrent if PDFs average around 1 MB; halve that if you regularly see 5 MB+ files. Set a Docker memory limit so an OOM kills the worker cleanly instead of the whole host (we learned that one the hard way).
- A dedicated Fastify analyzer worker behind a queue: pick the number empirically. Push load through it, watch
process.memoryUsage().heapUsedandrss, and back off whenrssflatlines above 80% of the container limit.
All of these are planning numbers, not benchmarks. Measure your own workload before locking them in.
For higher throughput than a single worker can give you, the right pattern is not a bigger semaphore — it is a queue with N workers. We cover that pattern in batch PDF verification with queues.
Production checklist
A short list of things worth wiring up before you ship this to real traffic.
Memory observability. Log process.memoryUsage() (or, better, push it to a metrics sink) every 30 seconds. A worker whose rss climbs monotonically across an hour is leaking. Take a heap snapshot with node --inspect and chrome://inspect before it OOMs — once it dies, you lose the evidence.
Node flags. Set --max-old-space-size to about 75% of your container memory. If your container has 1 GB, set it to 768. This forces V8 to GC earlier and gives you a useful crash signature (a clean V8 OOM with a stack trace) instead of a container kill from the orchestrator.
Sentry around the analyze call. Wrap each HTPBE? call in a Sentry span with the file URL as a tag (not the bytes), the verdict as a tag, and the latency as a measurement. When something regresses six weeks from now, the “p99 latency for modified verdicts” query is the one you will want.
Logs you will thank yourself for. At a minimum: request ID, file URL, declared size, observed size, HTPBE? check ID, verdict, modification markers. The check ID is the join key to our system — if you ever need to ask us about a specific result, that ID is what we need.
Honest timeouts. The HTPBE? /analyze endpoint can take a few seconds on a complex PDF. Do not set your client timeout to 2 seconds; 30 is reasonable and gives the algorithm room to finish on edge cases. Set the connect timeout separately from the response timeout if your HTTP client supports it.
A cap above the cap. Even with a 10 MB per-file cap, a single worker can be wedged by 4 concurrent 10 MB downloads from a slow CDN. Set a per-worker total inflight-bytes budget if your workload trends large.
Where this leaves you
The pattern in this article gives a single Node worker meaningfully more headroom than the naive multipart-upload approach — not by making the parser itself cheaper, but by closing each of the three control points we opened with: bytes never sit in the function during upload, the in-loop cap puts a hard ceiling on what gets pulled into memory, and an explicit semaphore caps how many parses run in parallel. Memory, size, concurrency — all three bounded. That is what changes the shape of your latency curve under load.
The reason any of this matters at all is that the document arriving at your service is — in a meaningful percentage of cases — not what it claims to be. A PDF that has been opened in an editor and re-saved between the bank that issued it and the user that uploaded it carries structural fingerprints of that re-save: divergent dates, incremental update chains, mismatched producer strings, post-signature edits. Catching those structural signals before your business logic trusts the document is what HTPBE? does. The API reference has the full schema; the pricing page has the per-check economics that make this realistic in a high-throughput pipeline.