PDF Security Blog

PDF xref Table Forensics: Detect Edits From File Structure

HTPBE Team·May 5, 2026·11 min read

This article is a snapshot — content was accurate as of May 2026 (code examples tested against the API as of April 2026). The product evolves actively; specific counts, examples, and detection rules may have changed since publication — see the changelog for the current state.

Most PDF fraud detection focuses on what you can read: timestamps, producer strings, creator fields. That metadata is useful, but it is also the easiest thing to forge. A fraudster can overwrite every metadata field in under a minute.

The xref table is different. It is not a metadata field you can find and overwrite with a text editor. It is the PDF’s internal index — the structure the reader uses to locate every object in the file. Because the PDF specification requires editors to append a new xref rather than rewrite the existing one, every save operation leaves a structural mark that cannot be cleanly removed without rebuilding the file from scratch.

This article covers what the xref table is at the byte level, why it is the primary forensic signal for incremental modifications, what the edit trail looks like in practice, and how HTPBE reads the xref chain to produce intact, modified, and inconclusive verdicts. If you want a broader overview of how PDF tamper detection works across all five analysis layers, that is covered separately.

What the PDF cross-reference table actually is

A PDF file is not a sequential document. It is a collection of numbered objects — pages, fonts, images, the info dictionary, annotation arrays — written at arbitrary byte positions in the file. The cross-reference table (xref) is the index that maps each object number to its byte offset.

A minimal xref section looks like this:

xref
0 6
0000000000 65535 f 
0000000009 00000 n 
0000000058 00000 n 
0000000115 00000 n 
0000000266 00000 n 
0000000397 00000 n

Each 20-byte entry records three values: the byte offset (0000000058), the generation number (00000), and a flag (n for in-use, f for free). When a PDF reader opens the file, it goes directly to the xref, reads the index, and uses the offsets to locate objects without scanning the entire file from the beginning.

After the xref section comes the trailer:

trailer
<<
  /Size 6
  /Root 1 0 R
  /Info 2 0 R
>>
startxref
431
%%EOF

The startxref offset tells the reader where in the file the xref begins. The /Root and /Info entries in the trailer point to the document catalog and the metadata dictionary.

Modern PDF 1.5+ files often use compressed xref streams (object streams) instead of the traditional plaintext xref table. The format differs — object offsets are packed into a binary stream inside a numbered PDF object — but the forensic logic is identical: each save session appends a new xref stream with its own /Prev pointer to the previous one.

A freshly exported PDF has exactly one xref section (or xref stream) and one %%EOF. That is the baseline.

Why incremental updates leave an unavoidable structural trail

The PDF specification defines a mechanism called the incremental update. When a PDF is edited and saved, a conforming implementation does not rewrite the existing file body. It appends changes to the end:

New or modified objects are written after the existing %%EOF
A new xref section is written, containing entries only for the changed objects
A new trailer is written with a /Prev pointer to the previous xref’s byte offset
A new %%EOF marker closes the update

The resulting file looks like this:

[original body — objects 1–5]
%%EOF
[new object 2 — modified Info dict]
[new object 6 — added annotation]
xref
0 1
0000000000 65535 f 
2 1
0000000512 00000 n 
6 1
0000000688 00000 n 
trailer
<<
  /Size 7
  /Root 1 0 R
  /Info 2 0 R
  /Prev 431          ← points to first xref
>>
startxref
730
%%EOF

The reader starts from the last %%EOF, follows startxref backward, reads the newest xref, and follows /Prev to reconstruct the full object table. Later revisions override earlier ones for the same object number — that’s how edits work.

For forensic purposes, the /Prev chain is a directed linked list of every save operation the file has ever undergone. Each node in the chain is an xref section. Counting the nodes gives you the number of save sessions.

A file with one xref was saved exactly once — almost always at export time from the originating application. A file with three xref sections was saved three times: once when created, and twice more afterward. That is a PDF revision history embedded in the structure. Whether it is malicious depends on context, but it is structurally irrefutable.

What the cross-reference table tampering trail looks like at the byte level

Here are the key differences between an untouched PDF and a modified one, visible at the binary level.

A clean, single-session PDF:

%PDF-1.7
[objects]
xref
0 12
[12 entries]
trailer
<< /Size 12 /Root 1 0 R /Info 2 0 R >>
startxref
4821
%%EOF

xref_count: 1. No /Prev. One %%EOF. This is what an unmodified export looks like.

A PDF modified with a standard editor:

%PDF-1.7
[original objects 1–11]
xref
0 12
[12 entries — original]
trailer
<< /Size 12 /Root 1 0 R /Info 2 0 R >>
startxref
4821
%%EOF

[modified Info dict at new offset]
[modified page content object at new offset]
xref
2 1
0000009104 00000 n 
5 1
0000009388 00000 n 
trailer
<< /Size 12 /Root 1 0 R /Info 2 0 R /Prev 4821 >>
startxref
9512
%%EOF

xref_count: 2. One /Prev pointer. Two %%EOF markers. The /Info object (object 2) was rewritten — metadata was changed in the edit session. The page content object (object 5) was also rewritten — text was altered.

A PDF modified twice:

Three xref sections, two /Prev pointers, three %%EOF markers. xref_count: 3. Each session is a separate link in the chain. If object 2 (the Info dict) changed in the second session but not the third, you know when the metadata modification happened relative to the content modification.

The xref chain is self-timestamping in the sense that the order of modifications is preserved in the structure. You cannot reorder the sessions without rebuilding the file. Duplicate object IDs across sessions confirm which objects were overwritten during each edit pass.

How HTPBE reads the xref chain to detect modified PDFs

HTPBE walks the xref chain from the last %%EOF backward through every /Prev pointer, counting sections and identifying which objects changed in each session. This structural pass is one part of the full forensic detection pipeline.

The raw structural data is returned in the API response alongside the verdict:

{
  "id": "ck_7b3e1a09-...",
  "status": "modified",
  "modification_confidence": "high",
  "modification_markers": ["INCREMENTAL_UPDATES", "DIFFERENT_DATES"],
  "xref_count": 3,
  "has_incremental_updates": true,
  "update_chain_length": 2,
  "has_digital_signature": false,
  "creator": "Microsoft Word",
  "producer": "iLovePDF",
  "creation_date": 1704067200,
  "modification_date": 1709251200
}

xref_count is the total number of xref sections. update_chain_length is xref_count - 1 — the number of post-creation edit sessions. has_incremental_updates is true when update_chain_length > 0.

The verdict is not produced by xref analysis alone. HTPBE aggregates signals across 35 forensic checks — metadata timestamps, xref structure, digital signature integrity, producer/creator consistency, and object-level anomalies — into a single verdict. INCREMENTAL_UPDATES appearing in modification_markers means the xref chain was the triggering signal for that verdict.

The three verdicts in structural terms

intact: One xref section. No /Prev pointer. Timestamps consistent. Producer matches the document’s claimed origin tool. No anomalous object rewrites in a single session.

modified: Multiple xref sections, or a single xref section with timestamp anomalies, or a signature covering less than the full file body, or a producer string that names a known editing tool in a context where it should not appear. The modification_markers field names exactly which signals fired. See how the five detection layers combine to produce a verdict for the full scoring logic.

inconclusive: The document was created with consumer software — Microsoft Word, Google Docs, LibreOffice, a print-to-PDF driver. These tools do not produce the structural signatures of institutional document generators, so xref analysis is not meaningful: a fraudster can produce a fake bank statement in Word with a single clean xref, the same as a legitimate one. inconclusive is not a failure; it is a precise signal that the document’s origin bypasses structural forensics. The what inconclusive means guide explains this verdict class in detail.

Querying xref data from the PDF forensics API

Submit a PDF URL for analysis and retrieve the structural fields:

# Step 1 — submit for analysis
curl -X POST https://api.htpbe.tech/v1/analyze \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://your-storage.example.com/documents/statement.pdf"}'

# Response
# {"id": "ck_7b3e1a09-3d7c-4a8e-b1f2-9e0d3c5a7b8f"}

# Step 2 — retrieve result
curl https://api.htpbe.tech/v1/result/ck_7b3e1a09-3d7c-4a8e-b1f2-9e0d3c5a7b8f \
  -H "Authorization: Bearer YOUR_API_KEY"

In TypeScript, reading the xref fields directly:

interface HTPBEResult {
  id: string;
  status: 'intact' | 'modified' | 'inconclusive';
  modification_confidence: 'certain' | 'high' | 'none' | null;
  modification_markers: string[];
  xref_count: number;
  has_incremental_updates: boolean;
  update_chain_length: number;
  creator: string | null;
  producer: string | null;
  creation_date: number | null;
  modification_date: number | null;
}

async function analyzeXrefChain(pdfUrl: string): Promise<void> {
  const submitRes = await fetch('https://api.htpbe.tech/v1/analyze', {
    method: 'POST',
    headers: {
      Authorization: `Bearer ${process.env.HTPBE_API_KEY}`,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({ url: pdfUrl }),
  });

  const { id } = await submitRes.json() as { id: string };

  const resultRes = await fetch(`https://api.htpbe.tech/v1/result/${id}`, {
    headers: { Authorization: `Bearer ${process.env.HTPBE_API_KEY}` },
  });

  const result = await resultRes.json() as HTPBEResult;

  console.log(`Verdict: ${result.status}`);
  console.log(`xref sections: ${result.xref_count}`);
  console.log(`Edit sessions after creation: ${result.update_chain_length}`);
  console.log(`Incremental updates: ${result.has_incremental_updates}`);

  if (result.has_incremental_updates) {
    console.log(`Modification markers: ${result.modification_markers.join(', ')}`);
  }
}

For a Python integration example see the PDF tamper detection Python tutorial. The xref fields are useful beyond routing the status verdict. A document review system might log xref_count and update_chain_length to an audit trail even for intact documents, for later investigation if the document is disputed.

The limits of xref forensics

xref chain analysis does not catch every modification. Two scenarios defeat it entirely.

Full file reconstruction. Tools like Ghostscript, qpdf, and print to PDF workflows flatten the entire file into a fresh single-revision PDF. The output has one xref, no /Prev pointers, and a clean structure — because it is a genuinely new file. If someone edits a PDF then runs it through Ghostscript or prints-to-PDF before submitting it, the xref chain is gone. The modification happened; the structural evidence did not survive.

Editing the source, not the PDF. If a fraudster has the original Word document and changes a salary figure there before exporting to PDF, no PDF editing ever occurred. The resulting PDF has a single xref, consistent timestamps, and a Producer that legitimately says “Microsoft Word.” Structural analysis has nothing to work with. This is precisely why inconclusive exists as a verdict class — and why consumer-software-origin documents are treated differently from institutional ones in high-stakes workflows.

These are real gaps. HTPBE documents them honestly because workflows built on a false sense of completeness fail in production. The xref chain catches the majority of real-world document fraud, which uses standard PDF editors rather than sophisticated reconstruction pipelines. Against a skilled adversary who understands PDF internals, additional signals — content analysis, issuer fraud detection, document provenance controls — are necessary. The PDF forensics without the original file article covers which signals remain available when there is no reference copy to compare against.

What xref forensics catches in practice

The gap between “what it misses in theory” and “what it misses in practice” is significant.

Real document fraud in lending, HR, insurance, and legal workflows is overwhelmingly performed with off-the-shelf tools: iLovePDF, Adobe Acrobat, PDF-XChange Editor, Foxit. These tools all produce incremental updates by default. A fraudster who edits a bank statement balance in iLovePDF and submits it is not going to run the result through Ghostscript afterward — they do not know what Ghostscript is.

In these cases, the xref chain is unambiguous: xref_count: 2, update_chain_length: 1, has_incremental_updates: true, modification_markers: ["INCREMENTAL_UPDATES"]. The file was opened, edited, and saved with a consumer PDF editor. That is the fact the structure records.

For developers building document intake pipelines, the xref chain is the most reliable structural signal to expose in an audit log. It does not require interpretation. xref_count: 1 means the file has lived one life. xref_count: 3 means it has had three.

Integrating PDF xref forensics into a document intake pipeline

If you are building a document intake workflow — loan applications, payslip checks, insurance claims, contract review — xref forensics belongs in the pipeline at the point of ingestion. Not as the final word on a document’s authenticity, but as a fast, structural first pass that catches the common cases and surfaces anomalies for human review.

The HTPBE PDF forensics API exposes xref_count, has_incremental_updates, update_chain_length, and the full modification_markers array in every analysis response. Pricing starts at $15/month for 30 checks. You can start testing with a free test API key against synthetic documents before touching production data — no sales call, no approval queue.