logo
Back to Blog

The Anatomy of a Modified PDF: What Our Algorithm Detects

HTPBE Team··10 min read

When you upload a PDF to HTPBE, our algorithm performs a comprehensive forensic analysis to determine if the document has been modified. This is not a simple check — it is a multi-layered examination that analyzes dozens of technical indicators simultaneously.

Transparency builds trust. This article explains exactly how our algorithm works, what it examines, and why this approach provides more reliable results than manual inspection or single-method verification.

Whether you are a security professional evaluating our technology, a potential API customer considering integration, or a curious user wanting to understand what happens behind the scenes, this technical deep-dive will show you how we detect PDF modifications.

Transparency Builds Trust

Many PDF verification tools operate as "black boxes" — you upload a file and get a result, but you do not know how the analysis works. We believe transparency is essential for building trust, especially when verification results affect important decisions.

This article explains:

  • The five layers of analysis our algorithm performs
  • What each layer reveals about PDF modifications
  • How we calculate confidence scores
  • Why 100% confidence differs from high confidence
  • How we handle false positives
  • Real examples of what our algorithm detects

As discussed in Forensic Focus forums, understanding forensic analysis methods helps users make informed decisions about verification results.

The 5 Layers of HTPBE Analysis

Our algorithm examines PDFs through five distinct but complementary layers. Each layer provides different types of evidence, and together they create a comprehensive picture of document integrity.

Layer 1: Metadata Analysis

Metadata is the "DNA" of a PDF — it reveals how the document was created and processed.

What we examine:

  • Creation and modification dates
  • Creator and producer applications
  • PDF version information
  • Document properties (title, author, subject)
  • Technical metadata fields

What it reveals:

  • When the document was created vs modified
  • Which applications processed the document
  • Whether metadata shows suspicious patterns
  • If dates align with expected document history

Detection capabilities:

  • Date mismatches (old document, recent modification)
  • Unexpected creator/producer applications
  • Metadata inconsistencies
  • Suspicious application fingerprints

Limitations:

  • Metadata can be manipulated
  • Some tools set dates incorrectly
  • Legitimate modifications change metadata

As Zeltser explains, metadata analysis provides valuable clues but requires careful interpretation.

Layer 2: Digital Signature Verification

Digital signatures provide cryptographic proof of document integrity.

What we examine:

  • Signature presence and validity
  • Certificate chain validation
  • Signature scope (what content is signed)
  • Timestamp validation
  • Signature wrapping detection

What it reveals:

  • Whether document was modified after signing
  • If signature is cryptographically valid
  • What portion of document is protected
  • When document was signed

Detection capabilities:

  • Post-signing modifications (invalidates signature)
  • Signature wrapping attacks
  • Certificate validity issues
  • Timestamp anomalies

Limitations:

  • Only detects modifications after signing
  • Pre-signing modifications not detected
  • Requires valid certificate chain
  • Some attacks can bypass signature checks

As Security Online notes, signature verification is powerful but not infallible.

Layer 3: Cross-Reference Table Analysis

PDFs use cross-reference tables to locate objects within the file. Analyzing these tables reveals structural modifications.

What we examine:

  • Cross-reference table integrity
  • Object reference consistency
  • Table structure anomalies
  • Incremental update markers
  • Object deletion patterns

What it reveals:

  • Structural modifications to PDF
  • Objects added or removed
  • Incremental update history
  • File structure integrity

Detection capabilities:

  • Structural tampering
  • Object manipulation
  • Incremental updates
  • Cross-reference corruption

Limitations:

  • Requires technical PDF knowledge
  • Some modifications preserve structure
  • Legitimate edits create similar patterns

As MailXaminer explains, cross-reference analysis provides deep insights into PDF modification history.

Layer 4: Incremental Update Detection

PDFs can be modified using incremental updates, which add changes without rewriting the entire file.

What we examine:

  • Incremental update markers
  • Revision history layers
  • Update sequence numbers
  • Object version tracking
  • Update timestamps

What it reveals:

  • How many times document was modified
  • When incremental updates occurred
  • What was changed in each update
  • Whether updates are suspicious

Detection capabilities:

  • Multiple modification sessions
  • Incremental update patterns
  • Revision layer analysis
  • Update sequence anomalies

Limitations:

  • Legitimate edits create updates
  • Some editors rewrite files (no updates)
  • Requires parsing PDF structure

As HackTricks notes, incremental update analysis is essential for detecting sophisticated modifications.

Layer 5: Producer/Creator Fingerprinting

Different PDF creation and editing tools leave distinctive "fingerprints" in the files they produce.

What we examine:

  • Producer application signatures
  • Creator tool patterns
  • Tool-specific metadata
  • Application version information
  • Processing history

What it reveals:

  • Which tools created and modified the PDF
  • Whether tools match expected workflow
  • If editing tools were used unexpectedly
  • Processing chain history

Detection capabilities:

  • Unexpected editing tools
  • Tool mismatch patterns
  • Processing history anomalies
  • Application fingerprint identification

Limitations:

  • Tools can be spoofed
  • Some tools leave minimal traces
  • Legitimate workflows use multiple tools

Research from arXiv shows that producer fingerprinting can identify editing tools with high accuracy.

What Each Layer Reveals

Understanding what each layer detects helps interpret verification results:

Combined Evidence

When multiple layers detect modifications:

  • Strong evidence: Multiple indicators agree
  • High confidence: Consistent signals across layers
  • Definitive finding: Cryptographic proof (signatures)

When layers conflict:

  • Investigation needed: Conflicting signals require review
  • Context matters: Consider document type and use case
  • False positive possible: Legitimate modifications may trigger indicators

Layer Interactions

Layers work together:

  • Metadata + Fingerprinting: Reveals editing tool usage
  • Signatures + Structure: Detects signature wrapping
  • Updates + Metadata: Shows modification timeline
  • All layers: Comprehensive modification picture

How We Calculate Confidence Scores

Our algorithm combines evidence from all layers to produce confidence scores:

Confidence Levels

100% Confidence (Definitive Finding):

  • Cryptographic proof of modification
  • Digital signature invalidated by changes
  • Cannot be a false positive
  • Definitive evidence of tampering

High Confidence:

  • Strong structural evidence
  • Multiple indicators agree
  • Consistent signals across layers
  • Very reliable but not cryptographic proof

Standard Detection:

  • Some indicators present
  • Requires context interpretation
  • May have legitimate explanations
  • Useful for investigation

Low Confidence:

  • Weak or conflicting signals
  • May be false positive
  • Requires manual review
  • Context-dependent

Scoring Algorithm

Our scoring considers:

  • Indicator strength: How strong is each signal?
  • Indicator agreement: Do signals agree or conflict?
  • Cryptographic proof: Is there cryptographic evidence?
  • Context factors: Document type, expected modifications
  • False positive risk: Likelihood of legitimate explanation

As tools from GitHub PDF analysis collections show, combining multiple indicators improves accuracy.

Why 100% Confidence Is Different from High Confidence

Understanding confidence levels helps you interpret results correctly:

100% Confidence: Cryptographic Proof

What it means:

  • Cryptographic evidence of modification
  • Digital signature invalidated
  • Cannot be disputed
  • Definitive finding

When it occurs:

  • Digitally signed PDF modified after signing
  • Signature hash no longer matches content
  • Cryptographic proof of tampering

What to do:

  • Treat as definitive evidence
  • Investigate modification source
  • Do not proceed without explanation
  • Document finding for records

High Confidence: Strong Structural Evidence

What it means:

  • Strong evidence of modification
  • Multiple indicators agree
  • Very reliable but not cryptographic
  • May have rare false positives

When it occurs:

  • Multiple modification indicators
  • Structural anomalies detected
  • Metadata inconsistencies
  • Incremental update patterns

What to do:

  • Investigate further
  • Request explanation from sender
  • Verify through independent channels
  • Consider context and document type

Key Difference

  • 100% confidence: Cryptographic proof — cannot be wrong
  • High confidence: Strong evidence — very reliable but not cryptographic

False Positives: When Legitimate PDFs Look Suspicious

Not all modifications indicate fraud. Some legitimate actions trigger modification indicators:

Legitimate Modifications

Linearized PDFs:

  • Optimized for web viewing
  • Creates structural changes
  • May trigger modification indicators
  • Normal optimization process

PDF/A Conversion:

  • Archival format conversion
  • Modifies file structure
  • Changes metadata
  • Legitimate archival process

Re-saving:

  • Re-saving without content changes
  • Updates modification date
  • May create incremental updates
  • Normal workflow action

Form Field Filling:

  • Completing PDF forms
  • Modifies document structure
  • Updates form field values
  • Intended document use

Document Merging:

  • Combining multiple PDFs
  • Creates new structure
  • Modifies metadata
  • Legitimate document creation

How We Handle False Positives

Our algorithm considers:

  • Document type: Different types have different expected modifications
  • Modification patterns: Legitimate edits have recognizable patterns
  • Context factors: Expected workflows vs suspicious activity
  • Confidence scoring: Lower confidence for ambiguous cases

Reducing False Positives

  • Context awareness: Understanding document purpose
  • Pattern recognition: Identifying legitimate modification patterns
  • Confidence scoring: Distinguishing strong vs weak signals
  • Manual review: Flagging ambiguous cases for review

Real Examples: What Our Algorithm Detects

Example 1: Invoice Bank Account Change

Scenario: Invoice PDF with modified bank account details

What algorithm detected:

  • Layer 1: Recent modification date (invoice should be original)
  • Layer 2: No signature (invoices often unsigned)
  • Layer 3: Structural changes in payment section
  • Layer 4: Incremental update after creation
  • Layer 5: Editing tool fingerprint detected

Confidence: High (85%) Result: Flagged as modified — bank account change detected

Example 2: Contract Modified After Signing

Scenario: Contract PDF modified after digital signature

What algorithm detected:

  • Layer 1: Modification date after signature timestamp
  • Layer 2: Invalid digital signature (cryptographic proof)
  • Layer 3: Structural changes detected
  • Layer 4: Incremental update after signing
  • Layer 5: Editing tool used after signing

Confidence: 100% (definitive) Result: Cryptographic proof of post-signing modification

Example 3: Legitimate Form Completion

Scenario: PDF form filled out by user

What algorithm detected:

  • Layer 1: Modification date updated
  • Layer 2: No signature (forms often unsigned)
  • Layer 3: Form field modifications (expected)
  • Layer 4: Single update (form filling)
  • Layer 5: Form tool fingerprint (legitimate)

Confidence: Standard (60%) Result: Legitimate modification — form completion pattern

Why Automated Analysis Beats Manual Inspection

Automated analysis provides advantages manual inspection cannot match:

Speed

  • Automated: Seconds to analyze
  • Manual: Minutes to hours
  • Scalability: Process thousands of documents

Accuracy

  • Automated: Consistent analysis
  • Manual: Human error and bias
  • Reliability: Same input, same output

Comprehensiveness

  • Automated: Checks all layers simultaneously
  • Manual: Focuses on visible indicators
  • Depth: Analyzes technical details humans miss

Objectivity

  • Automated: No bias or interpretation
  • Manual: Subjective assessment
  • Consistency: Standardized evaluation

Accessibility

  • Automated: No technical knowledge required
  • Manual: Requires PDF expertise
  • Usability: Accessible to all users

Conclusion

HTPBE's algorithm uses five layers of analysis to detect PDF modifications:

  1. Metadata analysis: Reveals creation and modification history
  2. Digital signature verification: Provides cryptographic proof
  3. Cross-reference table analysis: Detects structural changes
  4. Incremental update detection: Identifies modification sessions
  5. Producer fingerprinting: Identifies editing tools

Together, these layers provide comprehensive modification detection with confidence scores that help you make informed decisions. Understanding how the algorithm works helps you interpret results correctly and trust the verification process.

See our algorithm in action — Analyze your PDF free at htpbe.tech

Share This Article

Found this article helpful? Share it with others to spread knowledge about PDF security and verification.

https://htpbe.tech/blog/anatomy-of-modified-pdf-algorithm-detects

Don't Trust Blindly, Check Your Document

Our free tool analyzes PDF to detect modifications.
No registration required. Instant results.

How it WorksAPI