The Anatomy of a Modified PDF: What Our Algorithm Detects
When you upload a PDF to HTPBE, our algorithm performs a comprehensive forensic analysis to determine if the document has been modified. This is not a simple check — it is a multi-layered examination that analyzes dozens of technical indicators simultaneously.
Transparency builds trust. This article explains exactly how our algorithm works, what it examines, and why this approach provides more reliable results than manual inspection or single-method verification.
Whether you are a security professional evaluating our technology, a potential API customer considering integration, or a curious user wanting to understand what happens behind the scenes, this technical deep-dive will show you how we detect PDF modifications.
Transparency Builds Trust
Many PDF verification tools operate as "black boxes" — you upload a file and get a result, but you do not know how the analysis works. We believe transparency is essential for building trust, especially when verification results affect important decisions.
This article explains:
- The five layers of analysis our algorithm performs
- What each layer reveals about PDF modifications
- How we calculate confidence scores
- Why 100% confidence differs from high confidence
- How we handle false positives
- Real examples of what our algorithm detects
As discussed in Forensic Focus forums, understanding forensic analysis methods helps users make informed decisions about verification results.
The 5 Layers of HTPBE Analysis
Our algorithm examines PDFs through five distinct but complementary layers. Each layer provides different types of evidence, and together they create a comprehensive picture of document integrity.
Layer 1: Metadata Analysis
Metadata is the "DNA" of a PDF — it reveals how the document was created and processed.
What we examine:
- Creation and modification dates
- Creator and producer applications
- PDF version information
- Document properties (title, author, subject)
- Technical metadata fields
What it reveals:
- When the document was created vs modified
- Which applications processed the document
- Whether metadata shows suspicious patterns
- If dates align with expected document history
Detection capabilities:
- Date mismatches (old document, recent modification)
- Unexpected creator/producer applications
- Metadata inconsistencies
- Suspicious application fingerprints
Limitations:
- Metadata can be manipulated
- Some tools set dates incorrectly
- Legitimate modifications change metadata
As Zeltser explains, metadata analysis provides valuable clues but requires careful interpretation.
Layer 2: Digital Signature Verification
Digital signatures provide cryptographic proof of document integrity.
What we examine:
- Signature presence and validity
- Certificate chain validation
- Signature scope (what content is signed)
- Timestamp validation
- Signature wrapping detection
What it reveals:
- Whether document was modified after signing
- If signature is cryptographically valid
- What portion of document is protected
- When document was signed
Detection capabilities:
- Post-signing modifications (invalidates signature)
- Signature wrapping attacks
- Certificate validity issues
- Timestamp anomalies
Limitations:
- Only detects modifications after signing
- Pre-signing modifications not detected
- Requires valid certificate chain
- Some attacks can bypass signature checks
As Security Online notes, signature verification is powerful but not infallible.
Layer 3: Cross-Reference Table Analysis
PDFs use cross-reference tables to locate objects within the file. Analyzing these tables reveals structural modifications.
What we examine:
- Cross-reference table integrity
- Object reference consistency
- Table structure anomalies
- Incremental update markers
- Object deletion patterns
What it reveals:
- Structural modifications to PDF
- Objects added or removed
- Incremental update history
- File structure integrity
Detection capabilities:
- Structural tampering
- Object manipulation
- Incremental updates
- Cross-reference corruption
Limitations:
- Requires technical PDF knowledge
- Some modifications preserve structure
- Legitimate edits create similar patterns
As MailXaminer explains, cross-reference analysis provides deep insights into PDF modification history.
Layer 4: Incremental Update Detection
PDFs can be modified using incremental updates, which add changes without rewriting the entire file.
What we examine:
- Incremental update markers
- Revision history layers
- Update sequence numbers
- Object version tracking
- Update timestamps
What it reveals:
- How many times document was modified
- When incremental updates occurred
- What was changed in each update
- Whether updates are suspicious
Detection capabilities:
- Multiple modification sessions
- Incremental update patterns
- Revision layer analysis
- Update sequence anomalies
Limitations:
- Legitimate edits create updates
- Some editors rewrite files (no updates)
- Requires parsing PDF structure
As HackTricks notes, incremental update analysis is essential for detecting sophisticated modifications.
Layer 5: Producer/Creator Fingerprinting
Different PDF creation and editing tools leave distinctive "fingerprints" in the files they produce.
What we examine:
- Producer application signatures
- Creator tool patterns
- Tool-specific metadata
- Application version information
- Processing history
What it reveals:
- Which tools created and modified the PDF
- Whether tools match expected workflow
- If editing tools were used unexpectedly
- Processing chain history
Detection capabilities:
- Unexpected editing tools
- Tool mismatch patterns
- Processing history anomalies
- Application fingerprint identification
Limitations:
- Tools can be spoofed
- Some tools leave minimal traces
- Legitimate workflows use multiple tools
Research from arXiv shows that producer fingerprinting can identify editing tools with high accuracy.
What Each Layer Reveals
Understanding what each layer detects helps interpret verification results:
Combined Evidence
When multiple layers detect modifications:
- Strong evidence: Multiple indicators agree
- High confidence: Consistent signals across layers
- Definitive finding: Cryptographic proof (signatures)
When layers conflict:
- Investigation needed: Conflicting signals require review
- Context matters: Consider document type and use case
- False positive possible: Legitimate modifications may trigger indicators
Layer Interactions
Layers work together:
- Metadata + Fingerprinting: Reveals editing tool usage
- Signatures + Structure: Detects signature wrapping
- Updates + Metadata: Shows modification timeline
- All layers: Comprehensive modification picture
How We Calculate Confidence Scores
Our algorithm combines evidence from all layers to produce confidence scores:
Confidence Levels
100% Confidence (Definitive Finding):
- Cryptographic proof of modification
- Digital signature invalidated by changes
- Cannot be a false positive
- Definitive evidence of tampering
High Confidence:
- Strong structural evidence
- Multiple indicators agree
- Consistent signals across layers
- Very reliable but not cryptographic proof
Standard Detection:
- Some indicators present
- Requires context interpretation
- May have legitimate explanations
- Useful for investigation
Low Confidence:
- Weak or conflicting signals
- May be false positive
- Requires manual review
- Context-dependent
Scoring Algorithm
Our scoring considers:
- Indicator strength: How strong is each signal?
- Indicator agreement: Do signals agree or conflict?
- Cryptographic proof: Is there cryptographic evidence?
- Context factors: Document type, expected modifications
- False positive risk: Likelihood of legitimate explanation
As tools from GitHub PDF analysis collections show, combining multiple indicators improves accuracy.
Why 100% Confidence Is Different from High Confidence
Understanding confidence levels helps you interpret results correctly:
100% Confidence: Cryptographic Proof
What it means:
- Cryptographic evidence of modification
- Digital signature invalidated
- Cannot be disputed
- Definitive finding
When it occurs:
- Digitally signed PDF modified after signing
- Signature hash no longer matches content
- Cryptographic proof of tampering
What to do:
- Treat as definitive evidence
- Investigate modification source
- Do not proceed without explanation
- Document finding for records
High Confidence: Strong Structural Evidence
What it means:
- Strong evidence of modification
- Multiple indicators agree
- Very reliable but not cryptographic
- May have rare false positives
When it occurs:
- Multiple modification indicators
- Structural anomalies detected
- Metadata inconsistencies
- Incremental update patterns
What to do:
- Investigate further
- Request explanation from sender
- Verify through independent channels
- Consider context and document type
Key Difference
- 100% confidence: Cryptographic proof — cannot be wrong
- High confidence: Strong evidence — very reliable but not cryptographic
False Positives: When Legitimate PDFs Look Suspicious
Not all modifications indicate fraud. Some legitimate actions trigger modification indicators:
Legitimate Modifications
Linearized PDFs:
- Optimized for web viewing
- Creates structural changes
- May trigger modification indicators
- Normal optimization process
PDF/A Conversion:
- Archival format conversion
- Modifies file structure
- Changes metadata
- Legitimate archival process
Re-saving:
- Re-saving without content changes
- Updates modification date
- May create incremental updates
- Normal workflow action
Form Field Filling:
- Completing PDF forms
- Modifies document structure
- Updates form field values
- Intended document use
Document Merging:
- Combining multiple PDFs
- Creates new structure
- Modifies metadata
- Legitimate document creation
How We Handle False Positives
Our algorithm considers:
- Document type: Different types have different expected modifications
- Modification patterns: Legitimate edits have recognizable patterns
- Context factors: Expected workflows vs suspicious activity
- Confidence scoring: Lower confidence for ambiguous cases
Reducing False Positives
- Context awareness: Understanding document purpose
- Pattern recognition: Identifying legitimate modification patterns
- Confidence scoring: Distinguishing strong vs weak signals
- Manual review: Flagging ambiguous cases for review
Real Examples: What Our Algorithm Detects
Example 1: Invoice Bank Account Change
Scenario: Invoice PDF with modified bank account details
What algorithm detected:
- Layer 1: Recent modification date (invoice should be original)
- Layer 2: No signature (invoices often unsigned)
- Layer 3: Structural changes in payment section
- Layer 4: Incremental update after creation
- Layer 5: Editing tool fingerprint detected
Confidence: High (85%) Result: Flagged as modified — bank account change detected
Example 2: Contract Modified After Signing
Scenario: Contract PDF modified after digital signature
What algorithm detected:
- Layer 1: Modification date after signature timestamp
- Layer 2: Invalid digital signature (cryptographic proof)
- Layer 3: Structural changes detected
- Layer 4: Incremental update after signing
- Layer 5: Editing tool used after signing
Confidence: 100% (definitive) Result: Cryptographic proof of post-signing modification
Example 3: Legitimate Form Completion
Scenario: PDF form filled out by user
What algorithm detected:
- Layer 1: Modification date updated
- Layer 2: No signature (forms often unsigned)
- Layer 3: Form field modifications (expected)
- Layer 4: Single update (form filling)
- Layer 5: Form tool fingerprint (legitimate)
Confidence: Standard (60%) Result: Legitimate modification — form completion pattern
Why Automated Analysis Beats Manual Inspection
Automated analysis provides advantages manual inspection cannot match:
Speed
- Automated: Seconds to analyze
- Manual: Minutes to hours
- Scalability: Process thousands of documents
Accuracy
- Automated: Consistent analysis
- Manual: Human error and bias
- Reliability: Same input, same output
Comprehensiveness
- Automated: Checks all layers simultaneously
- Manual: Focuses on visible indicators
- Depth: Analyzes technical details humans miss
Objectivity
- Automated: No bias or interpretation
- Manual: Subjective assessment
- Consistency: Standardized evaluation
Accessibility
- Automated: No technical knowledge required
- Manual: Requires PDF expertise
- Usability: Accessible to all users
Conclusion
HTPBE's algorithm uses five layers of analysis to detect PDF modifications:
- Metadata analysis: Reveals creation and modification history
- Digital signature verification: Provides cryptographic proof
- Cross-reference table analysis: Detects structural changes
- Incremental update detection: Identifies modification sessions
- Producer fingerprinting: Identifies editing tools
Together, these layers provide comprehensive modification detection with confidence scores that help you make informed decisions. Understanding how the algorithm works helps you interpret results correctly and trust the verification process.
See our algorithm in action — Analyze your PDF free at htpbe.tech