1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280# PDF Classification & Unsupported Document Detection
## Problem
The pipeline currently processes any PDF without checking whether it's a document type we can handle well. Forms, scanned-only documents, encrypted PDFs, and other edge cases either produce poor output silently or fail deep in the pipeline after expensive LLM processing. We need an early detection mechanism that classifies the PDF and either rejects unsupported types with a clear error or attaches warnings for limited-support types.
## Design Decision: Where to Introduce Classification
### Option A: Before Docling (pypdfium2 pre-flight) โ **Recommended for hard blockers**
Run a lightweight `pypdfium2` check on raw PDF bytes *before* Docling extraction. This is the right place for cheap, structural metadata that doesn't require any content analysis:
- **Form detection** โ `pypdfium2.PdfDocument.get_formtype()` returns 0 (none), 1 (AcroForm), 2 (XFA)
- **Encryption** โ pypdfium2 will fail to open password-protected PDFs
- **PDF version** โ `get_version()` for compatibility checks
- **Producer/Creator metadata** โ `get_metadata_dict()` reveals the tool that generated the PDF (scanner software, Word, LaTeX, etc.)
- **Tagged PDF** โ `is_tagged()` tells us if accessibility tagging exists already
- **Page count** โ sanity check before committing to extraction
- **Page dimensions** โ detect unusual sizes (labels, envelopes, oversized posters)
**Why here:** These checks cost ~5ms on any PDF size. No reason to run a full Docling extraction just to reject a form PDF.
### Option B: After Docling extraction โ **Recommended for content-based signals**
Some classifications require Docling's output to determine:
- **Scanned/image-only** โ `chars_per_page < 50` (already computed, could be refined)
- **Text density anomalies** โ near-empty pages, extremely sparse content
- **Figure-heavy documents** โ `figure_count / total_pages` ratio
- **Layout type distribution** โ already computed via `_detect_page_columns()`
**Why here:** These need the actual extraction to have run. But they execute before any LLM calls, so they're still cheap.
### Option C: During Structure Analysis (Phase 1) โ **Not recommended as the primary check**
Phase 1 already detects `is_scanned`, `has_tables`, `has_equations`, etc. per-page via Claude vision. But:
- This costs LLM tokens (~$0.01-0.05 per page)
- If we're going to reject the document, we've already wasted time and money
- Phase 1 is the right place for *refinements* (e.g., "this page is a scan even though text was extractable") but not the right place for *gating*
### Recommendation: Two-layer approach
```
PDF bytes
โ
โโ Layer 1: Pre-flight (pypdfium2) โโโ hard blockers โ reject/error
โ
โโ Docling extraction (v0)
โ
โโ Layer 2: Post-extraction analysis โโ soft warnings โ attach to result
โ
โโ Phase 1+ continues...
```
## PDF Types to Detect
### Hard Blockers (return error, don't process)
| Type | Detection Method | Layer | Rationale |
|------|-----------------|-------|-----------|
| **AcroForm PDF** | `form_type >= 1` AND widget annotation count > 0 | Pre-flight | Forms have interactive fields, checkboxes, dropdowns โ Docling extracts none of this; output is meaningless. Note: many PDFs declare AcroForm without having real fields โ we count actual widget annotations (subtype 20) to avoid false positives. |
| **XFA Form PDF** | `form_type in (2, 3)` AND widget annotation count > 0 | Pre-flight | XML-based dynamic forms (full XFA or hybrid AcroForm+XFA), completely opaque to layout extraction |
| **Password-protected** | pypdfium2 open fails / catches exception | Pre-flight | Can't extract anything |
| **Empty PDF** | 0 pages after extraction | Post-extraction | Nothing to process |
| **Oversized PDF** | Page count > configurable limit (e.g., 200) | Pre-flight | Resource protection; course materials shouldn't be 200+ pages |
### Soft Warnings (process with warning attached to result)
| Type | Detection Method | Layer | Warning Message |
|------|-----------------|-------|-----------------|
| **Scanned/image-only** | `chars_per_page < 50` (existing) + producer metadata contains scan-related keywords | Both | "Document appears to be a scan. OCR is not enabled; text extraction may be incomplete." |
| **Presentation slides** | Producer = PowerPoint/Keynote/Impress + low text density + high figure ratio | Post-extraction | "Document appears to be a slide deck. Layout-heavy content may not convert cleanly to linear markdown." |
| **Spreadsheet-as-PDF** | Producer = Excel/Calc + mostly tables + landscape orientation | Both | "Document appears to be an exported spreadsheet. Complex table structures may lose formatting." |
| **Very large pages** | Page dimensions significantly exceed standard sizes (A4/Letter) | Pre-flight | "Document contains oversized pages (e.g., posters). Content may be truncated or poorly laid out." |
| **Mixed orientation** | Some pages portrait, some landscape | Post-extraction | "Document contains mixed page orientations. Landscape pages may have layout issues." |
| **Nearly empty** | `chars_per_page < 200` but not zero | Post-extraction | "Document has very little extractable text per page." |
## Data Model
### New model: `PdfClassification`
```python
# src/services/pdf_classification.py
class PdfDocumentType(str, Enum):
"""High-level document type classification."""
STANDARD = "standard" # Normal text document
FORM = "form" # AcroForm or XFA
SCANNED = "scanned" # Image-only, needs OCR
PRESENTATION = "presentation" # Slides
SPREADSHEET = "spreadsheet" # Table-heavy export
UNKNOWN = "unknown"
class ClassificationSeverity(str, Enum):
"""Severity of a classification finding."""
ERROR = "error" # Hard block โ don't process
WARNING = "warning" # Process with degraded confidence
class ClassificationFinding(BaseModel):
"""A single finding from PDF classification."""
code: str # Machine-readable: "form_detected", "scanned_document", etc.
severity: ClassificationSeverity
message: str # Human-readable explanation
details: dict = {} # Additional metadata (e.g., form_type=2, producer="Adobe Scan")
class PdfClassification(BaseModel):
"""Complete classification result for a PDF."""
document_type: PdfDocumentType
findings: list[ClassificationFinding] = []
metadata: PdfMetadata # See below
@property
def has_errors(self) -> bool:
return any(f.severity == ClassificationSeverity.ERROR for f in self.findings)
@property
def has_warnings(self) -> bool:
return any(f.severity == ClassificationSeverity.WARNING for f in self.findings)
class PdfMetadata(BaseModel):
"""Raw PDF metadata extracted during classification."""
page_count: int
pdf_version: str | None = None
producer: str | None = None
creator: str | None = None
title: str | None = None
is_tagged: bool = False
form_type: int = 0 # 0=none, 1=AcroForm, 2=XFA
page_dimensions: list[tuple[float, float]] = [] # (width, height) per page
is_encrypted: bool = False
```
### Surface warnings in `PipelineViewerResult`
```python
# Add to PipelineViewerResult
class PipelineViewerResult(BaseModel):
...
classification: PdfClassification | None = None # NEW
warnings: list[str] = Field(default_factory=list) # NEW โ human-readable
```
### Surface in API response
```python
# Add to job state / API response schemas
class ProcessingResponse(JobStatusBase):
...
warnings: list[str] = [] # NEW โ propagated from classification
```
## Implementation Plan
### Phase 1: Pre-flight Classification Service
**New file:** `src/services/pdf_classifier.py`
```
classify_pdf(file_content: bytes) -> PdfClassification
```
1. Open PDF with pypdfium2 (catch errors โ encrypted finding)
2. Extract metadata: version, producer, creator, tagged, form_type
3. Get page count and dimensions
4. Run hard-blocker checks (forms, encrypted, oversized)
5. Run soft-warning heuristics (producer-based scan detection, unusual dimensions)
6. Return `PdfClassification`
**Estimated cost:** <10ms per PDF, zero external dependencies beyond pypdfium2 (already installed via Docling).
### Phase 2: Post-Extraction Enrichment
After Docling runs in `_step_docling()`, enrich the classification with content-based signals:
1. Text density analysis (refine `is_likely_scanned` into the classification)
2. Figure-to-page ratio
3. Layout distribution (all presentation? all single-column?)
4. Mixed orientation detection from page images
### Phase 3: Integration Points
#### A. In `PipelineViewerService.process()`
```python
async def process(self, file_content, filename, ...) -> PipelineViewerResult:
result = PipelineViewerResult(filename=filename, total_pages=0)
# NEW: Pre-flight classification
classification = classify_pdf(file_content)
result.classification = classification
if classification.has_errors:
# Attach error step and return early
result.steps.append(StepResult(
name="classification",
display_name="PDF Classification",
version_after="v0",
elapsed_ms=classification_elapsed,
error="; ".join(f.message for f in classification.findings
if f.severity == ClassificationSeverity.ERROR),
))
return result
result.warnings = [f.message for f in classification.findings
if f.severity == ClassificationSeverity.WARNING]
# Continue with Docling extraction...
await self._step_docling(result, file_content, ...)
# NEW: Post-extraction enrichment
enrich_classification(result.classification, result)
# ... rest of pipeline
```
#### B. In `DocumentProcessingService.process_document()`
```python
# After pipeline returns, check for hard errors
if result.classification and result.classification.has_errors:
error_msg = "; ".join(
f.message for f in result.classification.findings
if f.severity == ClassificationSeverity.ERROR
)
await self._update_job_state(job_id, status="failed", error=error_msg)
return result
# Store warnings in job state for API consumers
if result.warnings:
await self._update_job_state(job_id, warnings=json.dumps(result.warnings))
```
#### C. In API response
Warnings propagated through job state โ API response so the frontend can display them.
### Phase 4: Configuration
Add configurable thresholds to `src/config.py`:
```python
# PDF Classification
PDF_MAX_PAGES: int = 200
PDF_MIN_CHARS_PER_PAGE_SCANNED: int = 50
PDF_BLOCK_FORMS: bool = True
PDF_BLOCK_ENCRYPTED: bool = True
```
### Phase 5: Tests
- **Unit tests** for `pdf_classifier.py` โ test each detection path with crafted/fixture PDFs
- **Unit tests** for enrichment logic โ mock Docling output, verify classification is updated
- **Integration test** โ submit a form PDF via API, verify error response
- **Integration test** โ submit a scanned PDF, verify warning in response
## File Changes Summary
| File | Change |
|------|--------|
| `src/services/pdf_classifier.py` | **NEW** โ classification service |
| `src/services/pipeline_viewer_models.py` | Add `PdfClassification`, `PdfMetadata`, `ClassificationFinding` models |
| `src/services/pipeline_viewer.py` | Call classifier pre-flight + post-extraction enrichment |
| `src/services/document_processing_service.py` | Handle classification errors, propagate warnings |
| `src/config.py` | Add classification thresholds |
| `src/shared/constants/` | Classification codes as constants |
| `src/api/schemas.py` | Add `warnings` to response schemas |
| `tests/unit/services/test_pdf_classifier.py` | **NEW** โ unit tests |
| `tests/integration/test_classification.py` | **NEW** โ integration tests |
## Open Questions
1. **Should classification block at the PII stage too?** We could run pre-flight classification during PII scanning (which already loads the PDF) to fail even faster. This would avoid the S3 round-trip for unsupported documents.
2. **Should warnings affect confidence scores?** A "presentation" classification could automatically lower the final confidence score, signaling that the output needs more review.
3. **Should we add a new job status?** e.g., `"unsupported"` as a terminal status distinct from `"failed"`, so the frontend can show a more specific message than generic failure.
4. **Scan detection + OCR:** Currently `do_ocr=False`. If we detect a scanned document, should we re-run Docling with `do_ocr=True` as a fallback, or just warn? (This would significantly increase processing time.)