📦 EqualifyEverything / equalify-docs

📄 supported-document-types.md · 55 lines
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55---
title: Supported Document Types
date: 2026-04-16
author: Equalify Tech Team
description: What Reflow produces, which document types convert well, and which are outside the current scope.
---

# Supported document types

Equalify Reflow is designed for **course materials** — syllabi, academic papers, policy documents, and presentations. This page is a quick lookup for what's in scope, what the pipeline produces, and where quality drops off. For the judgement side — how to evaluate a specific conversion — see [interpret the output](../how-to/interpret-the-output.md).

## Size limits

| Limit | Value | Behaviour when exceeded |
|---|---|---|
| File size | 100 MB | API rejects with `413 Payload Too Large` at submission |
| Page count | 50 pages | Job moves to `failed` status with error: `PDF has N pages, which exceeds the maximum of 50. Please split into smaller documents.` |

Documents close to the 50-page ceiling also incur the most cost (roughly linear in page count — plan ~$0.08–0.10 per page for a Haiku-tier run).

## Quality by document type

| Document type | Typical quality | Common issues |
|---|---|---|
| Syllabi and course materials | High | Occasional heading-level disagreements |
| Policy documents | High | Complex nested numbering schemes |
| Letters and memos | High | Letterhead content may be over-described |
| Academic chapters | Medium | Footnote ordering, reading order in multi-column layouts |
| Presentations (slides) | Medium | Slide boundaries, text embedded in images |
| Infographics and posters | Lower | Spatial relationships lost when linearised |
| Brochures with complex layouts | Lower | Multi-column reading order confusion |

The pipeline emits `warnings` on the job response for document types it handles poorly, visible in both the API response and the viewer.

## Known limitations

The following are outside current scope and will produce lower-quality output:

- **Scanned multi-column academic chapters** — reading order across columns is unreliable for scanned content
- **Heavy infographics** — spatial relationships (flow diagrams, org charts) flatten into linear text
- **Mathematical equations** — complex LaTeX formulas are not fully supported
- **Bilingual scanned documents** — OCR quality degrades with mixed-language scanned content
- **Very long documents** — while the technical limit is 50 pages, quality and cost scale with complexity; documents over ~40 pages may benefit from being split along natural section boundaries

## What the output contains

Every completed job produces:

- **`result.md`** — a single markdown file with the full document content (semantic headings H1–H6, alt text on images, accessible tables with header rows, reconstructed lists, inline hyperlinks, logical reading order)
- **Figures** — individual image files (PNG) for each extracted figure/chart/diagram, each tied to a `figure_id` referenced from the markdown
- **Change ledger** — a JSON record of every edit the pipeline made, with before/after text and a one-sentence reason per edit
- **Bundle** — optional ZIP of the above, downloadable from the `/bundle` endpoint

Decorative images (logos, spacers) are identified and left with empty alt text, following WCAG best practices. Informational images get descriptive alt text generated by the image sub-agent.