📦 EqualifyEverything / equalify-docs

📄 how-it-works.md · 108 lines
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108---
title: How Equalify Reflow Works
date: 2026-03-23
author: Equalify Tech Team
description: The seven-stage pipeline that converts PDFs into accessible, semantic markdown.
---

# How Equalify Reflow Works

## The Thesis

Documents are primarily written in two languages at once. There's the text — the words on the page. And there's the **visual language** — the conventions of size, weight, position, proximity, and spacing that tell you what those words *mean* structurally. Biggest text, centered, top of page? Title. Small italic text under an image? Caption. Indented block with a bullet? List item. This is a language that sighted people understand fluently without much thought.

Multimodal AI models — models that process both images and text — have an understanding of **visual language** and the coding knowledge to express it as **semantic structure**. This makes translating from a visual layout to accessible HTML a natural fit for what these models already do.

But having a model that *can* translate isn't the same as having a system that *does* translate reliably. A bilingual dictionary contains real knowledge, but it doesn't make you a translator. Translation requires architecture: knowing what you're translating from, what you're translating to, what counts as correct, and how to verify the output. That's what Equalify Reflow is.

## Why Markdown

Instead of trying to "fix" PDFs — a format designed for print fidelity where accessibility is bolted on after the fact — we extract the content and rebuild it in a format that is natively accessible.

That format is **Markdown**:

- **Democratic by design** — plain text, no proprietary tooling, owned by no one
- **Human-readable without rendering** — open it in any text editor and understand the structure
- **Semantically rich** — headings, lists, tables, links all have explicit structural meaning
- **Maps directly to HTML** — renders losslessly into accessible HTML with proper heading hierarchy, table headers, alt text, and landmark regions
- **Lingua franca** — readable by humans, AI models, and computer programs alike

## The Pipeline

Equalify Reflow converts PDFs through a seven-stage pipeline. Each stage builds on the previous one, producing versioned markdown snapshots so you can see exactly what changed at each step.

### Stage 1: Initial Extraction

[IBM Docling](https://github.com/docling-project/docling) handles the first pass. It uses smaller, efficient models and whatever structural data already exists inside the PDF to produce a first-pass markdown version. This handles mechanical parsing — text blocks, tables, images, reading order — without burning expensive LLM calls on mechanical work. Gets you roughly 70% of the way there.

If the document is scanned (image-only), Docling applies OCR to extract the text before proceeding.

### Stage 2: Structure Analysis

Before the AI processes a page, the system needs to understand what it's looking at. This stage analyzes the PDF's visual presentation alongside the semantic data from Docling to classify the document — is this a poster, an academic paper, a syllabus, a flyer?

The document type matters because it **dynamically adjusts the prompt** given to the model at every subsequent stage. A two-column academic paper needs different handling than a single-page event poster.

This stage also produces a structural map of the document: an outline of headings and sections, page-level attributes (layout type, content flags like images, tables, equations), footnote locations, and any elements that need special attention. All of this context is carried forward through the pipeline as a **dossier** that informs every downstream decision.

### Stage 3: Heading Reconciliation

Headings come first because a valid heading hierarchy is the backbone of document accessibility. Get that right and everything else has a skeleton to hang on.

The agent infers heading levels from visual signals — font size, weight, position, spacing — and reconciles them into a consistent hierarchy across the entire document.

### Stage 4: Page Content Correction

This is where the core translation happens. Each page is given to a multimodal LLM as both an **image** and its **current markdown interpretation**. The model's job is to edit the markdown to make it match what the visual page communicates.

The model works through **tool calls** — each edit includes a reasoning explanation, giving insight into how the model is interpreting the document. This reasoning trail is recorded in a **change ledger** for auditability.

Some tool calls spawn **specialist sub-agents** for tasks that need focused expertise:

- **Alt Text Agent** — image description, chart summarization, decorative vs. informational labeling
- **Table Agent** — cell relationships, header associations, complex table structures
- **List Agent** — nested list reconstruction, continuation across visual breaks

### Stage 5: Code Block Detection

If the document contains code snippets, this stage identifies them and wraps them in proper fenced code blocks with language annotations. Academic papers, technical documentation, and course materials frequently contain inline code that needs to be distinguished from prose.

### Stage 6: Cross-Page Assembly

This stage brings all the individual pages together and removes page boundaries — pages are a print metaphor, and on screens they're an obstacle. The agent examines the seams between pages and fixes:

- Sentences split mid-word across a page break
- Tables or lists interrupted by a page boundary
- Footnotes relocated to their logical position
- Paragraphs that continue from one page to the next

### Stage 7: Final Cleanup

Whitespace normalization, consistent formatting, and any remaining artifacts from the extraction process are cleaned up. The result is a single, reflowable document that adapts to any viewport, any device, any rendering context — accessible by construction.

## PII Protection

Before any AI processing occurs, every document is scanned for personally identifiable information using [Microsoft Presidio](https://microsoft.github.io/presidio/). If PII is detected — names, emails, phone numbers, SSNs — the document is held for human review before proceeding. The system is designed for course materials only, not student records.

## The Change Ledger

Every edit made by the pipeline is recorded with:

- **What changed** — the before and after text
- **Why** — the model's reasoning for the edit
- **Where** — the page and target element

This ledger is available for human review, creating a transparent audit trail. In `human` review mode, an administrator can inspect every change before the document is finalized.

## Tech Stack

- **[FastAPI](https://fastapi.tiangolo.com/)** — Python async web framework
- **[IBM Docling](https://github.com/docling-project/docling)** — PDF extraction and OCR
- **[Claude](https://www.anthropic.com/claude) via [AWS Bedrock](https://aws.amazon.com/bedrock/)** — multimodal AI processing
- **[PydanticAI](https://ai.pydantic.dev/)** — agent framework with tool-call architecture
- **[Microsoft Presidio](https://microsoft.github.io/presidio/)** — PII detection
- **[Redis](https://redis.io/)** — job queuing, state management, and event streaming
- **[S3](https://aws.amazon.com/s3/)** — document storage with circuit breakers
- **[Docker](https://www.docker.com/)** — containerized development and deployment
- **[Terraform](https://www.terraform.io/)** — AWS infrastructure as code