📦 EqualifyEverything / equalify-docs

📄 understanding-the-output.md · 106 lines
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106---
title: Understanding the Output
date: 2026-03-23
author: Equalify Tech Team
description: What the accessible version of your document includes, how to evaluate quality, and what to look for in review.
---

# Understanding the Output

When Equalify Reflow processes a PDF, it produces an accessible markdown document and a set of extracted images. This guide explains what the output includes, how to evaluate its quality, and what the system's current limitations are.

## What You Get

### Accessible Markdown

The primary output is a single markdown file containing the full document content with:

- **Semantic headings** — a properly nested heading hierarchy (H1 through H6) that reflects the document's logical structure, not just its visual appearance
- **Alt text for images** — descriptive text for informational images, charts, and diagrams. Decorative images (logos, background graphics) are marked as decorative
- **Accessible tables** — tables with header rows identified, preserving the relationship between headers and data cells
- **Reconstructed lists** — bulleted and numbered lists with proper nesting, even when the original PDF layout fragmented them across columns or pages
- **Hyperlinks** — URLs detected in the text are converted to clickable links
- **Clean reading order** — content flows in a logical sequence, free from the column-jumping and page-break artifacts of the PDF layout

### Extracted Figures

Images, charts, diagrams, and photos are extracted from the PDF and saved as separate files. Each figure includes:

- A unique identifier linking it to its position in the markdown
- The page number where it appeared
- The original caption, if one existed in the PDF

During the Translation stage, a specialist sub-agent generates alt text for each figure and embeds it directly in the markdown (e.g., `![Description of chart](figures/figure-1.png)`). Decorative images like logos are identified and left with empty alt text, following WCAG best practices.

### The Change Ledger

Every edit the pipeline makes is recorded in a change ledger. Each entry includes:

- **Action** — what type of change was made (add, modify, delete)
- **Target** — what was changed (heading, paragraph, table, figure, etc.)
- **Before / After** — the exact text before and after the edit
- **Reasoning** — why the AI made this change

The ledger is available through the API (`GET /api/v1/documents/{job_id}/ledger`) and in the pipeline viewer's Changes panel.

## Evaluating Quality

### What to Check

When reviewing a converted document, focus on these areas:

**Structure**
- Does the heading hierarchy make sense? H1 should be the document title, H2s should be major sections, and so on
- Are sections in the correct order?
- Do lists have the right nesting?

**Content Accuracy**
- Is the text faithful to the original? Look for OCR errors (character substitutions, missing words)
- Are numbers, dates, and proper nouns correct?
- Are footnotes in the right place and correctly numbered?

**Tables**
- Do tables have header rows identified?
- Are cells aligned with the correct headers?
- For complex tables (merged cells, multi-level headers), verify the structure manually

**Images**
- Do informational images have descriptive alt text?
- Is the alt text accurate — does it convey the same information as the image?
- Are decorative images (logos, dividers) appropriately left without alt text?

**Formatting**
- Are hyperlinks clickable (not just plain text URLs)?
- Is emphasis (bold, italic) preserved where it carries meaning?
- Are code blocks properly identified and formatted?

### Quality by Document Type

Some document types convert better than others:

| Document Type | Typical Quality | Common Issues |
|---------------|----------------|---------------|
| Syllabi and course materials | High | Occasional heading level disagreements |
| Policy documents | High | Complex nested numbering schemes |
| Letters and memos | High | Letterhead content may be over-described |
| Academic chapters | Medium | Footnote ordering, reading order in multi-column layouts |
| Presentations (slides) | Medium | Slide boundaries, text embedded in images |
| Infographics and posters | Lower | Spatial relationships lost when linearized |
| Brochures with complex layouts | Lower | Multi-column reading order confusion |

## Known Limitations

The system is designed for **course materials** — syllabi, academic papers, policy documents, presentations, and similar content. The following document types are outside the current scope and may produce lower-quality results:

- **Scanned multi-column academic chapters** — reading order detection across columns is unreliable for scanned documents
- **Heavy infographics** — spatial relationships (flow diagrams, organizational charts) are lost when linearized to text
- **Mathematical equations** — complex LaTeX formulas are not fully supported
- **Bilingual scanned documents** — OCR quality degrades significantly with mixed-language scanned content
- **Documents over 40 pages** — the system processes them, but quality and cost scale with complexity

When the pipeline detects a document type it handles poorly, it emits **warnings** in the response. These warnings appear in both the API response and the viewer interface.

## Providing Feedback

If you find an issue in a converted document, see the [Providing Feedback](providing-feedback.md) guide for how to report issues and suggest corrections through the WordPress plugin.