AI powered pipeline that transforms academic PDF content into semantic markup that can be represented more inclusively
https://github.com/EqualifyEverything/equalify-reflow.git
Equalify Reflow is an open source project that frees content trapped inside PDFs. It takes a PDF and produces semantic, reflowable markdown โ content that works on any screen size, with any assistive technology, and with AI tools. Originally built with the University of Illinois Chicago (UIC) to make course materials work better for everyone, the project is now available for any organisation that needs accessible document conversion.
A server-side pipeline that converts PDFs into semantic markdown using document extraction (IBM Docling) and AI text correction (Claude Haiku via AWS Bedrock). Upload a PDF, get back structured markdown with proper headings, alt text on images, accessible tables, and extracted figures.
PDF uploaded via API
|
v
PII scan (Microsoft Presidio)
Pass: queue for processing
Fail: await instructor approval
|
v
Versioned pipeline (5 phases):
1. Extraction -- PDF โ markdown + page images (Docling, with OCR fallback for scanned PDFs)
2. Analysis -- AI classifies pages and identifies headings, footnotes, code blocks
3. Headings -- AI reconciles heading hierarchy and normalises levels
4. Translation -- AI fixes per-page content and tags code block languages
5. Assembly -- Cross-page boundary fixes and final cleanup
|
v
Semantic markdown + extracted figures stored in S3
|
v
Results available via API or pipeline viewer UI
| Feature | Status |
|---|---|
| Versioned processing pipeline -- Docling extraction, AI structure analysis, page corrections, boundary fixes | Complete |
| REST API -- Submit documents, poll status, stream events (SSE), retrieve results | Complete |
| PII detection -- Microsoft Presidio scans all documents before AI processing | Complete |
| Approval workflow -- Token-based PII approval with configurable timeouts | Complete |
| S3 storage -- Upload/download with circuit breakers and retry logic | Complete |
| Redis job management -- Job state, queuing, rate limiting, event bus | Complete |
| Authentication -- API key auth, protected Swagger docs | Complete |
| Monitoring -- Prometheus metrics, Grafana dashboards, Jaeger tracing | Complete |
| Pipeline viewer -- React UI for upload, step-by-step review, version diff comparison | Complete |
| Testing -- Three-tier suite (unit, integration, E2E) with coverage reports in CI | Complete |
# 1. Copy the env template and add your API key
cp .env.example .env
# Edit .env and set ANTHROPIC_API_KEY=sk-ant-...
# 2. Start all services
make dev
# 3. Verify
curl http://localhost:8080/health
# 4. View API docs (publicly accessible)
open http://localhost:8080/docs
The API runs at http://localhost:8080 with hot reload enabled. Edit code in src/ and changes reload automatically inside the container.
The pipeline's AI backend auto-detects from your environment: if ANTHROPIC_API_KEY is set, it uses Anthropic direct; otherwise it falls back to AWS Bedrock. Set AI_PROVIDER=anthropic or AI_PROVIDER=bedrock to force a specific backend. See .env.example for the full list of settings.
make dev # Start development environment
make down # Stop all services
make logs-api # View API logs
make test-fast # Unit tests (<30s) -- run before commits
make test-integration # Integration tests (<2min) -- run before PRs
make shell # Access container bash
make health # Verify infrastructure
Run make help for all commands.
The FastAPI app exposes interactive API documentation at /docs (Swagger UI) and /redoc (ReDoc).
/api/v1/. See the Swagger UI for the complete list, request/response shapes, and authentication requirements.src/
โโโ main.py # FastAPI app entry point
โโโ config.py # Settings from environment variables
โโโ dependencies.py # Dependency injection
โโโ api/ # REST endpoints (documents, approval, pipeline, health)
โโโ agents/ # AI prompt modules (structure, boundary, footnote)
โโโ services/ # Business logic (20 services)
โ โโโ pipeline_viewer.py # Core versioned processing pipeline
โ โโโ document_processing_service.py # Pipeline orchestration + S3/Redis
โ โโโ storage_service.py # S3 with circuit breakers
โ โโโ job_service.py # Redis job state (Lua scripts)
โ โโโ queue_service.py # Redis queues
โ โโโ pii_service.py # Presidio PII detection
โโโ workers/ # Background tasks (PII scan, timeout checks)
โโโ middleware/ # Auth, logging, rate limiting, metrics, CORS
โโโ shared/ # Constants and shared utilities
โโโ utils/ # Retry logic, circuit breakers, tokens
clients/viewer/ # React pipeline viewer (Vite + TypeScript + Tailwind)
tests/ # Unit, integration, and E2E tests
infrastructure/ # Prometheus, Grafana, Floci configs
docs/ # Architecture, guides
Backend: Python 3.11+, FastAPI, PydanticAI, IBM Docling, Microsoft Presidio. AI model backend is pluggable โ currently AWS Bedrock (Claude Haiku); Anthropic direct and other providers planned.
Infrastructure (local dev): Docker, Redis, AWS S3 (via Floci โ a lightweight MIT-licensed emulator), Prometheus + Grafana
Monitoring: Prometheus, Grafana, Jaeger (OpenTelemetry)
Pipeline Viewer: React 18, TypeScript, Vite, ShadCN/Radix, Tailwind CSS
Testing: pytest, pytest-asyncio, pytest-xdist, pytest-cov, testcontainers
| Service | Port | Purpose |
|---|---|---|
| API Gateway | http://localhost:8080 | Main application, API docs, pipeline viewer |
| Redis | localhost:6379 | Job state, queues, rate limiting |
| Floci | localhost:4566 | S3 + CloudWatch emulation (replaces LocalStack) |
| Prometheus | http://localhost:9090 | Metrics collection |
| Grafana | http://localhost:3001 | Metrics dashboards (admin/admin) |
| Jaeger | http://localhost:16686 | Distributed tracing |
Docs live under docs/, grouped by what you're trying to do:
| Mode | What's there |
|---|---|
| Tutorials | Guided walkthroughs: convert your first PDF, add your first agent |
| How-to | Task recipes: set up dev environment, run tests, iterate on a prompt, add a new agent, add S3 operations, test rate limits, debug a CI failure, add features |
| Reference | Authoritative lookups: pipeline phases, model tiers, authentication, rate limits, CI workflows |
| Explanation | Design rationale: architecture, authentication design, rate limiting, S3 resilience, testing strategy |
CLAUDE.md)http://localhost:8080/docs locally, or https://reflow.equalify.uic.edu/docs for the deployed instance