๐Ÿ“ฆ EqualifyEverything / equalify-reflow

AI powered pipeline that transforms academic PDF content into semantic markup that can be represented more inclusively

โ˜… 3 stars โ‘‚ 0 forks ๐Ÿ‘ 3 watching โš–๏ธ Other
๐Ÿ“ฅ Clone https://github.com/EqualifyEverything/equalify-reflow.git
HTTPS git clone https://github.com/EqualifyEverything/equalify-reflow.git
SSH git clone git@github.com:EqualifyEverything/equalify-reflow.git
CLI gh repo clone EqualifyEverything/equalify-reflow
Blake Bertuccelli-Booth Blake Bertuccelli-Booth Merge pull request #137 from EqualifyEverything/feat/structured-json-logging f6dbec8 26 days ago ๐Ÿ“ History
๐Ÿ“‚ main View all commits โ†’
๐Ÿ“ .claude
๐Ÿ“ .github
๐Ÿ“ clients
๐Ÿ“ config
๐Ÿ“ docs
๐Ÿ“ infrastructure
๐Ÿ“ project-docs
๐Ÿ“ scripts
๐Ÿ“ src
๐Ÿ“ tests
๐Ÿ“„ .env.example
๐Ÿ“„ .gitignore
๐Ÿ“„ AGENTS.md
๐Ÿ“„ CLAUDE.md
๐Ÿ“„ CODE_OF_CONDUCT.md
๐Ÿ“„ CONTRIBUTING.md
๐Ÿ“„ docker-compose.yml
๐Ÿ“„ Dockerfile
๐Ÿ“„ LICENSE
๐Ÿ“„ Makefile
๐Ÿ“„ pyproject.toml
๐Ÿ“„ pytest.ini
๐Ÿ“„ README.md
๐Ÿ“„ SECURITY.md
๐Ÿ“„ tailwind.config.js
๐Ÿ“„ uv.lock
๐Ÿ“„ README.md

Equalify Reflow

CI Licence: AGPL-3.0-or-later

Equalify Reflow is an open source project that frees content trapped inside PDFs. It takes a PDF and produces semantic, reflowable markdown โ€” content that works on any screen size, with any assistive technology, and with AI tools. Originally built with the University of Illinois Chicago (UIC) to make course materials work better for everyone, the project is now available for any organisation that needs accessible document conversion.

What It Is

A server-side pipeline that converts PDFs into semantic markdown using document extraction (IBM Docling) and AI text correction (Claude Haiku via AWS Bedrock). Upload a PDF, get back structured markdown with proper headings, alt text on images, accessible tables, and extracted figures.

What It Is Not

  • Not a client-side tool or browser extension
  • Not a PDF viewer or annotation tool
  • Not a general-purpose document editor
  • Not a real-time converter -- processing takes ~5 minutes per document depending on length and complexity

How It Works

PDF uploaded via API
         |
         v
PII scan (Microsoft Presidio)
  Pass: queue for processing
  Fail: await instructor approval
         |
         v
Versioned pipeline (5 phases):
  1. Extraction  -- PDF โ†’ markdown + page images (Docling, with OCR fallback for scanned PDFs)
  2. Analysis    -- AI classifies pages and identifies headings, footnotes, code blocks
  3. Headings    -- AI reconciles heading hierarchy and normalises levels
  4. Translation -- AI fixes per-page content and tags code block languages
  5. Assembly    -- Cross-page boundary fixes and final cleanup
         |
         v
Semantic markdown + extracted figures stored in S3
         |
         v
Results available via API or pipeline viewer UI

What's Implemented

FeatureStatus
Versioned processing pipeline -- Docling extraction, AI structure analysis, page corrections, boundary fixesComplete
REST API -- Submit documents, poll status, stream events (SSE), retrieve resultsComplete
PII detection -- Microsoft Presidio scans all documents before AI processingComplete
Approval workflow -- Token-based PII approval with configurable timeoutsComplete
S3 storage -- Upload/download with circuit breakers and retry logicComplete
Redis job management -- Job state, queuing, rate limiting, event busComplete
Authentication -- API key auth, protected Swagger docsComplete
Monitoring -- Prometheus metrics, Grafana dashboards, Jaeger tracingComplete
Pipeline viewer -- React UI for upload, step-by-step review, version diff comparisonComplete
Testing -- Three-tier suite (unit, integration, E2E) with coverage reports in CIComplete

Quick Start

Prerequisites

  • Docker (v20.10+)
  • Docker Compose (v2.0+)
  • An Anthropic API key โ€” get one at console.anthropic.com/settings/keys. Alternatively, bring your own AWS credentials with Bedrock access.

Get Running

# 1. Copy the env template and add your API key
cp .env.example .env
# Edit .env and set ANTHROPIC_API_KEY=sk-ant-...

# 2. Start all services
make dev

# 3. Verify
curl http://localhost:8080/health

# 4. View API docs (publicly accessible)
open http://localhost:8080/docs

The API runs at http://localhost:8080 with hot reload enabled. Edit code in src/ and changes reload automatically inside the container.

The pipeline's AI backend auto-detects from your environment: if ANTHROPIC_API_KEY is set, it uses Anthropic direct; otherwise it falls back to AWS Bedrock. Set AI_PROVIDER=anthropic or AI_PROVIDER=bedrock to force a specific backend. See .env.example for the full list of settings.

Essential Commands

make dev               # Start development environment
make down              # Stop all services
make logs-api          # View API logs
make test-fast         # Unit tests (<30s) -- run before commits
make test-integration  # Integration tests (<2min) -- run before PRs
make shell             # Access container bash
make health            # Verify infrastructure

Run make help for all commands.

API Documentation

The FastAPI app exposes interactive API documentation at /docs (Swagger UI) and /redoc (ReDoc).

  • Local development: http://localhost:8080/docs (public โ€” no auth)
  • OpenAPI schema: http://localhost:8080/openapi.json (public)
  • Endpoints overview: all application endpoints are prefixed with /api/v1/. See the Swagger UI for the complete list, request/response shapes, and authentication requirements.

Project Structure

src/
โ”œโ”€โ”€ main.py                 # FastAPI app entry point
โ”œโ”€โ”€ config.py               # Settings from environment variables
โ”œโ”€โ”€ dependencies.py         # Dependency injection
โ”œโ”€โ”€ api/                    # REST endpoints (documents, approval, pipeline, health)
โ”œโ”€โ”€ agents/                 # AI prompt modules (structure, boundary, footnote)
โ”œโ”€โ”€ services/               # Business logic (20 services)
โ”‚   โ”œโ”€โ”€ pipeline_viewer.py             # Core versioned processing pipeline
โ”‚   โ”œโ”€โ”€ document_processing_service.py # Pipeline orchestration + S3/Redis
โ”‚   โ”œโ”€โ”€ storage_service.py             # S3 with circuit breakers
โ”‚   โ”œโ”€โ”€ job_service.py                 # Redis job state (Lua scripts)
โ”‚   โ”œโ”€โ”€ queue_service.py               # Redis queues
โ”‚   โ””โ”€โ”€ pii_service.py                 # Presidio PII detection
โ”œโ”€โ”€ workers/                # Background tasks (PII scan, timeout checks)
โ”œโ”€โ”€ middleware/              # Auth, logging, rate limiting, metrics, CORS
โ”œโ”€โ”€ shared/                 # Constants and shared utilities
โ””โ”€โ”€ utils/                  # Retry logic, circuit breakers, tokens

clients/viewer/             # React pipeline viewer (Vite + TypeScript + Tailwind)
tests/                      # Unit, integration, and E2E tests
infrastructure/             # Prometheus, Grafana, Floci configs
docs/                       # Architecture, guides

Technology Stack

Backend: Python 3.11+, FastAPI, PydanticAI, IBM Docling, Microsoft Presidio. AI model backend is pluggable โ€” currently AWS Bedrock (Claude Haiku); Anthropic direct and other providers planned.

Infrastructure (local dev): Docker, Redis, AWS S3 (via Floci โ€” a lightweight MIT-licensed emulator), Prometheus + Grafana

Monitoring: Prometheus, Grafana, Jaeger (OpenTelemetry)

Pipeline Viewer: React 18, TypeScript, Vite, ShadCN/Radix, Tailwind CSS

Testing: pytest, pytest-asyncio, pytest-xdist, pytest-cov, testcontainers

Services (Development)

ServicePortPurpose
API Gatewayhttp://localhost:8080Main application, API docs, pipeline viewer
Redislocalhost:6379Job state, queues, rate limiting
Flocilocalhost:4566S3 + CloudWatch emulation (replaces LocalStack)
Prometheushttp://localhost:9090Metrics collection
Grafanahttp://localhost:3001Metrics dashboards (admin/admin)
Jaegerhttp://localhost:16686Distributed tracing

Documentation

Docs live under docs/, grouped by what you're trying to do:

ModeWhat's there
TutorialsGuided walkthroughs: convert your first PDF, add your first agent
How-toTask recipes: set up dev environment, run tests, iterate on a prompt, add a new agent, add S3 operations, test rate limits, debug a CI failure, add features
ReferenceAuthoritative lookups: pipeline phases, model tiers, authentication, rate limits, CI workflows
ExplanationDesign rationale: architecture, authentication design, rate limiting, S3 resilience, testing strategy
Also:

  • CONTRIBUTING.md โ€” how to submit a change
  • AGENTS.md โ€” orientation file for AI and human agents working on the repo (symlinked as CLAUDE.md)
  • API reference โ€” runtime Swagger at http://localhost:8080/docs locally, or https://reflow.equalify.uic.edu/docs for the deployed instance