1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86# Run the benchmark
Equalify Reflow ships with a reproducible benchmark harness: `scripts/batch_run.py` submits a corpus of PDFs to a running API, polls each job to completion (auto-approving PII findings), and writes per-document results plus an aggregate summary to disk. The published benchmark corpora live under [`docs/reference/benchmarks/`](../reference/benchmarks/).
Use this to:
- Reproduce a published pilot against the current pipeline
- Regression-check pipeline changes before a release
- Compare two releases against the same corpus (before/after)
- Build a fixture set from an ad-hoc directory of PDFs
## Prerequisites
- A running Equalify Reflow API. Use `make dev` locally, or point at a deployed instance (staging, production).
- An API key with submit access, exported as `BATCH_API_KEY`.
- The PDF corpus on local disk (see [Reproducing a published corpus](#reproducing-a-published-corpus) below for why PDFs are not checked in).
## Two input modes
### Manifest mode โ reproducible corpora
Manifests are plain-text files listing one PDF path per line (comments start with `#`, paths resolve relative to the manifest). This is the mode to use for any published benchmark.
```bash
uv run scripts/batch_run.py \
--manifest docs/reference/benchmarks/v0.1.0-beta.6-pilot/manifest.txt
```
### Directory mode โ ad-hoc runs
For one-off regression checks against a local folder of PDFs:
```bash
uv run scripts/batch_run.py --pdf-dir ~/some-pdfs/
```
## Common flags
| Flag | Purpose | Default |
|---|---|---|
| `--manifest PATH` | Read PDF list from a manifest file | โ |
| `--pdf-dir PATH` | Glob `*.pdf` from a directory | โ |
| `--output PATH` | Where to write results | `batch-results/<UTC timestamp>` |
| `--api-url URL` | API base URL | `$BATCH_API_URL` or `http://localhost:8080` |
| `--concurrency N` | Parallel jobs | `$BATCH_CONCURRENCY` or `2` |
`BATCH_API_KEY` is required and read only from the environment โ never from a flag โ so keys don't end up in shell history.
## What you get back
```
<output-dir>/
โโโ summary.json # aggregate: counts, total cost, timings, per-doc results
โโโ <doc-label>/
โ โโโ metadata.json # full pipeline metadata (job ID, phases, ledger, llm_cost)
โ โโโ result.md # converted markdown
โ โโโ figures/
โ โโโ figure-1.png, ... # extracted figures
```
`summary.json` is the canonical benchmark artifact: total docs, completed, failed, total cost/tokens, and a `results[]` array with per-document `cost`, `tokens`, `elapsed`, `pages`, and `edits`.
## Reproducing a published corpus
Published corpora (e.g. `v0.1.0-beta.6-pilot/`) ship the manifest but not the PDFs. Several documents in the UIC pilot corpus are third-party copyrighted works (book chapters, newsletters), so redistributing them in a public repo would be inappropriate.
To reproduce a published run:
1. Obtain the corpus separately (UIC team members: ask in the project channel; external reproducers: substitute your own documents matching the type/page-count profile in the manifest).
2. Place the PDFs under `samples/` beside the manifest, using the filenames listed in the manifest.
3. Run `batch_run.py --manifest <path>` as above.
4. Compare `summary.json` against the published one. Differences are expected when model versions or prompts have changed โ that's the point.
## Tips
- Start small. Run against a 2โ3 PDF manifest before a 30-PDF one to catch auth / connectivity issues early.
- Watch rate limits. The script throttles submissions (5s delay, retries with backoff) but a busy shared API will still 429. Lower `--concurrency` if you see sustained 429s.
- GPU cold starts. First submission after an idle period can take minutes; the per-job timeout is 15 min for this reason.
- Costs are real. Each run against a production instance incurs Bedrock (or Anthropic) token charges. The pilot's 30 docs cost ~$13 โ plan accordingly.
## Related
- [`docs/reference/benchmarks/`](../reference/benchmarks/) โ published corpora and reports
- [`docs/how-to/iterate-on-a-prompt.md`](iterate-on-a-prompt.md) โ for targeted prompt-change validation against a single document
- [`docs/how-to/run-tests.md`](run-tests.md) โ for the unit / integration / e2e test suite (different purpose: code correctness, not conversion quality)