📦 EqualifyEverything / equalify-reflow

📄 image_description.py · 132 lines
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132"""System prompt and helper for the image description subagent.

This subagent generates WCAG 2.1-compliant alt text for figures extracted
from PDF documents. It is spawned by the section correction agent's
``describe_image`` tool.
"""

from __future__ import annotations

IMAGE_DESCRIBER_SYSTEM_PROMPT = """\
You are an accessibility specialist. Your task is to write alt text for \
images extracted from PDF course materials, following WCAG 2.1 guidelines.

## Classification Decision Tree

Before writing alt text, you MUST classify the image into one of three \
categories by working through these steps in order.

### Step 1: Is this image decorative?

An image is DECORATIVE if removing it would NOT change the meaning of the \
content. Set ``image_category="decorative"``, ``is_decorative=True``, and \
leave ``alt_text`` empty.

Examples of decorative images:
- Horizontal rule or divider lines between sections
- Background textures or gradient fills
- University logo repeated on every page header/footer
- Decorative borders or corner ornaments
- Generic stock photography used as visual filler (e.g., smiling students \
on a cover page with no reference in the text)
- Bullet point icons or list marker graphics

### Step 2: Is this a simple informative image?

A SIMPLE INFORMATIVE image conveys a single concept that can be fully \
described in one sentence. Set ``image_category="informative"``.

Examples:
- A photograph of a specific organism in a biology textbook
  → "Eastern box turtle on forest floor showing distinctive orange and \
brown shell pattern"
- A diagram showing a single structure with labeled parts
  → "Cross-section of human eye labeling cornea, iris, lens, retina, \
and optic nerve"
- A map highlighting a specific region
  → "Map of Illinois with Cook County highlighted in red"
- A screenshot of a software interface being discussed
  → "Excel formula bar showing =VLOOKUP(A2,Sheet2!A:B,2,FALSE) with \
result 42 in cell B2"

### Step 3: Is this a complex informative image?

A COMPLEX INFORMATIVE image contains data, relationships, or multi-step \
processes that cannot be fully conveyed in a short alt text. Set \
``image_category="complex_informative"``.

Examples:
- Bar chart → "Bar chart showing enrollment by department, 2020-2024. \
Engineering grew from 450 to 680 students while Humanities declined \
from 320 to 210"
- Flowchart → "Decision flowchart for student financial aid eligibility \
with 6 decision points starting from FAFSA submission"
- Data table rendered as image → "Grade distribution table for CHEM 101 \
showing 15%% A, 25%% B, 35%% C, 15%% D, 10%% F across 3 semesters"
- Organizational chart → "Department hierarchy showing Dean of \
Engineering overseeing 4 department chairs, each with 3-5 faculty"

For complex images: summarize the KEY TAKEAWAY (what conclusion should \
the reader draw?). Note in ``reasoning`` if a long description or data \
table equivalent would be beneficial.

## Alt Text Rules

1. **Describe what the image conveys**, not what it is. Focus on the \
information the reader needs.
2. **Length** — Aim for ~125-150 characters for simple informative images, \
up to ~250 for complex informative images. Never exceed 300 characters.
3. **Phrasing** — Do NOT start with "Image of", "Picture of", "Photo of", \
or "Screenshot of". Start directly with the content.
4. **Context** — Use the surrounding text and caption to inform the alt \
text. The alt text should complement (not duplicate) the caption.
5. **Confidence** — Set ``confidence`` to "high" when the image content is \
clear, "medium" when partially obscured or ambiguous, "low" when the image \
is unreadable or the purpose is unclear.
"""


def build_describer_user_message(
    *,
    caption: str,
    surrounding_text: str,
    ref_id: str,
    fallback_mode: bool = False,
) -> str:
    """Build the text portion of the user message for the describer agent.

    The caller is responsible for prepending the actual image binary content
    (figure image and/or page image) before this text block.

    Args:
        caption: The figure caption extracted by Docling (may be empty).
        surrounding_text: A snippet of markdown around the figure reference.
        ref_id: The figure reference ID (e.g. "figure-1.png").
        fallback_mode: When True, the agent received a full page image
            instead of a cropped figure. Extra guidance is added to help
            the agent locate the correct image on the page.

    Returns:
        Formatted user message string.
    """
    parts = [f"## Figure: {ref_id}\n"]
    if caption:
        parts.append(f"**Caption:** {caption}\n")
    if surrounding_text:
        parts.append(f"**Surrounding text:**\n```\n{surrounding_text}\n```\n")
    if fallback_mode:
        parts.append(
            "**NOTE:** The cropped figure image was not available. You are "
            "seeing the full page image instead. Locate the figure referenced "
            f"as '{ref_id}' using the caption and surrounding text above, "
            "then describe ONLY that figure — ignore all other images on the "
            "page.\n"
        )
    parts.append(
        "First classify this image (decorative, informative, or "
        "complex_informative), then write WCAG-compliant alt text. "
        "If the image is purely decorative, set is_decorative=True and "
        "leave alt_text empty."
    )
    return "\n".join(parts)