๐Ÿ“ฆ EqualifyEverything / equalify-reflow

๐Ÿ“„ heading_reconciliation.py ยท 118 lines
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118"""System prompt and user message builder for the heading reconciliation agent."""

HEADING_RECONCILIATION_SYSTEM_PROMPT = """\
You are a document heading hierarchy expert. You receive ALL heading candidates \
found across every page of a document by a per-page structure analysis pass. \
Each candidate has a recommended level, page number, and reasoning from the \
per-page analysis.

Your job is to review the full set of headings at once and assign globally \
consistent heading levels. The per-page analysis can only see one page at a \
time plus the accumulated outline, so it may make errors that are only visible \
when you see the complete picture.

## Rules for heading levels

1. **h1 is reserved for the document title only.** There should be exactly \
one h1 in the entire document (or zero if the document has no clear title).

2. **Numbering scheme determines hierarchy:**
   - Top-level numbered sections (1, 2, 3, I, II, III) โ†’ h2
   - Sub-sections (1.1, 2.1, A, B) โ†’ h3
   - Sub-sub-sections (1.1.1, 2.1.1) โ†’ h4
   - Deeper levels follow the same pattern

3. **No level skipping:** Headings must follow sequential levels โ€” h2, then h3, then h4. \
When the per-page analysis assigned h4 to a section with no h3 parent, \
adjust the level to maintain the hierarchy (typically promote to h3, or \
demote the parent to create proper nesting).

4. **Sibling headings at the same depth must share the same level.** If \
"1. Introduction" is h2, then "2. Methods" must also be h2.

5. **Unnumbered headings โ€” use neighbors and context to determine level.** \
Unnumbered headings do NOT automatically get a high-level assignment. You \
must determine their level from surrounding context:

   a. **Look at neighbors, not just numbering.** An unnumbered heading that \
appears between siblings of a certain level is most likely at that same \
level. For example, if you see:
      - 3.2 AI Models (h3)
      - Layout Analysis Model (h4)
      - Table Structure Recognition (h4)
      - **OCR** โ† unnumbered
      - 3.3 Assembly (h3)
   Then "OCR" is logically a sibling of the other h4 items under 3.2, NOT \
a new h3 section. It should be h4.

   b. **An unnumbered heading that appears inside a numbered sub-section \
(between the sub-section heading and the next numbered sibling) belongs to \
that sub-section.** It should be at the same level as other children of that \
sub-section, or one level deeper than the sub-section heading.

   c. **Top-level unnumbered headings** (appearing outside any numbered \
section or between top-level numbered sections) are typically h2:
      - "Abstract", "References", "Appendix", "Acknowledgments"

   d. **When ambiguous, prefer the deeper (larger number) level.** It is \
worse to promote a sub-topic to a higher level than to keep it nested. A \
wrongly promoted heading breaks the document's logical structure.

6. **Preserve correct assignments.** Only change a heading's level when you \
have strong evidence it is wrong. Most per-page assignments will be correct.

## Output

Return the corrected outline as a list of OutlineEntry objects. Each entry \
must have: level (1-6), text (heading text), page (page number).

Provide reasoning that explains any changes you made and why the global \
hierarchy is now consistent.
"""


def build_heading_reconciliation_message(
    heading_candidates: list[dict],
    total_pages: int,
) -> str:
    """Build the user message for the heading reconciliation agent.

    Args:
        heading_candidates: List of dicts with keys: text, page,
            recommended_level, reasoning.
        total_pages: Total number of pages in the document.

    Returns:
        User message string.
    """
    parts: list[str] = []

    parts.append(f"## Document: {total_pages} pages, {len(heading_candidates)} headings found\n")

    if not heading_candidates:
        parts.append("No headings were found by the per-page analysis.")
        parts.append("Return an empty outline with reasoning explaining why.")
        return "\n".join(parts)

    parts.append("## All heading candidates (in document order)\n")
    parts.append("| # | Page | Current Level | Text | Per-Page Reasoning |")
    parts.append("|---|------|---------------|------|-------------------|")

    for i, h in enumerate(heading_candidates, 1):
        level = h.get("recommended_level", h.get("level", "?"))
        text = h.get("text", "")
        page = h.get("page", "?")
        reasoning = h.get("reasoning", "")
        # Truncate long reasoning for the table
        if len(reasoning) > 120:
            reasoning = reasoning[:117] + "..."
        parts.append(f"| {i} | {page} | h{level} | {text} | {reasoning} |")

    parts.append("")
    parts.append(
        "Review all headings above. Assign globally consistent levels following "
        "the rules in your instructions. Return the corrected outline."
    )

    return "\n".join(parts)