Evaluation & Metrics

RT-SEG represents segmentations as character offsets: a list of (start, end) spans that partition a trace string. Evaluation therefore reduces to comparing boundary positions (segment ends) and/or segment spans.

This page documents the metrics implemented in RT-SEG’s evaluation utilities, how to interpret them, and how to report results.


Terminology (offsets → boundaries)

Given offsets:

  • [(s0, e0), (s1, e1), ..., (sk, ek)] with e_k = len(trace)

The boundary indices are the end offsets of all but the last segment:

  • B = { e0, e1, ..., e_{k-1} }

Most boundary metrics operate on these boundary index sets.


Primary metric used in experiments: Boundary Similarity (B-score)

RT-SEG’s main “agreement-style” score is a lenient boundary similarity that treats near-miss boundaries as matches.

Definition (as implemented)

Let:

  • B1 = boundary indices of segmentation 1
  • B2 = boundary indices of segmentation 2
  • window = jitter tolerance in characters (± window)

Directional match count:

  • A boundary b ∈ B1 is a match if ∃ b' ∈ B2 such that |b - b'| ≤ window.

Compute matches both directions:

  • m1 = matches(B1 → B2)
  • m2 = matches(B2 → B1)

Convert to precision/recall style:

  • p = m1 / |B1|
  • r = m2 / |B2|

Final score is the harmonic mean:

  • B = 2pr / (p + r) (or 0 if p + r = 0)

Interpretation

  • Range: 0.0 … 1.0 (higher is better)
  • What it rewards: boundaries that land “close enough” to the gold boundary (within window chars)
  • What it penalizes: missing boundaries, or boundaries far away (outside the jitter window)

Choosing window

window is in characters, not tokens or sentences.

  • Small window (e.g. 3–5): strict; good when offsets are stable and annotation is precise.
  • Larger window (e.g. 10–30): tolerant; good when offsets drift due to formatting, tokenization artifacts, or sentence splitting ambiguity.

Gold-based metric groups (single-trace)

RT-SEG’s evaluation entry point groups metrics into categories. The most useful gold-based ones:

1) Classical segmentation metrics (character-label based)

These treat segmentation as a sequence labeling problem over characters.

  • P_k (lower is better)
    Probability that two positions at distance k are incorrectly judged as “same segment vs different segment”.
  • WindowDiff (lower is better)
    Counts disagreements in the number of boundaries within a sliding window.

Notes:

  • k is chosen automatically as ~ half the average segment length (computed from gold labels).

2) Boundary accuracy (strict vs tolerant)

  • Boundary_F1 (higher is better)
    Exact-match boundary precision/recall/F1 using boundary index sets.
  • Soft_Boundary_F1 (higher is better)
    Distance-weighted boundary F1 using exponential decay with tolerance parameter sigma:

    exp(-distance / sigma)

  • Boundary_Displacement (lower is better)
    Mean absolute distance from each gold boundary to the nearest predicted boundary.

Parameters:

  • sigma (characters): larger = more forgiving for near-misses.

3) Segment structure overlap

These compare segment spans, not only boundaries.

  • Mean_IoU (higher is better)
    For each gold segment, find the best matching predicted segment and compute Intersection-over-Union, then average.
  • Mean_Dice (higher is better)
    Similar but using Dice coefficient.
  • Segmentation_Bias
    (num_pred_segments - num_gold_segments) / max(1, num_gold_segments)

Interpretation:

  • Bias > 0 → over-segmentation (too many segments)
  • Bias < 0 → under-segmentation (too few segments)

Gold-free diagnostics (pairwise agreement)

When you don’t have a gold segmentation, RT-SEG supports pairwise comparisons between methods/annotators:

  • Boundary_Similarity (the B-score above) with window
  • Boundary_Density_JSD
    Jensen–Shannon divergence between boundary-position histograms across the trace (lower = more similar distribution).

This is useful for:

  • annotator agreement studies
  • comparing two engines without claiming either is “gold”

“Optimistic” boundary cover (lenient diagnostic)

Boundary_Cover measures how well one segmentation’s boundaries are “covered” by another within a slack.

  • slack (characters) is the tolerance.
  • It is optimistic in the sense that it does not directly penalize extra boundaries in the covering segmentation.

This is best used as a diagnostic, not the headline metric.


Minimal example: evaluate one method vs gold (single trace)

from rt_seg import evaluate_segmentations

trace = "Step 1: Get data. Data is [1, 2]. Step 2: Sum data. Sum is 3. Step 3: Square it. Result is 9."

segmentations = {
    "Gold": [(0, 31), (31, 59), (59, 84)],
    "MethodA": [(0, 31), (31, 84)],
}

tables = evaluate_segmentations(
    trace=trace,
    segmentations=segmentations,
    gold_key="Gold",
    sigma=5.0,     # soft boundary tolerance (chars)
    window=10,     # boundary similarity jitter (chars)
    slack=10,      # boundary cover jitter (chars)
)

for group, df in tables.items():
    print(f"\n=== {group} ===")
    print(df)

What you get:

  • a dict of DataFrames keyed by metric group (e.g. classical, boundary_accuracy, segment_structure, agreement, optimistic)

Minimal example: evaluate across a dataset (aggregate)

Use this when you have multiple traces and want means/stds per method:

from rt_seg import evaluate_aggregate_segmentations, aggregated_results_to_json

traces = [
    "Trace one ...",
    "Trace two ...",
]

segmentations_per_trace = [
    {
        "Gold": [(0, 10), (10, 20)],
        "MethodA": [(0, 12), (12, 20)],
    },
    {
        "Gold": [(0, 5), (5, 15)],
        "MethodA": [(0, 6), (6, 15)],
    },
]

agg = evaluate_aggregate_segmentations(
    traces=traces,
    segmentations=segmentations_per_trace,
    gold_key="Gold",
    sigma=5.0,
    window=10,
    slack=10,
)

# Optional: JSON-serializable structure for saving
payload = aggregated_results_to_json(agg)
print(payload.keys())  # linear_metrics, pairwise_agreement_metrics, per_method_agreement_metrics

1) Always report metric hyperparameters

Because window, sigma, and slack directly control tolerance, include them in every table caption or JSON header:

  • window (Boundary Similarity jitter, chars)
  • sigma (Soft boundary decay scale, chars)
  • slack (Boundary cover jitter, chars)

2) Report a compact “headline + diagnostics” set

A practical default bundle:

Headline (pick 1–2):

  • Boundary_Similarity (with specified window)
  • Soft_Boundary_F1 (with specified sigma)

Diagnostics:

  • Boundary_Displacement (lower is better)
  • Segmentation_Bias (sign tells over/under segmentation)

3) Include runtime separately

If you track per-trace processing time (e.g. ptime), report:

  • mean runtime per trace
  • optionally p50/p95 if distributions are heavy-tailed (LLM engines often are)

4) Save results as JSON

Store:

  • method identifier (engine(s) + aligner + base unit)
  • metric means/stds
  • evaluation hyperparameters (window, sigma, slack)
  • dataset identifier / split name

This makes runs reproducible and comparable across machines.


This site uses Just the Docs, a documentation theme for Jekyll.