RTLLMSegUnitBased (LLM segmentation over sentence/clause indices)
Idea
RTLLMSegUnitBased is a downstream LLM engine that performs segmentation over precomputed base units (sentences or clauses), rather than over raw character offsets.
The trace is first decomposed into base units using SegBase.get_base_offsets(...). The LLM then receives a JSON-encoded list/dict of these units for a local chunk and is asked to return segments as lists of unit indices. These index-level segments are finally converted back into character offsets.
Compared to RTLLMOffsetBased, this approach:
- constrains the model to choose boundaries only at base-unit boundaries (often improving robustness), and
- avoids brittle character-level offset prediction.
Method (high-level)
- Base segmentation Compute base offsets:
offsets = SegBase.get_base_offsets(trace, seg_base_unit=...)and extract base strings:strace = [trace[s:e] for (s,e) in offsets]
-
Chunk the base-unit list Process base units in chunks of size
chunk_size(interpreted here as number of base units, not characters). -
Prompt the LLM with base units Each chunk is encoded as JSON, mapping local indices to text:
```json {“0”: “…”, “1”: “…”, “2”: “…”}
The engine calls the LLM with:
- a
system_prompt - a user message containing
prompt + base_chunk_input
The model is expected to return a JSON list of segments expressed as lists of indices, e.g.:
[[0, 1, 2], [3, 4], [5, 6, 7]]
-
Robust parsing and retries The engine attempts to parse the substring between the first
[and last], thenjson.loads(...). If parsing fails, it retries up tomax_retry. -
Stitch segments into global unit indices Local segment indices are shifted by the current chunk offset
i:seg_global = [s + i for s in seg_local]
-
Advance to next chunk The engine advances
iusing the last returned segment:check_seg = [s + i for s in local_segments[-1]]i = min(check_seg)
To avoid duplicating context near chunk boundaries, it may drop the last predicted segment (
del all_segments[-1]) and re-run with adjusted chunk sizing. -
Convert unit-index segments to character offsets For each global unit-index segment
seg = [j_1, ..., j_k]:- left boundary =
offsets[j_1][0] - right boundary =
offsets[j_k][1]
Finally, offsets are “corrected” to ensure non-overlapping spans by snapping each segment’s end to the next segment’s start.
- left boundary =
Output:
corrected_final_offsets: list of character spanslabels: currently"UNK"for all segments
Models used
This engine uses instruction-tuned causal LMs via:
AutoModelForCausalLMAutoTokenizer
Supported model identifiers in the code:
Qwen/Qwen2.5-7B-Instruct-1MQwen/Qwen2.5-7B-Instructmistralai/Mixtral-8x7B-Instruct-v0.1
Implementation notes:
- Prompts are formatted using
tokenizer.apply_chat_template(...). - Generation uses
max_new_tokens=8000. - Uses
device_map="auto"andtorch_dtype="auto".
As with other downstream prompting engines, segmentation is highly prompt- and model-dependent; report both for reproducibility.
Key parameters
-
seg_base_unit: Literal["sent", "clause"]Base unit used for the index space. -
chunk_size: intNumber of base units included in each LLM call. Larger chunk sizes provide more context but increase generation length and JSON complexity. -
system_prompt: strShould enforce strict JSON output and define the expected index-segmentation format. -
prompt: strPrefix prepended to the JSON chunk input. In the current implementation_trace_passis called withprompt=""(see Notes). -
max_retry: int(default:30) Maximum retries when output parsing fails.
Usage
from rt_seg import RTSeg
from rt_seg import RTLLMSegUnitBased
trace = "..."
system_prompt = (
"You are a segmentation assistant. "
"Input is a JSON dict mapping indices to text spans. "
"Return only JSON: a list of segments, each segment is a list of integer indices, "
"covering the chunk in order."
)
segmentor = RTSeg(engines=RTLLMSegUnitBased, seg_base_unit="sent")
offsets, labels = segmentor(
trace,
seg_base_unit="sent",
chunk_size=25,
system_prompt=system_prompt,
prompt="", # see note below
model_name="Qwen/Qwen2.5-7B-Instruct",
max_retry=30,
)
segments = [trace[s:e] for s, e in offsets]