RTLLMOffsetBased (Direct offset proposal via prompted LLM)
Idea
RTLLMOffsetBased is a downstream LLM engine that asks an instruction-tuned model to output explicit character offsets for segmentation boundaries.
Instead of operating over a base unit (sentences/clauses) and merging labels, this engine treats segmentation as a direct structured prediction problem: the model returns a list of (start, end) spans (relative to a chunk), which are then stitched into a full-trace segmentation.
This approach is useful when you want:
- segmentation that is not constrained to sentence/clause units, and
- direct compatibility with the library’s canonical output representation (character offsets).
Method (high-level)
-
Chunk the trace The trace is processed in sliding chunks of size
chunk_sizeto stay within model context limits. -
Prompt the LLM for offsets For each chunk, the engine calls the LLM with:
- a
system_prompt - a user message containing
prompt + chunk_text
The model is expected to output a JSON list containing spans, e.g.:
```json [[0, 120], [120, 260], [260, 415]]
- a
Offsets are interpreted as character indices within the chunk.
-
Robust parsing The engine attempts to parse the model output by:
- stripping the output,
- extracting the substring between the first
[and last], - loading JSON via
json.loads.
If the model returns a dict, values are used as segments. If a single
[start, end]pair is returned, it is wrapped into a list. -
Stitch segments into global offsets Each local
(a, b)span is converted into a global offset:(i + a, i + b)whereiis the chunk’s starting position in the full trace.
-
Progress via last end offset After each chunk, the engine advances
ito the end of the last predicted segment:i = all_segments[-1][1]
-
Finish If any remainder remains, append a final segment
(i, len(trace)).
Output:
offsets: list of global(start, end)character spanslabels: currently"UNK"for each segment
Models used
This engine uses instruction-tuned causal LMs via:
AutoModelForCausalLMAutoTokenizer
Supported model identifiers in the code:
Qwen/Qwen2.5-7B-Instruct-1MQwen/Qwen2.5-7B-Instructmistralai/Mixtral-8x7B-Instruct-v0.1
Implementation notes:
- Prompts are formatted using
tokenizer.apply_chat_template(...). - Generation uses
max_new_tokens=8000for long structured outputs. - Inference uses
device_map="auto"andtorch_dtype="auto".
For reproducibility, the segmentation behavior is highly prompt-dependent; report both the model name and the prompt templates used in experiments.
Key parameters
-
chunk_size: intChunk length in characters. Controls granularity and context available to the model. -
prompt: strPrefix prompt inserted before the chunk text. In the current implementation, this parameter is passed into_segmentbut_trace_passis called with an empty prompt (""). If you intend to useprompt, ensure it is forwarded correctly (see Notes). -
system_prompt: strSystem instruction. Should enforce JSON-only output and specify offset conventions. -
max_retries_per_chunk: int(default:10) Present in the signature but not currently used in the implementation. -
margin: int(default:200) Present in the signature but not currently used in the implementation (often used for overlap/carryover).
Usage
from rt_seg import RTSeg
from rt_seg import RTLLMOffsetBased
trace = "..."
system_prompt = (
"You are a segmentation assistant. "
"Return only JSON: a list of [start, end] character offsets for coherent segments."
)
segmentor = RTSeg(engines=RTLLMOffsetBased)
offsets, labels = segmentor(
trace,
chunk_size=4000,
system_prompt=system_prompt,
prompt="", # see note below
model_name="Qwen/Qwen2.5-7B-Instruct",
)
segments = [trace[s:e] for s, e in offsets]