Time-travel debugging: fork and fix any failing span

TracePilot AI lets you go back to the exact moment your agent failed, change the input, and see a new result — without redeploying your code or re-running the entire agent from scratch. This is called time-travel debugging, and it works through the Fork & Rerun feature in the TracePilot dashboard.

Why traditional debugging falls short

When an AI agent fails, the usual approach is to add logging, redeploy, reproduce the failure, and hope you guessed the right inputs. For multi-step agents that make dozens of LLM calls, this cycle can take hours — and you may still miss the exact context that caused the bad output. TracePilot takes a different approach: because every span captures the exact input, output, and context at each step, you can jump directly to the failing step and experiment with fixes before touching your code.

Traditional debugging

Add logs → redeploy → reproduce → wait → repeat. Hours per cycle.

Time-travel debugging

Open dashboard → find failing span → fork → edit → rerun. Seconds per cycle.

How Fork & Rerun works

Every span in a trace captures its full input at the time of execution. Fork & Rerun uses that captured input as a starting point. When you fork a span, TracePilot replays that single operation with whatever input you provide — isolated from the rest of the trace. This means you can test a prompt fix, a different tool argument, or a corrected message history without modifying your agent code or running the full execution pipeline.

Building a debuggable agent

To get the most from time-travel debugging, your agent needs to link spans into a tree using parentSpanId and stepOrder. This is what lets the dashboard surface the exact failing step and give you precise control over what to fork.

import { TracePilot } from 'tracepilot-sdk';
import OpenAI from 'openai';

const tp = new TracePilot('tp_live_YOUR_KEY');
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

async function researchAgent(query: string) {
  await tp.startTrace('research-agent');

  const messages = [{ role: 'user', content: query }];

  // Step 1 — initial reasoning (root span)
  const { result: plan, spanId: planSpanId } = await tp.wrapOpenAI(
    () => openai.chat.completions.create({ model: 'gpt-4o', messages }),
    messages
  );

  // Step 2 — tool call, child of step 1
  const { result: searchResult, spanId: searchSpanId } = await tp.wrapToolCall(
    'web-search',
    () => webSearch(plan.choices[0].message.content),
    planSpanId, // parent span
    2           // step order
  );

  // Step 3 — final synthesis, child of step 2
  const followUp = [
    ...messages,
    plan.choices[0].message,
    { role: 'tool', content: JSON.stringify(searchResult) }
  ];

  const { result: answer } = await tp.wrapOpenAI(
    () => openai.chat.completions.create({ model: 'gpt-4o', messages: followUp }),
    followUp,
    searchSpanId, // parent span
    3             // step order
  );

  return answer.choices[0].message.content;
}

The parentSpanId links each span to its parent. The stepOrder numbers the steps. Together, they give the dashboard a complete picture of the execution and let you navigate to any step by its position in the tree.

Always set stepOrder in sequence (1, 2, 3…) to keep the execution tree readable in the dashboard. If two spans share the same parent, use distinct step numbers to make them easy to distinguish.

Debugging a failing step

Suppose the agent above returns a poor answer at step 3. The web search returned relevant results, but the synthesis prompt was too vague. Here is the Fork & Rerun workflow:

Open the dashboard

Go to tracepilotai.com/dashboard and find the trace for the failing run. Traces are listed by agent name and timestamp.

Find the failing span

Expand the trace tree. Locate step 3 — the synthesis span. If the call threw an error, TracePilot marks it automatically. If the output was wrong but not an error, click the span to inspect its input and output.

Fork & Rerun

Click Fork & Rerun on the span. The dashboard opens an editor pre-filled with the exact messages that were sent to the model at that step.

Edit the input

Modify the messages — tighten the system prompt, correct the tool output, or add context that was missing. You’re editing the input that TracePilot captured from the live run.

See the new output

Click Run. TracePilot executes that single span with your edited input and shows you the new result immediately. No redeployment. No waiting.

Once you’ve confirmed the fix produces the right output, update your agent code to match and redeploy at your own pace.

Fork & Rerun executes a real LLM call using your API key. Token usage from forked runs is tracked as a separate span in the dashboard, but it does count against your OpenAI usage.

What gets captured per span

Every span that TracePilot records contains enough context to reproduce the call exactly:

The full message array (for LLM spans) or arguments (for tool spans)
The model name and any parameters passed to the completion
The response, including all choices
Token counts and latency
Any error thrown during execution

This completeness is what makes forking reliable. You’re not reconstructing state — you’re replaying it.

Documentation Index

​Why traditional debugging falls short

Traditional debugging

Time-travel debugging

​How Fork & Rerun works

​Building a debuggable agent

​Debugging a failing step

​What gets captured per span

Why traditional debugging falls short

How Fork & Rerun works

Building a debuggable agent

Debugging a failing step

What gets captured per span