Replaying LLM Sessions for Iterative AI Agent Improvement

Experimenting with prompts in isolation limits your understanding. To truly grasp how a prompt change impacts an entire session, you need to apply changes directly to real user interactions. Replaying LLM sessions with Helicone unlocks this capability, providing insights unattainable through isolated testing.

Why is this powerful?

Authentic Context: By leveraging actual production data, you see how changes affect real user experiences.
Unveiling Hidden Effects: Discover unintended consequences that only emerge over full sessions.
Accelerated Iteration: Automate testing with real inputs, streamlining your optimization process.

Helicone empowers you to replay any complex session—a capability no other platform offers. Due to our adaptability, more mature product teams often build bespoke solutions atop Helicone to store, aggregate, and analyze their AI workflows, enhancing performance with genuine user data without reinventing the wheel.

In this guide, we’ll demonstrate how to leverage Helicone to replay LLM sessions. You’ll learn how to set up an initial session, query session data, and replay sessions with modifications. We’ll also share tips on customizing this approach for your unique needs.

Overview of the Replay Process With Helicone

The process of replaying LLM sessions with Helicone involves three main steps:

Setting Up the Initial Session: Instrument your LLM calls to include Helicone session metadata so that they can be tracked and logged.
Querying Helicone for Session Data: Use Helicone’s API to retrieve the logs of past sessions that you want to replay.
Replaying the Session with Modifications: Programmatically modify the retrieved session data as needed and send requests to the LLM to observe the effects.

Let’s explore each of these steps in detail by following an example.

Example: AI Debate Application

We’ll walk through an example of a debate session between a user and an assistant. Between each argument, a impartial assistant scores the argument from 1 to 10.

Step 1: Setting Up the Initial Session

Before you can replay sessions, you need to log them properly in Helicone. By adding only 3 headers to your LLM API requests, you can tag and group them into sessions.

Instrumenting Your LLM Calls

Configure the OpenAI API client with Helicone

Set up the session headers, configure the OpenAI API client with Helicone, and include the necessary headers when making a request.

const { OpenAI } = require("openai");
const { randomUUID } = require("crypto");

// Generate unique session identifiers
const sessionId = randomUUID();
const sessionName = "AI Debate";
const sessionPath = "/debate/climate-change";

// Initialize OpenAI client with Helicone baseURL and auth header
const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
  baseURL: "https://oai.helicone.ai/v1",
  defaultHeaders: {
    "Helicone-Auth": `Bearer ${process.env.HELICONE_API_KEY}`,
  },
});

// Include Helicone session headers when making requests
const response = await openai.chat.completions(
  {
    model: "gpt-4o-mini",
    messages: conversation,
  },
  {
    headers: {
      "Helicone-Session-Id": sessionId,
      "Helicone-Session-Name": sessionName,
      "Helicone-Session-Path": sessionPath,
    },
  }
);

Initialize the conversation with the assistant

const topic = "The impact of climate change on global economies";

const conversation = [
  {
    role: "system",
    content:
      "You're a debating professional. You're engaging in a structured debate with the user. Each of you will present arguments for or against the topic. Keep responses concise and to the point.",
  },
  {
    role: "assistant",
    content: `Welcome to our debate! Today's topic is: "${topic}". I will argue in favor, and you will argue against. Please present your opening argument.`,
  },
];

Loop through the debate turns

let turn = 1;

while (turn <= MAX_TURNS) {
  // Get user's argument
  const userArgument = await promptUser("Your argument: ");
  conversation.push({ role: "user", content: userArgument });

  // Score the user's argument
  await evaluateArgument(
    userArgument,
    "Your Argument",
    sessionId,
    sessionName,
    sessionPath
  );

  // Assistant responds with a counter-argument
  const assistantResponse = await generateAssistantResponse(
    conversation,
    sessionId,
    sessionName,
    sessionPath
  );
  conversation.push(assistantResponse);

  // Score the assistant's argument
  await evaluateArgument(
    assistantResponse.content,
    "Assistant's Argument",
    sessionId,
    sessionName,
    sessionPath
  );

  turn++;
}

Note: The functions promptUser, evaluateArgument, and generateAssistantResponse handle user input, argument evaluation, and generating assistant responses, respectively.

After setting up and running your session through Helicone, you can view it in Helicone:

Go fullscreen for the best experience.

Read more about how to implement Helicone sessions here.

Step 2: Querying the Session Data from Helicone

const response = await fetch("https://api.helicone.ai/v1/request/query", {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
    Authorization: `Bearer ${HELICONE_API_KEY}`,
  },
  body: JSON.stringify({
    filter: {
      properties: {
        "Helicone-Session-Id": {
          equals: SESSION_ID_TO_REPLAY,
        },
      },
    },
  }),
});
const data = await response.json();

Step 3: Processing and Modifying the Session Data

Now that you have the session data, you’ll need to process it.

Parse and sort the requests

Sorting session data can be complex because each use case is unique and may require custom logic. For our debate session example, we simply sort the requests by their created_at timestamp.

const requests = data.data.map((request) => ({
  created_at: request.request_created_at,
  session: request.request_properties["Helicone-Session-Id"],
  signed_body_url: request.signed_body_url,
  request_path: request.request_path,
  path: request.request_properties["Helicone-Session-Path"],
  prompt_id: request.request_properties["Helicone-Prompt-Id"],
  body: request.body,
}));
requests.sort((a, b) => new Date(a.created_at) - new Date(b.created_at));

Modify the request bodies as needed

For example, we can adjust the system prompts to change the assistants argument or argument evaluation response.

function modifyRequestBody(request) {
  if (request.prompt_id === "argument-evaluation") {
    const systemMessage = request.body.messages.find(
      (msg) => msg.role === "system"
    );
    if (systemMessage) {
      systemMessage.content += " Keep the feedback short and concise.";
    }
  } else if (request.prompt_id === "assistant-argument") {
    const systemMessage = request.body.messages.find(
      (msg) => msg.role === "system"
    );
    if (systemMessage) {
      systemMessage.content +=
        " Take the persona of a genius in this field when responding.";
    }
  }
  return request;
}

Replay the modified session

// Create a new session for the replay
const replaySessionId = randomUUID();
for (const request of requests) {
  const modifiedRequest = modifyRequestBody(request);

  // Reuse session metadata from the original request
  await handleChatCompletion(modifiedRequest);
}

async function handleChatCompletion(modifiedRequest) {
  const { body, path, prompt_id, request_path } = modifiedRequest;

  // Send the modified request to the LLM
  const response = await fetch(request_path, {
    method: "POST",
    headers: {
      Authorization: `Bearer ${OPENAI_API_KEY}`,
      "Helicone-Auth": `Bearer ${HELICONE_API_KEY}`,
      // Reuse the session metadata for logging
      "Helicone-Session-Id": replaySessionId,
      "Helicone-Session-Name": sessionName,
      "Helicone-Session-Path": path,
      "Helicone-Prompt-Id": prompt_id,
    },
    body: JSON.stringify(body),
  });
}

Note: In the handleChatCompletion function, we send the modified request to the LLM. By reusing the same session-name, session-path, prompt-id, and request path from the original requests, we ensure that the replayed session is logged in Helicone under the same session metadata. This allows you to see the replayed requests in Helicone, grouped under the same session, making it easier to compare and analyze the effects of your modifications.

After running the replay, you can view it in Helicone:

Go fullscreen for the best experience.

With the replayed session now visible in Helicone, you can observe how the modifications impact the AI’s responses throughout the session. This visualization shows how changes in prompts or configurations affect subsequent interactions.

Optional: Evaluations & Prompt Versioning

Evaluations

To assess the impact of your changes quantitatively, use Helicone’s evaluation features to assign scores to both the original and replayed sessions. Comparing these scores helps you understand the effects of your modifications and refine your prompts more effectively.

Prompt Versioning

Helicone’s prompt versioning feature allows you to manage and compare different versions of your prompts effectively. By maintaining multiple versions, you can test various combinations within your sessions to identify which prompts yield the best results. To retrieve specific prompt versions, use Helicone’s Prompt API.

Here’s how you can iterate over prompt versions to test different compositions in your sessions:

Define a list of prompt versions for each prompt.
For each combination in the cartesian product of prompt versions:
    Generate a new session ID.
    For each prompt in the combination:
        Fetch the prompt content using Helicone's Prompt API.
    Run the session using the retrieved prompts.

Alternatively, as described above, you can manually modify the prompts after retrieving the session, and Helicone will automatically log the new prompt versions for you. In this approach, you create the prompt versions first.

Conclusion

By replaying and modifying LLM sessions with Helicone, you gain deeper insights into how changes affect the entire workflow. This method provides context-rich, real-world data that leads to more effective optimizations and a comprehensive understanding of your AI’s behavior.

Helicone Plug 🧊

Helicone is the ultimate open-source LLM observability platform built for developers to monitor, debug and improve their LLM applications. We’re fully open source and built to handle high usage use-cases seamlessly. With over 1.8 billion requests and 1.6 trillion tokens logged, it’s proven at scale. Check out our open-source GitHub repository here.

Trusted by Industry Leaders: Companies like Sunrun, Michelin, QAWolf, Slate Magazine, and many more rely on Helicone to enhance their AI models and workflows.

You can view the full code used in this guide on GitHub.

Time: 15 minute read

Created: September 26, 2024

Author: Cole Gottdank