How do I mock the OpenAI API for testing?

Start MockServer, then create an expectation with an httpLlmResponse action: set provider to OPENAI, specify the model, and describe the completion text. Point your application at http://localhost:1080/v1/chat/completions instead of api.openai.com. MockServer returns a byte-for-byte correct OpenAI response including id, created, and object fields that the OpenAI SDK expects.

Can I test an AI agent without paying for LLM API calls?

Yes. MockServer returns deterministic, instant LLM responses with no network calls to the real provider. Create an httpLlmResponse expectation with your chosen provider and completion text, point your agent at MockServer instead of the real API endpoint, and every call costs nothing and returns the same result every run — ideal for CI pipelines.

Does MockServer support streaming (SSE) LLM responses?

Yes. Set streaming to true on the completion object. MockServer splits the text into token-sized chunks and emits them as Server-Sent Events in the provider's native streaming format (SSE for OpenAI, Anthropic, and Gemini; NDJSON for Ollama). Use the streamingPhysics field to control time-to-first-token, tokens-per-second rate, and jitter for realistic timing.

Which LLM providers can MockServer mock?

MockServer supports OPENAI (Chat Completions), OPENAI_RESPONSES (Responses API), ANTHROPIC, GEMINI, BEDROCK (Converse API with binary event-stream framing), AZURE_OPENAI, OLLAMA, COHERE (rerank), and VOYAGE (rerank). Each provider value produces the wire format that the corresponding SDK expects, including correct headers and JSON structure.

How do I mock tool/function calls from an LLM?

Add a toolCalls array to the completion object with each tool call's id, name, and arguments. Set stopReason to tool_use (Anthropic) or tool_calls (OpenAI). For OpenAI-compatible mocks you can also set toolChoice to required to force the mocked finish_reason to tool_calls, matching how the real API behaves when a caller forces a tool invocation.

How do I test multi-turn LLM agent conversations with MockServer?

Use conversationPredicates combined with scenario state. Each expectation specifies predicates such as turnIndex (which assistant turn this is), latestMessageContains (substring match on the last message), or containsToolResultFor (matches when a tool result is present). Set times.remainingTimes to 1 on each turn so the expectations are consumed in order.

Mock LLM APIs (OpenAI, Anthropic, Gemini)

When you test an application that calls an LLM, you face two problems: the real API is slow, costs money, and returns different results every run. MockServer solves both: tell it which provider your application targets and what the completion should contain, and it returns a byte-for-byte correct response — including the full streaming SSE framing — without touching the network.

In practice: point your application at MockServer instead of api.openai.com (or the equivalent endpoint for your provider), create one expectation, and your test gets a deterministic, instant, free LLM response every time.

# Start MockServer
docker run -d --rm -p 1080:1080 mockserver/mockserver

# Create an OpenAI-compatible completion mock
curl -s -X PUT http://localhost:1080/mockserver/expectation \
  -H "Content-Type: application/json" \
  -d '{
    "httpRequest":  { "method": "POST", "path": "/v1/chat/completions" },
    "httpLlmResponse": {
      "provider": "OPENAI",
      "model": "gpt-4o",
      "completion": {
        "text": "MockServer is a free, open-source HTTP mock server.",
        "stopReason": "stop",
        "usage": { "inputTokens": 12, "outputTokens": 9 }
      }
    }
  }'

# Your application can now call http://localhost:1080/v1/chat/completions
# and receive a correctly formatted OpenAI response.

MockServer can mock LLM API responses from any major provider using the httpLlmResponse action. You describe the completion in a provider-neutral format and MockServer encodes it into the correct wire format for the target provider — headers, JSON structure, streaming framing, and all. This lets you test AI-powered applications deterministically, without calling a real LLM.

Basic completion mock
Supported providers
Streaming responses
Tool calls
Embeddings
Rerank
Multimodal request recognition (images and audio)
Multi-turn conversations
Session isolation
Cost budget
Moderation, content filter & refusal
Chaos / fault injection
Realtime voice APIs (OpenAI Realtime, Gemini Live)
Streaming — client examples (all languages)

Basic Completion Mock

Create an expectation with an httpLlmResponse action. The provider tells MockServer which wire format to produce; the completion describes what to return.

The Java client has a typed fluent API (LlmMockBuilder + HttpLlmResponse). All other clients send the expectation as raw JSON using their generic expectation method (mockAnyResponse in JavaScript; a direct PUT to /mockserver/expectation in other languages).

import static org.mockserver.client.LlmMockBuilder.llmMock;
import static org.mockserver.model.Completion.completion;
import static org.mockserver.model.Provider.OPENAI;
import static org.mockserver.model.Usage.usage;

// OpenAI — POST /v1/chat/completions
llmMock("/v1/chat/completions")
    .withProvider(OPENAI)
    .withModel("gpt-4o")
    .respondingWith(
        completion()
            .withText("MockServer is an open-source HTTP mock server.")
            .withStopReason("stop")
            .withUsage(usage().withInputTokens(12).withOutputTokens(8))
    )
    .applyTo(mockServerClient);

var mockServerClient = require('mockserver-client').mockServerClient;

// OpenAI — POST /v1/chat/completions
mockServerClient("localhost", 1080).mockAnyResponse({
    "httpRequest": {
        "method": "POST",
        "path": "/v1/chat/completions"
    },
    "httpLlmResponse": {
        "provider": "OPENAI",
        "model": "gpt-4o",
        "completion": {
            "text": "MockServer is an open-source HTTP mock server.",
            "stopReason": "stop",
            "usage": { "inputTokens": 12, "outputTokens": 8 }
        }
    }
}).then(
    function () { console.log("expectation created"); },
    function (error) { console.log(error); }
);

import requests

# OpenAI — POST /v1/chat/completions
requests.put(
    "http://localhost:1080/mockserver/expectation",
    json={
        "httpRequest": {
            "method": "POST",
            "path": "/v1/chat/completions"
        },
        "httpLlmResponse": {
            "provider": "OPENAI",
            "model": "gpt-4o",
            "completion": {
                "text": "MockServer is an open-source HTTP mock server.",
                "stopReason": "stop",
                "usage": {"inputTokens": 12, "outputTokens": 8}
            }
        }
    }
)

require 'net/http'
require 'json'

# OpenAI — POST /v1/chat/completions
uri = URI('http://localhost:1080/mockserver/expectation')
http = Net::HTTP.new(uri.host, uri.port)
req = Net::HTTP::Put.new(uri.path, 'Content-Type' => 'application/json')
req.body = JSON.generate({
  'httpRequest' => { 'method' => 'POST', 'path' => '/v1/chat/completions' },
  'httpLlmResponse' => {
    'provider' => 'OPENAI',
    'model' => 'gpt-4o',
    'completion' => {
      'text' => 'MockServer is an open-source HTTP mock server.',
      'stopReason' => 'stop',
      'usage' => { 'inputTokens' => 12, 'outputTokens' => 8 }
    }
  }
})
http.request(req)

package main

import (
    "bytes"
    "encoding/json"
    "net/http"
)

// OpenAI — POST /v1/chat/completions
func createLlmExpectation() {
    body, _ := json.Marshal(map[string]interface{}{
        "httpRequest": map[string]interface{}{
            "method": "POST",
            "path":   "/v1/chat/completions",
        },
        "httpLlmResponse": map[string]interface{}{
            "provider": "OPENAI",
            "model":    "gpt-4o",
            "completion": map[string]interface{}{
                "text":       "MockServer is an open-source HTTP mock server.",
                "stopReason": "stop",
                "usage":      map[string]int{"inputTokens": 12, "outputTokens": 8},
            },
        },
    })
    req, _ := http.NewRequest(http.MethodPut,
        "http://localhost:1080/mockserver/expectation", bytes.NewReader(body))
    req.Header.Set("Content-Type", "application/json")
    http.DefaultClient.Do(req)
}

using System.Net.Http;
using System.Text;
using System.Text.Json;

// OpenAI — POST /v1/chat/completions
var expectation = new
{
    httpRequest = new { method = "POST", path = "/v1/chat/completions" },
    httpLlmResponse = new
    {
        provider = "OPENAI",
        model = "gpt-4o",
        completion = new
        {
            text = "MockServer is an open-source HTTP mock server.",
            stopReason = "stop",
            usage = new { inputTokens = 12, outputTokens = 8 }
        }
    }
};
using var client = new HttpClient();
var json = JsonSerializer.Serialize(expectation);
await client.PutAsync(
    "http://localhost:1080/mockserver/expectation",
    new StringContent(json, Encoding.UTF8, "application/json"));

use serde_json::json;

// OpenAI — POST /v1/chat/completions
let client = reqwest::blocking::Client::new();
client.put("http://localhost:1080/mockserver/expectation")
    .json(&json!({
        "httpRequest": {
            "method": "POST",
            "path": "/v1/chat/completions"
        },
        "httpLlmResponse": {
            "provider": "OPENAI",
            "model": "gpt-4o",
            "completion": {
                "text": "MockServer is an open-source HTTP mock server.",
                "stopReason": "stop",
                "usage": { "inputTokens": 12, "outputTokens": 8 }
            }
        }
    }))
    .send()
    .unwrap();

<?php
// OpenAI — POST /v1/chat/completions
$expectation = [
    'httpRequest' => ['method' => 'POST', 'path' => '/v1/chat/completions'],
    'httpLlmResponse' => [
        'provider' => 'OPENAI',
        'model' => 'gpt-4o',
        'completion' => [
            'text' => 'MockServer is an open-source HTTP mock server.',
            'stopReason' => 'stop',
            'usage' => ['inputTokens' => 12, 'outputTokens' => 8],
        ],
    ],
];
$ch = curl_init('http://localhost:1080/mockserver/expectation');
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'PUT');
curl_setopt($ch, CURLOPT_POSTFIELDS, json_encode($expectation));
curl_setopt($ch, CURLOPT_HTTPHEADER, ['Content-Type: application/json']);
curl_exec($ch);
curl_close($ch);

# OpenAI — POST /v1/chat/completions
curl -v -X PUT "http://localhost:1080/mockserver/expectation" -d '{
  "httpRequest": {
    "method": "POST",
    "path": "/v1/chat/completions"
  },
  "httpLlmResponse": {
    "provider": "OPENAI",
    "model": "gpt-4o",
    "completion": {
      "text": "MockServer is an open-source HTTP mock server.",
      "stopReason": "stop",
      "usage": { "inputTokens": 12, "outputTokens": 8 }
    }
  }
}'

# Anthropic — POST /v1/messages
curl -v -X PUT "http://localhost:1080/mockserver/expectation" -d '{
  "httpRequest": {
    "method": "POST",
    "path": "/v1/messages"
  },
  "httpLlmResponse": {
    "provider": "ANTHROPIC",
    "model": "claude-sonnet-4-20250514",
    "completion": {
      "text": "MockServer is an open-source HTTP mock server.",
      "stopReason": "end_turn",
      "usage": { "inputTokens": 12, "outputTokens": 8 }
    }
  }
}'

When a request matches POST /v1/chat/completions, MockServer returns an OpenAI-formatted JSON response with the specified text, model, stop reason, and token usage — including the correct id, created, and object fields that the OpenAI SDK expects.

Supported Providers

Each provider value produces the correct API response format for that provider's SDK:

Provider	Typical API path	Notes
OPENAI	/v1/chat/completions	Chat Completions API format
OPENAI_RESPONSES	/v1/responses	OpenAI Responses API format
ANTHROPIC	/v1/messages	Anthropic Messages API format
GEMINI	/v1beta/models/{model}:generateContent	Google Gemini format
BEDROCK	/model/{model}/converse	AWS Bedrock Converse API; streaming uses application/vnd.amazon.eventstream binary framing
AZURE_OPENAI	/openai/deployments/{deployment}/chat/completions	Azure-hosted OpenAI format (delegates to OpenAI codec)
OLLAMA	/api/chat	Ollama local model API format
COHERE	/v1/rerank	Cohere rerank API format (rerank only — see Rerank)
VOYAGE	/v1/rerank	Voyage AI rerank API format (rerank only — see Rerank)

Streaming Responses

Set streaming to true on the completion to return a stream instead of a single JSON response. MockServer splits the text into token-sized chunks and sends them as streaming chunks in the provider's native streaming format (SSE for most providers; NDJSON for Ollama).

By default the text is split into fine, subword-sized chunks that closely match how a real provider streams tokens (so a streamed response emits more, smaller deltas). Set subwordStreaming to false in streamingPhysics if you prefer the coarser whole-word chunking (one chunk per word). The timing controls below behave identically in both modes — subword chunking simply produces more deltas, which makes the overall stream take slightly longer at the same tokensPerSecond rate.

Use streamingPhysics to control timing — useful for testing loading indicators, timeouts, and backpressure handling:

{
  "httpRequest": {
    "method": "POST",
    "path": "/v1/chat/completions"
  },
  "httpLlmResponse": {
    "provider": "OPENAI",
    "model": "gpt-4o",
    "completion": {
      "text": "This response is streamed token by token.",
      "streaming": true,
      "streamingPhysics": {
        "timeToFirstToken": {
          "timeUnit": "MILLISECONDS",
          "value": 200
        },
        "tokensPerSecond": 50,
        "jitter": 0.1,
        "seed": 42
      },
      "usage": {
        "inputTokens": 10,
        "outputTokens": 8
      }
    }
  }
}

Field	Description
timeToFirstToken	Delay before the first SSE event is sent
tokensPerSecond	Base token emission rate (1 – 10000)
jitter	Fractional uniform deviation from the base rate (0.0 – 1.0)
seed	PRNG seed for reproducible inter-token timing
subwordStreaming	Whether to stream fine subword-sized deltas (true, the default) or coarser whole-word deltas (false). Subword deltas are closer to a real provider's per-token stream

Tool Calls

To mock an LLM response that invokes tools (function calling), add toolCalls to the completion:

{
  "httpRequest": {
    "method": "POST",
    "path": "/v1/chat/completions"
  },
  "httpLlmResponse": {
    "provider": "OPENAI",
    "model": "gpt-4o",
    "completion": {
      "toolCalls": [
        {
          "id": "call_abc123",
          "name": "get_weather",
          "arguments": "{\"location\": \"London\"}"
        }
      ],
      "stopReason": "tool_use"
    }
  }
}

For OpenAI-compatible mocks you can also set toolChoice on the completion to model the request's tool_choice directive. It accepts auto, none, required, or a named tool. When toolChoice is required and a tool call is configured, the mocked response's finish_reason is forced to tool_calls (for both non-streaming and streaming responses), matching how OpenAI behaves when a caller forces a tool call. Omitting toolChoice leaves the finish reason behaviour unchanged.

{
  "httpLlmResponse": {
    "provider": "OPENAI",
    "model": "gpt-4o",
    "completion": {
      "toolChoice": "required",
      "toolCalls": [
        { "name": "get_weather", "arguments": "{\"location\": \"London\"}" }
      ]
    }
  }
}

Embeddings

Mock an embeddings endpoint by setting embedding instead of completion. MockServer returns a provider-correct embeddings response with a vector of the requested size. Set deterministicFromInput to true so the same input text always produces the same (L2-normalised) vector — ideal for reproducible tests of retrieval / vector-search code.

When deterministicFromInput is true, the vector is also semantically plausible: it is derived from the words and character n-grams in the input text, so texts that share vocabulary get a higher cosine similarity and unrelated texts are near-orthogonal. This means your retrieval / RAG code can rank related documents above unrelated ones offline — for example "the cat sat on the mat" scores far higher against "a cat sits on a mat" than against "quarterly financial report" — with no real embedding model and deterministic results for the same input, seed, and dimensions.

{
  "httpRequest": {
    "method": "POST",
    "path": "/v1/embeddings"
  },
  "httpLlmResponse": {
    "provider": "OPENAI",
    "model": "text-embedding-3-small",
    "embedding": {
      "dimensions": 1536,
      "deterministicFromInput": true,
      "seed": 42
    }
  }
}

Provider	Response shape	Default dimensions
OPENAI / AZURE_OPENAI	{"object":"list","data":[{"embedding":[...]}]}	1536
GEMINI	{"embedding":{"values":[...]}}	768
OLLAMA	{"embeddings":[[...]]} (the /api/embed shape)	768
BEDROCK (Titan)	{"embedding":[...],"inputTextTokenCount":N}	1024
BEDROCK (Cohere)	{"embeddings":[[...]]} (when model starts with cohere)	1024

dimensions overrides the vector length; seed makes the deterministic vector reproducible across runs. ANTHROPIC and OPENAI_RESPONSES have no embeddings endpoint and return an error if used for an embedding mock.

Rerank

Mock a rerank endpoint by setting rerank with the COHERE or VOYAGE provider. MockServer reads the candidate documents from the request body's documents array and returns one result per document — each with its original index and a relevance_score — sorted from most to least relevant. Set deterministicFromInput to true so the same documents always score the same way. Use topN to return only the top few results.

{
  "httpRequest": {
    "method": "POST",
    "path": "/v1/rerank"
  },
  "httpLlmResponse": {
    "provider": "COHERE",
    "rerank": {
      "topN": 3,
      "deterministicFromInput": true,
      "seed": 42
    }
  }
}

Each provider returns its own response envelope around the same per-document scores ({"index": N, "relevance_score": F}), sorted by descending relevance:

Provider	Response shape
COHERE	{"results": [{"index": 1, "relevance_score": 0.9123}, ...]}
VOYAGE	{"object": "list", "data": [{"index": 1, "relevance_score": 0.9123}, ...], "usage": {"total_tokens": 0}}

Documents may be plain strings or objects with a text field (Cohere's structured form) — both are read from the request's documents array.

Multimodal Request Recognition (Images and Audio)

When matching against requests, MockServer recognises image and audio content parts in the decoded conversation, so conversation predicates can assert that a message contained an image or an audio clip (and, where the provider declares it, the media type or audio format). For the OpenAI-compatible format this covers image_url and input_audio content parts. This is request-side recognition only — MockServer notes the presence (and how many) but does not store the media bytes or generate image/audio responses.

Multi-Turn Conversations

For agent testing, you often need to script a sequence of LLM responses that depend on what the agent sent. MockServer supports this through conversation predicates combined with scenario state to create multi-turn conversation flows.

Each turn uses conversationPredicates to match against the conversation history in the request body, and scenario state to track which turn the conversation is on:

[
  {
    "httpRequest": {
      "method": "POST",
      "path": "/v1/chat/completions"
    },
    "httpLlmResponse": {
      "provider": "OPENAI",
      "model": "gpt-4o",
      "completion": {
        "text": "I'll look up the weather for you.",
        "toolCalls": [
          {
            "id": "call_1",
            "name": "get_weather",
            "arguments": "{\"location\": \"London\"}"
          }
        ],
        "stopReason": "tool_use"
      },
      "conversationPredicates": {
        "turnIndex": 0,
        "latestMessageContains": "weather"
      }
    },
    "times": { "remainingTimes": 1 }
  },
  {
    "httpRequest": {
      "method": "POST",
      "path": "/v1/chat/completions"
    },
    "httpLlmResponse": {
      "provider": "OPENAI",
      "model": "gpt-4o",
      "completion": {
        "text": "The weather in London is 18C and sunny.",
        "stopReason": "stop"
      },
      "conversationPredicates": {
        "containsToolResultFor": "get_weather"
      }
    },
    "times": { "remainingTimes": 1 }
  }
]

Conversation predicates

Predicates match against the parsed conversation in the request body (decoded using the provider's message format):

Predicate	Description
turnIndex	Match when the assistant turn count equals this value (0-based)
latestMessageContains	Match when the last message contains this substring
latestMessageMatches	Match when the last message matches this regex pattern
latestMessageRole	Match when the last message has this role (e.g. user, tool)
containsToolResultFor	Match when the conversation contains a tool result for this tool name
semanticMatchAgainst	Opt-in, exploratory: the expected meaning the latest message should express, judged by a runtime LLM. Off by default — ignored unless mockserver.llmSemanticMatchingEnabled is set and a backend resolves. Non-deterministic; for exploration only, never for CI assertions.

Prompt normalisation

Agent prompts are dynamically assembled, so exact-byte matching can be brittle. Add a normalization object to the predicates to apply deterministic transforms before matching:

"conversationPredicates": {
  "latestMessageContains": "search for weather",
  "normalization": {
    "collapseWhitespace": true,
    "lowercase": true,
    "sortJsonKeys": true
  }
}

Session Isolation

When multiple agents or test threads share a MockServer instance, each conversation needs its own independent state. Use an isolation source to extract a session key from each request (a header, query parameter, or cookie) so conversation state is tracked per session.

Session isolation is configured via the Java client’s conversation builder (isolateBy(IsolationSource.header("x-session-id"))) or the create_llm_conversation MCP tool’s isolateBy parameter. The isolation source is encoded into the expectation’s scenario name — it is not a separate field in the raw expectation JSON. Common isolation sources:

Header: extract the session key from a request header (e.g. x-session-id)
Query parameter: extract from a URL query parameter
Cookie: extract from a cookie value

Example using the create_llm_conversation MCP tool:

{
  "provider": "OPENAI",
  "path": "/v1/chat/completions",
  "isolateBy": {
    "source": "header",
    "name": "x-session-id"
  },
  "turns": [
    {
      "match": { "turnIndex": 0 },
      "response": { "text": "Hello!", "stopReason": "stop" }
    }
  ]
}

Each unique isolation key value creates an independent conversation state machine, so multiple agents can run conversations in parallel without interfering with each other.

Realtime Voice APIs (OpenAI Realtime, Gemini Live)

The realtime voice APIs — the OpenAI Realtime API and the Google Gemini Live API — are not ordinary HTTP calls: the client opens a WebSocket and exchanges a stream of JSON events. MockServer can mock both so you can test an agent or app that uses them completely offline, with no real API, no cost, and no audio hardware. The mock replies to the client’s control messages with the provider-correct event stream for one scripted spoken turn (transcript deltas, audio-byte deltas, and a final usage/done event).

Point your realtime SDK at MockServer’s WebSocket URL (e.g. ws://localhost:1080/v1/realtime) instead of the real provider. Audio bytes in the mocked stream are short silence placeholders — the aim is to reproduce the event protocol faithfully (so your event-handling code is exercised), not to synthesise real speech. The same scripted turn answers every request on the connection.

Available in the Java client only. Realtime voice mocking is a Java-client convenience (RealtimeMockBuilder): there is no dedicated realtime JSON field. The builder expands the short script below into a full provider-correct httpWebSocketResponse expectation (session events, streamed transcript/audio deltas, and a final usage event) using the server-side realtime codecs. Reproducing that by hand in another client would mean writing out every WebSocket frame, so the other client libraries do not offer this shortcut.

import static org.mockserver.client.RealtimeMockBuilder.openAiRealtime;
import static org.mockserver.llm.realtime.RealtimeTurn.realtimeTurn;

// Mock a spoken response over ws://localhost:1080/v1/realtime
openAiRealtime()
    .withModel("gpt-realtime")
    .respondingWith(
        realtimeTurn("The capital of France is Paris.")
            .withInputTokens(20)
            .withOutputTokens(7)
    )
    .applyTo(mockServerClient);

// On connect the mock pushes session.created; it acknowledges session.update and
// conversation.item.create, and answers each response.create with the full event
// lifecycle ending in response.done (with usage).

import static org.mockserver.client.RealtimeMockBuilder.geminiLive;

// Mock a Gemini Live BidiGenerateContent session
geminiLive()
    .respondingWith("Bonjour le monde")
    .applyTo(mockServerClient);

// The mock answers the client's setup message with setupComplete, then answers each
// clientContent turn with a streamed serverContent chunk sequence, generationComplete,
// and a final turnComplete carrying usageMetadata.

Use .withModality(RealtimeModality.TEXT) for a text-only response, .withTokensPerSecond(...) / .withTimeToFirstToken(...) to control streaming timing, and .respondingWith(realtimeTurn(...)) to set an explicit transcript, synthetic audio bytes, or token usage. Protocol corners that are intentionally not mocked yet (server voice-activity-detection events, tool/function calls, input-audio transcription) are listed in the developer documentation.

Cost Budget

When using MockServer as a proxy in front of a real LLM provider, the llmCostBudgetUsd property sets a cumulative cost ceiling. Once the estimated cost of all forwarded LLM completions exceeds the budget, further LLM forwards are blocked with a 429 response. This is a safety net for CI pipelines or development environments where runaway agents could generate unexpected charges.

Property	Env var	Default	Description
mockserver.llmCostBudgetUsd	MOCKSERVER_LLM_COST_BUDGET_USD	-1.0 (disabled)	Cumulative cost budget in USD. Set to a positive value to enable; negative or unset means no limit.

The budget is enforced on all forward paths (matched forward actions, proxy-pass, and reverse-proxy routes) and resets on PUT /mockserver/reset. Cost estimation uses an internal pricing table and is approximate — treat it as a safety guard, not an invoice.

Moderation, Content Filter & Refusal

Production agents must handle provider content-filter blocks and refusals. These opt-in fields let you trigger them deterministically. All default to off / not-flagged, so existing mocks are unaffected.

OpenAI Moderations endpoint — set moderation (instead of completion) and a request returns OpenAI's /v1/moderations shape. Categories listed in flaggedCategories are marked flagged; an empty list is a not-flagged verdict.

{
  "httpRequest": {
    "method": "POST",
    "path": "/v1/moderations"
  },
  "httpLlmResponse": {
    "provider": "OPENAI",
    "moderation": {
      "flaggedCategories": [ "hate", "violence" ]
    }
  }
}

Azure content-filter annotations — set contentFilter with per-category severities (hate, sexual, violence, selfHarm: one of safe, low, medium, high) on an AZURE_OPENAI completion. MockServer adds the content_filter_results (per choice) and prompt_filter_results (top-level) annotations agents read to detect filtering (filtered is true at medium/high).

{
  "httpRequest": {
    "method": "POST",
    "path": "/openai/deployments/gpt-4o/chat/completions"
  },
  "httpLlmResponse": {
    "provider": "AZURE_OPENAI",
    "model": "gpt-4o",
    "completion": { "text": "Filtered response" },
    "contentFilter": { "hate": "high", "violence": "safe" }
  }
}

Refusals & content-filter blocks — an Anthropic refusal is just a completion with "stopReason": "refusal". To inject a block probabilistically across providers, use the contentFilterBlockProbability chaos field (see below), which emits the provider-correct shape.

Chaos / Fault Injection

Add a chaos object to the httpLlmResponse to test how your application handles LLM failures:

{
  "httpRequest": {
    "method": "POST",
    "path": "/v1/chat/completions"
  },
  "httpLlmResponse": {
    "provider": "OPENAI",
    "model": "gpt-4o",
    "completion": {
      "text": "Normal response text",
      "streaming": true
    },
    "chaos": {
      "errorStatus": 503,
      "errorProbability": 0.3,
      "truncateMode": "MID_STREAM",
      "truncateAtFraction": 0.5,
      "malformedSse": true
    }
  }
}

Field	Description
errorStatus	Return this HTTP status as a provider error (e.g. 503, 429). With no errorProbability, always fires.
retryAfter	Value for the Retry-After header on an injected error (e.g. "30"). Useful alongside errorStatus: 429 to test retry-after handling.
errorProbability	Probability (0.0 – 1.0) that the error fires on each request
truncateMode	NONE (default) or MID_STREAM. Must be set to MID_STREAM for truncateAtFraction to take effect.
truncateAtFraction	For streaming responses, cut the stream short at this fraction of SSE events (0.0 – 1.0, default 0.5). Only applies when truncateMode is MID_STREAM.
malformedSse	Inject a deliberately broken-JSON SSE chunk
quotaName / quotaLimit / quotaWindowMillis	Deterministic fixed-window rate limit
quotaErrorStatus	HTTP status returned when the quota is exceeded (default 429). Must be between 100 and 599.
contentFilterBlockProbability	Probability (0.0 – 1.0) of emitting a provider-correct content-filter block (see Moderation, Content Filter & Refusal): OpenAI → 400 content_filter; Azure → 400 with a filtered innererror; Anthropic → 200 refusal; Gemini → 200 SAFETY. Seeded-deterministic and takes priority over errorStatus.
seed	PRNG seed for reproducible probabilistic faults

See Chaos Testing & Fault Injection for more about chaos profiles.

Provider-Correct Error Bodies

LLM chaos error injection emits the body shape the real provider returns — not a generic error body. This lets your client SDK's retry and backoff logic be tested against realistic error responses:

Provider	HTTP Status (OVERLOAD)	Body shape
Anthropic	529	`{"type":"error","error":{"type":"overloaded_error","message":"..."}}`
OpenAI	503	`{"error":{"message":"...","type":"server_error","code":"server_error"}}`
Gemini	503	Google API status envelope
Ollama	503	Plain `{"error":"..."}` message

An optional errorKind field (OVERLOAD, RATE_LIMIT, or SERVER_ERROR) on the LLM chaos profile lets you declare the error intent. MockServer emits the active provider's natural HTTP status and its provider-correct body without you having to pick the status manually. An explicit errorStatus still overrides the code while keeping the provider-correct body. A missing or unrecognised errorKind falls back to the generic body.

Cached and Reasoning Token Usage

org.mockserver.model.Usage has three optional fields in addition to the baseline inputTokens / outputTokens pair. These are back-compatible and optional: omitting them leaves behaviour unchanged.

Field (Java / JSON)	Description	Provider mapping
`cachedInputTokens`	Input tokens served from a prompt cache at a reduced rate. A subset of `inputTokens` (not additive).	OpenAI: `usage.prompt_tokens_details.cached_tokens` Anthropic: `usage.cache_read_input_tokens` Gemini: `usageMetadata.cachedContentTokenCount`
`cacheCreationTokens`	Input tokens written to a prompt cache at a premium rate.	Anthropic: `usage.cache_creation_input_tokens`
`reasoningTokens`	Reasoning/thinking tokens billed as output but not part of the visible completion. A subset of `outputTokens` (not additive).	OpenAI: `usage.completion_tokens_details.reasoning_tokens` Gemini: `usageMetadata.thoughtsTokenCount`

Set these fields on the usage object in your expectation JSON to simulate provider responses from models that use prompt caching or chain-of-thought reasoning:

import static org.mockserver.client.LlmMockBuilder.llmMock;
import static org.mockserver.model.Completion.completion;
import static org.mockserver.model.Provider.ANTHROPIC;
import static org.mockserver.model.Usage.usage;

llmMock("/v1/messages")
    .withProvider(ANTHROPIC)
    .withModel("claude-sonnet-4-20250514")
    .respondingWith(
        completion()
            .withText("Paris is the capital of France.")
            .withStopReason("end_turn")
            .withUsage(
                usage()
                    .withInputTokens(150)
                    .withOutputTokens(10)
                    .withCachedInputTokens(120)  // subset of inputTokens
                    .withCacheCreationTokens(0)
                    .withReasoningTokens(0)
            )
    )
    .applyTo(mockServerClient);

var mockServerClient = require('mockserver-client').mockServerClient;

mockServerClient("localhost", 1080).mockAnyResponse({
    "httpRequest": {
        "method": "POST",
        "path": "/v1/messages"
    },
    "httpLlmResponse": {
        "provider": "ANTHROPIC",
        "model": "claude-sonnet-4-20250514",
        "completion": {
            "text": "Paris is the capital of France.",
            "stopReason": "end_turn",
            "usage": {
                "inputTokens": 150,
                "outputTokens": 10,
                "cachedInputTokens": 120,
                "cacheCreationTokens": 0,
                "reasoningTokens": 0
            }
        }
    }
}).then(
    function () { console.log("expectation created"); },
    function (error) { console.log(error); }
);

import requests

requests.put(
    "http://localhost:1080/mockserver/expectation",
    json={
        "httpRequest": {
            "method": "POST",
            "path": "/v1/messages"
        },
        "httpLlmResponse": {
            "provider": "ANTHROPIC",
            "model": "claude-sonnet-4-20250514",
            "completion": {
                "text": "Paris is the capital of France.",
                "stopReason": "end_turn",
                "usage": {
                    "inputTokens": 150,
                    "outputTokens": 10,
                    "cachedInputTokens": 120,
                    "cacheCreationTokens": 0,
                    "reasoningTokens": 0
                }
            }
        }
    }
)

require 'net/http'
require 'json'

uri = URI('http://localhost:1080/mockserver/expectation')
http = Net::HTTP.new(uri.host, uri.port)
req = Net::HTTP::Put.new(uri.path, 'Content-Type' => 'application/json')
req.body = JSON.generate({
  'httpRequest' => { 'method' => 'POST', 'path' => '/v1/messages' },
  'httpLlmResponse' => {
    'provider' => 'ANTHROPIC',
    'model' => 'claude-sonnet-4-20250514',
    'completion' => {
      'text' => 'Paris is the capital of France.',
      'stopReason' => 'end_turn',
      'usage' => {
        'inputTokens' => 150,
        'outputTokens' => 10,
        'cachedInputTokens' => 120,
        'cacheCreationTokens' => 0,
        'reasoningTokens' => 0
      }
    }
  }
})
http.request(req)

package main

import (
    "bytes"
    "encoding/json"
    "net/http"
)

func createCachedUsageExpectation() {
    body, _ := json.Marshal(map[string]interface{}{
        "httpRequest": map[string]interface{}{
            "method": "POST",
            "path":   "/v1/messages",
        },
        "httpLlmResponse": map[string]interface{}{
            "provider": "ANTHROPIC",
            "model":    "claude-sonnet-4-20250514",
            "completion": map[string]interface{}{
                "text":       "Paris is the capital of France.",
                "stopReason": "end_turn",
                "usage": map[string]int{
                    "inputTokens":         150,
                    "outputTokens":        10,
                    "cachedInputTokens":   120,
                    "cacheCreationTokens": 0,
                    "reasoningTokens":     0,
                },
            },
        },
    })
    req, _ := http.NewRequest(http.MethodPut,
        "http://localhost:1080/mockserver/expectation", bytes.NewReader(body))
    req.Header.Set("Content-Type", "application/json")
    http.DefaultClient.Do(req)
}

using System.Net.Http;
using System.Text;
using System.Text.Json;

var expectation = new
{
    httpRequest = new { method = "POST", path = "/v1/messages" },
    httpLlmResponse = new
    {
        provider = "ANTHROPIC",
        model = "claude-sonnet-4-20250514",
        completion = new
        {
            text = "Paris is the capital of France.",
            stopReason = "end_turn",
            usage = new
            {
                inputTokens = 150,
                outputTokens = 10,
                cachedInputTokens = 120,
                cacheCreationTokens = 0,
                reasoningTokens = 0
            }
        }
    }
};
using var client = new HttpClient();
var json = JsonSerializer.Serialize(expectation);
await client.PutAsync(
    "http://localhost:1080/mockserver/expectation",
    new StringContent(json, Encoding.UTF8, "application/json"));

use serde_json::json;

let client = reqwest::blocking::Client::new();
client.put("http://localhost:1080/mockserver/expectation")
    .json(&json!({
        "httpRequest": {
            "method": "POST",
            "path": "/v1/messages"
        },
        "httpLlmResponse": {
            "provider": "ANTHROPIC",
            "model": "claude-sonnet-4-20250514",
            "completion": {
                "text": "Paris is the capital of France.",
                "stopReason": "end_turn",
                "usage": {
                    "inputTokens": 150,
                    "outputTokens": 10,
                    "cachedInputTokens": 120,
                    "cacheCreationTokens": 0,
                    "reasoningTokens": 0
                }
            }
        }
    }))
    .send()
    .unwrap();

<?php
$expectation = [
    'httpRequest' => ['method' => 'POST', 'path' => '/v1/messages'],
    'httpLlmResponse' => [
        'provider' => 'ANTHROPIC',
        'model' => 'claude-sonnet-4-20250514',
        'completion' => [
            'text' => 'Paris is the capital of France.',
            'stopReason' => 'end_turn',
            'usage' => [
                'inputTokens' => 150,
                'outputTokens' => 10,
                'cachedInputTokens' => 120,
                'cacheCreationTokens' => 0,
                'reasoningTokens' => 0,
            ],
        ],
    ],
];
$ch = curl_init('http://localhost:1080/mockserver/expectation');
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'PUT');
curl_setopt($ch, CURLOPT_POSTFIELDS, json_encode($expectation));
curl_setopt($ch, CURLOPT_HTTPHEADER, ['Content-Type: application/json']);
curl_exec($ch);
curl_close($ch);

curl -v -X PUT "http://localhost:1080/mockserver/expectation" -d '{
  "httpRequest": {
    "method": "POST",
    "path": "/v1/messages"
  },
  "httpLlmResponse": {
    "provider": "ANTHROPIC",
    "model": "claude-sonnet-4-20250514",
    "completion": {
      "text": "Paris is the capital of France.",
      "stopReason": "end_turn",
      "usage": {
        "inputTokens": 150,
        "outputTokens": 10,
        "cachedInputTokens": 120,
        "cacheCreationTokens": 0,
        "reasoningTokens": 0
      }
    }
  }
}'

When MockServer proxies traffic to a real LLM provider, these fields are decoded from the provider's usage shape and stored. The GenAI telemetry spans report them as mockserver.gen_ai.usage.cached_input_tokens, mockserver.gen_ai.usage.cache_creation_tokens, and mockserver.gen_ai.usage.reasoning_tokens.

Streaming — Client Examples

These examples show how to create a streaming LLM expectation with realistic timing physics. The Java client uses the typed fluent API; all other clients send the expectation as raw JSON.

import static java.util.concurrent.TimeUnit.MILLISECONDS;
import static org.mockserver.client.Llm.jitter;
import static org.mockserver.client.Llm.timeToFirstToken;
import static org.mockserver.client.Llm.tokensPerSecond;
import static org.mockserver.client.LlmMockBuilder.llmMock;
import static org.mockserver.model.Completion.completion;
import static org.mockserver.model.Provider.OPENAI;

llmMock("/v1/chat/completions")
    .withProvider(OPENAI)
    .withModel("gpt-4o")
    .respondingWith(
        completion()
            .withText("This response is streamed token by token.")
            .streaming()
            .withStreamingPhysics(
                timeToFirstToken(200, MILLISECONDS),
                tokensPerSecond(50),
                jitter(0.1))
    )
    .applyTo(mockServerClient);

var mockServerClient = require('mockserver-client').mockServerClient;

mockServerClient("localhost", 1080).mockAnyResponse({
    "httpRequest": {
        "method": "POST",
        "path": "/v1/chat/completions"
    },
    "httpLlmResponse": {
        "provider": "OPENAI",
        "model": "gpt-4o",
        "completion": {
            "text": "This response is streamed token by token.",
            "streaming": true,
            "streamingPhysics": {
                "timeToFirstToken": { "timeUnit": "MILLISECONDS", "value": 200 },
                "tokensPerSecond": 50,
                "jitter": 0.1,
                "seed": 42
            }
        }
    }
}).then(
    function () { console.log("expectation created"); },
    function (error) { console.log(error); }
);

import requests

requests.put(
    "http://localhost:1080/mockserver/expectation",
    json={
        "httpRequest": {
            "method": "POST",
            "path": "/v1/chat/completions"
        },
        "httpLlmResponse": {
            "provider": "OPENAI",
            "model": "gpt-4o",
            "completion": {
                "text": "This response is streamed token by token.",
                "streaming": True,
                "streamingPhysics": {
                    "timeToFirstToken": {"timeUnit": "MILLISECONDS", "value": 200},
                    "tokensPerSecond": 50,
                    "jitter": 0.1,
                    "seed": 42
                }
            }
        }
    }
)

require 'net/http'
require 'json'

uri = URI('http://localhost:1080/mockserver/expectation')
http = Net::HTTP.new(uri.host, uri.port)
req = Net::HTTP::Put.new(uri.path, 'Content-Type' => 'application/json')
req.body = JSON.generate({
  'httpRequest' => { 'method' => 'POST', 'path' => '/v1/chat/completions' },
  'httpLlmResponse' => {
    'provider' => 'OPENAI',
    'model' => 'gpt-4o',
    'completion' => {
      'text' => 'This response is streamed token by token.',
      'streaming' => true,
      'streamingPhysics' => {
        'timeToFirstToken' => { 'timeUnit' => 'MILLISECONDS', 'value' => 200 },
        'tokensPerSecond' => 50,
        'jitter' => 0.1,
        'seed' => 42
      }
    }
  }
})
http.request(req)

package main

import (
    "bytes"
    "encoding/json"
    "net/http"
)

func createStreamingLlmExpectation() {
    body, _ := json.Marshal(map[string]interface{}{
        "httpRequest": map[string]interface{}{
            "method": "POST",
            "path":   "/v1/chat/completions",
        },
        "httpLlmResponse": map[string]interface{}{
            "provider": "OPENAI",
            "model":    "gpt-4o",
            "completion": map[string]interface{}{
                "text":      "This response is streamed token by token.",
                "streaming": true,
                "streamingPhysics": map[string]interface{}{
                    "timeToFirstToken": map[string]interface{}{
                        "timeUnit": "MILLISECONDS", "value": 200,
                    },
                    "tokensPerSecond": 50,
                    "jitter":          0.1,
                    "seed":            42,
                },
            },
        },
    })
    req, _ := http.NewRequest(http.MethodPut,
        "http://localhost:1080/mockserver/expectation", bytes.NewReader(body))
    req.Header.Set("Content-Type", "application/json")
    http.DefaultClient.Do(req)
}

using System.Net.Http;
using System.Text;
using System.Text.Json;

var expectation = new
{
    httpRequest = new { method = "POST", path = "/v1/chat/completions" },
    httpLlmResponse = new
    {
        provider = "OPENAI",
        model = "gpt-4o",
        completion = new
        {
            text = "This response is streamed token by token.",
            streaming = true,
            streamingPhysics = new
            {
                timeToFirstToken = new { timeUnit = "MILLISECONDS", value = 200 },
                tokensPerSecond = 50,
                jitter = 0.1,
                seed = 42
            }
        }
    }
};
using var client = new HttpClient();
var json = JsonSerializer.Serialize(expectation);
await client.PutAsync(
    "http://localhost:1080/mockserver/expectation",
    new StringContent(json, Encoding.UTF8, "application/json"));

use serde_json::json;

let client = reqwest::blocking::Client::new();
client.put("http://localhost:1080/mockserver/expectation")
    .json(&json!({
        "httpRequest": {
            "method": "POST",
            "path": "/v1/chat/completions"
        },
        "httpLlmResponse": {
            "provider": "OPENAI",
            "model": "gpt-4o",
            "completion": {
                "text": "This response is streamed token by token.",
                "streaming": true,
                "streamingPhysics": {
                    "timeToFirstToken": { "timeUnit": "MILLISECONDS", "value": 200 },
                    "tokensPerSecond": 50,
                    "jitter": 0.1,
                    "seed": 42
                }
            }
        }
    }))
    .send()
    .unwrap();

<?php
$expectation = [
    'httpRequest' => ['method' => 'POST', 'path' => '/v1/chat/completions'],
    'httpLlmResponse' => [
        'provider' => 'OPENAI',
        'model' => 'gpt-4o',
        'completion' => [
            'text' => 'This response is streamed token by token.',
            'streaming' => true,
            'streamingPhysics' => [
                'timeToFirstToken' => ['timeUnit' => 'MILLISECONDS', 'value' => 200],
                'tokensPerSecond' => 50,
                'jitter' => 0.1,
                'seed' => 42,
            ],
        ],
    ],
];
$ch = curl_init('http://localhost:1080/mockserver/expectation');
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'PUT');
curl_setopt($ch, CURLOPT_POSTFIELDS, json_encode($expectation));
curl_setopt($ch, CURLOPT_HTTPHEADER, ['Content-Type: application/json']);
curl_exec($ch);
curl_close($ch);

curl -v -X PUT "http://localhost:1080/mockserver/expectation" -d '{
  "httpRequest": {
    "method": "POST",
    "path": "/v1/chat/completions"
  },
  "httpLlmResponse": {
    "provider": "OPENAI",
    "model": "gpt-4o",
    "completion": {
      "text": "This response is streamed token by token.",
      "streaming": true,
      "streamingPhysics": {
        "timeToFirstToken": { "timeUnit": "MILLISECONDS", "value": 200 },
        "tokensPerSecond": 50,
        "jitter": 0.1,
        "seed": 42
      }
    }
  }
}'

Java Client API (Simple Completion)

The simplest LLM expectation returns a single completion for a request. The Java client uses the typed when(...).respond(llmResponse()...) fluent API; all other clients send the same expectation as raw JSON:

import static org.mockserver.model.HttpRequest.request;
import static org.mockserver.model.HttpLlmResponse.llmResponse;
import static org.mockserver.model.Completion.completion;
import org.mockserver.model.Provider;

mockServerClient
    .when(
        request()
            .withMethod("POST")
            .withPath("/v1/chat/completions")
    )
    .respond(
        llmResponse()
            .withProvider(Provider.OPENAI)
            .withModel("gpt-4o")
            .withCompletion(
                completion()
                    .withText("Hello from MockServer!")
                    .withStopReason("stop")
            )
    );

var mockServerClient = require('mockserver-client').mockServerClient;

mockServerClient("localhost", 1080).mockAnyResponse({
    "httpRequest": {
        "method": "POST",
        "path": "/v1/chat/completions"
    },
    "httpLlmResponse": {
        "provider": "OPENAI",
        "model": "gpt-4o",
        "completion": {
            "text": "Hello from MockServer!",
            "stopReason": "stop"
        }
    }
}).then(
    function () { console.log("expectation created"); },
    function (error) { console.log(error); }
);

import requests

requests.put(
    "http://localhost:1080/mockserver/expectation",
    json={
        "httpRequest": {
            "method": "POST",
            "path": "/v1/chat/completions"
        },
        "httpLlmResponse": {
            "provider": "OPENAI",
            "model": "gpt-4o",
            "completion": {
                "text": "Hello from MockServer!",
                "stopReason": "stop"
            }
        }
    }
)

require 'net/http'
require 'json'

uri = URI('http://localhost:1080/mockserver/expectation')
http = Net::HTTP.new(uri.host, uri.port)
req = Net::HTTP::Put.new(uri.path, 'Content-Type' => 'application/json')
req.body = JSON.generate({
  'httpRequest' => { 'method' => 'POST', 'path' => '/v1/chat/completions' },
  'httpLlmResponse' => {
    'provider' => 'OPENAI',
    'model' => 'gpt-4o',
    'completion' => {
      'text' => 'Hello from MockServer!',
      'stopReason' => 'stop'
    }
  }
})
http.request(req)

package main

import (
    "bytes"
    "encoding/json"
    "net/http"
)

func createLlmExpectation() {
    body, _ := json.Marshal(map[string]interface{}{
        "httpRequest": map[string]interface{}{
            "method": "POST",
            "path":   "/v1/chat/completions",
        },
        "httpLlmResponse": map[string]interface{}{
            "provider": "OPENAI",
            "model":    "gpt-4o",
            "completion": map[string]interface{}{
                "text":       "Hello from MockServer!",
                "stopReason": "stop",
            },
        },
    })
    req, _ := http.NewRequest(http.MethodPut,
        "http://localhost:1080/mockserver/expectation", bytes.NewReader(body))
    req.Header.Set("Content-Type", "application/json")
    http.DefaultClient.Do(req)
}

using System.Net.Http;
using System.Text;
using System.Text.Json;

var expectation = new
{
    httpRequest = new { method = "POST", path = "/v1/chat/completions" },
    httpLlmResponse = new
    {
        provider = "OPENAI",
        model = "gpt-4o",
        completion = new
        {
            text = "Hello from MockServer!",
            stopReason = "stop"
        }
    }
};
using var client = new HttpClient();
var json = JsonSerializer.Serialize(expectation);
await client.PutAsync(
    "http://localhost:1080/mockserver/expectation",
    new StringContent(json, Encoding.UTF8, "application/json"));

use serde_json::json;

let client = reqwest::blocking::Client::new();
client.put("http://localhost:1080/mockserver/expectation")
    .json(&json!({
        "httpRequest": {
            "method": "POST",
            "path": "/v1/chat/completions"
        },
        "httpLlmResponse": {
            "provider": "OPENAI",
            "model": "gpt-4o",
            "completion": {
                "text": "Hello from MockServer!",
                "stopReason": "stop"
            }
        }
    }))
    .send()
    .unwrap();

<?php
$expectation = [
    'httpRequest' => ['method' => 'POST', 'path' => '/v1/chat/completions'],
    'httpLlmResponse' => [
        'provider' => 'OPENAI',
        'model' => 'gpt-4o',
        'completion' => [
            'text' => 'Hello from MockServer!',
            'stopReason' => 'stop',
        ],
    ],
];
$ch = curl_init('http://localhost:1080/mockserver/expectation');
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'PUT');
curl_setopt($ch, CURLOPT_POSTFIELDS, json_encode($expectation));
curl_setopt($ch, CURLOPT_HTTPHEADER, ['Content-Type: application/json']);
curl_exec($ch);
curl_close($ch);

curl -v -X PUT "http://localhost:1080/mockserver/expectation" -d '{
  "httpRequest": {
    "method": "POST",
    "path": "/v1/chat/completions"
  },
  "httpLlmResponse": {
    "provider": "OPENAI",
    "model": "gpt-4o",
    "completion": {
      "text": "Hello from MockServer!",
      "stopReason": "stop"
    }
  }
}'

Creating Expectations — full request matcher and action reference
Chaos Testing & Fault Injection — HTTP and LLM chaos profiles
MCP Tools Reference — mock_llm_completion and create_llm_conversation tools for AI agents
Inspect AI Agent Traffic — proxy and inspect real LLM API traffic
Observability — LLM token/cost Prometheus metrics and OpenTelemetry GenAI spans
LLM Cost Optimisation — export a one-click optimisation brief from captured LLM traffic to find ways to cut inference cost