Performance Testing MCP Apps, ChatGPT Apps, and Claude Connectors
Performance testing MCP App tool call latency and resource rendering speed.
MCP Apps have a performance profile that’s different from normal web apps. Your tool handler runs through the MCP protocol, your UI renders inside a host iframe, and the user is already waiting for an AI response before your code even runs. Every millisecond you add to tool call latency is a millisecond the user stares at a spinner. Performance testing helps you find and fix slowdowns before your users notice them.
TL;DR: Measure four things: tool call latency (with the mcp fixture), cold start time (first call after server start), resource bundle size (build output), and rendering speed (with the inspector fixture). Set thresholds, run them in CI, and fail the build when a regression slips in.
Why Performance Matters More in MCP Apps
When someone uses a normal web app, they expect a page load. When someone uses an MCP App inside ChatGPT or Claude, they’re in the middle of a conversation. The AI calls your tool, waits for a response, and renders your resource component in an iframe. The user is already waiting for the AI to think, so any extra delay from your tool handler or slow rendering compounds on top of that.
Here’s what makes MCP App performance tricky:
- Tool call latency is invisible to you unless you measure it. There’s no browser network tab showing your handler’s response time. The AI host calls your server, waits, and eventually renders the result. If your handler takes 3 seconds, the user just sees the AI “thinking” longer.
- Resource components load inside an iframe, which means a separate document load with its own CSS, JavaScript, and rendering pipeline. A bloated bundle that would be fine on a standalone page becomes noticeable when it’s one of several iframes the host is managing.
- Cold starts hit MCP Apps hard because serverless platforms (Lambda, Cloudflare Workers, Fly.io Machines) may spin down your server between tool calls. The first call after an idle period pays the full cold start penalty.
- Hosts impose timeouts. If your tool takes too long, the host may cancel the call. ChatGPT and Claude both have timeout limits that vary by plan and context.
Benchmarking Tool Call Latency
The mcp fixture from sunpeak/test gives you protocol-level access to your tool handlers (see integration testing for the full setup). You can wrap calls with timing to create performance benchmarks:
import { test, expect } from 'sunpeak/test';
test('search-products responds within 500ms', async ({ mcp }) => {
const start = performance.now();
const result = await mcp.callTool('search-products', {
query: 'wireless headphones',
});
const elapsed = performance.now() - start;
expect(result.isError).toBeFalsy();
expect(elapsed).toBeLessThan(500);
});
This measures the full round-trip through the MCP protocol: JSON-RPC serialization, your handler logic, response serialization, and the return path. It’s the same latency a real host would experience.
For tools that call external APIs, the API call usually dominates the timing. Separate internal from external latency by testing with mocked dependencies too:
import { test, expect, vi } from 'sunpeak/test';
// Mock the external API to isolate handler performance
vi.mock('../../src/lib/api-client', () => ({
searchProducts: vi.fn().mockResolvedValue([
{ id: '1', name: 'Headphones', price: 79.99 },
]),
}));
test('search handler logic runs under 50ms (excluding API)', async ({ mcp }) => {
const start = performance.now();
const result = await mcp.callTool('search-products', {
query: 'headphones',
});
const elapsed = performance.now() - start;
expect(result.isError).toBeFalsy();
expect(elapsed).toBeLessThan(50);
});
If this test suddenly takes 200ms, you know the slowdown is in your handler logic (data transformation, validation, serialization), not the external API.
Testing Multiple Tools
Benchmark all your tools in one test file to get a full latency profile:
import { test, expect } from 'sunpeak/test';
const LATENCY_BUDGETS: Record<string, number> = {
'search-products': 500,
'product-detail': 300,
'create-order': 1000,
'list-categories': 200,
};
for (const [toolName, budget] of Object.entries(LATENCY_BUDGETS)) {
test(`${toolName} responds within ${budget}ms`, async ({ mcp }) => {
const start = performance.now();
const result = await mcp.callTool(toolName, getTestInput(toolName));
const elapsed = performance.now() - start;
expect(result.isError).toBeFalsy();
expect(elapsed).toBeLessThan(budget);
});
}
function getTestInput(toolName: string): Record<string, unknown> {
const inputs: Record<string, Record<string, unknown>> = {
'search-products': { query: 'test' },
'product-detail': { productId: 'test-1' },
'create-order': { productId: 'test-1', quantity: 1 },
'list-categories': {},
};
return inputs[toolName] ?? {};
}
This gives you a per-tool latency budget. Write-heavy tools like create-order get a higher budget than read-only lookups. When a tool exceeds its budget, the test fails and tells you which tool got slower.
Measuring Cold Start Time
Cold start time is the latency on the very first tool call after your server boots. It includes module loading, dependency initialization, database connection setup, and any startup logic. On serverless platforms, users hit this delay after every idle period.
The mcp fixture starts a fresh server for each test file by default, so the first test in a file naturally measures cold start:
import { test, expect } from 'sunpeak/test';
test('first tool call (cold start) completes within 2s', async ({ mcp }) => {
const start = performance.now();
const result = await mcp.callTool('list-categories', {});
const coldStartTime = performance.now() - start;
expect(result.isError).toBeFalsy();
expect(coldStartTime).toBeLessThan(2000);
// Second call should be much faster (warm)
const warmStart = performance.now();
await mcp.callTool('list-categories', {});
const warmTime = performance.now() - warmStart;
expect(warmTime).toBeLessThan(500);
console.log(`Cold start: ${coldStartTime.toFixed(0)}ms`);
console.log(`Warm call: ${warmTime.toFixed(0)}ms`);
});
If cold start time creeps up, check for heavy imports at the top of your tool handler files. Lazy-loading large dependencies (database clients, SDK initializations) can cut cold start time significantly:
// Slow: loads the entire SDK on every cold start
import { BigSDK } from 'heavy-analytics-sdk';
// Faster: loads the SDK only when the tool is called
let sdk: BigSDK | null = null;
async function getSDK() {
if (!sdk) {
const { BigSDK } = await import('heavy-analytics-sdk');
sdk = new BigSDK();
}
return sdk;
}
Checking Resource Bundle Size
Your resource component bundles load inside the host iframe. A 500KB bundle that loads in 100ms on a fast connection takes noticeably longer on mobile, and the host is already managing its own JavaScript. Smaller bundles mean faster time-to-interactive for your MCP App UI.
Add a build size check to your test suite:
// tests/perf/bundle-size.test.ts
import { test, expect } from 'vitest';
import { readdir, stat } from 'fs/promises';
import { join } from 'path';
import { gzipSync } from 'zlib';
import { readFileSync } from 'fs';
const MAX_BUNDLE_SIZE_KB = 100; // gzipped
const BUILD_DIR = join(__dirname, '../../dist/assets');
test('resource bundles are under size limit', async () => {
const files = await readdir(BUILD_DIR);
const jsFiles = files.filter(f => f.endsWith('.js'));
for (const file of jsFiles) {
const filePath = join(BUILD_DIR, file);
const raw = readFileSync(filePath);
const gzipped = gzipSync(raw);
const sizeKB = gzipped.length / 1024;
expect(
sizeKB,
`${file} is ${sizeKB.toFixed(1)}KB gzipped (limit: ${MAX_BUNDLE_SIZE_KB}KB)`
).toBeLessThan(MAX_BUNDLE_SIZE_KB);
}
});
Run this after pnpm build in your CI pipeline. If someone adds a heavy charting library or accidentally imports a large dependency, the build fails with a clear message about which file exceeded the limit.
Common bundle size wins for MCP Apps:
- Use the host’s CSS variables for theming instead of bundling a CSS-in-JS library (see MCP App styling)
- Import only what you need from utility libraries (
import { format } from 'date-fns'instead ofimport * as dateFns from 'date-fns') - Split resource components so each tool only loads the code it needs
- Use
React.lazy()for sections of your component that aren’t visible on initial render
Measuring Rendering Speed
Tool call latency is only half the story. After the host receives your tool’s response, it loads your resource component in an iframe. The time from iframe load to visible UI is your rendering speed.
Use the inspector fixture to measure this in a real browser:
import { test, expect } from 'sunpeak/test';
test('dashboard renders within 1s of data arrival', async ({ inspector }) => {
await inspector.simulateToolCall('show-dashboard', {
quarter: 'Q1',
year: 2026,
});
// Time how long until the main content is visible
const start = performance.now();
await inspector.page.waitForSelector('[data-testid="dashboard-content"]', {
timeout: 1000,
});
const renderTime = performance.now() - start;
expect(renderTime).toBeLessThan(1000);
});
For more detailed metrics, use the browser’s Performance API through the inspector:
test('dashboard first contentful paint is under 500ms', async ({ inspector }) => {
await inspector.simulateToolCall('show-dashboard', {
quarter: 'Q1',
year: 2026,
});
// Wait for the component to finish rendering
await inspector.page.waitForSelector('[data-testid="dashboard-content"]');
// Get paint timing from the iframe
const fcp = await inspector.page.evaluate(() => {
const entries = performance.getEntriesByType('paint');
const fcpEntry = entries.find(e => e.name === 'first-contentful-paint');
return fcpEntry?.startTime ?? -1;
});
expect(fcp).toBeGreaterThan(0);
expect(fcp).toBeLessThan(500);
});
Testing with Large Data Sets
A dashboard that renders 10 items quickly might choke on 1,000. Test with realistic data volumes:
test('product list renders 500 items without lag', async ({ inspector }) => {
const products = Array.from({ length: 500 }, (_, i) => ({
id: `product-${i}`,
name: `Product ${i}`,
price: Math.random() * 100,
category: ['Electronics', 'Books', 'Clothing'][i % 3],
}));
await inspector.simulateToolCall('search-products', {
query: 'all',
_mockOutput: { results: products, total: 500 },
});
const start = performance.now();
await inspector.page.waitForSelector('[data-testid="product-list"]');
const renderTime = performance.now() - start;
// Even with 500 items, rendering should stay under 2s
expect(renderTime).toBeLessThan(2000);
});
If this test fails, consider virtualizing long lists (only rendering visible rows) or paginating the results in your tool handler.
Running Performance Tests in CI
Add performance tests to your CI/CD pipeline so regressions get caught on every pull request:
# .github/workflows/test.yml
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: pnpm/action-setup@v4
- uses: actions/setup-node@v4
with:
node-version: 20
cache: pnpm
- run: pnpm install
- run: pnpm build
- run: pnpm test:unit # includes bundle size checks
- run: pnpm test:e2e # includes latency benchmarks
A few practical tips for performance tests in CI:
Expect variance. CI runners have variable performance. A tool call that takes 100ms on your laptop might take 300ms on a shared GitHub Actions runner. Set your thresholds with enough headroom, or compare against a baseline from the same CI environment rather than using absolute numbers.
Track trends, not just thresholds. A test that passes at 490ms today (under a 500ms limit) is telling you something. Log performance numbers and review them periodically, even when tests pass.
Separate performance tests from correctness tests. Performance tests are slower and more flaky than unit tests. Put them in a separate CI job or test directory so a flaky timing test doesn’t block a bug fix from merging.
What to Optimize First
If your MCP App feels slow and you’re not sure where to start, measure all four metrics and fix the biggest one first:
-
Tool call latency is usually the biggest contributor to perceived slowness because it directly extends the AI’s “thinking” time. Start by profiling your handler: is it the database query, the external API call, or the data transformation? Cache what you can, batch what you can’t, and return partial results when full results are expensive.
-
Cold start time matters most on serverless platforms and for tools that get called infrequently. If your cold start is over 2 seconds, lazy-load heavy dependencies and move initialization out of the critical path.
-
Bundle size matters for mobile users and for MCP Apps that render complex UIs. If your gzipped bundle is over 100KB, check your imports for accidentally bundled libraries and split your code by tool.
-
Rendering speed matters for data-heavy resource components. If your component takes over a second to render, virtualize long lists, defer non-visible sections, and avoid re-renders from unnecessary state changes.
Most MCP Apps should focus on tool call latency first because that’s where the most time gets spent and where optimizations have the biggest impact on user experience.
Get Started
Performance testing for MCP Apps uses the same fixtures and test runner you already have. If you’re using sunpeak, add latency benchmarks with the mcp fixture and rendering benchmarks with the inspector fixture. If you have an existing MCP server, run npx sunpeak test init to scaffold the test infrastructure.
# Initialize testing for your MCP App
npx sunpeak test init
# Run all tests including performance benchmarks
pnpm test:e2e
Check out the testing framework documentation for the full API reference, or read the complete testing guide for how performance tests fit into a full testing strategy.
Get Started
npx sunpeak new
Further Reading
- Complete guide to testing ChatGPT Apps and MCP Apps
- Integration testing MCP Apps with the mcp fixture
- MCP App CI/CD with GitHub Actions
- MCP App error handling - loading, error, and cancelled states
- Mocking and stubbing in MCP App tests
- Visual regression testing for MCP Apps
- How to deploy an MCP App
- MCP App framework
- ChatGPT App framework
- Claude Connector framework
- Testing framework
Frequently Asked Questions
What is performance testing for MCP Apps?
Performance testing for MCP Apps measures how fast your tool handlers respond, how quickly your resource components render inside host iframes, and how large your client bundle is. Unlike traditional web app performance testing, MCP App performance has unique constraints: your UI runs inside ChatGPT or Claude iframes, your tool handlers go through the MCP protocol layer, and cold starts affect the first tool call after deployment. Performance tests catch regressions before they reach production.
How do I measure MCP App tool call latency?
Use the mcp fixture from sunpeak/test. Wrap mcp.callTool() with performance.now() to measure round-trip time through the MCP protocol. Set a threshold (for example, 500ms) and fail the test if the tool exceeds it. Run these tests in CI to catch regressions from new code, heavier database queries, or added middleware.
What is a good response time for an MCP App tool handler?
Most MCP App tool handlers should respond in under 500ms for a good user experience. Users notice delays over 1 second. If your tool calls external APIs, the API latency is usually the bottleneck. Cache frequently accessed data, batch related API calls, and return partial results when full results would be slow.
How do I test MCP App cold start performance?
Cold start time is the delay on the first tool call after your server starts. Measure it by timing the first mcp.callTool() in a fresh test run. Serverless platforms like AWS Lambda and Cloudflare Workers add cold start latency from container initialization. Track this metric separately from warm response times because it affects real users who trigger your tool for the first time after a deployment or idle period.
How do I measure MCP App bundle size?
Run your production build and check the output size of your resource component bundles. Use a build script that fails if any bundle exceeds a size limit. Large bundles load slowly inside host iframes, especially on mobile. Keep resource component bundles under 100KB gzipped as a starting target and optimize from there.
Can I run performance tests for MCP Apps in CI/CD?
Yes. Add performance tests alongside your existing unit and integration tests. Use the mcp fixture for tool call latency benchmarks and a build size check script for bundle size limits. Store baseline metrics and compare against them on each pull request. GitHub Actions, GitLab CI, and other CI platforms all support this workflow.
How do I performance test rendering speed of MCP App resource components?
Use the inspector fixture from sunpeak/test to load your resource component in a real browser, then measure time-to-interactive or first contentful paint using the Performance API. For simpler checks, measure how long it takes for key elements to appear after the component mounts. This catches rendering regressions from added complexity, larger data sets, or unoptimized re-renders.
What performance metrics matter most for MCP Apps?
The four metrics that matter most are tool call latency (how fast your handler responds), cold start time (first call after deployment), resource bundle size (download time inside the iframe), and rendering speed (time from data arrival to visible UI). Tool call latency and cold start time affect how long the user waits for the AI to show a result. Bundle size and rendering speed affect how fast the result appears once data arrives.