All posts

E2E Testing MCP Apps, ChatGPT Apps, and Claude Connectors (April 2026)

Abe Wheeler
MCP Apps MCP App Testing MCP App Framework ChatGPT Apps ChatGPT App Testing ChatGPT App Framework Claude Connectors Claude Connector Testing Claude Connector Framework E2E Testing End-to-End Testing Playwright
E2E testing MCP Apps across ChatGPT and Claude hosts with the inspector fixture.

E2E testing MCP Apps across ChatGPT and Claude hosts with the inspector fixture.

Unit tests tell you your component renders correctly in happy-dom. Integration tests tell you your tool handler returns the right data through the MCP protocol. Neither tells you what your MCP App actually looks like when it renders inside ChatGPT or Claude.

End-to-end tests fill that gap. They render your full MCP App in a real browser, inside a simulated host runtime, with the actual iframe sandboxing, CSS variables, and display mode constraints your users will see. If a layout breaks only in fullscreen mode on Claude, an E2E test catches it. A unit test won’t.

TL;DR: Use the inspector fixture from sunpeak/test to render tools in simulated ChatGPT and Claude runtimes with inspector.renderTool(). Assert against the rendered UI with Playwright’s locator API via result.app(). Tests run against both hosts automatically, cover all display modes and themes, and work locally and in CI/CD with no paid accounts. Run with pnpm test:e2e.

What E2E Tests Cover That Other Tests Don’t

MCP Apps have a rendering pipeline that’s longer than most web apps. Your tool handler returns structuredContent, the host runtime serializes it through the MCP protocol, the host embeds your resource component in a sandboxed iframe, and the component reads data through useToolData() inside that iframe. Each step can introduce bugs.

E2E tests exercise the entire pipeline in a real browser. They catch:

  • Iframe rendering bugs where your component works in happy-dom but breaks in a real sandboxed iframe
  • CSS variable resolution issues where host theme variables don’t cascade into your component the way you expected
  • Display mode layout problems where your component overflows in inline mode or leaves empty space in fullscreen
  • Cross-host differences where your component looks right in ChatGPT but breaks in Claude because of different padding, fonts, or color tokens
  • SafeArea clipping where content gets cut off by host chrome in certain display modes

Unit tests mock the hooks and never render in a real iframe. Integration tests verify the protocol but never render UI. Snapshot tests compare HTML structure but don’t catch visual bugs. E2E tests are where you confirm your app works the way users will actually experience it.

The inspector Fixture

The inspector fixture from sunpeak/test is the core of E2E testing for MCP Apps. It handles starting the dev server, navigating the inspector, selecting the host, and traversing the double-iframe structure so you can write assertions against your rendered component.

import { test, expect } from 'sunpeak/test';

test('dashboard renders revenue chart', async ({ inspector }) => {
  const result = await inspector.renderTool('show-dashboard', {
    quarter: 'Q1',
    year: 2026,
  });
  const app = result.app();

  await expect(app.locator('h2:has-text("Q1 2026")')).toBeVisible();
  await expect(app.locator('.revenue-chart')).toBeVisible();
});

inspector.renderTool() takes three arguments:

  1. Tool name that matches your tool file in src/tools/
  2. Tool input (optional) with the arguments to pass. If omitted, the test uses data from a matching simulation file
  3. Options (optional) for displayMode, theme, prodResources, and timeout

The method returns a result object with:

  • result.app() returns a Playwright FrameLocator scoped to your resource component inside the host iframe. This is what you write assertions against.
  • result.structuredContent contains the raw data your component received
  • result.isError is true when the tool call failed
  • result.screenshot() captures a screenshot for visual regression tests

Writing Your First E2E Test

E2E tests live in tests/e2e/ with the .spec.ts extension. Here’s a complete test file for a weather MCP App:

// tests/e2e/weather.spec.ts
import { test, expect } from 'sunpeak/test';

test('shows current weather for a city', async ({ inspector }) => {
  const result = await inspector.renderTool('get-weather', {
    city: 'Portland',
    units: 'fahrenheit',
  });
  const app = result.app();

  await expect(app.locator('text=Portland')).toBeVisible();
  await expect(app.locator('.temperature')).toBeVisible();
  await expect(app.locator('.conditions')).toBeVisible();
});

test('shows error for unknown city', async ({ inspector }) => {
  const result = await inspector.renderTool('get-weather', {
    city: 'Faketown',
    units: 'celsius',
  });

  expect(result.isError).toBeTruthy();
});

test('renders forecast grid in fullscreen', async ({ inspector }) => {
  const result = await inspector.renderTool('get-weather', {
    city: 'Portland',
    units: 'fahrenheit',
  }, { displayMode: 'fullscreen' });
  const app = result.app();

  const forecastItems = app.locator('.forecast-day');
  await expect(forecastItems).toHaveCount(7);
});

Run these tests with:

pnpm test:e2e

Each test runs twice by default, once against a simulated ChatGPT runtime and once against Claude. The test report shows which host each run targeted, so you can tell immediately if a failure is host-specific.

Testing Across Hosts

The defineConfig() from sunpeak/test/config automatically creates Playwright projects for ChatGPT and Claude. You don’t write host-selection logic in your tests. Every test runs against both hosts with separate results.

// playwright.config.ts
import { defineConfig } from 'sunpeak/test/config';
export default defineConfig();

When a test fails on one host but passes on the other, the test name includes the host:

✓ [chatgpt] shows current weather for a city (1.2s)
✗ [claude] shows current weather for a city (1.4s)
   Error: Locator "text=Portland" not visible

This tells you the bug is Claude-specific. Maybe the Claude host renders with different padding that pushes your content outside the visible area, or a Claude theme variable maps to a color that makes your text invisible against the background.

If you need to skip a test on a specific host, check inspector.host:

test('pip mode shows compact layout', async ({ inspector }) => {
  test.skip(inspector.host === 'claude', 'Claude does not support pip mode yet');

  const result = await inspector.renderTool('get-weather', {
    city: 'Portland',
  }, { displayMode: 'pip' });
  const app = result.app();

  await expect(app.locator('.compact-view')).toBeVisible();
});

Testing Display Modes

ChatGPT Apps render in three display modes: inline (embedded in the chat), picture-in-picture (floating window), and fullscreen (modal overlay). Each mode gives your component different dimensions and different host chrome, so your layout needs to adapt.

Test all three by passing displayMode to renderTool():

import { test, expect } from 'sunpeak/test';

test('inline mode shows summary view', async ({ inspector }) => {
  const result = await inspector.renderTool('show-dashboard', undefined, {
    displayMode: 'inline',
  });
  const app = result.app();

  await expect(app.locator('.summary-card')).toBeVisible();
  await expect(app.locator('.detailed-charts')).not.toBeVisible();
});

test('fullscreen mode shows detailed charts', async ({ inspector }) => {
  const result = await inspector.renderTool('show-dashboard', undefined, {
    displayMode: 'fullscreen',
  });
  const app = result.app();

  await expect(app.locator('.summary-card')).toBeVisible();
  await expect(app.locator('.detailed-charts')).toBeVisible();
});

test('pip mode shows minimal controls', async ({ inspector }) => {
  const result = await inspector.renderTool('show-dashboard', undefined, {
    displayMode: 'pip',
  });
  const app = result.app();

  await expect(app.locator('.pip-controls')).toBeVisible();
  await expect(app.locator('.summary-card')).not.toBeVisible();
});

For thorough coverage, you can loop over display modes:

const displayModes = ['inline', 'pip', 'fullscreen'] as const;

for (const displayMode of displayModes) {
  test(`core content visible in ${displayMode} mode`, async ({ inspector }) => {
    const result = await inspector.renderTool('show-dashboard', undefined, {
      displayMode,
    });
    const app = result.app();

    await expect(app.locator('[data-testid="dashboard-root"]')).toBeVisible();
  });
}

This generates three test cases per host, six total. Each verifies that your root element renders in the given mode. For mode-specific layouts, write separate tests that check the elements unique to each mode.

Testing Themes

Both ChatGPT and Claude support light and dark themes, and your component should look correct in both. The theme option controls which theme the host renders:

test('dark theme renders with correct background', async ({ inspector }) => {
  const result = await inspector.renderTool('show-dashboard', undefined, {
    theme: 'dark',
  });
  const app = result.app();

  const root = app.locator('[data-testid="dashboard-root"]');
  await expect(root).toHaveCSS('background-color', 'rgb(32, 33, 35)');
});

test('light theme renders with correct background', async ({ inspector }) => {
  const result = await inspector.renderTool('show-dashboard', undefined, {
    theme: 'light',
  });
  const app = result.app();

  const root = app.locator('[data-testid="dashboard-root"]');
  await expect(root).toHaveCSS('background-color', 'rgb(255, 255, 255)');
});

If your component uses host CSS variables for theming (which it should), the theme test verifies that the variables resolve correctly in each theme. If you’ve hardcoded a color somewhere, this test will catch the mismatch.

For full cross-product coverage of themes, display modes, and hosts:

const themes = ['light', 'dark'] as const;
const displayModes = ['inline', 'pip', 'fullscreen'] as const;

for (const theme of themes) {
  for (const displayMode of displayModes) {
    test(`renders in ${theme} / ${displayMode}`, async ({ inspector }) => {
      const result = await inspector.renderTool('show-dashboard', undefined, {
        theme,
        displayMode,
      });
      const app = result.app();
      await expect(app.locator('[data-testid="dashboard-root"]')).toBeVisible();
    });
  }
}

That’s 12 test cases (2 hosts x 2 themes x 3 display modes) from a few lines of code. Each runs against the local inspector in seconds with no external dependencies.

Using Simulation Files

Simulation files define deterministic tool call states so your E2E tests don’t depend on a live backend. They live in tests/simulations/ as JSON files:

{
  "tool": "show-dashboard",
  "userMessage": "Show me the Q1 dashboard",
  "toolInput": {
    "arguments": { "quarter": "Q1", "year": 2026 }
  },
  "toolResult": {
    "content": [{ "type": "text", "text": "Dashboard loaded" }],
    "structuredContent": {
      "quarter": "Q1 2026",
      "revenue": 142500,
      "orders": 1203,
      "topProducts": [
        { "name": "Wireless Headphones", "unitsSold": 412, "revenue": 32960 },
        { "name": "USB-C Hub", "unitsSold": 287, "revenue": 14350 }
      ]
    }
  }
}

When you call inspector.renderTool('show-dashboard') without passing explicit input, the fixture loads the matching simulation file and uses its toolResult.structuredContent as the data your component receives via useToolData(). The userMessage appears in the simulated chat interface above your component.

This means your E2E tests are fully deterministic. The same data renders the same UI on every run, regardless of backend state, API availability, or time of day. That’s what you want for CI/CD.

You can create multiple simulation files for the same tool to test different states:

tests/simulations/
  dashboard-q1.json          # normal state
  dashboard-empty.json       # no data
  dashboard-high-volume.json # stress test with many products

Then target a specific simulation by passing the matching input:

test('handles empty dashboard', async ({ inspector }) => {
  const result = await inspector.renderTool('show-dashboard', {
    quarter: 'Q4',
    year: 2025,
  });
  const app = result.app();

  await expect(app.locator('text=No data available')).toBeVisible();
});

Simulation files pair well with mocking patterns for more complex scenarios where your tool needs to call server tools or external APIs during the test.

Testing User Interactions

Interactive MCP Apps use useAppState to sync state between the resource component and the host. E2E tests can exercise these interactions with the Playwright locator API:

test('tab navigation updates displayed content', async ({ inspector }) => {
  const result = await inspector.renderTool('show-dashboard', undefined, {
    displayMode: 'fullscreen',
  });
  const app = result.app();

  // Verify default tab
  await expect(app.locator('text=Revenue Overview')).toBeVisible();

  // Click the orders tab
  await app.locator('button:has-text("Orders")').click();

  // Verify content changed
  await expect(app.locator('text=Revenue Overview')).not.toBeVisible();
  await expect(app.locator('text=Order History')).toBeVisible();
});

test('search filters product list', async ({ inspector }) => {
  const result = await inspector.renderTool('show-dashboard', undefined, {
    displayMode: 'fullscreen',
  });
  const app = result.app();

  // Type in search field
  await app.locator('input[placeholder="Search products"]').fill('headphones');

  // Verify filtered results
  await expect(app.locator('text=Wireless Headphones')).toBeVisible();
  await expect(app.locator('text=USB-C Hub')).not.toBeVisible();
});

Interactions in E2E tests exercise the real event handlers, state updates, and re-renders inside the iframe. This catches bugs that unit tests miss because unit tests mock useAppState with a plain function, while E2E tests run the actual host-managed state synchronization.

Testing Loading, Error, and Cancelled States

Your MCP App needs to handle loading, error, and cancelled states gracefully. E2E tests can verify these by using simulation files that represent each state, or by testing against tools that produce errors:

test('shows loading indicator before data arrives', async ({ inspector }) => {
  const result = await inspector.renderTool('show-dashboard');
  const app = result.app();

  // The component should handle the data arrival and not show loading
  // after renderTool resolves
  await expect(app.locator('[data-testid="dashboard-root"]')).toBeVisible();
  await expect(app.locator('text=Loading')).not.toBeVisible();
});

test('shows error message for invalid input', async ({ inspector }) => {
  const result = await inspector.renderTool('show-dashboard', {
    quarter: 'invalid',
    year: -1,
  });

  expect(result.isError).toBeTruthy();
});

For testing error UI rendering, create a simulation file with error data:

{
  "tool": "show-dashboard",
  "userMessage": "Show me the dashboard",
  "toolResult": {
    "content": [{ "type": "text", "text": "Failed to load dashboard data" }],
    "isError": true
  }
}

Testing Production Builds

By default, E2E tests run against the dev server with HMR. To test the production bundle (closer to what your users see), pass prodResources: true:

test('production build renders correctly', async ({ inspector }) => {
  const result = await inspector.renderTool('show-dashboard', undefined, {
    displayMode: 'fullscreen',
    prodResources: true,
  });
  const app = result.app();

  await expect(app.locator('[data-testid="dashboard-root"]')).toBeVisible();
});

This catches build-time issues like missing imports, tree-shaking that removes code your component needs, or CSS that loads in a different order in production.

Debugging Failing E2E Tests

When an E2E test fails, Playwright provides several debugging tools:

Visual debugger opens an interactive UI where you step through each test action and inspect the DOM at each point:

pnpm test:e2e -- --ui

Debug mode pauses execution at the first failure so you can inspect the browser:

pnpm test:e2e -- --debug

Traces record a full timeline of actions, network requests, and DOM snapshots. Enable them in your config or run:

pnpm test:e2e -- --trace on

Then open the trace with pnpm exec playwright show-trace test-results/trace.zip.

Screenshots on failure are saved automatically to test-results/. When a locator assertion times out, the screenshot shows you exactly what the iframe contained at the moment of failure.

For host-specific failures, the test name tells you which host broke. If [claude] dashboard renders revenue chart fails but [chatgpt] dashboard renders revenue chart passes, you know to focus on Claude-specific rendering. Open the inspector manually at localhost:3000?host=claude to reproduce the issue in the browser.

Organizing E2E Tests

For a small MCP App with one or two tools, a single tests/e2e/app.spec.ts file works fine. For larger apps, organize by resource:

tests/
  e2e/
    dashboard.spec.ts     # show-dashboard tool
    search.spec.ts        # search-products tool
    product.spec.ts       # product-detail tool
    interactions.spec.ts  # cross-tool user flows
  simulations/
    dashboard-q1.json
    dashboard-empty.json
    search-headphones.json
    product-detail.json
  unit/
    dashboard.test.tsx
    search.test.ts

Name test files to match the tool they test. Group interaction tests that span multiple tools into their own file. Keep simulation files in tests/simulations/ where the framework auto-discovers them.

Run all E2E tests:

pnpm test:e2e

Run a single test file:

pnpm test:e2e tests/e2e/dashboard.spec.ts

Run a single test by name:

pnpm test:e2e -g "dashboard renders revenue chart"

Where E2E Tests Fit in the Testing Pyramid

E2E tests are slower than unit tests and integration tests, but they catch bugs those faster tests miss. Here’s how the layers work together for MCP Apps:

  • Unit tests (pnpm test:unit): Test component rendering and handler logic in isolation. Fast, run in milliseconds. Catch most logic bugs.
  • Integration tests (pnpm test:e2e with mcp fixture): Test tool handlers through the real MCP protocol. Catch contract mismatches and protocol bugs. Run in seconds.
  • E2E tests (pnpm test:e2e with inspector fixture): Test the full rendered app in a real browser against simulated hosts. Catch iframe, CSS, display mode, and cross-host bugs. Run in seconds.
  • Visual regression tests (pnpm test:visual): Screenshot comparison on top of E2E tests. Catch pixel-level regressions. Run in seconds.

Write many unit tests, a handful of integration tests per tool, and E2E tests for each tool’s primary rendering states and display modes. You don’t need E2E tests for every edge case, that’s what unit tests are for. Use E2E tests where the rendering environment matters: layout, theme, display mode, and host-specific behavior.

Run everything together with:

pnpm test

This runs unit tests and E2E tests. Add pnpm test:visual to your CI pipeline if you want screenshot comparison on every push.

Get Started

sunpeak projects come with E2E testing pre-configured. The tests/e2e/ directory, Playwright config, and example tests are scaffolded when you create a new project:

npx sunpeak@latest create my-app
cd my-app
pnpm test:e2e

For existing MCP servers that aren’t built with sunpeak, you can add the testing framework separately:

npx sunpeak test init

This creates the tests/ directory structure, playwright.config.ts, and example E2E tests. Point the config at your MCP server and start writing tests with the inspector fixture. It works with any MCP server, regardless of language or framework.

Check the testing framework documentation for the full API reference, or read the complete testing guide for how E2E tests fit into a full MCP App testing strategy.

Get Started

Documentation →
npx sunpeak new

Further Reading

Frequently Asked Questions

How do I E2E test an MCP App?

Use the inspector fixture from sunpeak/test. Call inspector.renderTool("tool-name", input, options) to render your tool in a simulated ChatGPT or Claude runtime. The method returns a result object with an app() frame locator for asserting against the rendered UI. Tests run with pnpm test:e2e using Playwright. No paid accounts or AI credits needed.

What is the difference between E2E tests and unit tests for MCP Apps?

Unit tests mock sunpeak hooks and render components in happy-dom without a browser. They run in milliseconds but miss iframe rendering, protocol issues, and cross-host bugs. E2E tests render your full MCP App in a real browser inside simulated ChatGPT and Claude runtimes, exercising the complete stack from tool call to rendered UI. E2E tests catch host-specific rendering bugs, display mode issues, and theme problems that unit tests cannot see.

Do I need a ChatGPT or Claude subscription to run E2E tests?

No. E2E tests run against the sunpeak inspector, which ships simulated ChatGPT and Claude runtimes locally. No paid subscriptions, API keys, or AI credits are required. Tests run the same way on your machine and in CI/CD with zero external dependencies.

How do E2E tests run against both ChatGPT and Claude automatically?

The defineConfig() from sunpeak/test/config creates separate Playwright projects for each host. Every test runs once per host automatically. When a test fails on Claude but passes on ChatGPT, the test report shows which host had the problem. You do not need to loop over hosts in your test code.

How do I test different display modes in E2E tests for MCP Apps?

Pass the displayMode option to inspector.renderTool(). For example: inspector.renderTool("show-dashboard", undefined, { displayMode: "fullscreen" }). Test inline, pip, and fullscreen modes. Each mode has different iframe dimensions and host chrome, so your layout needs to work in all three.

What are simulation files in MCP App E2E testing?

Simulation files are JSON files in tests/simulations/ that define deterministic tool call states. They specify the tool name, mock input, mock output (structuredContent), and a user message. When you call inspector.renderTool() without explicit input, the simulation file provides the data. This gives you reproducible UI states without calling a real backend.

How do I test user interactions in MCP App E2E tests?

Use the Playwright locator API on the frame returned by result.app(). For example: app.locator("button:has-text(\"Submit\")").click() to click a button, then await expect(app.locator("text=Success")).toBeVisible() to verify the result. For apps with useAppState, interactions trigger state updates that you can assert against in the rendered UI.

How do I debug failing MCP App E2E tests?

Run pnpm test:e2e -- --ui to open the Playwright visual debugger. You can step through tests, inspect the DOM at each step, and see screenshots. For individual tests, run pnpm test:e2e -- --debug to pause on the first failure. Playwright also saves screenshots automatically on failure to test-results/.