MCP App Testing Framework

Attach your local or remote MCP server to run E2E tests, visual regression tests, and multi-model evals, or run live browser tests against the real ChatGPT and Claude.

Add it to any MCP server in any language. No paid accounts, no AI credits.

npx sunpeak test init --server URL

What is the sunpeak testing framework?

sunpeak sunpeak

MCP Apps run inside AI hosts like ChatGPT and Claude, not in a browser you control. Every code change means deploying, opening the host, triggering the tool, and checking the result manually. The sunpeak testing framework replicates those host runtimes locally so you can test automatically.

The sunpeak testing framework is an open-source (MIT) testing framework for MCP servers that provides automated testing for MCP Apps, ChatGPT Apps, and Claude Connectors. Works with any MCP server in any language.

E2E Tests

Playwright tests against simulated ChatGPT and Claude hosts

Visual Regression

Screenshot comparison against saved baselines

Multi-Model Evals

Test tool calling across GPT-4o, Claude, Gemini, and other LLMs

Live Host Tests

Playwright tests against real ChatGPT and Claude

Manual Testing

Inspect MCP Apps in the local sunpeak inspector

Test CLI

Command What it runs Runtime
pnpm test E2E tests Playwright + inspector
pnpm test:visual E2E + visual regression Playwright + inspector + screenshots
pnpm test:live Live tests against real ChatGPT Playwright + real host
pnpm test:eval Evals against multiple LLM models Vitest + Vercel AI SDK
npx sunpeak test init Scaffold test infrastructure Adds Playwright config, tests, and evals

How It Works

1

Scaffold Tests

Run npx sunpeak test init in your project. It detects your project type (JS/TS, Python, Go, Rust) and creates Playwright config, test files, and eval scaffolding. For non-JS projects, it creates a self-contained tests/sunpeak/ directory.

2

Define Simulations

Create JSON fixtures in tests/simulations/ that define tool input, tool result, and server tool mocks. Each simulation is a reproducible state your resource can render. The inspector loads them automatically.

3

Write Tests

Import { test, expect } from sunpeak/test. Use the mcp fixture to call tools, set themes and display modes, and assert against the rendered resource with Playwright locators and MCP-specific matchers.

4

Run in CI/CD

Add pnpm test to your pipeline. It starts the dev server, runs E2E tests against both ChatGPT and Claude runtimes, and shuts down when complete. No accounts, keys, or credits on your CI runners.

import { test, expect } from 'sunpeak/test';

test('albums render in light mode', async ({ inspector }) => {
  const result = await inspector.renderTool('show-albums', undefined, { theme: 'light' });
  const app = result.app();
  await expect(app.locator('button:has-text("Summer Slice")')).toBeVisible();
});

test('albums render in fullscreen', async ({ inspector }) => {
  const result = await inspector.renderTool('show-albums', undefined, { displayMode: 'fullscreen' });
  const app = result.app();
  await expect(app.locator('button:has-text("Summer Slice")')).toBeVisible();
  // Compare against saved baseline (only runs with --visual flag)
  await result.screenshot('albums-fullscreen');
});

What You Can Test

  • Multi-Host Rendering

    Tests run against both ChatGPT and Claude runtimes automatically via Playwright projects. One test file covers both hosts.

  • Themes & Display Modes

    Test light/dark themes and inline/fullscreen/pip display modes. Use setTheme() and setDisplayMode() or pass options to callTool().

  • Visual Regression

    Capture screenshots with result.screenshot() and compare against baselines. Configure thresholds and max diff pixel ratios in defineConfig().

  • Backend Tool Mocking

    Simulation files can mock callServerTool responses with simple or conditional matching. Test interactive flows without a real backend.

  • MCP-Specific Assertions

    Custom matchers: toHaveTextContent(), toHaveStructuredContent(), toBeError() alongside standard Playwright locators.

  • Multi-Model Evals

    Send prompts to GPT-4o, Claude, Gemini, and other models. Assert each model calls the right tools with the right arguments. Each eval runs N times per model to measure reliability.

  • Any MCP Server, Any Language

    Use npx sunpeak test init with any MCP server. Configure the server via HTTP URL or startup command. Python, Go, TypeScript, Rust, anything.

Who It's For

MCP App Developers

Stop manually refreshing ChatGPT and Claude after every code change. Write tests once, run them against both hosts automatically. Catch regressions before they ship.

MCP Server Authors

Test MCP servers written in any language. Run npx sunpeak test init --server URL to add test infrastructure to Python, Go, TypeScript, or Rust servers.

Coding Agents

Agents like Claude Code, Codex, and Cursor can run pnpm test to validate MCP Apps without manual testing in a real host. Automated testing in the agent loop.

Getting Started

Add sunpeak testing to any MCP project:

npx sunpeak test init --server URL

Then run pnpm test to execute E2E tests. See the testing documentation for the full guide.

Testing Docs → Inspector →

Frequently Asked Questions

Do I need a sunpeak project to use the testing framework?

No. Run "npx sunpeak test init" in any JavaScript, TypeScript, Python, Go, or Rust project. It scaffolds Playwright config and a starter test file. For non-JS projects, it creates a self-contained tests/sunpeak/ directory with everything included.

What test runners does sunpeak use?

E2E tests use Playwright against the sunpeak inspector (replicated ChatGPT and Claude runtimes). Live tests use Playwright against real ChatGPT. You write standard Playwright assertions plus MCP-specific matchers like toHaveTextContent and toHaveStructuredContent.

How do simulation files work?

Simulation files are JSON fixtures in tests/simulations/ that define a tool call scenario: tool input, tool result, and optional server tool mocks. The inspector loads them to render your MCP App in a specific state. Each simulation is a reproducible test scenario you can assert against.

Can I test across ChatGPT and Claude automatically?

Yes. The sunpeak test runner uses Playwright projects to run each test against both ChatGPT and Claude host runtimes automatically. One test file, both hosts. Configure which hosts to test in defineConfig().

What is visual regression testing?

Run "pnpm test:visual" to capture screenshots of your MCP App and compare them against saved baselines. If the UI changes unexpectedly, the test fails with a diff image. Run "pnpm test:visual --update" to update baselines after intentional changes.

How do live tests differ from E2E tests?

E2E tests run against the local inspector with simulation fixtures. They are fast, deterministic, and free. Live tests run against real ChatGPT using Playwright. sunpeak handles auth, message sending, and iframe access. You only write assertions against the rendered app.

Does sunpeak testing work in CI/CD?

Yes. Add "pnpm test" to your CI pipeline. It starts the dev server automatically, runs E2E and visual regression tests, and shuts down when complete. No paid host accounts, API keys, or AI credits needed on CI runners.

What are evals in sunpeak?

Evals test whether different LLMs call your tools correctly. They connect to your MCP server, discover tools via MCP protocol, send prompts to multiple models (GPT-4o, Claude, Gemini, etc.), and assert that each model calls the right tools with the right arguments. Each eval runs N times per model to measure reliability. Run them with "pnpm test:eval".

Is sunpeak testing free?

Yes. sunpeak is MIT licensed and open source. The testing framework, inspector, CLI, and all tooling are free to use. Evals require API keys for the LLM providers you want to test against.

Open Source & MIT Licensed

sunpeak is free to use, modify, and distribute.

Want to inspect MCP Apps interactively? See the Inspector page. Building MCP Apps? See the MCP App Framework page.