MCP App Testing Framework
Attach your local or remote MCP server to run E2E tests, visual regression tests, and multi-model evals, or run live browser tests against the real ChatGPT and Claude.
Add it to any MCP server in any language. No paid accounts, no AI credits.
npx sunpeak test init --server URL What is the sunpeak testing framework?
MCP Apps run inside AI hosts like ChatGPT and Claude, not in a browser you control. Every code change means deploying, opening the host, triggering the tool, and checking the result manually. The sunpeak testing framework replicates those host runtimes locally so you can test automatically.
The sunpeak testing framework is an open-source (MIT) testing framework for MCP servers that provides automated testing for MCP Apps, ChatGPT Apps, and Claude Connectors. Works with any MCP server in any language.
E2E Tests
Playwright tests against simulated ChatGPT and Claude hosts
Visual Regression
Screenshot comparison against saved baselines
Multi-Model Evals
Test tool calling across GPT-4o, Claude, Gemini, and other LLMs
Live Host Tests
Playwright tests against real ChatGPT and Claude
Manual Testing
Inspect MCP Apps in the local sunpeak inspector
Test CLI
| Command | What it runs | Runtime |
|---|---|---|
pnpm test | E2E tests | Playwright + inspector |
pnpm test:visual | E2E + visual regression | Playwright + inspector + screenshots |
pnpm test:live | Live tests against real ChatGPT | Playwright + real host |
pnpm test:eval | Evals against multiple LLM models | Vitest + Vercel AI SDK |
npx sunpeak test init | Scaffold test infrastructure | Adds Playwright config, tests, and evals |
How It Works
Scaffold Tests
Run npx sunpeak test init in your project. It detects your project type (JS/TS, Python, Go, Rust) and creates
Playwright config, test files, and eval scaffolding. For non-JS projects, it creates a
self-contained tests/sunpeak/ directory.
Define Simulations
Create JSON fixtures in tests/simulations/ that define tool input, tool result, and server tool mocks. Each simulation is a reproducible
state your resource can render. The inspector loads them automatically.
Write Tests
Import { test, expect } from sunpeak/test. Use the mcp fixture to call tools, set themes and display modes, and assert against the rendered
resource with Playwright locators and MCP-specific matchers.
Run in CI/CD
Add pnpm test to your pipeline. It starts the dev server, runs E2E tests against both ChatGPT and
Claude runtimes, and shuts down when complete. No accounts, keys, or credits on your CI
runners.
import { test, expect } from 'sunpeak/test';
test('albums render in light mode', async ({ inspector }) => {
const result = await inspector.renderTool('show-albums', undefined, { theme: 'light' });
const app = result.app();
await expect(app.locator('button:has-text("Summer Slice")')).toBeVisible();
});
test('albums render in fullscreen', async ({ inspector }) => {
const result = await inspector.renderTool('show-albums', undefined, { displayMode: 'fullscreen' });
const app = result.app();
await expect(app.locator('button:has-text("Summer Slice")')).toBeVisible();
// Compare against saved baseline (only runs with --visual flag)
await result.screenshot('albums-fullscreen');
}); What You Can Test
- Multi-Host Rendering
Tests run against both ChatGPT and Claude runtimes automatically via Playwright projects. One test file covers both hosts.
- Themes & Display Modes
Test light/dark themes and inline/fullscreen/pip display modes. Use
setTheme()andsetDisplayMode()or pass options tocallTool(). - Visual Regression
Capture screenshots with
result.screenshot()and compare against baselines. Configure thresholds and max diff pixel ratios indefineConfig(). - Backend Tool Mocking
Simulation files can mock
callServerToolresponses with simple or conditional matching. Test interactive flows without a real backend. - MCP-Specific Assertions
Custom matchers:
toHaveTextContent(),toHaveStructuredContent(),toBeError()alongside standard Playwright locators. - Multi-Model Evals
Send prompts to GPT-4o, Claude, Gemini, and other models. Assert each model calls the right tools with the right arguments. Each eval runs N times per model to measure reliability.
- Any MCP Server, Any Language
Use
npx sunpeak test initwith any MCP server. Configure the server via HTTP URL or startup command. Python, Go, TypeScript, Rust, anything.
Who It's For
MCP App Developers
Stop manually refreshing ChatGPT and Claude after every code change. Write tests once, run them against both hosts automatically. Catch regressions before they ship.
MCP Server Authors
Test MCP servers written in any language. Run npx sunpeak test init --server URL to add test infrastructure to Python, Go, TypeScript, or Rust servers.
Coding Agents
Agents like Claude Code, Codex, and Cursor can run pnpm test to validate MCP Apps without manual testing in a real host. Automated testing in the agent
loop.
Getting Started
Add sunpeak testing to any MCP project:
npx sunpeak test init --server URL
Then run pnpm test to execute E2E tests. See the testing documentation for the full guide.
Frequently Asked Questions
Do I need a sunpeak project to use the testing framework?
No. Run "npx sunpeak test init" in any JavaScript, TypeScript, Python, Go, or Rust project. It scaffolds Playwright config and a starter test file. For non-JS projects, it creates a self-contained tests/sunpeak/ directory with everything included.
What test runners does sunpeak use?
E2E tests use Playwright against the sunpeak inspector (replicated ChatGPT and Claude runtimes). Live tests use Playwright against real ChatGPT. You write standard Playwright assertions plus MCP-specific matchers like toHaveTextContent and toHaveStructuredContent.
How do simulation files work?
Simulation files are JSON fixtures in tests/simulations/ that define a tool call scenario: tool input, tool result, and optional server tool mocks. The inspector loads them to render your MCP App in a specific state. Each simulation is a reproducible test scenario you can assert against.
Can I test across ChatGPT and Claude automatically?
Yes. The sunpeak test runner uses Playwright projects to run each test against both ChatGPT and Claude host runtimes automatically. One test file, both hosts. Configure which hosts to test in defineConfig().
What is visual regression testing?
Run "pnpm test:visual" to capture screenshots of your MCP App and compare them against saved baselines. If the UI changes unexpectedly, the test fails with a diff image. Run "pnpm test:visual --update" to update baselines after intentional changes.
How do live tests differ from E2E tests?
E2E tests run against the local inspector with simulation fixtures. They are fast, deterministic, and free. Live tests run against real ChatGPT using Playwright. sunpeak handles auth, message sending, and iframe access. You only write assertions against the rendered app.
Does sunpeak testing work in CI/CD?
Yes. Add "pnpm test" to your CI pipeline. It starts the dev server automatically, runs E2E and visual regression tests, and shuts down when complete. No paid host accounts, API keys, or AI credits needed on CI runners.
What are evals in sunpeak?
Evals test whether different LLMs call your tools correctly. They connect to your MCP server, discover tools via MCP protocol, send prompts to multiple models (GPT-4o, Claude, Gemini, etc.), and assert that each model calls the right tools with the right arguments. Each eval runs N times per model to measure reliability. Run them with "pnpm test:eval".
Is sunpeak testing free?
Yes. sunpeak is MIT licensed and open source. The testing framework, inspector, CLI, and all tooling are free to use. Evals require API keys for the LLM providers you want to test against.
Open Source & MIT Licensed
sunpeak is free to use, modify, and distribute.
Want to inspect MCP Apps interactively? See the Inspector page. Building MCP Apps? See the MCP App Framework page.