All posts

MCP App CI/CD: Run Your Tests in GitHub Actions (April 2026)

Abe Wheeler
MCP Apps MCP App Testing MCP App Framework ChatGPT Apps ChatGPT App Testing CI/CD GitHub Actions
GitHub Actions running MCP App tests against the sunpeak inspector, no paid host accounts needed.

GitHub Actions running MCP App tests against the sunpeak inspector, no paid host accounts needed.

Here’s the GitHub Actions workflow I’d put in every MCP App project. It runs unit tests on every push and end-to-end tests on every pull request, against both the ChatGPT and Claude inspector hosts, with no paid AI subscriptions required on your runners.

TL;DR: Add pnpm test to your GitHub Actions workflow. It runs both unit and e2e tests in a single command. Use pnpm test:visual to add visual regression tests, pnpm test:unit / pnpm test:e2e to run them separately, or pnpm test:eval for multi-model evals that test tool calling across GPT-4o, Claude, Gemini, and other LLMs. sunpeak’s testing framework handles the rest: no external dependencies, no accounts, no credits.

Why CI/CD Matters for MCP Apps

MCP Apps have more moving pieces than a typical web app. Your resource components run inside an iframe, sandboxed by the host. Tool handlers run on a Node.js server. Data flows through the MCP protocol. Display mode transitions, theme switching, and cross-host differences all affect rendering.

Manual testing in a real ChatGPT or Claude session catches some of this. But it’s slow, costs credits, and doesn’t scale. Every team member needs a paid account. Every code change requires a manual 4-click refresh cycle. And you can’t test both hosts in one click.

CI/CD with sunpeak eliminates all of that. The inspector runs both hosts locally, so your GitHub Actions runners can test the full ChatGPT and Claude rendering paths without any external dependencies.

What You’re Working With

A sunpeak MCP App project includes these test commands out of the box:

pnpm test           # Unit + E2E tests
pnpm test:unit      # Unit tests only (Vitest)
pnpm test:e2e       # E2E tests only (Playwright)
pnpm test:visual    # E2E + visual regression tests
pnpm test:eval      # Evals against multiple LLM models
pnpm test:live      # Live tests against real hosts

pnpm test runs both unit and e2e tests. Unit tests run your resource components in happy-dom, fast, no browser, no server. End-to-end tests run the full app in Chromium against the local inspector using the inspector fixture from sunpeak/test. Visual regression tests add screenshot comparison on top of e2e. All are CI-ready as scaffolded.

The Playwright config is a one-liner:

// playwright.config.ts
import { defineConfig } from 'sunpeak/test/config';
export default defineConfig();

This handles dev server startup, port allocation, and multi-host project setup. You don’t need to configure a web server step or set up port forwarding.

The Test Workflow

Put this in .github/workflows/test.yml:

name: Test

on:
  push:
    branches: ['**']

jobs:
  test:
    name: Unit + E2E Tests
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v5

      - uses: pnpm/action-setup@v4
        with:
          version: 10

      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: pnpm

      - name: Install dependencies
        run: pnpm install --frozen-lockfile

      - name: Install Playwright browsers
        run: pnpm exec playwright install --with-deps chromium

      - name: Run tests
        run: pnpm test

pnpm test runs both unit and e2e tests. Unit tests run first with Vitest, then e2e tests run with Playwright against the inspector.

The key step is playwright install --with-deps chromium. GitHub Actions runners don’t ship Playwright browsers. This installs Chromium plus its system dependencies (fonts, libssl, etc.) before your tests run.

You do not need to start the inspector separately. The defineConfig() from sunpeak/test/config includes a webServer block that starts the dev server before the test suite and kills it after. The port is dynamically allocated to avoid collisions. On CI (process.env.CI is always 'true' on GitHub Actions), it always starts a fresh server. Locally it reuses an existing one if the port is already in use.

Adding Visual Regression Tests

To catch unintended UI changes, use --visual:

      - name: Run tests with visual regression
        run: pnpm test:visual

This runs e2e tests and compares screenshots against baseline images stored in your repo. The first run generates baselines. Subsequent runs fail if any pixels differ beyond the threshold. Visual tests run against both ChatGPT and Claude hosts, so you catch host-specific rendering regressions automatically.

You can also run unit tests alongside visual tests by running both pnpm test:unit and pnpm test:visual.

Testing Both Hosts in CI

sunpeak’s testing framework runs each test against both ChatGPT and Claude hosts automatically. The defineConfig() from sunpeak/test/config sets up Playwright projects for each host. You don’t need to loop over hosts in your test code:

// tests/e2e/dashboard.spec.ts
import { test, expect } from 'sunpeak/test';

test('dashboard renders', async ({ inspector }) => {
  const result = await inspector.renderTool('get-dashboard', undefined, { displayMode: 'inline' });
  const app = result.app();
  await expect(app.locator('text=4,218')).toBeVisible();
});

This test runs twice in CI, once for ChatGPT and once for Claude. Both run in the same Playwright process against the same inspector server. No extra services, no paid accounts.

Caching Dependencies

On a typical MCP App project, pnpm install takes 30-60 seconds on a cold runner. Using actions/setup-node’s built-in pnpm cache (set via cache: pnpm in the setup-node action) brings that down to a few seconds on warm runs.

For Playwright browsers, caching is more involved. Add this to your e2e workflow if you want to avoid re-downloading Chromium on every run:

- name: Cache Playwright browsers
  uses: actions/cache@v4
  id: playwright-cache
  with:
    path: ~/.cache/ms-playwright
    key: playwright-chromium-${{ runner.os }}-${{ hashFiles('**/pnpm-lock.yaml') }}

- name: Install Playwright browsers
  if: steps.playwright-cache.outputs.cache-hit != 'true'
  run: pnpm exec playwright install --with-deps chromium

This installs browsers only when your lockfile changes (which is when Playwright might have been updated). Cache hits are common because lockfile changes are infrequent relative to code changes.

Viewing Test Results

Playwright generates an HTML report artifact. Upload it so you can inspect failures without re-running:

- name: Upload Playwright report
  if: always()
  uses: actions/upload-artifact@v4
  with:
    name: playwright-report
    path: playwright-report/
    retention-days: 7

The if: always() ensures the artifact uploads even when tests fail, which is exactly when you need it.

For unit tests, Vitest’s verbose reporter writes directly to the Actions log. No artifact needed.

Organizing Simulations for CI

Simulation files in tests/simulations/ are the data layer for your e2e tests. Each one defines a tool invocation with mock input and output:

{
  "tool": "get_dashboard",
  "userMessage": "Show me this week's analytics",
  "toolInput": { "timeRange": "7d" },
  "toolResult": {
    "structuredContent": {
      "visits": 4218,
      "conversions": 83,
      "bounceRate": 0.41
    }
  }
}

One simulation file per meaningful UI state: happy path, empty state, error state, each display mode that has distinct layout changes. These files run identically in CI as they do locally. There’s no remote data, no API calls, no rate limits.

Because simulations cover the full state space of your UI, a green CI run against the inspector gives you the same confidence as a green run against real ChatGPT or Claude. The hosts implement the same MCP App standard. The runtime behavior is identical.

The Full Picture

Two workflow files, a few dozen lines of YAML, and every pull request into main gets:

  • Unit tests validating component logic
  • End-to-end tests against the ChatGPT host
  • End-to-end tests against the Claude host
  • A downloadable HTML report for any failure

For tool calling confidence, add evals to a separate workflow step with pnpm test:eval. Evals send prompts to GPT-4o, Claude, Gemini, and other models, then assert each model calls the right tools with the right arguments. Each eval case runs multiple times per model, so you get statistical pass/fail rates (e.g. “8/10 passed on GPT-4o”). Evals require API keys set as GitHub Actions secrets, so keep them in a separate job that only runs on main or release branches:

  eval:
    name: Multi-Model Evals
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    steps:
      - uses: actions/checkout@v5
      - uses: pnpm/action-setup@v4
        with:
          version: 10
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: pnpm
      - run: pnpm install --frozen-lockfile
      - name: Run evals
        run: pnpm test:eval
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          GOOGLE_GENERATIVE_AI_API_KEY: ${{ secrets.GOOGLE_GENERATIVE_AI_API_KEY }}

No paid subscriptions on your CI runners. No credits burned per test. No manual clicking. The only cost is the GitHub Actions compute minutes, and because the inspector runs entirely in-process, tests are fast enough that a full suite typically completes in under two minutes. Evals add API costs for the LLM providers you test against.

That’s the CI setup I’d ship with every MCP App project from day one.

Get Started

Documentation →
npx sunpeak new

Further Reading

Frequently Asked Questions

Do I need a ChatGPT or Claude account to run MCP App tests in CI/CD?

No. sunpeak's inspector replicates both the ChatGPT and Claude runtimes locally. Your GitHub Actions runner starts the inspector, runs tests against it, and shuts it down. Zero paid accounts, zero AI credits burned. The only dependencies are Node.js and the packages in your lockfile.

How do I run MCP App end-to-end tests in GitHub Actions?

Install dependencies with pnpm install, install Playwright browsers with pnpm exec playwright install --with-deps chromium, then run pnpm test. The testing framework automatically starts the sunpeak dev server before your tests and shuts it down after.

What is the difference between unit tests and e2e tests for MCP Apps in CI/CD?

Unit tests (pnpm test:unit, powered by Vitest) run your resource components in happy-dom without a real browser or server. They are fast and good for component logic. E2E tests (pnpm test:e2e, powered by Playwright with the inspector fixture) run your full MCP App in a real browser against the sunpeak inspector. Running pnpm test without flags runs both. Add pnpm test:visual for visual regression screenshot tests. All belong in CI/CD.

How do I cache pnpm dependencies in GitHub Actions for MCP App projects?

Use actions/cache@v4 with the key set to your OS and pnpm-lock.yaml hash. Restore from cache before pnpm install. Cache the ~/.pnpm-store directory. This avoids re-downloading packages on every run and cuts workflow time by 30-60% on typical MCP App projects.

Can I run MCP App tests against both ChatGPT and Claude hosts in CI/CD?

Yes. pnpm test runs against both hosts automatically via Playwright projects. You do not need to loop over hosts manually. The defineConfig() from sunpeak/test/config sets up both host projects, and each test runs once per host with no extra configuration on the CI side.

How do I run Playwright in headless mode in GitHub Actions for MCP App tests?

Playwright runs headless by default. You do not need to configure anything extra. Install Playwright browsers with pnpm exec playwright install --with-deps chromium before running pnpm test, and the tests will run headless on the GitHub Actions runner without a display server.