Deployment & Observability for Micro-Frontends #

Independent deployment is the headline benefit of micro-frontends — and the source of their hardest operational failures. The moment a remote ships on its own schedule, the assumptions baked into monolithic release and monitoring tooling quietly break. This guide sits under Core Micro-Frontend Architecture Tradeoffs and covers the deployment and observability disciplines that keep autonomous remotes from turning into an undebuggable distributed system: distributed tracing across micro-frontends so a single user action stays correlated as it crosses host and remote boundaries, error boundary telemetry for remote apps so a crashing remote reports itself instead of dragging the page down silently, CDN cache invalidation for federated remotes so a fresh deploy is actually served, and progressive rollout with feature flags for remotes so a bad version reaches one percent of users instead of all of them.

What Breaks When Remotes Deploy Independently #

In a monolith, a deploy is atomic. One artifact ships, one version runs, one source map resolves every stack trace, and one cache-busting hash invalidates everything at once. Federation discards all four guarantees simultaneously.

Consider the failure mode in concrete terms. A user clicks “Checkout” in the host shell. The host renders a checkout remote, which calls a payments remote, which reads a token managed by an auth remote. The payment fails. Your monitoring shows a 500 from one backend, an unhandled rejection in the browser console, and nothing connecting the two — because each remote was built by a different team, deployed at a different time, and reports to its own dashboard with its own request IDs.

The pain compounds across four axes:

Solving these is not optional polish. It is the cost of admission for independent deployment, and it is governed by how you treat versioning. The remote contract you ship — covered in versioning strategies for remote apps — determines whether a rollback is a one-line manifest revert or a coordinated firefight.

Observability and cache flow across host and remotes A host shell propagates W3C traceparent to three remotes; all emit spans and errors to a shared collector while a versioned CDN serves their entry files. Host shell starts root span checkout remote payments remote auth remote traceparent Collector spans + errors one trace id Versioned CDN remoteEntry.[hash].js immutable assets fetch entry
One root span in the host flows as W3C traceparent into each remote; all spans and errors land in a shared collector under one trace id, while the CDN serves immutable, versioned entry files.

Key Objectives #

A deployment and observability strategy for federated remotes should deliver:

CI/CD Pipeline Shape for Independent Remote Deploys #

The pipeline’s job is to publish a new, immutably-named remoteEntry.js to the CDN and then atomically flip a manifest pointer that the host resolves at runtime. The host never rebuilds. This decoupling is what the remotes map in your Module Federation configuration makes possible — the host resolves the remote URL at runtime, so swapping the bytes behind that URL is a deploy.

Here is an annotated GitHub Actions pipeline for a single remote. Note where tracing, cache headers, flagging, and rollback hooks attach.

name: deploy-remote-checkout
on:
  push:
    branches: [main]

concurrency:
  group: deploy-checkout       # serialize deploys of THIS remote only
  cancel-in-progress: false    # never cancel a deploy mid-flight

jobs:
  build-and-deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: npm

      - run: npm ci

      # Inject the build version into the bundle so spans/errors can tag it.
      - name: Set version
        run: echo "REMOTE_VERSION=${GITHUB_SHA::8}" >> "$GITHUB_ENV"

      # Output is remoteEntry.[contenthash].js — content-addressed, immutable.
      - name: Build
        run: npm run build
        env:
          REMOTE_VERSION: ${{ env.REMOTE_VERSION }}
          PUBLIC_PATH: https://cdn.example.com/checkout/${{ env.REMOTE_VERSION }}/

      # Contract test: assert exposed module surface still matches the host's expectation
      # before anything reaches the CDN. A breaking change fails the deploy here.
      - name: Verify remote contract
        run: npm run test:contract

      # Upload hashed assets with a long, immutable cache; never the manifest yet.
      - name: Upload assets to CDN
        run: |
          aws s3 sync dist/ s3://mfe-cdn/checkout/${REMOTE_VERSION}/ \
            --cache-control "public,max-age=31536000,immutable" \
            --exclude "manifest.json"

      # Smoke test the freshly uploaded entry in isolation (loads, exposes modules).
      - name: Smoke test deployed entry
        run: npm run test:smoke -- --url "https://cdn.example.com/checkout/${REMOTE_VERSION}/remoteEntry.js"

      # Register the version behind a flag — 0% traffic until rollout begins.
      - name: Register version with flag service
        run: |
          curl -fsS -X POST "$FLAG_API/remotes/checkout/versions" \
            -H "Authorization: Bearer $FLAG_TOKEN" \
            -d "{\"version\":\"${REMOTE_VERSION}\",\"rollout\":0}"
        env:
          FLAG_API: ${{ secrets.FLAG_API }}
          FLAG_TOKEN: ${{ secrets.FLAG_TOKEN }}

      # Flip the pointer LAST, with a short TTL so a rollback propagates fast.
      - name: Publish manifest pointer
        run: |
          echo "{\"checkout\":\"${REMOTE_VERSION}\"}" > manifest.json
          aws s3 cp manifest.json s3://mfe-cdn/checkout/manifest.json \
            --cache-control "public,max-age=30,must-revalidate"

The two cache policies are the crux. Hashed assets get max-age=31536000,immutable because their content can never change under a given URL. The small manifest.json that maps a remote name to its current version gets a short TTL (max-age=30) so that flipping or reverting the pointer reaches users in seconds. This separation — covered in depth in CDN cache invalidation for federated remotes — gives you both aggressive caching and fast rollback.

Rollback is a Pointer Revert #

Because the host resolves remotes through the manifest at runtime, a rollback is one operation: rewrite manifest.json to the previous version string. The old immutable assets are still on the CDN, so the previous version is instantly serviceable. No rebuild, no re-upload — just flip the pointer back and wait out the 30-second TTL.

Correlating Traces Across Host and Remotes #

The unit of correlation is the W3C traceparent, a single header/field format every remote can agree on regardless of framework. The host starts a root span when a user journey begins, and each remote either continues that trace (for in-page operations) or propagates the context onto its outbound fetch calls so the backend trace stitches to the frontend one.

The mechanism is a shared tracing context exposed by the host and consumed by remotes. Because all remotes share a single tracer instance (the same singleton discipline you use for avoiding bundle duplication of React), spans nest correctly instead of starting orphaned traces.

// host/tracing.ts — initialized once, shared with every remote via the share scope.
import {
  context,
  trace,
  propagation,
  type Span,
} from '@opentelemetry/api';

const tracer = trace.getTracer('mfe-host', __HOST_VERSION__);

// Remotes call this to wrap any operation in a child span of the active trace.
export function withRemoteSpan<T>(
  remoteName: string,
  remoteVersion: string,
  fn: (span: Span) => Promise<T>,
): Promise<T> {
  const span = tracer.startSpan(`remote:${remoteName}`, {
    attributes: {
      'mfe.remote.name': remoteName,
      'mfe.remote.version': remoteVersion, // the GITHUB_SHA from the pipeline
    },
  });
  // Run fn inside this span's context so nested spans + fetches inherit it.
  return context.with(trace.setSpan(context.active(), span), async () => {
    try {
      return await fn(span);
    } catch (err) {
      span.recordException(err as Error);
      throw err;
    } finally {
      span.end();
    }
  });
}

// A fetch wrapper that injects W3C traceparent into outbound requests so the
// backend trace joins the same trace id as the frontend span.
export async function tracedFetch(input: RequestInfo, init: RequestInit = {}) {
  const headers = new Headers(init.headers);
  propagation.inject(context.active(), headers, {
    set: (carrier, key, value) => (carrier as Headers).set(key, value),
  });
  return fetch(input, { ...init, headers });
}

A remote consumes this without importing its own tracer:

// checkout-remote/submit.ts
import { withRemoteSpan, tracedFetch } from 'host/tracing'; // from the share scope

export function submitOrder(payload: OrderPayload) {
  return withRemoteSpan('checkout', __REMOTE_VERSION__, async (span) => {
    span.setAttribute('order.items', payload.items.length);
    const res = await tracedFetch('/api/orders', {
      method: 'POST',
      body: JSON.stringify(payload),
    });
    span.setAttribute('http.status_code', res.status);
    return res.json();
  });
}

Now the checkout span, the auth span, and the backend POST /api/orders span all carry the same trace id. The full pattern — sampling, context propagation across lazy-loaded remotes, and avoiding double instrumentation — is detailed in distributed tracing across micro-frontends.

Error Boundary Telemetry #

A traced happy path is half the story. The other half is making failures loud and attributable. Two failure classes matter: a remote that fails to load (network 404, chunk-load error, integrity mismatch) and a remote that throws after mounting.

Wrap every remote mount in an error boundary that reports to the same collector, tagged with the remote’s name and version so the right team is paged.

// host/RemoteBoundary.tsx
import { Component, type ReactNode } from 'react';
import { trace } from '@opentelemetry/api';

interface Props { remote: string; version: string; children: ReactNode; }
interface State { failed: boolean; }

export class RemoteBoundary extends Component<Props, State> {
  state: State = { failed: false };

  static getDerivedStateFromError() {
    return { failed: true };
  }

  componentDidCatch(error: Error, info: { componentStack: string }) {
    const span = trace.getActiveSpan();
    span?.recordException(error);
    // Attribute the failure: which remote, which version, where in the tree.
    navigator.sendBeacon('/telemetry/errors', JSON.stringify({
      kind: 'remote_runtime_error',
      remote: this.props.remote,
      version: this.props.version,
      message: error.message,
      stack: error.stack,
      componentStack: info.componentStack,
      traceId: span?.spanContext().traceId,
    }));
  }

  render() {
    if (this.state.failed) {
      return <div role="alert">This section is temporarily unavailable.</div>;
    }
    return this.props.children;
  }
}

The boundary degrades gracefully — the host keeps working while one region shows a fallback — and it converts a silent blank space into a structured event carrying the trace id. Tying error volume to a specific version is also what makes automated rollback possible: a spike in remote_runtime_error for version a1b2c3d4 is the rollback signal. The deeper treatment, including load-failure telemetry and source-map symbolication per remote, is in error boundary telemetry for remote apps.

CDN Cache Invalidation & Versioned remoteEntry #

The cache strategy follows directly from the pipeline. Two rules prevent the “fix deployed against cached bytes” failure mode:

  1. Content-address everything except the pointer. Every asset — including remoteEntry.[contenthash].js — lives under a version-scoped path and is served immutable. Its bytes never change, so it can cache forever and rollback is just re-pointing.
  2. Keep the manifest pointer tiny and short-lived. The only mutable object is manifest.json, served with a 30-second TTL. That is your invalidation surface — flip it, wait the TTL, done. You almost never issue a CDN purge.

The classic mistake is caching remoteEntry.js itself with a long TTL at a stable URL. Then a deploy uploads new bytes the CDN refuses to serve, and edge nodes disagree for hours. Versioned paths plus an immutable policy make that physically impossible. Edge cases — surrogate keys, multi-CDN consistency, and import-map variants — are covered in CDN cache invalidation for federated remotes.

Feature-Flagged Progressive Rollout & Automated Rollback #

Even with perfect tracing, telemetry, and caching, flipping 100% of traffic to a new remote is a gamble. Progressive rollout converts a binary deploy into a dial. The pipeline registered the new version at rollout: 0; rollout is the act of turning that dial up while watching telemetry.

The host resolves which version to load per user from the flag service, falling back to the last-known-good version if the flag service is unreachable:

// host/resolveRemoteVersion.ts
export async function resolveRemoteVersion(remote: string, userId: string) {
  try {
    const res = await fetch(`/flags/remotes/${remote}?u=${userId}`, {
      cache: 'no-store',
    });
    const { version } = await res.json(); // service buckets the user by rollout %
    return version;
  } catch {
    // Flag service down: serve the stable manifest pointer, never block render.
    const manifest = await fetch('/checkout/manifest.json').then((r) => r.json());
    return manifest[remote];
  }
}

Automated rollback closes the loop. Wire an alerting rule against the telemetry you already emit — error rate for the candidate version, or a drop in a conversion span — and have it call the flag API to set rollout back to 0. Because version selection is runtime and the old immutable assets are still on the CDN, the rollback takes effect on the next page load with no rebuild. The bucketing strategy, canary cohorts, and metric thresholds are detailed in progressive rollout with feature flags for remotes.

Testing & Validation #

Validate the observability story the same way you validate features — in the pipeline, before production:

Common Pitfalls #

Issue Root cause & resolution
Errors from one remote can’t be tied to a user journey Each remote starts its own trace. Share a single tracer through the federation share scope and propagate W3C traceparent so spans nest under one trace id.
A deployed fix isn’t served; users still hit the bug remoteEntry.js cached long-lived at a stable URL. Content-address assets under version paths with immutable; keep only a tiny manifest pointer short-lived.
Remote load failure shows a blank region, no alert No load-failure telemetry. Wrap mounts in an error boundary that beacons a structured, version-tagged event and renders a fallback.
Rolling back means a rebuild and re-deploy Remote version baked into the host build. Resolve remote versions at runtime via the manifest/flag service so rollback is a pointer revert.
New version breaks 100% of users instantly No graduated exposure. Register versions at 0% behind a flag and dial up while watching version-tagged telemetry; automate rollback on threshold breach.
Can’t tell which remote version caused a regression Spans and errors lack a version attribute. Inject the build SHA at build time and tag every span and error with mfe.remote.version.

FAQ #

Do all remotes need to use the same observability vendor?

No. They need to agree on a propagation format — W3C traceparent and tracestate — and report to a shared collector. The collector can fan out to different backends. What cannot vary is the trace context format, because that is the contract that lets spans from different teams join one trace.

How do I keep distributed tracing from bloating every remote’s bundle?

Expose a single tracing module from the host and mark it singleton in the share scope, exactly as you would React. Remotes import the host’s tracer instead of bundling their own OpenTelemetry instance, so instrumentation adds near-zero per-remote weight and avoids duplicate, conflicting tracers.

What is the safest order of operations in a remote deploy?

Build with a content-hashed entry, run contract and smoke tests, upload immutable assets to a version-scoped path, register the version at 0% rollout, and only then flip the short-TTL manifest pointer. The pointer flip is last and cheap to revert, so the riskiest step is also the most reversible one.

How fast can a rollback take effect?

Roughly the manifest’s TTL — seconds, if you keep it at 30. Because version resolution happens at runtime and the previous immutable assets remain on the CDN, reverting the pointer (or setting the flag rollout to 0) is served on the next page load with no rebuild and no CDN purge.