Temporal SDK Metrics: Prometheus Exporter vs OTLP for Multi-Worker Deployments

TL;DR: If you run multiple Temporal workers on the same host (PM2, systemd, bare metal), the Prometheus exporter causes port conflicts (EADDRINUSE). OTLP export lets all workers push metrics to a single collector via gRPC, eliminating port management entirely. The OTLP collector receives on :4317 and serves Prometheus HTTP on :9464 — one scrape target regardless of worker count.

If you're running Temporal workers in production, you've probably enabled SDK metrics to monitor activity latency, workflow completion rates, and worker utilization. The Temporal SDK automatically exports these metrics — no manual instrumentation needed. But the moment you scale beyond a single worker on a single machine, you hit an interesting question: how do you collect metrics from multiple workers without them fighting over the same port?

I've been through this with the simple-temporal project, and there are two viable approaches. This post walks through both, why I eventually chose OTLP, and the exact configuration that makes it work.

Quick Background: Temporal SDK Metrics

Before we get into the architectures, here's how Temporal SDK metrics work at a high level.

When you create a Temporal worker, the SDK automatically records metrics — activity execution latency, workflow completions, worker task slot utilization, process resource usage. You don't instrument your workflow or activity code. You just configure where those metrics go.

In Node.js, the setup looks like this:

import { Runtime } from "@temporalio/worker";
 
Runtime.install({
  telemetryOptions: {
    metrics: {
      // The question is: what goes here?
    },
  },
});

The SDK supports two metric export mechanisms:

Prometheus exporter — serves an HTTP endpoint that Prometheus scrapes
OTLP exporter — pushes metrics via gRPC to an OpenTelemetry Collector

Both export the same metrics. The difference is how they export them, and that difference matters a lot when you run multiple workers.

Approach 1: Prometheus Exporter (Pull-Based)

The Prometheus exporter is the simpler of the two. The SDK starts an HTTP server inside the worker process, listens on a port, and serves metrics in Prometheus text format whenever a scraper hits the /metrics endpoint.

Single Worker Setup

Runtime.install({
  telemetryOptions: {
    metrics: {
      prometheus: {
        bindAddress: "0.0.0.0:9464",
      },
    },
  },
});

Prometheus scrapes http://worker-host:9464/metrics every few seconds, and the metrics land in your dashboard. Simple, clean, works great.

The Multi-Worker Problem

The moment you run a second worker on the same machine, this breaks:

Error: listen EADDRINUSE: address already in use :::9464

Each worker tries to bind port 9464 for its own metrics server. Only one can have it. You now have to decide: how do I collect metrics from all my workers without them stepping on each other?

Workaround: Unique Ports Per Worker

One solution is to assign each worker its own port.

const BASE_PORT = 9464;
const WORKER_INSTANCE = parseInt(process.env.WORKER_INSTANCE || "0", 10);
const METRICS_PORT = BASE_PORT + WORKER_INSTANCE;
 
Runtime.install({
  telemetryOptions: {
    metrics: {
      prometheus: {
        bindAddress: `0.0.0.0:${METRICS_PORT}`,
      },
    },
  },
});

Each worker gets a deterministic port:

Worker 1 → :9464
Worker 2 → :9465
Worker 3 → :9466

Then you list all those ports in your Prometheus scrape config:

scrape_configs:
  - job_name: "temporal-workers"
    static_configs:
      - targets:
          - "localhost:9464"
          - "localhost:9465"
          - "localhost:9466"

This works, but it's not great. Every time you add or remove a worker, you update the scrape config. The Prometheus config becomes tightly coupled to your process management. And you're maintaining N HTTP servers (one per worker) just to serve metrics.

Approach 2: OTLP Export (Push-Based)

OpenTelemetry Protocol (OTLP) takes a different approach. Instead of each worker hosting its own HTTP server and waiting to be scraped, each worker pushes metrics to a shared collector via gRPC. The collector aggregates everything and serves a single Prometheus endpoint.

How It Works

Worker 1 ──gRPC──┐
Worker 2 ──gRPC──┤  OTLP Collector  ──HTTP:9464──> Prometheus
Worker 3 ──gRPC──┘       │
                         └──> Any OTLP backend (Jaeger, DataDog, etc.)

All workers push to the same collector endpoint (localhost:4317). The collector handles aggregation, and Prometheus scrapes just one target — the collector itself.

Worker Configuration

The worker side is simpler than the Prometheus approach because there's no port management:

// temporal/telemetryOptions.js
import { Runtime, makeTelemetryFilterString } from "@temporalio/worker";
 
function initializeRuntime() {
  const otlpEndpoint = process.env.OTEL_EXPORTER_OTLP_ENDPOINT || "localhost:4317";
 
  Runtime.install({
    telemetryOptions: {
      logging: {
        forward: {},
        filter: makeTelemetryFilterString({ core: "INFO", other: "INFO" }),
      },
      metrics: {
        otel: {
          url: `grpc://${otlpEndpoint}`,
          temporality: "cumulative",
          metricsExportInterval: 1000, // push every 1 second
        },
      },
    },
  });
}

Environment variables per worker:

# Worker 1
OTEL_EXPORTER_OTLP_ENDPOINT=localhost:4317
OTEL_SERVICE_NAME=payment-worker-1
 
# Worker 2
OTEL_EXPORTER_OTLP_ENDPOINT=localhost:4317
OTEL_SERVICE_NAME=notification-worker-1
 
# Worker 3
OTEL_EXPORTER_OTLP_ENDPOINT=localhost:4317
OTEL_SERVICE_NAME=core-worker-1

No port conflicts. Every worker pushes to the same gRPC endpoint. The OTEL_SERVICE_NAME identifies which worker the metrics came from — that's how you distinguish them in your dashboards.

The OTLP Collector

The collector is the bridge. It receives OTLP metrics via gRPC (port 4317), optionally processes them (batching, filtering), and exports them as Prometheus HTTP (port 9464).

Here's the actual configuration I run:

# otel-collector/config.yaml
extensions:
  health_check:
    endpoint: "0.0.0.0:13133"
 
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: "0.0.0.0:4317"
      http:
        endpoint: "0.0.0.0:4318"
 
processors:
  batch:
    timeout: 10s
    send_batch_size: 8192
 
exporters:
  prometheus:
    endpoint: "0.0.0.0:9464"
 
service:
  extensions: [health_check]
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]

What this does:

receivers.otlp listens for OTLP metrics on gRPC (:4317) and HTTP (:4318)
processors.batch batches metrics every 10 seconds or 8192 data points — whichever comes first
exporters.prometheus serves the aggregated metrics as Prometheus text format on :9464

And the Docker Compose deployment:

# docker-compose.yml
services:
  otel-collector:
    container_name: temporal-otel-collector
    image: otel/opentelemetry-collector-contrib:0.104.0
    ports:
      - "4317:4317"   # OTLP gRPC receiver
      - "4318:4318"   # OTLP HTTP receiver
      - "9464:9464"   # Prometheus exporter
      - "13133:13133" # Health check
    volumes:
      - ./otel-collector/config.yaml:/etc/otel-collector-config.yaml:ro
    command: ["--config=/etc/otel-collector-config.yaml"]
    restart: unless-stopped
    deploy:
      resources:
        limits:
          memory: 2G
        reservations:
          memory: 512M

Prometheus then scrapes a single target — the collector:

scrape_configs:
  - job_name: "temporal-workers"
    scrape_interval: 15s
    static_configs:
      - targets: ["otel-collector:9464"]

One scrape target, regardless of how many workers you run. Add or remove workers without touching Prometheus configuration.

Architecture Comparison

Here's how the two approaches stack up:

Aspect	Prometheus Exporter	OTLP Export
Transport	HTTP pull (scraper requests)	gRPC push (worker sends)
Ports per worker	1 unique port each	None — pushes to shared collector
Scrape targets	N (one per worker)	1 (the collector)
Config coupling	Tight — Prometheus knows every worker	Loose — Prometheus only knows collector
Scalability	Manage port ranges	Add workers, no config changes
Collector Pipeline Features	Limited	Auth, batching, retries, routing
Collector needed	No	Yes
Complexity	Lower per-worker, higher at scale	Higher upfront, lower at scale
Resource usage	N HTTP servers (one per worker)	1 collector + N gRPC clients

When to use Prometheus exporter:

Single worker per host
Docker/Kubernetes where each worker is its own container (no port conflicts)
Quick prototyping or small setups
You want zero infrastructure beyond Prometheus itself

When to use OTLP export:

Multiple workers per host (PM2, systemd, bare metal)
You run workers outside Docker but monitoring inside Docker
Enterprise environments requiring authentication
You want to decouple worker scaling from monitoring configuration
You're already using OpenTelemetry for other signals (traces, logs)

What About Kubernetes?

If each Temporal worker runs in its own Kubernetes pod, the Prometheus exporter works fine — every pod gets its own network namespace, so there's no port conflict. Prometheus can discover each pod via service annotations and scrape :9464 per pod without issues.

The port conflict problem mainly appears when multiple workers share the same host or container. This happens with:

PM2 process management on bare metal
systemd running multiple worker services
Docker Compose with multiple worker replicas on the same host
Development machines running several workers locally

In all these cases, OTLP removes the coupling between worker count and monitoring configuration.

Real-World Setup: PM2 + OTLP

Here's the complete setup from my simple-temporal project. The workers run via PM2 on the host machine, the OTLP collector runs in Docker, and Prometheus scrapes the collector.

Directory Structure

├── docker-compose.yml          # Prometheus + Grafana + OTLP collector
├── otel-collector/
│   └── config.yaml             # OTLP receiver → Prometheus exporter
├── temporal/
│   ├── telemetryOptions.js     # Shared OTLP initialization
│   └── worker.js               # Worker entry point
└── ecosystem.config.js         # PM2 process management

Worker Initialization

// temporal/telemetryOptions.js
import { Runtime, makeTelemetryFilterString } from "@temporalio/worker";
 
const DEFAULTS = {
  METRICS_EXPORT_INTERVAL_MS: 1000,
  LOG_LEVEL_CORE: "INFO",
  LOG_LEVEL_OTHER: "INFO",
  TEMPORALITY: "cumulative",
};
 
function parseOtlpHeaders() {
  const headersEnv = process.env.OTEL_EXPORTER_OTLP_HEADERS;
  if (!headersEnv) return {};
  try {
    return JSON.parse(headersEnv);
  } catch (error) {
    console.error(`Failed to parse OTLP headers: ${error.message}`);
    return {};
  }
}
 
function normalizeEndpoint(endpoint) {
  let cleanEndpoint = endpoint.replace(/^(https?|grpc|grpcs):\/\//, "");
  if (!cleanEndpoint.includes(":")) {
    cleanEndpoint = `${cleanEndpoint}:4317`;
  }
  return cleanEndpoint;
}
 
function initializeRuntime() {
  const otlpEndpoint = process.env.OTEL_EXPORTER_OTLP_ENDPOINT;
 
  const telemetryConfig = {
    logging: {
      forward: {},
      filter: makeTelemetryFilterString({
        core: process.env.TEMPORAL_LOG_LEVEL_CORE || DEFAULTS.LOG_LEVEL_CORE,
        other: process.env.TEMPORAL_LOG_LEVEL_OTHER || DEFAULTS.LOG_LEVEL_OTHER,
      }),
    },
  };
 
  if (otlpEndpoint) {
    const normalizedEndpoint = normalizeEndpoint(otlpEndpoint);
    telemetryConfig.metrics = {
      otel: {
        url: `grpc://${normalizedEndpoint}`,
        headers: parseOtlpHeaders(),
        temporality: DEFAULTS.TEMPORALITY,
        metricsExportInterval: DEFAULTS.METRICS_EXPORT_INTERVAL_MS,
      },
    };
 
    console.log(`[Telemetry] OTLP metrics → grpc://${normalizedEndpoint}`);
  } else {
    console.warn("[Telemetry] No OTLP endpoint configured — metrics disabled");
  }
 
  Runtime.install({ telemetryOptions: telemetryConfig });
}
 
export default initializeRuntime;

PM2 Ecosystem

// ecosystem.config.js
module.exports = {
  apps: [
    {
      name: "payment-worker",
      script: "./temporal/worker.js",
      env: {
        OTEL_EXPORTER_OTLP_ENDPOINT: "localhost:4317",
        OTEL_SERVICE_NAME: "payment-worker",
        WORKER_ID: "payment-worker-1",
      },
    },
    {
      name: "notification-worker",
      script: "./temporal/worker.js",
      env: {
        OTEL_EXPORTER_OTLP_ENDPOINT: "localhost:4317",
        OTEL_SERVICE_NAME: "notification-worker",
        WORKER_ID: "notification-worker-1",
      },
    },
  ],
};

Startup Guard Script

Before the collector starts, I run a quick port check to avoid conflicts:

#!/bin/bash
# scripts/start-otel.sh
 
check_port() {
  local port=$1
  local name=$2
  if ss -tuln | grep -q ":${port} "; then
    echo "ERROR: Port ${port} (${name}) is already in use"
    return 1
  fi
  echo "✓ Port ${port} (${name}) is available"
  return 0
}
 
check_port 4317 "OTLP gRPC"
check_port 4318 "OTLP HTTP"
check_port 9464 "Prometheus metrics"
check_port 13133 "Health check"
 
docker compose up -d

Data Flow End-to-End

PM2 starts 3 workers
        │
        ▼
Each Worker: Runtime.install({ otel: { url: "grpc://localhost:4317" } })
        │
        ▼
SDK opens gRPC connection to OTLP Collector (:4317)
        │
        ▼
Worker executes workflow/activity → SDK records metric
        │
        ▼
Every 1 second → SDK pushes metrics batch via gRPC
        │
        ▼
OTLP Collector receives metrics (the batch processor buffers them)
        │
        ▼
Every 10 seconds / 8192 items → batch processor flushes to Prometheus exporter
        │
        ▼
OTLP Collector serves aggregated metrics on :9464/metrics
        │
        ▼
Prometheus scrapes otel-collector:9464 every 15 seconds
        │
        ▼
Grafana queries Prometheus, visualizes on dashboard

When the Collector Goes Down

One concern with OTLP is dependency on the collector. If the collector is unavailable, do workers stop working?

No. The Temporal SDK handles this gracefully. If the gRPC connection to the collector fails, the worker continues executing workflows and activities. Metrics simply aren't exported until the collector comes back. The SDK logs the connection failure but doesn't crash or block.

That said, metrics generated while the collector is unavailable should generally be considered lost unless buffering and retry mechanisms are explicitly configured. The Temporal SDK does not maintain a local buffer of unsent metrics — if the gRPC connection fails, the data for that export interval is dropped. This is different from Prometheus scraping, where the scraper retries on the next interval and eventually gets the data.

For production, run the collector with a startup guard script and health checks, and monitor its availability separately.

Key Takeaways

Prometheus exporter is perfect until you scale horizontally — single workers and Kubernetes pods work fine. The port conflict only appears when multiple workers share a host.
OTLP removes the coupling between workers and monitoring — one collector receives from all workers, one Prometheus scrape target, no config changes when workers scale.
The OTLP collector is the bridge — gRPC in (:4317), Prometheus HTTP out (:9464). It's a single Docker container with a straightforward config file.
Metrics loss during collector downtime is expected — the SDK doesn't locally buffer unsent metrics. Monitor the collector's health separately and run it with proper guard scripts.
Service names replace ports — with Prometheus, you distinguish workers by port (:9464, :9465). With OTLP, you use OTEL_SERVICE_NAME — a meaningful identifier that survives restarts and re-deploys.
Standardizing on OpenTelemetry opens doors — once OTLP is flowing to the collector, sending the same metrics to alternative backends (DataDog, Jaeger, Grafana Cloud) requires only a config change, not a code change.

I landed on OTLP because my workers run on bare metal via PM2, and managing port ranges for 10+ workers was a maintenance headache I didn't want. The OTLP collector is one Docker container, one config file, and it just works.

simple-temporal on GitHub — the project this post is based on
Temporal SDK Monitoring Docs — official Temporal documentation
OpenTelemetry Collector — OTLP collector reference