Temporal SDK Metrics: Prometheus Exporter vs OTLP for Multi-Worker Deployments
How to choose between Prometheus exporter and OTLP for Temporal SDK metrics, especially when running multiple worker instances on the same host. A practical guide with real configuration.
TL;DR: If you run multiple Temporal workers on the same host (PM2, systemd, bare metal), the Prometheus exporter causes port conflicts (
EADDRINUSE). OTLP export lets all workers push metrics to a single collector via gRPC, eliminating port management entirely. The OTLP collector receives on:4317and serves Prometheus HTTP on:9464— one scrape target regardless of worker count.
If you're running Temporal workers in production, you've probably enabled SDK metrics to monitor activity latency, workflow completion rates, and worker utilization. The Temporal SDK automatically exports these metrics — no manual instrumentation needed. But the moment you scale beyond a single worker on a single machine, you hit an interesting question: how do you collect metrics from multiple workers without them fighting over the same port?
I've been through this with the simple-temporal project, and there are two viable approaches. This post walks through both, why I eventually chose OTLP, and the exact configuration that makes it work.
Quick Background: Temporal SDK Metrics
Before we get into the architectures, here's how Temporal SDK metrics work at a high level.
When you create a Temporal worker, the SDK automatically records metrics — activity execution latency, workflow completions, worker task slot utilization, process resource usage. You don't instrument your workflow or activity code. You just configure where those metrics go.
In Node.js, the setup looks like this:
import { Runtime } from "@temporalio/worker";
Runtime.install({
telemetryOptions: {
metrics: {
// The question is: what goes here?
},
},
});The SDK supports two metric export mechanisms:
- Prometheus exporter — serves an HTTP endpoint that Prometheus scrapes
- OTLP exporter — pushes metrics via gRPC to an OpenTelemetry Collector
Both export the same metrics. The difference is how they export them, and that difference matters a lot when you run multiple workers.
Approach 1: Prometheus Exporter (Pull-Based)
The Prometheus exporter is the simpler of the two. The SDK starts an HTTP server inside the worker process, listens on a port, and serves metrics in Prometheus text format whenever a scraper hits the /metrics endpoint.
Single Worker Setup
Runtime.install({
telemetryOptions: {
metrics: {
prometheus: {
bindAddress: "0.0.0.0:9464",
},
},
},
});Prometheus scrapes http://worker-host:9464/metrics every few seconds, and the metrics land in your dashboard. Simple, clean, works great.
The Multi-Worker Problem
The moment you run a second worker on the same machine, this breaks:
Error: listen EADDRINUSE: address already in use :::9464Each worker tries to bind port 9464 for its own metrics server. Only one can have it. You now have to decide: how do I collect metrics from all my workers without them stepping on each other?
Workaround: Unique Ports Per Worker
One solution is to assign each worker its own port.
const BASE_PORT = 9464;
const WORKER_INSTANCE = parseInt(process.env.WORKER_INSTANCE || "0", 10);
const METRICS_PORT = BASE_PORT + WORKER_INSTANCE;
Runtime.install({
telemetryOptions: {
metrics: {
prometheus: {
bindAddress: `0.0.0.0:${METRICS_PORT}`,
},
},
},
});Each worker gets a deterministic port:
- Worker 1 →
:9464 - Worker 2 →
:9465 - Worker 3 →
:9466
Then you list all those ports in your Prometheus scrape config:
scrape_configs:
- job_name: "temporal-workers"
static_configs:
- targets:
- "localhost:9464"
- "localhost:9465"
- "localhost:9466"This works, but it's not great. Every time you add or remove a worker, you update the scrape config. The Prometheus config becomes tightly coupled to your process management. And you're maintaining N HTTP servers (one per worker) just to serve metrics.
Approach 2: OTLP Export (Push-Based)
OpenTelemetry Protocol (OTLP) takes a different approach. Instead of each worker hosting its own HTTP server and waiting to be scraped, each worker pushes metrics to a shared collector via gRPC. The collector aggregates everything and serves a single Prometheus endpoint.
How It Works
Worker 1 ──gRPC──┐
Worker 2 ──gRPC──┤ OTLP Collector ──HTTP:9464──> Prometheus
Worker 3 ──gRPC──┘ │
└──> Any OTLP backend (Jaeger, DataDog, etc.)All workers push to the same collector endpoint (localhost:4317). The collector handles aggregation, and Prometheus scrapes just one target — the collector itself.
Worker Configuration
The worker side is simpler than the Prometheus approach because there's no port management:
// temporal/telemetryOptions.js
import { Runtime, makeTelemetryFilterString } from "@temporalio/worker";
function initializeRuntime() {
const otlpEndpoint = process.env.OTEL_EXPORTER_OTLP_ENDPOINT || "localhost:4317";
Runtime.install({
telemetryOptions: {
logging: {
forward: {},
filter: makeTelemetryFilterString({ core: "INFO", other: "INFO" }),
},
metrics: {
otel: {
url: `grpc://${otlpEndpoint}`,
temporality: "cumulative",
metricsExportInterval: 1000, // push every 1 second
},
},
},
});
}Environment variables per worker:
# Worker 1
OTEL_EXPORTER_OTLP_ENDPOINT=localhost:4317
OTEL_SERVICE_NAME=payment-worker-1
# Worker 2
OTEL_EXPORTER_OTLP_ENDPOINT=localhost:4317
OTEL_SERVICE_NAME=notification-worker-1
# Worker 3
OTEL_EXPORTER_OTLP_ENDPOINT=localhost:4317
OTEL_SERVICE_NAME=core-worker-1No port conflicts. Every worker pushes to the same gRPC endpoint. The OTEL_SERVICE_NAME identifies which worker the metrics came from — that's how you distinguish them in your dashboards.
The OTLP Collector
The collector is the bridge. It receives OTLP metrics via gRPC (port 4317), optionally processes them (batching, filtering), and exports them as Prometheus HTTP (port 9464).
Here's the actual configuration I run:
# otel-collector/config.yaml
extensions:
health_check:
endpoint: "0.0.0.0:13133"
receivers:
otlp:
protocols:
grpc:
endpoint: "0.0.0.0:4317"
http:
endpoint: "0.0.0.0:4318"
processors:
batch:
timeout: 10s
send_batch_size: 8192
exporters:
prometheus:
endpoint: "0.0.0.0:9464"
service:
extensions: [health_check]
pipelines:
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]What this does:
receivers.otlplistens for OTLP metrics on gRPC (:4317) and HTTP (:4318)processors.batchbatches metrics every 10 seconds or 8192 data points — whichever comes firstexporters.prometheusserves the aggregated metrics as Prometheus text format on:9464
And the Docker Compose deployment:
# docker-compose.yml
services:
otel-collector:
container_name: temporal-otel-collector
image: otel/opentelemetry-collector-contrib:0.104.0
ports:
- "4317:4317" # OTLP gRPC receiver
- "4318:4318" # OTLP HTTP receiver
- "9464:9464" # Prometheus exporter
- "13133:13133" # Health check
volumes:
- ./otel-collector/config.yaml:/etc/otel-collector-config.yaml:ro
command: ["--config=/etc/otel-collector-config.yaml"]
restart: unless-stopped
deploy:
resources:
limits:
memory: 2G
reservations:
memory: 512MPrometheus then scrapes a single target — the collector:
scrape_configs:
- job_name: "temporal-workers"
scrape_interval: 15s
static_configs:
- targets: ["otel-collector:9464"]One scrape target, regardless of how many workers you run. Add or remove workers without touching Prometheus configuration.
Architecture Comparison
Here's how the two approaches stack up:
| Aspect | Prometheus Exporter | OTLP Export |
|---|---|---|
| Transport | HTTP pull (scraper requests) | gRPC push (worker sends) |
| Ports per worker | 1 unique port each | None — pushes to shared collector |
| Scrape targets | N (one per worker) | 1 (the collector) |
| Config coupling | Tight — Prometheus knows every worker | Loose — Prometheus only knows collector |
| Scalability | Manage port ranges | Add workers, no config changes |
| Collector Pipeline Features | Limited | Auth, batching, retries, routing |
| Collector needed | No | Yes |
| Complexity | Lower per-worker, higher at scale | Higher upfront, lower at scale |
| Resource usage | N HTTP servers (one per worker) | 1 collector + N gRPC clients |
When to use Prometheus exporter:
- Single worker per host
- Docker/Kubernetes where each worker is its own container (no port conflicts)
- Quick prototyping or small setups
- You want zero infrastructure beyond Prometheus itself
When to use OTLP export:
- Multiple workers per host (PM2, systemd, bare metal)
- You run workers outside Docker but monitoring inside Docker
- Enterprise environments requiring authentication
- You want to decouple worker scaling from monitoring configuration
- You're already using OpenTelemetry for other signals (traces, logs)
What About Kubernetes?
If each Temporal worker runs in its own Kubernetes pod, the Prometheus exporter works fine — every pod gets its own network namespace, so there's no port conflict. Prometheus can discover each pod via service annotations and scrape :9464 per pod without issues.
The port conflict problem mainly appears when multiple workers share the same host or container. This happens with:
- PM2 process management on bare metal
- systemd running multiple worker services
- Docker Compose with multiple worker replicas on the same host
- Development machines running several workers locally
In all these cases, OTLP removes the coupling between worker count and monitoring configuration.
Real-World Setup: PM2 + OTLP
Here's the complete setup from my simple-temporal project. The workers run via PM2 on the host machine, the OTLP collector runs in Docker, and Prometheus scrapes the collector.
Directory Structure
├── docker-compose.yml # Prometheus + Grafana + OTLP collector
├── otel-collector/
│ └── config.yaml # OTLP receiver → Prometheus exporter
├── temporal/
│ ├── telemetryOptions.js # Shared OTLP initialization
│ └── worker.js # Worker entry point
└── ecosystem.config.js # PM2 process managementWorker Initialization
// temporal/telemetryOptions.js
import { Runtime, makeTelemetryFilterString } from "@temporalio/worker";
const DEFAULTS = {
METRICS_EXPORT_INTERVAL_MS: 1000,
LOG_LEVEL_CORE: "INFO",
LOG_LEVEL_OTHER: "INFO",
TEMPORALITY: "cumulative",
};
function parseOtlpHeaders() {
const headersEnv = process.env.OTEL_EXPORTER_OTLP_HEADERS;
if (!headersEnv) return {};
try {
return JSON.parse(headersEnv);
} catch (error) {
console.error(`Failed to parse OTLP headers: ${error.message}`);
return {};
}
}
function normalizeEndpoint(endpoint) {
let cleanEndpoint = endpoint.replace(/^(https?|grpc|grpcs):\/\//, "");
if (!cleanEndpoint.includes(":")) {
cleanEndpoint = `${cleanEndpoint}:4317`;
}
return cleanEndpoint;
}
function initializeRuntime() {
const otlpEndpoint = process.env.OTEL_EXPORTER_OTLP_ENDPOINT;
const telemetryConfig = {
logging: {
forward: {},
filter: makeTelemetryFilterString({
core: process.env.TEMPORAL_LOG_LEVEL_CORE || DEFAULTS.LOG_LEVEL_CORE,
other: process.env.TEMPORAL_LOG_LEVEL_OTHER || DEFAULTS.LOG_LEVEL_OTHER,
}),
},
};
if (otlpEndpoint) {
const normalizedEndpoint = normalizeEndpoint(otlpEndpoint);
telemetryConfig.metrics = {
otel: {
url: `grpc://${normalizedEndpoint}`,
headers: parseOtlpHeaders(),
temporality: DEFAULTS.TEMPORALITY,
metricsExportInterval: DEFAULTS.METRICS_EXPORT_INTERVAL_MS,
},
};
console.log(`[Telemetry] OTLP metrics → grpc://${normalizedEndpoint}`);
} else {
console.warn("[Telemetry] No OTLP endpoint configured — metrics disabled");
}
Runtime.install({ telemetryOptions: telemetryConfig });
}
export default initializeRuntime;PM2 Ecosystem
// ecosystem.config.js
module.exports = {
apps: [
{
name: "payment-worker",
script: "./temporal/worker.js",
env: {
OTEL_EXPORTER_OTLP_ENDPOINT: "localhost:4317",
OTEL_SERVICE_NAME: "payment-worker",
WORKER_ID: "payment-worker-1",
},
},
{
name: "notification-worker",
script: "./temporal/worker.js",
env: {
OTEL_EXPORTER_OTLP_ENDPOINT: "localhost:4317",
OTEL_SERVICE_NAME: "notification-worker",
WORKER_ID: "notification-worker-1",
},
},
],
};Startup Guard Script
Before the collector starts, I run a quick port check to avoid conflicts:
#!/bin/bash
# scripts/start-otel.sh
check_port() {
local port=$1
local name=$2
if ss -tuln | grep -q ":${port} "; then
echo "ERROR: Port ${port} (${name}) is already in use"
return 1
fi
echo "✓ Port ${port} (${name}) is available"
return 0
}
check_port 4317 "OTLP gRPC"
check_port 4318 "OTLP HTTP"
check_port 9464 "Prometheus metrics"
check_port 13133 "Health check"
docker compose up -dData Flow End-to-End
PM2 starts 3 workers
│
▼
Each Worker: Runtime.install({ otel: { url: "grpc://localhost:4317" } })
│
▼
SDK opens gRPC connection to OTLP Collector (:4317)
│
▼
Worker executes workflow/activity → SDK records metric
│
▼
Every 1 second → SDK pushes metrics batch via gRPC
│
▼
OTLP Collector receives metrics (the batch processor buffers them)
│
▼
Every 10 seconds / 8192 items → batch processor flushes to Prometheus exporter
│
▼
OTLP Collector serves aggregated metrics on :9464/metrics
│
▼
Prometheus scrapes otel-collector:9464 every 15 seconds
│
▼
Grafana queries Prometheus, visualizes on dashboardWhen the Collector Goes Down
One concern with OTLP is dependency on the collector. If the collector is unavailable, do workers stop working?
No. The Temporal SDK handles this gracefully. If the gRPC connection to the collector fails, the worker continues executing workflows and activities. Metrics simply aren't exported until the collector comes back. The SDK logs the connection failure but doesn't crash or block.
That said, metrics generated while the collector is unavailable should generally be considered lost unless buffering and retry mechanisms are explicitly configured. The Temporal SDK does not maintain a local buffer of unsent metrics — if the gRPC connection fails, the data for that export interval is dropped. This is different from Prometheus scraping, where the scraper retries on the next interval and eventually gets the data.
For production, run the collector with a startup guard script and health checks, and monitor its availability separately.
Key Takeaways
-
Prometheus exporter is perfect until you scale horizontally — single workers and Kubernetes pods work fine. The port conflict only appears when multiple workers share a host.
-
OTLP removes the coupling between workers and monitoring — one collector receives from all workers, one Prometheus scrape target, no config changes when workers scale.
-
The OTLP collector is the bridge — gRPC in (
:4317), Prometheus HTTP out (:9464). It's a single Docker container with a straightforward config file. -
Metrics loss during collector downtime is expected — the SDK doesn't locally buffer unsent metrics. Monitor the collector's health separately and run it with proper guard scripts.
-
Service names replace ports — with Prometheus, you distinguish workers by port (
:9464,:9465). With OTLP, you useOTEL_SERVICE_NAME— a meaningful identifier that survives restarts and re-deploys. -
Standardizing on OpenTelemetry opens doors — once OTLP is flowing to the collector, sending the same metrics to alternative backends (DataDog, Jaeger, Grafana Cloud) requires only a config change, not a code change.
I landed on OTLP because my workers run on bare metal via PM2, and managing port ranges for 10+ workers was a maintenance headache I didn't want. The OTLP collector is one Docker container, one config file, and it just works.
Related
- simple-temporal on GitHub — the project this post is based on
- Temporal SDK Monitoring Docs — official Temporal documentation
- OpenTelemetry Collector — OTLP collector reference