Photon builds infrastructure for deploying AI agents into messaging platforms people already use.

Spectrum is Photon's developer platform for connecting AI agents to messaging channels such as iMessage, WhatsApp, Telegram, Slack, Discord, and Instagram.

Who is Spectrum built for?

Spectrum is built for developers and teams building conversational AI agents that need to communicate through existing messaging platforms.

photon

Spectrum

Platforms

Docs

Resources

Community

Pricing

Discord

3.2k

Contact

Dashboard

photon

Blog

Engineering

Jul 3, 2026

How We Rebuilt Our Shared iMessage Routing to Handle 10M+ Messages a Day

Ryan

Tom

No headings found on page

Start building
with Spectrum

Deploy AI agents
across every channel

Start for free

See our plans

Learn more about Spectrum

Preface

Photon's free and pro users share iMessage phone numbers, so every message routes through a central proxy that resolves ownership at runtime — outbound is straightforward, inbound is not.

At 10M+ messages per day across thousands of concurrent users, the legacy fan-out architecture was re-deriving event ownership on every delivery, coordinating through Redis, and running on a runtime (Bun) that made the whole system effectively unobservable in production.
We migrated the runtime from Bun to Node to fix silent gRPC connection failures that were killing long-lived streams without any error logs, and to unlock end-to-end distributed tracing. We then replaced the per-request fan-out with a durable Postgres-backed event log that resolves ownership once at ingest, and deleted ~5,200 lines of legacy coordination code.
The binding cache that gates inbound delivery was silently dropping messages when upstream services returned transient errors — a subtle bug that looked identical to "no binding exists."
The result: a system that is simpler, fully observable end-to-end, and durable by default. Business users on dedicated numbers were never affected — this was purely about making the shared-number path reliable at scale.

The routing problem behind our reliability complaints

If you've used Photon's free or pro tier in the last few months, there's a good chance you hit connection drops, delayed messages, or outright missing replies. We heard you — in support tickets, on social media, in Discord. The complaints were real, and they all traced back to the same place.

Photon gives every business customer a dedicated iMessage number with its own backend. That's the easy path — one number, one customer, one system. No routing ambiguity.

Free and pro users are different. They share iMessage phone numbers across a pool, and every message — inbound or outbound — passes through a central proxy service that figures out who it belongs to. That routing layer was the source of every stability issue.

Outbound is simple. You have the project ID and the target phone number, so you can deterministically look up which Photon number to send from. One lookup, one send.

Inbound is where things get hard.

An inbound message arrives with two pieces of information: the Photon phone number it was sent to, and the sender's phone number. From that pair, you have to resolve which project owns that conversation, find the active stream the client is listening on, and deliver the event. Every inbound message requires this resolution, and it has to happen fast — we're processing over ten million messages a day on the shared-number service alone.

At that scale, the routing layer is the product. If resolution is slow, messages feel laggy. If it's wrong, messages vanish. If the system can't tell you what's happening inside it, you're flying blind when things break.

All three of those things were happening.

How the legacy system worked (and why it stopped being enough)

The original architecture was a fan-out model. Each proxy replica opened live gRPC streams to every iMessage relay instance in the fleet, received raw events from all of them, and resolved ownership per event, per stream, per replica.

This worked when the fleet was small and the user count was modest. But it has a structural problem: it re-derives event ownership on every delivery. Catch-up (replaying missed events when a client reconnects), auto-reply detection ("is anyone actually receiving this conversation?"), and subscribe (streaming new events in real time) all depend on walking live relay state instead of reading from a single source of truth.

Coordination lived in Redis. Auto-reply claims, presence heartbeats, global catch-up caps — all Redis-backed, all stateful in ways that made failure modes subtle and hard to reproduce.

And the whole thing ran on Bun.

The OOM that scaling couldn't fix

The first crisis was the service falling over entirely.

Each proxy pod was ballooning to 2–5 GB of heap and eventually crashing — an OOM loop we were masking by running ~32 replicas. Scaling bought time, but it also multiplied the problem: every replica ran the background event listener against all 20 iMessage relay instances, so more pods meant more aggregate fan-out, more connections, and more memory pressure.

A heap snapshot on a relatively small pod (73 subscriptions, 287 MB) told the story: 2.2 million SlimPromiseReaction objects in a linked list, each retaining a delivered event — message content, timestamps, the full payload. The reactions were growing without bound.

The root cause was a classic Promise.race leak in the stream merger. The proxy merges ~20 backend gRPC streams per client subscription using Promise.race. Each iteration re-.then()s every pending promise. The 19 losers keep their promise object, so a quiet relay instance's next() accumulates one reaction — holding the full event — for every event delivered by any instance. Forever. The withHeartbeat wrapper had the same bug: racing a reused pendingNext re-.then()d it on every tick.

The fix was a leak-free merge primitive. Each source next() gets exactly one reaction that pushes onto a ready queue and wakes the consumer via a single recreated promise. No re-.then() of pending promises, no unbounded reaction chains. Per-source overhead became constant.

After the fix, per-pod steady-state heap dropped from gigabytes back to ~90 MB. We scaled the replica count back down — the 32-pod fleet had been compensating for the leak, not for actual load.

With the service no longer crashing, we could turn to the deeper problems.

The observability gap: why we couldn't debug production

Bun was fast to develop on. But in production, it was a black box — in more ways than one.

The most immediate problem was gRPC. Our proxy holds long-lived gRPC streams to every iMessage relay instance in the fleet. Under Bun, these connections would silently die — no error, no log, no event. The stream just stopped delivering. We'd see symptoms (missing messages, stale state) but nothing in our logs to explain them. Moving to Node resolved this immediately. Node's gRPC stack (grpc-js) handles long-lived connections reliably and surfaces errors when they occur. The silent connection failures vanished overnight.

The deeper problem was observability. OpenTelemetry's instrumentation for HTTP and gRPC depends on hooking into Node's undici and diagnostics_channel internals. Bun doesn't expose these. That meant our calls into upstream services — eligibility checks, opt-in flows, the cloud API whose Postgres pool was the known root cause of our p99 latency spikes (8 seconds) — were completely untraced.

We knew the database was slow. We couldn't prove where or why from the proxy's perspective. Every outbound fetch was an opaque blob of time inside resolveInbound or resolveOutbound. No client spans, no trace propagation, no way to correlate a slow inbound delivery with the specific upstream query that caused it.

This wasn't a minor inconvenience. It meant every production incident was a guessing game. "Is it the proxy? Is it the cloud service? Is it the database? Is it a specific replica?" We had monitoring, but it was monitoring the wrong layer.

Step one: switch the runtime

We migrated from Bun to Node as the runtime, keeping Bun as the package manager and test runner.

The core motivation was unlocking @opentelemetry/instrumentation-undici — Node's fetch is built on undici, and OTel can hook into it via diagnostics_channel to auto-instrument every outbound HTTP call. On Bun, this is a no-op.

The migration itself was surgical. The repo had no build step — Bun ran src/index.ts directly. Node can't do that with TypeScript enums in the generated protobuf files, so we added tsx (esbuild-backed transpilation) as a runtime loader. Telemetry preloads — previously handled by bunfig.toml — moved to Node's --import flags. The Dockerfile switched the runtime stage from oven/bun to node:22-bookworm-slim while keeping the dependency install stage on Bun.

Tests stayed on bun test. The 79-test suite passed unchanged.

The payoff was immediate. With Node running, we registered undici auto-instrumentation and set a W3CTraceContextPropagator so traceparent headers flowed into upstream services. For the first time, a single trace could show: proxy receives inbound event → resolves binding via HTTP call → upstream service hits Postgres → query takes 4 seconds → that's your latency.

One trace. End to end. The 8-second p99 was no longer a rumor — it was a specific query on a specific service, visible from the proxy's own spans.

The silent monitoring failure

Switching to Node exposed a second problem we didn't know we had.

The OTLP exporter was using HTTP transport. On Node's undici stack, keep-alive sockets can half-close when the collector restarts or the memory_limiter refuses a batch. When that happens, every subsequent export attempt reusing that socket silently fails. No error, no retry, no log. The exporter just stops working.

We discovered that two out of three production replicas — including the busiest pod — were completely invisible to our monitoring. They were running, serving traffic, processing messages. But no traces, no logs, no metrics were reaching the collector. If an incident concentrated on one of those dark pods, we'd never know.

The fix was switching the OTLP exporter from HTTP to gRPC. The grpc-js library owns its channel lifecycle and transparently reconnects after a broken connection. A blip can't permanently wedge the export pipeline.

We also hardened the shutdown path — the telemetry flush was previously unbounded, so if the collector was unreachable at SIGTERM, the process would hang. We added a 5-second race with Promise.allSettled so shutdown is always bounded.

The binding cache bug that silently dropped messages

While building the new inbound pipeline, we found a subtle, high-impact bug in the binding cache.

The binding cache answers a simple question: "Is this sender eligible to receive messages on this Photon number?" It calls an upstream service, caches the result, and uses it for the duration of the TTL.

The problem: the cache treated every negative response the same way. A definitive "this sender is not provisioned" (HTTP 404/422) and a transient "the upstream service is down" (HTTP 5xx, network timeout) both got cached as false. For up to the full cache TTL, every inbound message for that sender would be silently dropped — not because the sender wasn't eligible, but because the cache remembered a transient failure as a permanent answer.

The inverse was also broken. Definitive negatives (a sender that genuinely isn't provisioned) weren't cached long enough, so a backlog of messages from an un-provisioned sender would re-hit the upstream service on every retry — 148,000 calls per day in one incident, creating a poison loop that degraded performance for everyone.

The fix was a three-state binding result: resolved (positive — cache normally), none with a reason (definitive negative — cache at full TTL), and TransientBindingError (never cached, always retried). Simple in concept, but it required threading the distinction through every caller in the resolution path.

The real fix: a durable inbound event log

The binding cache fix stopped the bleeding, but the fan-out architecture was still fundamentally fragile. Every delivery re-derived ownership. Every reconnection re-walked live Mac state. Every "is anyone listening?" question depended on presence heartbeats and Redis coordination.

We replaced it with a durable, replayable Postgres event log.

The core insight: resolve ownership once, at ingest time, and persist the result. Everything downstream — catch-up, subscribe, auto-reply, timeout detection — reads from the log instead of re-deriving from live state.

How the new pipeline works

Ingest. A lease-gated writer per iMessage relay instance multiplexes catch-up and subscribe streams into a sequence-ordered feed, resolves ownership inline (with retry and circuit-breaking for transient failures), and batch-writes events into a Postgres inbound_events table. Watermark heartbeats track progress; lag metrics make stalls visible.

Catch-up. When a client reconnects and asks for missed events, the system reads directly from the log — a fenced, ordered page-scan of that project's event slice. No more replaying the relay. Gap-fencing (using xmin snapshots, PgBouncer-safe) ensures reads don't return uncommitted rows. Bidirectional cursor translation lets legacy clients with old-format cursors migrate transparently.

Subscribe. A shared per-pod tail poller fans resolved rows out to bounded per-subscriber queues. Stalled subscribers are shed with RESOURCE_EXHAUSTED after a grace period — backpressure is explicit, not silent. Live streams start at the fenced project head, so there's no gap between catch-up and subscribe.

Auto-reply and timeout detection. The sweeper — a singleton redrive loop — handles pending-row retries, retention deletes, and delivery-timeout scanning. Instead of relying on presence heartbeats to answer "is anyone receiving this?", it looks at delivery_cursors.updated_at. If the cursor is stale, the subscriber isn't consuming, and the system can trigger auto-reply. The auto-reply claim store moved from Redis to Postgres, eliminating one more piece of external coordination state.

We migrated ingest, catch-up, subscribe, auto-reply, and timeout detection onto the log incrementally — validating each layer in staging before enabling it in production — and once everything was stable, cut over fully: the pipeline runs unconditionally, and the entire legacy fan-out/merge/Redis-coordination subsystem is deleted.

The final result: 1,907 lines added, 5,207 deleted, 42 files changed. The system got smaller and more capable at the same time.

What we gave up

Nothing is free. Here's what the new architecture costs:

Write amplification. Every inbound event is written to Postgres. At our volume, that's meaningful — tens of millions of rows per day, with retention deletes running on a schedule. The previous system held no persistent state for event delivery, so storage cost was zero.

Postgres as a critical dependency. The log makes Postgres load-bearing for inbound delivery, not just for configuration and metadata. We mitigated this with an explicit read-replica pool (DATABASE_REPLICA_URL) and by making the catch-up path replica-aware — but a Postgres outage now directly impacts message delivery.

Complexity in the ingest path. Lease-gated writers, circuit breakers, watermark heartbeats, gap-fencing — these are non-trivial components. The operational surface moved from "many replicas independently fan out" to "a coordinated ingest pipeline with lease arbitration." The former was simple and unreliable; the latter is complex and durable.

We judged these tradeoffs worthwhile. Persistent state that you can query, replay, and monitor beats ephemeral state that you can only hope was delivered correctly.

Lessons

Your runtime is a tracing decision. We chose Bun for developer velocity. We didn't realize that choice made production unobservable until we'd scaled past the point where "add more logging" was sufficient. If your system processes millions of events a day, the runtime's compatibility with your observability stack isn't a nice-to-have — it's load-bearing infrastructure.

Cache semantics are failure semantics. The binding cache bug was invisible precisely because it looked like correct behavior — the cache returned a value, the system acted on it, and the message was dropped. The only signal was absence: messages that should have arrived but didn't. Three-state results (yes / definitely no / I don't know) are more work to implement, but they're the only way to cache safely in a distributed system where upstream failures are routine.

Resolve once, persist, read many. The fan-out model's fundamental mistake was re-deriving ownership on every delivery. That's fine when you have ten users. At ten million messages a day, every derivation is a chance for inconsistency, a latency tax, and an operational blind spot. Persisting the resolution at ingest time turned every downstream consumer into a simple log reader.

Try it yourself

These changes are live now. If you tried Photon's free tier before and hit reliability issues — connection drops, missing messages, delayed replies — that's the system we just rebuilt.

The shared iMessage service is faster, more stable, and fully observable. We're confident enough in it to invite you back.

Get started for free at photon.codes
Need a dedicated number? Our business plans give you your own iMessage number with a separated backend — zero shared routing. Check out our pricing.
Questions or feedback? Reach us on Discord or at [email protected].