Blog

Durable SMS Surface Design for Production AI Systems

2026-02-206 min read

How to build an SMS surface that survives retries, late events, and long-running turn orchestration without losing context.

smsarchitecturereliability

Start with delivery guarantees, not UX polish

SMS feels simple, but production behavior is dominated by retries, delayed callbacks, and webhook race conditions. If the reliability model is weak, every customer-facing flow degrades under load.

A durable surface starts by guaranteeing that inbound payloads are validated, normalized, and written to storage before orchestration begins.

Map identity and context deterministically

Incoming phone numbers should resolve to stable identity records, then map to experience context with explicit rules. Avoid implicit defaults that hide routing mistakes.

When mapping is explicit, conversation continuity and downstream policy evaluation become predictable across retries and replays.

Keep async work observable and replay-safe

Async response handling should be backed by durable outbox events and worker processing with idempotency keys. This enables safe retries without duplicate user-facing sends.

Operationally, you need visibility into each step: webhook accepted, turn processed, outbound message queued, provider delivery status received.