Workers crash. The send job fails halfway through a 4-message reply. The retry runs from the top - and the user gets messages 1 and 2 again, then 3 and 4 for the first time. Now they have a duplicate. You need three things to make this robust: stable client GUIDs, a resume cursor, and a place to record what failed.Documentation Index
Fetch the complete documentation index at: https://docs.photon.codes/docs/llms.txt
Use this file to discover all available pages before exploring further.
Stable client GUIDs
Every message you enqueue gets a deterministic identifier - aclientGuid - assigned at enqueue time, not at send time. The transport uses it for deduplication: if it sees the same clientGuid twice, it discards the second copy.
clientGuid must be stable is subtle: a worker that crashes after the transport ack’d but before the worker recorded it will retry the same message. Without a stable GUID, the transport sees a “new” message and delivers a duplicate.
startIndex resume cursor
GUID-based dedup handles the “transport already saw this” case. But you also want the worker itself to skip messages it knows it already sent - both for performance and to avoid even attempting the dedup roundtrip. Persist astartIndex on the job. After each successful send, atomically bump it. On retry, resume from there:
If the crash happens between ack and persist, the retry will resend msg[1] - but the transport sees the same clientGuid B and discards it. Both layers of defense are doing work.
Per-resource memory scope
A single agent talks to many users. Each one needs their own working memory, conversation history, and observational notes. Mixing them up is catastrophic - the agent telling one user about another user’s plans is the kind of bug you see once and never forget. Scope every memory operation byresourceId = senderAddress. The thread ID is per-chat (chat-${chatId}), but memory is per-person:
resourceId), but conversation history is per-thread (different threadId). That’s usually what you want - the agent remembers who you are across all chats but treats each thread as its own conversation.
If you shard memory across databases or tenants, each shard needs to follow the same convention. Test multi-user concurrency hard - race conditions in working-memory updates corrupt state silently and you won’t notice until two users compare notes.
Job failure audit log
When a job fails, you want to know which job, when, with what payload, and why - without grepping through rotating logs. Ajob_failures table is a small amount of code that pays back disproportionately. Every error path calls recordJobFailure(queueName, jobId, payload, error) and inserts a row. Now you can ask:
- “Which jobs failed in the last hour?”
- “Are all failures coming from one chat?” (a corrupt working-memory state)
- “Are all failures from one queue stage?” (a transport outage vs. an LLM bug)
- “What was the payload that triggered this?” (reproducer)
- Add a retention policy. Otherwise the table grows forever. Delete entries older than 30 days.
- Make
recordJobFailureitself fail-safe. Wrap it in a try/catch with a log fallback - you don’t want a failed-failure-record to take down the worker. - Be careful with payload size. If your jobs carry images or large blobs, the audit table balloons. Either truncate the payload or store a pointer to it.