> ## Documentation Index
> Fetch the complete documentation index at: https://docs.photon.codes/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Recovery and state

> Idempotent retries, per-resource memory scope, and a durable failure audit log.

Workers crash. The send job fails halfway through a 4-message reply. The retry runs from the top - and the user gets messages 1 and 2 again, then 3 and 4 for the first time. Now they have a duplicate.

You need three things to make this robust: stable client GUIDs, a resume cursor, and a place to record what failed.

## Stable client GUIDs

Every message you enqueue gets a deterministic identifier - a `clientGuid` - assigned at enqueue time, not at send time. The transport uses it for deduplication: if it sees the same `clientGuid` twice, it discards the second copy.

```ts theme={null}
const messages = reply.map((text, index) => ({
  text,
  clientGuid: `${jobId}-${index}`, // stable across retries
}));
```

Most modern messaging transports support stable client-side IDs for dedup, and Spectrum surfaces them through the provider interface - check what your provider exposes before assuming you need to build dedup yourself.

The reason `clientGuid` must be stable is subtle: a worker that crashes after the transport ack'd but before the worker recorded it will retry the same message. Without a stable GUID, the transport sees a "new" message and delivers a duplicate.

## startIndex resume cursor

GUID-based dedup handles the "transport already saw this" case. But you also want the worker itself to skip messages it knows it already sent - both for performance and to avoid even attempting the dedup roundtrip.

Persist a `startIndex` on the job. After each successful send, atomically bump it. On retry, resume from there:

```mermaid theme={null}
sequenceDiagram
  participant W as Worker
  participant Tx as Transport

  W->>W: assign clientGuids [A, B, C, D]
  W->>Tx: send msg[0] (guid A)
  Tx-->>W: ack
  W->>W: startIndex = 1 (persisted)
  W->>Tx: send msg[1] (guid B)
  Tx-->>W: ack
  W->>W: startIndex = 2 (persisted)
  W-x W: ⚡ crash
  Note over W,Tx: retry - load startIndex=2
  W->>Tx: send msg[2] (guid C)
  W->>Tx: send msg[3] (guid D)
```

If the crash happens *between* ack and persist, the retry will resend `msg[1]` - but the transport sees the same `clientGuid B` and discards it. Both layers of defense are doing work.

## Per-resource memory scope

A single agent talks to many users. Each one needs their own working memory, conversation history, and observational notes. Mixing them up is catastrophic - the agent telling one user about another user's plans is the kind of bug you see once and never forget.

Scope every memory operation by `resourceId = senderAddress`. The thread ID is per-chat (`chat-${chatId}`), but memory is per-person:

```ts theme={null}
await memory.getWorkingMemory({
  resourceId: senderAddress,
  threadId: `chat-${chatId}`,
});
```

The same person messaging from a group chat versus a 1:1 sees the same working memory (same `resourceId`), but conversation history is per-thread (different `threadId`). That's usually what you want - the agent remembers *who you are* across all chats but treats each thread as its own conversation.

```mermaid theme={null}
flowchart TD
  Sender[senderAddress: alice@example.com]
  Sender --> WM[Working memory:<br/>per-resource]
  Sender --> Notes[Observational notes:<br/>per-resource]

  Chat1[chat-1<br/>1:1 with alice]
  Chat2[chat-2<br/>group with alice + bob]

  Chat1 --> H1[Message history<br/>per-thread]
  Chat2 --> H2[Message history<br/>per-thread]

  WM -.shared across.- Chat1
  WM -.shared across.- Chat2
```

If you shard memory across databases or tenants, each shard needs to follow the same convention. Test multi-user concurrency hard - race conditions in working-memory updates corrupt state silently and you won't notice until two users compare notes.

## Job failure audit log

When a job fails, you want to know which job, when, with what payload, and why - without grepping through rotating logs.

A `job_failures` table is a small amount of code that pays back disproportionately. Every error path calls `recordJobFailure(queueName, jobId, payload, error)` and inserts a row. Now you can ask:

* "Which jobs failed in the last hour?"
* "Are all failures coming from one chat?" (a corrupt working-memory state)
* "Are all failures from one queue stage?" (a transport outage vs. an LLM bug)
* "What was the payload that triggered this?" (reproducer)

Couple of operational notes:

* **Add a retention policy.** Otherwise the table grows forever. Delete entries older than 30 days.
* **Make `recordJobFailure` itself fail-safe.** Wrap it in a try/catch with a log fallback - you don't want a failed-failure-record to take down the worker.
* **Be careful with payload size.** If your jobs carry images or large blobs, the audit table balloons. Either truncate the payload or store a pointer to it.

## Putting it together

The crash-recovery story:

```mermaid theme={null}
flowchart TD
  Job[Job starts] --> Try{Run handler}
  Try -->|success| Done[Done]
  Try -->|crash| Retry[Queue retries with same jobId]
  Try -->|error| Audit[recordJobFailure]
  Audit --> Retry
  Retry --> Resume[Load startIndex]
  Resume --> Send[Send from startIndex onward]
  Send --> Tx{Transport check}
  Tx -->|new clientGuid| Deliver[Delivered]
  Tx -->|seen clientGuid| Skip[Discarded as dup]
  Deliver --> Bump[Persist startIndex]
  Skip --> Bump
  Bump --> Done
```

Three independent layers - queue retry, resume cursor, transport dedup - and any one of them is enough to prevent a duplicate in most failure modes. Together they survive almost everything short of the database itself going down.
