The Design Space of Refresh Token Rotation: A Mechanism Account of Reuse Detection, Grace Windows, and Client Persistence Order

Migrating mobile from browser session cookies to first-party Bearer tokens sounds like just "adding a JWT." But the real difficulty of Bearer tokens was never issuance — it's the refresh, specifically the rotate-on-use refresh token. This deceptively simple mechanism is a whole design space: should the token be stored with a static hash or a slow KDF, on what invariant does reuse detection hold, how do you distinguish a lost-response retry from genuine token theft, and why is the client's write order part of the security contract.

This post is not about "how to issue a JWT"; it wants to break refresh token rotation into one clean mechanism account: which core invariant it depends on, where that invariant gets quietly broken at the boundaries, and how each kind of breakage drags "phantom logouts" or "security holes" back in. The worked example throughout is one cross-layer auth migration on an environmental-monitoring platform (stack: .NET 10, JWT HS256, PostgreSQL, Expo / React Native). It is the sequel to the cookie-chunking-re-emission post — that one installed a temporary stabilizer to bring cookie re-emission down to once every 30 minutes; this one is the root-cause cure that gets mobile off session cookies entirely.

The migration constraint: additive, not replace-in-place

The first design decision in any auth migration is "how the old and new coexist." The constraint here is additive:

Cookie and Bearer coexist as two schemes;
The web side is byte-for-byte unchanged — still Cookie + CSRF, with not a line of apps/web touched;
Old cookie-based mobile builds keep working, and no user is forced to re-login during the rolling release.

The server registers Bearer as a second scheme, letting the authorization policy accept either:

// bearer as a second scheme
options.DefaultPolicy = new AuthorizationPolicyBuilder(
    CookieAuthenticationDefaults.AuthenticationScheme,
    JwtBearerDefaults.AuthenticationScheme).RequireAuthenticatedUser();

The database migration is purely additive: only CreateTable("refresh_tokens") plus indexes, no ALTER of any existing table, and Down() only DropTable, fully reversible. Additive isn't being conservative; it's a design principle that lowers migration risk: the new mechanism is laid down in parallel while the old one is left intact, the rollback cost approaches zero, and there's no "must switch simultaneously" cliff during the gradual rollout.

The token model: two kinds of token, two starkly opposite trade-offs

Bearer auth has two kinds of token, and their design trade-offs are almost exactly opposite; conflating them is where all subsequent confusion begins.

Access JWT: HS256 + kid header, 30-minute TTL, carries only the minimal sub / email claims. It is stateless and self-validating — the server can verify the signature without a database hit. The cost is that it cannot be revoked instantly, so its TTL must be short.

Refresh Token: an opaque ≥256-bit CSPRNG random string, stored with a static SHA-256 hex hash, rotated on each use, with a 14-day sliding idle window and no absolute cap. It is stateful and revocable — every use means a database lookup, a rotation, and the option to invalidate it.

The first trade-off worth recording here is storage: the refresh token is stored with a static hash, not a slow KDF (such as bcrypt). The reason is that it is itself a high-entropy CSPRNG string — a slow KDF exists to resist brute-force enumeration of low-entropy passwords, while brute-forcing a 256-bit random string is physically infeasible, so a slow KDF here merely adds latency to every refresh with zero security benefit. The entropy of the secret determines which storage to use — a more precise rule than "always bcrypt passwords."

The second trade-off is the access token's 30-minute TTL — it is deliberately aligned with the cookie's ValidationInterval=30min revocation SLA, so the whole system has only one revocation story and you don't maintain two latency models.

And the authorization logic is entirely unchanged: Bearer's sub → NameIdentifier, email → Email, and identity flows through the existing CurrentUserMiddleware → PermissionResolver for per-request DB resolution as before. Never read role/permission from a JWT claim — this is the same invariant as in the cookie post. Its design significance is: the token carries identity only, and authorization always comes from per-request live DB resolution, so the token path needs zero authorization rework, and account disablement / permission changes take effect instantly by nature, free of any token-TTL latency.

The signing key goes through a secret/env var, and the value committed in appsettings.json must be empty — missing it in production fails startup, never running bare without a signature.

Trap one: kid collision in the JWT signing-provider cache

The first trap is production-relevant, not just a test artifact. Microsoft.IdentityModel's default CryptoProviderFactory caches signing providers by SecurityKey.KeyId (i.e. kid). If two keys share the same kid but have different bytes (multiple hosts in one process, or a key rotated under the same kid), the cache returns a provider built for another key's bytes, and you get an inexplicable IDX10503: signature is invalid.

The correct way is to opt both the signing key and the validation key out of the per-key provider cache:

var key = new SymmetricSecurityKey(bytes)
{
    KeyId = kid,
    CryptoProviderFactory = new CryptoProviderFactory { CacheSignatureProviders = false }
};

The key gotcha: CacheSignatureProviders = false must be set on both the signing side (TokenService) and the validation side (Program.cs) — set it on only one side and you still collide. Building an HMAC-SHA256 provider is cheap, so the cost of turning off the cache is negligible. This trap reveals a more general lesson: when a cache's key is a "logical identifier" (kid) rather than a "content identifier" (a byte hash), an identifier collision makes the cache return the wrong object — a trap shared by every system that caches by a logical key.

Trap two: the "grace window" of refresh token rotation is a security-sensitive zone

This is the segment of the whole migration most worth going through in detail — it was a real security defect caught in code review, not a theoretical hypothesis. To understand it, you first have to establish the core invariant of rotate-on-use.

The core invariant: exactly one active head per family

The security of rotate-on-use rests on one invariant: a single token chain (family) has exactly one active head at any moment. Each refresh consumes the current head and chains a unique successor. On this basis reuse detection holds: if a consumed token is presented again, it means either a legitimate client is replaying or a thief is wielding a stolen old token — and in either case the correct response is to revoke the entire family, because "a consumed token reappearing" means the invariant has been broken. This is exactly the fundamental reason rotate-on-use is safer than a "long-lived, non-rotating refresh token": it can detect that a token has been cloned.

Why a grace window is needed

But this invariant fights with the real network. The initial implementation conservatively "rejected any consumed refresh token," and the result, on the flaky network inside a moving vehicle, dragged "phantom logouts" back in: the refresh response is lost, the client retries with a token the server has already consumed → naively this looks like a reuse attack → logout. The driver gets kicked on the way to work. So a grace window is needed (RefreshReuseGraceSeconds, default 30 seconds) to tolerate "lost-response retries" within the window. The difficulty is: the grace window must tolerate only legitimate lost-response retries and not open a door for genuine token theft. This is precisely the narrowest seam in the design space.

The wrong grace implementation: fork a sibling, leave the old head alive

A tempting but wrong fix is to derive a new sibling C from the consumed predecessor A while letting the original successor B stay alive — so that a slow response (not a lost response) also works:

// Lost-response retry: the client replays predecessor A (it never got successor B).
// The DB only stores SHA-256(B); B's raw value cannot be returned. The tempting "fix":
// derive a new sibling C from the consumed A, and let B stay alive.
var c = Mint(familyId: a.FamilyId, previous: a.Id);   // A is still consumed
// B is left untouched. → the family now has two active heads (B and C)

Why this is a security defect (found in review, not theoretical):

Dangling token: B stays valid until the 14-day idle expiry, yet no client holds it.
The family's reuse detection is permanently disabled: B and C are parallel active branches, and the invariant "exactly one active token per family" is permanently broken — a stolen B will never trip detection, because there were already two legitimate active heads, so the signal "a consumed token reappears" loses its meaning.
Unbounded minting: A is never consumed in the grace branch, so replaying A repeatedly within the grace window mints unlimited valid siblings, none of which trip detection.

In one sentence: the instant you fork a sibling, the core invariant is permanently broken, and reuse detection — the entire security value of rotate-on-use — drops to zero.

The correct implementation: rotation with slack (RFC 9700 / Auth0-Okta pattern)

The correct approach is to never fork but to rotate the family's head forward by one: atomically consume the lost head B, and chain a single new successor C from B.

// Atomically consume the lost head B via a conditional update, and chain a single new
// successor C from B. This preserves "exactly one active token per family." A legitimate
// lost-response retry still gets a usable pair; a stolen B now hits consumed-replay →
// family revocation; replaying A after C appears fails the depth-1 guard → ReuseDetected.
// Concurrency-safe: the loser of a race re-enters RotateAsync, landing on the same idempotent path.
if (!await TryConsumeAsync(headB, replacedBy: c.Id, ct)) return await RotateAsync(raw, ct);
var c = Mint(familyId: b.FamilyId, previous: b.Id);

The key gotcha: with statically-hashed rotation tokens, on retry you can never hand back a previously issued raw value — the DB has only its hash. So the only safe move is to rotate the family's head forward by one (consume the lost head, chain a successor) and never fork the family with a parallel sibling. This constraint comes directly from the earlier trade-off "static hash storage": because the hash is stored and the raw value is irreversible, "re-emitting an old token" is physically impossible, so the only remaining exit is rotating forward. Two seemingly unrelated design decisions interlock here into the same causal chain.

The grace predicate and reuse detection: translating the invariant into a decision

Grace triggers only when a set of strict predicates all hold: the presented token is consumed-only (not revoked, not expired) and ReplacedByTokenId is set and ConsumedAt is set and now ≤ ConsumedAt + RefreshReuseGraceSeconds and the successor row still exists, is active, and successor.ConsumedAt is null (strict depth-1: the direct successor must still be the active head). A hit goes through "rotation with slack."

Conversely, reuse detection triggers a whole-family revocation: a consumed token is presented again and it is not the immediately-preceding one within the grace window (i.e. depth > 1, or past the grace window, or the family is already revoked) — revoke all rows sharing the FamilyId, and every subsequent refresh on that family fails → one clean re-login. This is precisely translating the invariant "exactly one active head per family" into an executable decision: within depth-1 it's a legitimate retry, beyond depth-1 it's an attack signal.

The concurrency guard: seal the read-modify-write window with an atomic conditional update

TryConsumeAsync is a conditional ExecuteUpdateAsync: UPDATE … SET consumed_at, replaced_by_token_id WHERE id=@id AND consumed_at IS NULL AND revoked_at IS NULL, atomic, with no read-modify-write window. Of two racing rotations exactly one gets affected==1; the loser re-enters RotateAsync and lands on the idempotent grace path (no fork, no double mint). Only the relational path goes this way; the InMemory fallback is single-threaded test code, while Postgres always goes the atomic path. This reveals: the concurrency safety of rotation cannot rely on application-layer locking; "consume" must be a single conditional atomic write — the condition itself (consumed_at IS NULL) is the optimistic lock.

Trap three: the client's "persistence order" is an invisible contract

Even if the server is correct, if the client gets the order backwards, it reintroduces the whole-family revocation → logout bug from the client side.

Wrong: resolve before persisting the rotated token

// Single-flight refresh, but the rotated-refresh write is fire-and-forget (or sits after resolve).
const r = await fetch('/auth/token/refresh', { body: oldRefresh })
const pair = await r.json()
setAccessInMemory(pair.access)
void SecureStore.setItemAsync('refresh', pair.refresh) // not awaited
return { ok: true } // resolves too early
// → the next /token/refresh replays the consumed token. Grace tolerates the first time;
//   but under any retry/cold-start jitter it recurs → whole-family revocation → forced logout.

The essence of the problem: a waiting request — or the next refresh — proceeds while SecureStore still holds the old (consumed) refresh token, so the next refresh replays a consumed token.

Correct: the rotation write happens-before the shared promise resolves

// performRefresh awaits the new refresh's persistence write before its promise resolves.
// Because single-flight wraps it in a shared promise that the interceptor awaits, this
// guarantees no waiting request resumes and no next refresh can start before the rotated
// refresh lands in SecureStore.
const pair = await (await fetch('/auth/token/refresh', { body: oldRefresh })).json()
setAccessInMemory(pair.access)
await SecureStore.setItemAsync('refresh', pair.refresh) // happens-before
return { ok: true }

The key gotcha: for a rotate-on-use refresh token, the client's persistence write of the new refresh must complete before any code path that can trigger the next refresh. "Persisted the token" is necessary but not sufficient — the real contract is order (write-before-resolve, enforced by that one shared single-flight promise). This is the client mirror of the server invariant: the server guarantees "one active head per family," and the client must guarantee "what it holds is always that active head" — and the only way to guarantee the latter is to make "write the new head" happen-before "any operation that might use the head."

The rest of the client solution

Beyond the server's rotation/reuse detection and the client's persistence order, the client has a few more supporting invariants:

Token storage split. Refresh → Expo SecureStore (AFTER_FIRST_UNLOCK_THIS_DEVICE_ONLY, no iCloud sync). Access → in-module memory only, never persisted, never logged. clearTokens() deletes both the SecureStore entry (which survives iOS app uninstall on the Keychain and must be explicitly wiped) and nulls the in-memory access.
Single-flight 401 → refresh → retry once. The guard is a shared in-flight Promise, not a boolean flag. N concurrent 401s → exactly one POST /auth/token/refresh, and all waiters resolve from the same promise; the original request retries at most once, and a still-401 means one clean logout. 401→refresh is an if, never a loop.
Failure → exactly one clean logout. Any failed refresh → the logic that owns the flight runs clearTokens() + onSessionEnded() exactly once, even with N concurrent 401s.
Startup = boot refresh. AuthContext.restore() loads the stored refresh → refreshes → on success a single me() populates the immutable user/permission context, on failure a clean unauthenticated state. This replaces the old cookie me() race.
Mobile removal (mobile only): react-native-nitro-cookies and its peers, cookieJar.ts, csrf.ts — all used only on the auth path. apps/web's Cookie+CSRF is untouched.

Trap four: treating a transient failure as a rejection

The last piece that closes out this bug class is recognizing that "the refresh failed" is actually two different events. Naive hardening treats any non-ok refresh as session invalidation:

// Naive retry/timeout hardening: any non-ok refresh → logout.
const r = await fetchWithTimeout('/auth/token/refresh', { body: refresh })
if (!r.ok) {
  await clearTokens()
  onSessionEnded()
  return
}
// A request timeout / abort / 5xx / 429 / offline (bad in-vehicle network) now gets read as
// "session invalid" → the driver is logged out on the way to work. Reintroduces the phantom-logout bug from the retry layer.

The correct approach is strict, exhaustive "transient vs rejected" classification:

// Only a definite server rejection ends the session. Connection-shaped failures retry within a bounded budget.
function classify(o): 'ok' | 'transient' | 'rejected' {
  // transient (retry, never log out): transport error, abort/timeout, HTTP >=500, HTTP 429
  // rejected (log out exactly once): 401/400, other non-2xx, 2xx but token body missing/invalid, no stored refresh
  // ok: 2xx with a valid rotated pair
}
// Tear down only when reason === "rejected"; transient-exhausted → tokens untouched.

The key gotcha: logout must be triggered only by a definite server rejection; every connection-shaped outcome is transient → retained after retry, never logged out. The concrete hardening parameters: each fetch races an 8-second AbortController timeout, at most 3 attempts, full-jitter exponential backoff, a hard ~20-second total backoff budget; the retry loop is strung inside that one single-flight promise, so N concurrent 401s still produce only one serial sequence of fetches; telemetry is a pure local in-memory sink (no network, no PII, no token values), where transport:"bearer" is the adoption marker — it must be observed in production before the mobile cookie path is retired.

Behind this classification is a transferable principle: before a destructive operation like "revoking an observable session," you must separate "an explicit rejection" from "the connection didn't say." Treating the latter as the former is letting network jitter exercise a revocation authority that should belong to the server alone.

The transferable layer

Set aside the specific .NET and React Native APIs, and the real transferable insight from this migration is fourfold:

First, a security mechanism's value is parasitic on one invariant, and breaking the invariant nullifies the mechanism. All of rotate-on-use's security value comes from "exactly one active head per family" — the moment you fork a sibling, reuse detection drops to zero. When designing a security mechanism, first find the invariant it depends on, then ensure every code path (especially the grace/retry added "for the experience") leaves it unbroken.

Second, seemingly unrelated design decisions interlock at the boundary. "Store refresh with a static hash" (a storage decision) and "rotation can only go forward, never fork" (a security decision) are actually the same causal chain: because the hash is stored and the raw value is irreversible, re-emitting an old token is impossible, so you can only rotate forward. Evaluating each decision in isolation misses this coupling.

Third, the invariant of distributed state must be mirrored on every side. The server guarantees "one active head," and the client must therefore guarantee "what it holds is always that active head" — via the write-before-resolve order contract. Get it right on one side and wrong on the order on the other, and the bug comes right back from the other side.

Fourth, before a destructive operation, distinguish "rejection" from "unknown." Irreversible actions like logout, delete, and revoke should be triggered only by a definite signal; treating "the connection didn't say" as "an explicit rejection" mistakes the network's jitter for an authoritative verdict.