Skip to content

perf(scheduler): strip id_token and refresh_token from scheduler cache#1429

Open
Fumeng24 wants to merge 11 commits intoWei-Shaw:mainfrom
Fumeng24:optimize/strip-scheduler-cache-credentials
Open

perf(scheduler): strip id_token and refresh_token from scheduler cache#1429
Fumeng24 wants to merge 11 commits intoWei-Shaw:mainfrom
Fumeng24:optimize/strip-scheduler-cache-credentials

Conversation

@Fumeng24
Copy link
Copy Markdown

@Fumeng24 Fumeng24 commented Apr 1, 2026

Summary

The scheduler cache (sched:acc:{id}) stores full Account JSON blobs in Redis, including large OAuth tokens (id_token, refresh_token). In multi-instance deployments with limited inter-server bandwidth (e.g. 30Mbps), this causes excessive Redis traffic:

  • Each account entry is ~20KB due to JWT tokens
  • With 1000 accounts, a full rebuild writes ~20MB to Redis
  • Observed ~19.5Mbps of Redis upload traffic from scheduler cache writes alone

The gateway request path only needs access_token and api_key. The id_token and refresh_token are consumed exclusively by background token-refresh services that read directly from the database.

Changes

  • Add marshalAccountForCache() helper that strips id_token and refresh_token before serialization, using a shallow copy to avoid mutating the caller's data
  • Apply to SetAccount, SetSnapshot, and UpdateLastUsed write paths
  • Add unit tests for the stripping logic (nil account, empty credentials, strip verification, mutation safety)

Impact

  • ~60-70% reduction in per-account Redis payload size
  • Significantly reduces cross-server Redis bandwidth in multi-instance deployments
  • No impact on gateway functionality — access_token and api_key are preserved
  • No impact on token refresh — refresh services read credentials from PostgreSQL

Test plan

  • Unit tests for marshalAccountForCache (nil, empty, strip, no-mutation)
  • Verify gateway requests still work (access_token/api_key preserved in cache)
  • Verify token refresh still works (reads from DB, not cache)
  • Monitor Redis bandwidth reduction with iftop -f "port 6379"

Fumeng24 added 11 commits April 2, 2026 01:08
The scheduler cache stores full Account JSON blobs including OAuth tokens
in Redis. In multi-instance deployments with limited inter-server
bandwidth, this causes excessive Redis traffic — each account entry is
~20KB due to large JWT tokens (id_token, refresh_token), and with 1000
accounts a full rebuild writes ~20MB to Redis.

The gateway request path only needs access_token and api_key; id_token
and refresh_token are consumed exclusively by background token-refresh
services that read directly from the database.

This change:
- Adds marshalAccountForCache() that strips heavy credential fields
  before serialization, using a shallow copy to avoid mutating the caller
- Applies it to SetAccount, SetSnapshot, and UpdateLastUsed
- Adds unit tests for the stripping logic

Expected impact: ~60-70% reduction in per-account Redis payload size,
significantly reducing cross-server Redis bandwidth in multi-instance
deployments.
Extends the cache bandwidth optimization to also strip:
- AccountGroups[].Group: full Group objects (only GroupID needed for routing)
- AccountGroups[].Account: back-reference to parent (circular/redundant)
- Groups: duplicate of AccountGroups[].Group (ops-only, not used in gateway)

The gateway request path (isAccountInGroup) only reads
AccountGroup.GroupID. The Groups slice is only consumed by ops
monitoring services which are disabled on non-primary instances.

With 1000 accounts sharing the same groups, this eliminates massive
duplication — each ~40-field Group object was previously embedded twice
per account (in AccountGroups[].Group and Groups[]).

Adds unit tests for Group stripping and mutation safety.
…1357)

Cherry-picked from upstream PR Wei-Shaw#1357.
Adds cancel_stream_on_client_disconnect config option to stop draining
upstream responses when client disconnects, saving bandwidth.
Cherry-picked from upstream PR Wei-Shaw#1382.
Adds race-aware recovery for invalid_grant and per-account mutex
to prevent concurrent refresh causing false errors.
Cherry-picked from upstream PR Wei-Shaw#1391.
Upgrades 429/529 from passthrough accounts into failover-capable
errors instead of passing directly to client.
Cherry-picked from upstream PR Wei-Shaw#1358.
Prevents stale codex usage snapshots from re-rate-limiting
pool mode API key accounts after manual reset.
Cherry-picked from upstream PR Wei-Shaw#1399.
Adds DB_MAX_OPEN_CONNS etc. env bindings with conservative defaults.
…o_system_logs_test

Use strings.Contains from stdlib instead of custom helper that
conflicts with the same-named function in ops_repo_system_logs_test.go.
- max_open_conns: 20 → 50
- max_idle_conns: 5 → 25 (50% ratio per project docs)
- conn_max_lifetime_minutes: 10 → 30
Port of upstream PR Wei-Shaw#734 with all 6 review issues fixed:
- Fix Wei-Shaw#1: AuthService calls ReferralService.ProcessReferralRegistration()
- Fix Wei-Shaw#2: GetOrCreateProfile re-fetches on user_id unique conflict
- Fix Wei-Shaw#3: Migration uses ON DELETE CASCADE for all foreign keys
- Fix Wei-Shaw#4: Single migration file (082_create_referral_tables.sql)
- Fix Wei-Shaw#5: grantRewardsInTx is unexported
- Fix Wei-Shaw#6: Validates non-negative reward values

Features:
- User referral codes (8-char, lazy-loaded)
- Dual rewards (inviter + invitee, configurable in admin settings)
- Admin referral stats endpoint
- Self-referral prevention, duplicate protection via DB constraints
- Raw SQL repository (no ent dependency for referral tables)

Backend only - frontend will be added in a follow-up commit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant