What to check
- request rate per time window
- concurrency limits
- whether different models or capabilities have separate quotas
- whether free, test, and production environments differ
Engineering guidance
- centralize retry, backoff, and circuit breaking on the server
- use caching or queueing for high-frequency flows
- separate business traffic spikes from model invocation spikes
Suggested debugging order
- confirm whether you hit a platform-level throttle
- confirm whether the specific model or capability has its own rate cap
- inspect whether the client is retrying or resubmitting unexpectedly
