Configuring Token Quotas
Introduction
Envoy AI Gateway can rate-limit by token usage rather than request count, and track a separate budget for each caller based on identity. This prevents a single consumer or a runaway agent from exhausting a shared model budget, and it lets you offer per-user, per-department, and per-tier token quotas on one gateway.
A token quota combines three pieces:
AIGatewayRoute.llmRequestCostsextracts token counts from each LLM response into Envoy dynamic metadata.- A global rate limit backend (Redis) accumulates the cost. Because the token cost is known only after the response, it cannot be tracked by the per-pod local limiter.
- A
BackendTrafficPolicyof typeGlobaldefines the budget and the identity key.
Use Cases
- Give each user a monthly token budget on an expensive model, with a higher budget for a premium tier.
- Cap the total tokens a department can consume across all its applications.
- Protect a shared model from a single misbehaving automation account.
Prerequisites
-
Envoy AI Gateway is installed, with an
AIGatewayRouterouting to your model backends. Confirm the relevant CRDs are present: -
Caller identity is propagated as request headers, for example
x-user-id. See Authenticating Consumers. Without an identity header the budget below collapses to a single counter shared by all callers, so this step is what makes the quota per consumer. -
A Redis instance is available. Use a managed instance from Cache Service for Redis as the default provider, then record its access address (
host:port) and credentials. Verify reachability from the cluster before going further: -
Note the
Gatewayname and namespace — needed later to restart the right data-plane proxy:
Create the Gateway and AIGatewayRoute in a dedicated namespace (for example maas-system), not in the Envoy Gateway control-plane namespace envoy-gateway-system. A gateway placed in the control-plane namespace may not have the AI Gateway request-processing filter and SecurityPolicy applied to its listener, which silently breaks routing and policy enforcement. See Envoy AI Gateway.
Steps
Enable the global rate limit backend
The local rate limiter cannot accumulate response-derived token cost, so the gateway must use the Global Rate Limit service backed by Redis. Create a Redis instance from Cache Service for Redis and copy its access address from the instance detail page. Set that address in the envoy-gateway-config ConfigMap (namespace envoy-gateway-system), under data."envoy-gateway.yaml":
Add the rateLimit block under the top-level config (keep any existing keys such as gateway: and provider: intact):
<redis-host>:<redis-port>: the access address copied from the Redis instance detail page.- For a password-protected Redis, also create an Opaque
Secretwith keyredis-usernameandredis-password, then reference it viarateLimit.backend.redis.auth.passwordRef. See the Envoy Gateway rate-limit docs for the full schema.
The Envoy Gateway control plane reads this bootstrap configuration only at startup and does not hot-reload it, so restart its Deployment to apply the change, then confirm the dedicated envoy-ratelimit Deployment is healthy:
A single Redis instance per Envoy Gateway is sufficient. Any reachable Redis also works, but a managed instance is recommended for availability and backup.
If the Gateway was already running before the rate limit backend was enabled, also restart its data plane so the proxy picks up the rate limit service:
Capture token usage on the route
Add llmRequestCosts to the AIGatewayRoute so the gateway writes token counts into Envoy dynamic metadata (the per-request scratch space filters use to talk to each other) under the namespace io.envoy.ai_gateway. The rate-limit filter reads from this namespace in the next step.
-
metadataKey: the key underio.envoy.ai_gatewaywhere the count is written. Pick any name; theBackendTrafficPolicybelow must reference the same string. -
type:InputTokencounts the prompt,OutputTokencounts the completion,TotalTokenis the sum. UseCELfor a custom formula — for example, charge output tokens 3× because they are slower:
Apply and confirm the route is still accepted (the new field should not break translation):
Define the token budget by identity
Attach a BackendTrafficPolicy with a Global rate limit. Set the request cost to 0 and the response cost to the captured token metadata, so that only tokens count against the limit. Use clientSelectors to scope the budget per identity and per model.
x-user-idwithtype: Distinctgives each caller an independent counter, which produces a per-user quota. Usex-user-groupto aggregate a department against one budget, or match a specific group such aspremiumwithtype: Exactfor tiered limits.limit.requestsis interpreted as a token budget here, because the cost is sourced from token metadata. With200000tokens/hour and a typical chat call costing roughly 1.5–2k tokens, expect ~100–130 calls/hour per caller before throttling kicks in.cost.request.number: 0means a request that fails to reach the upstream (e.g. malformed body) consumes no quota. Set it to1if you want pre-flight throttling on call count as well.cost.response.metadata.keymust match ametadataKeydeclared on the route.
Verification
Drive a short burst with a valid identity token, then a request from a different identity, and confirm only the first identity is throttled:
A run that exhausts alice's quota will print something like alice #1 -> 200 … alice #5 -> 429 … bob -> 200. To inspect the counter directly in Redis, scan for keys containing the identity value — Envoy Gateway names each counter <gateway-namespace>/<gateway-name>/<listener>_<route>_..._<x-user-id-value>_..._<window-timestamp>, so the user identity is the simplest filter:
If Redis is unreachable, Envoy fails open by default: requests pass through unmetered until Redis recovers. Watch the rate-limit pod's logs (kubectl logs deploy/envoy-ratelimit -n envoy-gateway-system) and alert on its Ready condition so silent quota loss does not go unnoticed.
Learn More
Next Steps
Configure Metering Token Usage to report consumption per tenant and feed chargeback.