Rate Limits
Rate limits provide control over how frequently users and applications can call Workplace AIs APIs. It helps protect the application from accidental overload and excessive automated usage while maintaining a fair experience for all users.
This capability is implemented in application middleware using the Token Bucket algorithm and works consistently across all deployment models. Information around this type of rate limiting, and rate limiting in general, can be found here.
These controls do not replace a dedicated gateway or edge protection (DoS/DDoS). They operate after a request has already reached the application.
Configuration
Rate Limiting is optional and are disabled by default.
It can be enabled and configured for the following endpoints:
Search - User requests
Search - Programmatic requests made via an API key
Control Hub - User requests
Data Science API
OData API
For instance, this allows you to rate limit Data Science and OData requests, without impacting the user experience.
For each endpoint, you can choose to enable or disable the rate limiter. If enabled, you can configure:
Token limit: This defines the maximum number of requests a user or application can issue in a quick burst.
Replenishment period: How often new tokens are added to each bucket.
Tokens per period: The number of tokens added in each replenishment period.
Queue limit: The number of requests to queue before an error is thrown, requests will be queued until a token is available. If more requests are issued beyond this capacity, they will fail as described below.
For programmatic access, this is recommended to set Queue Limit to 0 so the application can take immediate action.

Impact
If the rate limiter is triggered, both the triggering and any subsequent requests will fail with a 429 "Too many requests" status code until more tokens are available.

The retry-after header will be set to the replenishment period.
Logging
When the rate limiter is triggered, the following warning will be logged:
Rate Limiter rejected request for {partition} to {http-method} {http-path}
partitionis either the username (authenticated) or originating hostname (unauthenticated) that initiated the requesthttp-methodis GET/POST/etchttp-pathis the endpoint. Query string parameters and POST bodies are not logged.
Other considerations
The rate limits are local to each API instance and only held in memory, this means that:
Restarting an endpoint (IIS reset, restarting the Application Pool, restarting the API processes, etc.) will reset the rate limiters.
In environments with load balancers, each instance of an endpoint will have its own limiter for each user - for consistency, sticky sessions are needed at the load balancer / application gateway.
When using a load balancer, the source header must be set correctly otherwise all unauthenticated requests will be allocated to the load balancer, not the originating machine.
Last updated