rate limit
overview
What is rate limiting?
A rate limit is a limit imposed by the API on the number of times a user or client can access a server within a specified period of time.
Why do we have rate limits?
Rate limits are a common practice for APIs, and they are set for several different reasons:
- They help prevent abuse or misuse of APIs . For example, a malicious actor could flood an API with requests in an attempt to overload it or cause a service outage. By setting rate limits, OpenAI prevents this activity from happening.
- Rate limiting helps ensure everyone has fair access to the API. If one person or organization makes too many requests, it can slow down everyone else's use of the API. By regulating the number of requests a single user can make, OpenAI ensures that the maximum number of people have access to the API without experiencing slowdowns.
- Rate limiting helps OpenAI manage the overall load on its infrastructure. If requests to the API increase dramatically, it can stress the server and cause performance issues. By setting rate limits, OpenAI can help maintain a smooth and consistent experience for all users.
Please read this document in its entirety to better understand how OpenAI's velocity extremum system works. We provide code samples and solutions needed to deal with common problems, please follow this guide before completing the Extreme Growth Request Form and detail how to complete the form in the last section.
What are the extremes of our API?
We enforce extreme value controls at the organization level rather than the user level based on using specific endpoints and what type of account you have. Extremes are measured in two ways: RPM (requests per minute ) and TPM (tokens per minute ) . The table below highlights the default extremes for our API, but these can be increased based on your use case by filling out the "Rate Limit Increase Request" form.
account type | Text & Embedding (text and embedded) | Chat (chat conversation) | Edit | Image (image) | Audio |
---|---|---|---|---|---|
Free trial users (free trial users) | 3 RPM ,150,000 TPM | 3 RPM, 40,000 TPM | 3 RPM, 150,000 TPM | 5 images / min | 3 RPM |
Pay-As-You-Go Users (first 48 hours) | 60 RPM ,250,000 TPM | 60 RPM , 60,000 TPM | 20 RPM ,150,000 TPM | 50 images / min | 50 RPM |
Pay-As-You-Go users (after 48 hours | 3,500 RPM ,350,000 TPM | 3,500 RPM ,90,000 TPM | 20 RPM ,150,000 TPM | 50 images / min | 50 RPM |
For Pay gpt-3.5-turbo-16k
- As-You-Go users the TPM limit is 2x the values listed above, 120K TPM and 180K TPM respectively.
TPM (ticks per minute) units vary by model:
type | 1 TPM is equivalent to |
---|---|
davinci | 1 token/minute |
curie | 25 tokens/minute |
babbage | 100 tokens/minute |
ada | 200 tokens/minute |
From a practical point of view, this means that you can send about 200 times more tokens per minute to the ada model , and more relative to the davinci model.
Note that rate limiting can be triggered by either option , depending on which happens first . For example, you might send 20 requests to the Codex endpoint and only use 100 tokens to fill your limit, even though 40k tokens are not sent in those 20 requests.
How does rate limiting work?
If your rate limit is 60 requests per minute and 150k davinci tokens per minute, you will be limited by one of the two, whichever happens first. For example, if your max requests/minute is 60, you should be able to send 1 request per second. If you're sending 1 request every 800 ms, after hitting the rate limit, simply put the program to sleep for 200 ms before sending another request; otherwise subsequent requests will fail. For the default value of 3,000 requests/min, the client can effectively send 1 request every 20ms or 0.02 seconds by default.
What if I get a rate limiting error?
The rate limit error looks like this: Rate limit reached for default-text-davinci-002 in organization org-{id} on requests per min. Limit: 20.000000 / min. Current: 24.000000 / min. , it means that you have made too many requests in a short period of time, and the API refuses to fulfill further requests until the specified time has elapsed.
Speed limit and max_tokens
Each of the models we provide has a fixed number of tokens that can be passed as input to them for processing. The upper bound on the number of tokens a model accepts cannot be increased. For example: If you use text-ada-001, you can send up to 2048 tokens per request to this model.
error handling
What steps can I take to mitigate such problems?
The OpenAI Cookbook has a Python notebook detailing how to avoid rate limit errors.
Use caution when offering programmatic access, batch processing capabilities, and automated social media posting - consider enabling these features only for trusted customers.
To prevent automation and high-volume abuse, set an upper bound on individual user usage within a specified time frame (day, week, or month). For users outside of this upper bound, consider implementing hard rules or a manual review process.
Retry with exponential backoff
It is also very simple to avoid frequent calls to API methods resulting in frequent error reports - just randomly wait and retry calling API methods ! The specific method is: when the API returns the "429 Too Many Requests" status code, suspend the execution of the code fragment, and calculate the waiting time t according to the current number of retries, and then try to call the API method again; if the "429 Too Many Requests" status code is still returned "Many Requests" status code pauses for t seconds and then repeats the above operations... until the data is successfully obtained!
Exponential backoff means performing a short sleep and retrying unsuccessful requests when the first rate limit error is hit. If the request is still unsuccessful, increase the sleep length and repeat the process. This will continue until the request succeeds or until the maximum number of retries is reached. This method has many advantages:
- Automatic retries mean you can recover from speed limit errors without crashing or losing data
- Exponential backoff means that the shortcut retry is tried first, while still benefiting from the longer delayed retry, so the first few retries fail.
- Add random jitter to delay the effect of retries hitting all retries at different times.
Note that unsuccessful requests count toward your per-minute limit, so continuously resending requests won't work. Here are some example solutions in Python using exponential backoff.
Example 1: Using the Tenacit library
Tenacity is an Apache 2.0-licensed general-purpose retry library, written in Python, designed to simplify the task of adding retry behavior to almost anything. To add exponential backoff to your requests, use the tenacity.retry decorator. The following example adds random exponential backoff to requests using the tenacity.wait_random_exponential function.
import openai
from tenacity import (
retry,
stop_after_attempt,
wait_random_exponential,
) # for exponential backoff
@retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6))
def completion_with_backoff(**kwargs):
return openai.Completion.create(**kwargs)
completion_with_backoff(model="text-davinci-003", prompt="Once upon a time,")
Note that the Tenacity library is a third-party tool, and OpenAI makes no guarantees about its reliability or security.
Example 2: Using the backoff library
Another Python library that provides backoff and retry function modifiers is backoff:
import backoff
import openai
@backoff.on_exception(backoff.expo, openai.error.RateLimitError)
def completions_with_backoff(**kwargs):
return openai.Completion.create(**kwargs)
completions_with_backoff(model="text-davinci-003", prompt="Once upon a time,")
Example 3: Implement exponential backoff
If you don't want to use a third-party library, you can implement your own backoff logic following this example:
# imports
import random
import time
import openai
# define a retry decorator
def retry_with_exponential_backoff(
func,
initial_delay: float = 1,
exponential_base: float = 2,
jitter: bool = True,
max_retries: int = 10,
errors: tuple = (openai.error.RateLimitError,),
):
"""Retry a function with exponential backoff."""
def wrapper(*args, **kwargs):
# Initialize variables
num_retries = 0
delay = initial_delay
# Loop until a successful response or max_retries is hit or an exception is raised
while True:
try:
return func(*args, **kwargs)
# Retry on specific errors
except errors as e:
# Increment retries
num_retries += 1
# Check if max retries has been reached
if num_retries > max_retries:
raise Exception(
f"Maximum number of retries ({max_retries}) exceeded."
)
# Increment the delay
delay *= exponential_base * (1 + jitter * random.random())
# Sleep for the delay
time.sleep(delay)
# Raise exceptions for any errors not specified
except Exception as e:
raise e
return wrapper
@retry_with_exponential_backoff
def completions_with_backoff(**kwargs):
return openai.Completion.create(**kwargs)
Again, OpenAI makes no guarantees about the security or efficiency of this solution, but it can be a good starting point for your own.
bulk request
The OpenAI API has separate limits for requests and tokens per minute.
If you hit the request-per-minute limit, but have capacity available in terms of tokens per minute, you can batch multiple tasks into each request to increase throughput. This will allow you to handle more tokens, especially for our smaller models.
Sending a batch of prompts is exactly the same as a normal API call, just pass a list of strings to the prompt parameter.
Example without batching :
import openai
num_stories = 10
prompt = "Once upon a time,"
# serial example, with one story completion per request
for _ in range(num_stories):
response = openai.Completion.create(
model="curie",
prompt=prompt,
max_tokens=20,
)
# print story
print(prompt + response.choices[0].text)
Example using batch
import openai # for making OpenAI API requests
num_stories = 10
prompts = ["Once upon a time,"] * num_stories
# batched example, with 10 story completions per request
response = openai.Completion.create(
model="curie",
prompt=prompts,
max_tokens=20,
)
# match completions to prompts by index
stories = [""] * len(prompts)
for choice in response.choices:
stories[choice.index] = prompts[choice.index] + choice.text
# print stories
for story in stories:
print(story)
Warning: Response objects may not return completions in the order of prompts, so always remember to use indexed fields to match responses to prompts.
Request to increase
When should I consider applying for a rate limit increase?
Our default rate limiting helps maximize stability and prevent abuse of our API. We increase the limit to enable high traffic applications, so the best time to request a rate limit increase is when you think you have the necessary traffic data to support a strong case for increasing the rate limit. Requests for substantial rate limit increases without supporting data are unlikely to be granted. If you're preparing for a product launch, get data through a 10-day phased rollout.
Keep in mind that rate limit increases can sometimes take 7-10 days, so it's wise to plan and commit early if there is data to support that current growth numbers will hit your rate limit.
Will my rate limit increase request be denied?
A common reason for rejection is lack of data needed to justify it. Numerical examples are provided below to show how best to support one rate-up request, and make a best effort to grant all requests that comply with the security policy and show supporting data requirements. We're committed to enabling developers to scale and succeed with our APIs.
I've implemented exponential backoff for my text/code API, but I'm still getting errors. How can I increase my frequency of going live?
Currently, we do not support features such as enhancing the free test endpoint, such as editing the endpoint. We also won't increase the frequency of ChatGPT live, but you can participate in the ChatGPT Pro visit list.
We know how frustrating it can be to be constrained by how often, and would love to increase the defaults for everyone. However, due to shared capacity constraints, frequency adjustments can only be made on the Rate Limit Increase Request form after the approval of paying customer certificates needs to be reviewed. To help assess what you really need, please provide statistics about current usage or projections based on historical user activity in the "Share Evidence of Need" section. If this information is not available, a gradual release approach is recommended: first release the service to a small group of users at the current rate, collect usage data within 10 business days, and then submit a formal go-live adjustment request based on that data for review and approval .
If you submit your application and are approved, you will be notified of the approval within 7-10 business days. Here are some examples of filling out this form:
DALL-E API example
Model | Estimated number of tokens/minute | Estimated number of requests | User number | evidence needed | 1 hour maximum throughput cost |
---|---|---|---|---|---|
DALL-E API | N/A | 50 | 1000 | Our app is currently in production and based on past traffic we are making about 10 requests per minute. | $60 |
DALL-E API | N/A | 150 | 10,000 | As our app became more and more popular in the App Store, we started running into rate limits. Can we get three times the default limit of 50 images per minute? If more are required, we will submit a new form. Thanks! | $180 |
Language model example
Model | Estimated number of tokens/minute | ESTIMATE Estimated number of requests REQUESTS/MINUTE | User number | evidence needed | 1 hour maximum throughput cost |
---|---|---|---|---|---|
text-davinci-003 | 325,000 | 4,0000 | 50 | We will be releasing to an initial set of alpha testers and will need a higher limit to accommodate their initial usage. We provide a link here to our Google Drive showing analytics and API usage. | $390 |
text-davinci-002 | 750,000 | 10,000 | 10,000 | Our app is getting a lot of attention, we have 50,000 people on the waiting list. We want to roll out to groups of 1000 people per day until we reach 50,000 users. Please see this link for our current token/minute traffic over the past 30 days. This is for 500 users, and based on their usage we think 750,000 tokens/minute and 10,000 requests/minute would be a good starting point. | $900 |
Code model example
Model | Estimated number of tokens/minute | ESTIMATE Estimated number of requests REQUESTS/MINUTE | User number | evidence needed | 1 hour maximum throughput cost |
---|---|---|---|---|---|
code-davinci-002 | 150,000 | 1,000 | 15 | We are a group of researchers working on a dissertation. We estimate that higher rate limits on code-davinci-002 will be required to complete the research by the end of this month. These estimates are based on the following calculations [...] | Codex models are currently in free beta, so we may not be able to provide increments for these models immediately. |
Note that these examples are general use-case scenarios only, and actual usage will vary based on specific implementations and use cases.