chatgpt rate limit gpt openai official document

rate limit

overview

What is rate limiting?

A rate limit is a limit imposed by the API on the number of times a user or client can access a server within a specified period of time.

Why do we have rate limits?

Rate limits are a common practice for APIs, and they are set for several different reasons:

  • They help prevent abuse or misuse of APIs . For example, a malicious actor could flood an API with requests in an attempt to overload it or cause a service outage. By setting rate limits, OpenAI prevents this activity from happening.
  • Rate limiting helps ensure everyone has fair access to the API. If one person or organization makes too many requests, it can slow down everyone else's use of the API. By regulating the number of requests a single user can make, OpenAI ensures that the maximum number of people have access to the API without experiencing slowdowns.
  • Rate limiting helps OpenAI manage the overall load on its infrastructure. If requests to the API increase dramatically, it can stress the server and cause performance issues. By setting rate limits, OpenAI can help maintain a smooth and consistent experience for all users.

Please read this document in its entirety to better understand how OpenAI's velocity extremum system works. We provide code samples and solutions needed to deal with common problems, please follow this guide before completing the Extreme Growth Request Form and detail how to complete the form in the last section.

What are the extremes of our API?

We enforce extreme value controls at the organization level rather than the user level based on using specific endpoints and what type of account you have. Extremes are measured in two ways: RPM (requests per minute ) and TPM (tokens per minute ) . The table below highlights the default extremes for our API, but these can be increased based on your use case by filling out the "Rate Limit Increase Request" form. 

account type Text & Embedding (text and embedded) Chat (chat conversation) Edit Image (image) Audio
Free trial users (free trial users) 3 RPM ,150,000 TPM 3 RPM, 40,000 TPM 3 RPM, 150,000 TPM 5 images / min 3 RPM
Pay-As-You-Go Users (first 48 hours) 60 RPM ,250,000 TPM 60 RPM , 60,000 TPM 20 RPM ,150,000 TPM 50 images / min 50 RPM
Pay-As-You-Go users (after 48 hours 3,500 RPM ,350,000 TPM 3,500 RPM ,90,000 TPM 20 RPM ,150,000 TPM 50 images / min 50 RPM

For  Pay gpt-3.5-turbo-16k - As-You-Go users the TPM limit is 2x the values ​​listed above, 120K TPM and 180K TPM respectively.

TPM (ticks per minute) units vary by model:

type 1 TPM is equivalent to
davinci 1 token/minute
curie 25 tokens/minute
babbage 100 tokens/minute
ada 200 tokens/minute

From a practical point of view, this means that you can send about 200 times more tokens per minute to the ada model , and more relative to the davinci model.

Note that rate limiting can be triggered by either option , depending on which happens first . For example, you might send 20 requests to the Codex endpoint and only use 100 tokens to fill your limit, even though 40k tokens are not sent in those 20 requests.

How does rate limiting work?

If your rate limit is 60 requests per minute and 150k davinci tokens per minute, you will be limited by one of the two, whichever happens first. For example, if your max requests/minute is 60, you should be able to send 1 request per second. If you're sending 1 request every 800 ms, after hitting the rate limit, simply put the program to sleep for 200 ms before sending another request; otherwise subsequent requests will fail. For the default value of 3,000 requests/min, the client can effectively send 1 request every 20ms or 0.02 seconds by default.

What if I get a rate limiting error?

The rate limit error looks like this: Rate limit reached for default-text-davinci-002 in organization org-{id} on requests per min. Limit: 20.000000 / min. Current: 24.000000 / min. , it means that you have made too many requests in a short period of time, and the API refuses to fulfill further requests until the specified time has elapsed.

Speed ​​limit and max_tokens

Each of the models we provide has a fixed number of tokens that can be passed as input to them for processing. The upper bound on the number of tokens a model accepts cannot be increased. For example: If you use text-ada-001, you can send up to 2048 tokens per request to this model.

error handling

What steps can I take to mitigate such problems?

The OpenAI Cookbook has a Python notebook detailing how to avoid rate limit errors.

Use caution when offering programmatic access, batch processing capabilities, and automated social media posting - consider enabling these features only for trusted customers.

To prevent automation and high-volume abuse, set an upper bound on individual user usage within a specified time frame (day, week, or month). For users outside of this upper bound, consider implementing hard rules or a manual review process.

Retry with exponential backoff

It is also very simple to avoid frequent calls to API methods resulting in frequent error reports - just randomly wait and retry calling API methods ! The specific method is: when the API returns the "429 Too Many Requests" status code, suspend the execution of the code fragment, and calculate the waiting time t according to the current number of retries, and then try to call the API method again; if the "429 Too Many Requests" status code is still returned "Many Requests" status code pauses for t seconds and then repeats the above operations... until the data is successfully obtained!

Exponential backoff means performing a short sleep and retrying unsuccessful requests when the first rate limit error is hit. If the request is still unsuccessful, increase the sleep length and repeat the process. This will continue until the request succeeds or until the maximum number of retries is reached. This method has many advantages:

  • Automatic retries mean you can recover from speed limit errors without crashing or losing data
  • Exponential backoff means that the shortcut retry is tried first, while still benefiting from the longer delayed retry, so the first few retries fail.
  • Add random jitter to delay the effect of retries hitting all retries at different times.

Note that unsuccessful requests count toward your per-minute limit, so continuously resending requests won't work. Here are some example solutions in Python using exponential backoff.

Example 1: Using the Tenacit library

Tenacity is an Apache 2.0-licensed general-purpose retry library, written in Python, designed to simplify the task of adding retry behavior to almost anything. To add exponential backoff to your requests, use the tenacity.retry decorator. The following example adds random exponential backoff to requests using the tenacity.wait_random_exponential function.

import openai
from tenacity import (
    retry,
    stop_after_attempt,
    wait_random_exponential,
)  # for exponential backoff

@retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6))
def completion_with_backoff(**kwargs):
    return openai.Completion.create(**kwargs)

completion_with_backoff(model="text-davinci-003", prompt="Once upon a time,")

Note that the Tenacity library is a third-party tool, and OpenAI makes no guarantees about its reliability or security.

Example 2: Using the backoff library

Another Python library that provides backoff and retry function modifiers is backoff:

import backoff
import openai
@backoff.on_exception(backoff.expo, openai.error.RateLimitError)
def completions_with_backoff(**kwargs):
    return openai.Completion.create(**kwargs)

completions_with_backoff(model="text-davinci-003", prompt="Once upon a time,")

Example 3: Implement exponential backoff

If you don't want to use a third-party library, you can implement your own backoff logic following this example:

# imports
import random
import time

import openai

# define a retry decorator
def retry_with_exponential_backoff(
    func,
    initial_delay: float = 1,
    exponential_base: float = 2,
    jitter: bool = True,
    max_retries: int = 10,
    errors: tuple = (openai.error.RateLimitError,),
):
    """Retry a function with exponential backoff."""

    def wrapper(*args, **kwargs):
        # Initialize variables
        num_retries = 0
        delay = initial_delay

        # Loop until a successful response or max_retries is hit or an exception is raised
        while True:
            try:
                return func(*args, **kwargs)

            # Retry on specific errors
            except errors as e:
                # Increment retries
                num_retries += 1

                # Check if max retries has been reached
                if num_retries > max_retries:
                    raise Exception(
                        f"Maximum number of retries ({max_retries}) exceeded."
                    )

                # Increment the delay
                delay *= exponential_base * (1 + jitter * random.random())

                # Sleep for the delay
                time.sleep(delay)

            # Raise exceptions for any errors not specified
            except Exception as e:
                raise e

    return wrapper

@retry_with_exponential_backoff
def completions_with_backoff(**kwargs):
    return openai.Completion.create(**kwargs)

Again, OpenAI makes no guarantees about the security or efficiency of this solution, but it can be a good starting point for your own.

bulk request

The OpenAI API has separate limits for requests and tokens per minute.

If you hit the request-per-minute limit, but have capacity available in terms of tokens per minute, you can batch multiple tasks into each request to increase throughput. This will allow you to handle more tokens, especially for our smaller models.

Sending a batch of prompts is exactly the same as a normal API call, just pass a list of strings to the prompt parameter.

Example without batching :

import openai

num_stories = 10
prompt = "Once upon a time,"

# serial example, with one story completion per request
for _ in range(num_stories):
    response = openai.Completion.create(
        model="curie",
        prompt=prompt,
        max_tokens=20,
    )
    # print story
    print(prompt + response.choices[0].text)

Example using batch

import openai  # for making OpenAI API requests


num_stories = 10
prompts = ["Once upon a time,"] * num_stories

# batched example, with 10 story completions per request
response = openai.Completion.create(
    model="curie",
    prompt=prompts,
    max_tokens=20,
)

# match completions to prompts by index
stories = [""] * len(prompts)
for choice in response.choices:
    stories[choice.index] = prompts[choice.index] + choice.text

# print stories
for story in stories:
    print(story)

Warning: Response objects may not return completions in the order of prompts, so always remember to use indexed fields to match responses to prompts.

Request to increase

When should I consider applying for a rate limit increase?

Our default rate limiting helps maximize stability and prevent abuse of our API. We increase the limit to enable high traffic applications, so the best time to request a rate limit increase is when you think you have the necessary traffic data to support a strong case for increasing the rate limit. Requests for substantial rate limit increases without supporting data are unlikely to be granted. If you're preparing for a product launch, get data through a 10-day phased rollout.

Keep in mind that rate limit increases can sometimes take 7-10 days, so it's wise to plan and commit early if there is data to support that current growth numbers will hit your rate limit.

Will my rate limit increase request be denied?

A common reason for rejection is lack of data needed to justify it. Numerical examples are provided below to show how best to support one rate-up request, and make a best effort to grant all requests that comply with the security policy and show supporting data requirements. We're committed to enabling developers to scale and succeed with our APIs.

I've implemented exponential backoff for my text/code API, but I'm still getting errors. How can I increase my frequency of going live?

Currently, we do not support features such as enhancing the free test endpoint, such as editing the endpoint. We also won't increase the frequency of ChatGPT live, but you can participate in the ChatGPT Pro visit list.

We know how frustrating it can be to be constrained by how often, and would love to increase the defaults for everyone. However, due to shared capacity constraints, frequency adjustments can only be made on the Rate Limit Increase Request form after the approval of paying customer certificates needs to be reviewed. To help assess what you really need, please provide statistics about current usage or projections based on historical user activity in the "Share Evidence of Need" section. If this information is not available, a gradual release approach is recommended: first release the service to a small group of users at the current rate, collect usage data within 10 business days, and then submit a formal go-live adjustment request based on that data for review and approval .

If you submit your application and are approved, you will be notified of the approval within 7-10 business days. Here are some examples of filling out this form:

DALL-E API example

Model Estimated number of tokens/minute Estimated number of requests User number evidence needed 1 hour maximum throughput cost
DALL-E API N/A 50 1000 Our app is currently in production and based on past traffic we are making about 10 requests per minute. $60
DALL-E API N/A 150 10,000 As our app became more and more popular in the App Store, we started running into rate limits. Can we get three times the default limit of 50 images per minute? If more are required, we will submit a new form. Thanks! $180

Language model example

Model Estimated number of tokens/minute ESTIMATE Estimated number of requests REQUESTS/MINUTE User number evidence needed 1 hour maximum throughput cost
text-davinci-003 325,000 4,0000 50 We will be releasing to an initial set of alpha testers and will need a higher limit to accommodate their initial usage. We provide a link here to our Google Drive showing analytics and API usage. $390
text-davinci-002 750,000 10,000 10,000 Our app is getting a lot of attention, we have 50,000 people on the waiting list. We want to roll out to groups of 1000 people per day until we reach 50,000 users. Please see this link for our current token/minute traffic over the past 30 days. This is for 500 users, and based on their usage we think 750,000 tokens/minute and 10,000 requests/minute would be a good starting point. $900

Code model example

Model Estimated number of tokens/minute ESTIMATE Estimated number of requests REQUESTS/MINUTE User number evidence needed 1 hour maximum throughput cost
code-davinci-002 150,000 1,000 15 We are a group of researchers working on a dissertation. We estimate that higher rate limits on code-davinci-002 will be required to complete the research by the end of this month. These estimates are based on the following calculations [...] Codex models are currently in free beta, so we may not be able to provide increments for these models immediately.

Note that these examples are general use-case scenarios only, and actual usage will vary based on specific implementations and use cases.

Guess you like

Origin blog.csdn.net/sinat_37574187/article/details/132037480