What are tokens, and how are Tokens counted in ChatGPT?

What are tokens, and how are Tokens counted in ChatGPT?

What are tokens?

Tokens can be thought of as fragments of words. The input is broken into tokens before the API processes the prompt. These tokens are not split exactly at the beginning or end of a word - tokens can contain trailing spaces or even subwords. Here are some rules of thumb to help understand token length:

1 token ~= 4 characters in English

1 token ~= ¾ word

100 tokens ~= 75 words

or

1-2 sentences~= 30 tokens

1 paragraph~= 100 tokens

1,500 words ~= 2048 tokens

To get more context on how tokens stack up, consider the following example:

  • Wayne Gretzky's famous quote "You miss 100% of the shots you don't take" contains 11 tokens.

How words are split into tokens also depends on the language. For example, 'Cómo estás' ('How are you' in Spanish) contains 5 tokens (corresponding to 10 characters). A higher token-to-character ratio may make it more expensive to implement the API for languages ​​other than English .

  • My name's pinyin + space + wetchat + my WeChat ID "liyuechun wetchat liyc1215" contains 13 tokens.

For Feishu, DingTalk, Qiwei GPT capability grafting and AIGC corporate training, contact me: liyc1215

  • The three characters "Li Yuechun" contain 8 tokens

  • The three characters "Fu Jinliang" contain 6 tokens

If you want to explore tokenization further, you can use our interactive Tokenizer tool, which allows you to count the number of tokens and see how text is divided into tokens. Alternatively, if you want to perform tokenization programmatically, you can use Tiktoken , a fast BPE tokenizer designed for OpenAI models. You can also try exploring other libraries, such as Python's transformers package, or node.js's gpt-3-encoder package.

Token restrictions

Depending on the model used , up to 4097 tokens can be used between prompt and completion in the request. If your prompt is 4000 tokens, then your completion can be up to 97 tokens.

This limit is currently a technical limitation, but there are often many innovative ways to work around this limit, such as compressing your prompts, breaking the text into smaller parts, etc.

Token pricing

The API offers several model types at different price points. Each model has a range of abilities, with davinci being the most powerful and ada being the fastest. Requests are priced differently for these different models. You can find details about token pricing here .

Explore tokens

The API processes words based on the context in the corpus data. GPT-3 takes a prompt, converts the input into a sequence of tokens, processes the prompt, and converts the predicted tokens back into the words we see in the response.

What may appear to us to be the same two words may generate different tokens based on their structure in the text. Consider how the API generates a token value for the word 'red' based on context in the text:

In the first example above, the token "2266" for 'red' contains a trailing space.

The token "2297" for 'Red' with leading spaces and starting with a capital letter is different from the token "2266" for 'red' starting with a lowercase letter.

When 'Red' is at the beginning of the sentence, the generated token does not contain leading spaces. The token "7738" is different from the previous two word examples.

observe:

The more likely/frequent a token is, the lower the token number assigned to it:

  • The token generated for the period in all 3 sentences is the same ("13"). This is because, contextually, the use of periods in the corpus data is quite similar.

  • Depending on the position of 'red' in the sentence, the generated token will be different:

    • Lower case in the middle of a sentence: ' red' - (token: "2266")

    • Capitalization in the middle of a sentence: 'Red' - (token: "2297")

    • Capitalization at the beginning of sentences: 'Red' - (token: "7738")

Now that we know that tokens can contain trailing space characters, it is helpful to remember that prompts ending with a space character may result in lower quality output. This is because the API already contains trailing spaces in its tokens dictionary.

Using the logit_bias parameter

Bias can be set for specific tokens in the logit_bias parameter to modify the likelihood of specified tokens appearing in completions. For example, we are building an AI baking assistant that is sensitive to users’ egg allergies.

When we run the API with the prompt 'The ingredients for banana bread are', the response will contain 'eggs' as the second ingredient with a probability of 26.8%.

Note: To view completion probabilities in the Playground, select Full Spectrum from the Show Probabilities drop-down menu.

Since our AI baking assistant is sensitive to egg allergy issues, we can use our knowledge of tokens to set a bias in the logit_bias parameter to prevent the model from generating responses that contain any 'egg' variants.

First, use the tokenizer tool to determine which tokens we need to bias.

Tokens:

  • Singular form of trailing space: 'egg' - "5935"

  • Plural form of trailing spaces: 'eggs' - "9653"

  • Subword token generated for 'Egg' or 'Eggs' - 'gg': "1130"

The bias value accepted by the logit_bias parameter ranges from -100 to +100. Extreme values ​​result in the ban (-100) or exclusive selection (100) of the relevant token.

Adding the logit bias to the prompt will modify the likelihood that 'egg' (and its variants) is included in the response to our banana bread prompt. The above prompt generates a response that does not contain any eggs!

While we can't guarantee that it will produce the best egg-free banana bread recipe, the AI ​​Baking Assistant meets the need for users with egg allergies in mind.

Summarize

  1. It is more cost-effective to use English for conversation, and it is more expensive to calculate tokens in other languages, including Chinese.
  2. On average, there are about four letters in the English alphabet equal to one token.
  3. Chinese characters, about one Chinese character averages two tokens
  4. Yesterday I used GPT to write 7 college entrance examination essays, with a total number of words 10397and Tokens 21,008.

Calculated: If you use GPT3.5 API access, the total input and output are: 21,008 tokens. The unit price of GPT3.5 is $0.002/1000tokens. Then the entire conversation cost of these 7 essays converted into RMB is: 21008/1000*(0.002*7) = 0.294112 yuan. GPT4 is 60 times the price of GPT3.5. If GPT4 is used, the consumption this time is: 0.294112 * 60 = 17.64672 yuan.

Final summary: The current price of using GPT3.5 is still very affordable. When the computing power is sufficient in the future, I believe that GPT4.0 will not be expensive either.

Original link: https://blog.yredu.xyz/archives/5119

Guess you like

Origin blog.csdn.net/liyuechun520/article/details/131127925