How to achieve the typewriter effect of ChatGPT

During the construction process, I also encountered another pit. Because there is a layer of nginx proxy in the middle, and "nginx enables caching by default, so the streaming output to nginx is cached", and finally the data obtained by the front end is cached and output at one time. At the same time gzip may also have an impact.

typewriter effect

When building a chat gpt agent recently, I found that although my configuration can connect the interface and return data, the result is displayed at one time, unlike the previous chat gpt official website demo that achieves the effect of a typewriter, one word at a time come out.

So I studied the principle of the effect of the chat gpt typewriter, and I can learn from it if I want to achieve a similar effect in the future.

Pure front-end to achieve typewriter effect

When I first searched for typewriter effects, most of the results were pure front-end solutions. The principle is also very simple, the content is output to the screen regularly through js. Here is the answer from chat gpt:

The typewriter effect can be achieved by the front end through the following steps:

Embed text content into HTML elements such as div or span.
Set the display mode of the element to hidden through CSS style (such as display: none;).
Use JavaScript to get the element and display its characters one by one.
Use a timer (such as the setInterval() function) to control the time interval of each character, so as to achieve the effect of displaying one by one.
When all characters have been displayed, stop the timer to avoid unnecessary computational overhead.

Here is a simple sample code:

HTML：

<div id="typewriter">Hello World!</div>

CSS:

#typewriter {
  display: none;
}

JS:

const element = document.getElementById('typewriter');

let i = 0;
const interval = setInterval(() => {
  element.style.display = 'inline';
  element.textContent = element.textContent.slice(0, i++) + '_';

  if (i > element.textContent.length) {
    clearInterval(interval);
    element.textContent = element.textContent.slice(0, -1);
  }
}, 100);

This code will display the text in the element with id typewriter one by one, with 100 milliseconds between each character. After the final display, the underline of the last character will be removed.

streaming output

Later, after I captured the packet and checked the official documentation of chat gpt, I found that things were not that simple. The typewriter effect of chat gpt is not a pure front-end style after the back-end returns at one time. Instead, the backend continuously outputs content to the frontend through streaming output.

In the chat gpt official documentation, there is a parameter that allows it to achieve streaming output:

This is a protocol specification called "event_stream_format".

event_stream_format (ESF for short) is a protocol specification based on HTTP/1.1 for implementing server push events. It defines a data format that can send events to clients as a text stream. The design goal of ESF is to provide a simple and effective way of real-time communication, and to support many platforms and programming languages.

ESF data consists of multiple lines of text separated by \n (LF). Among them, each event consists of the following three parts:

Event type (event)
data
identifier (id)

For example:

event: message 
data: Hello, world! 
id: 123

This example represents an event named message, carrying the message content Hello, world!, and providing an optional parameter with the identifier 123.

ESF also supports the following two special event types:

Comment (comment): A line beginning with a colon is only used as a comment.
Retry (retry): Specify the time interval for client reconnection, in milliseconds.

For example:

: This is a comment

retry: 10000

event: update
data: {"status": "OK"}

The ESF protocol also supports the Last-Event-ID header, which allows clients to reconnect after a disconnection and resume where they left off. When the client connects, it can pass the latest event ID to the server through this header, so that the server can continue to send events according to the ID.

ESF is a simple, lightweight protocol suitable for scenarios that require real-time data exchange and multi-party communication. Because it uses the standard HTTP/1.1 protocol, it can be easily implemented on the existing Web infrastructure.

Packet capture can find that the response looks like this:

You can see that it is data: add a json, and each time the streaming data is in the delta.

There are several important headers in http response:

Among them, keep-alive is to maintain the two-way communication between the client and the server, which everyone should be familiar with. The other two headers are explained below.

In fact, what openai returns here is text/event-stream, which is a streaming media protocol used to push real-time events in web applications. Its content is in text format, and each event consists of one or more fields, separated by newlines (\n). This MIME type is usually used for one-way communication from the server to the client, for example, the server pushes the latest news, stock quotes, etc. to the client.

The package captured by the open source project chatgpt-web I use here, the request is wrapped in a layer by nodejs, and returns application/octet-stream (not sure what the motivation for doing this is), it is a MIME type, usually used for Indicates that a resource has a content type of binary, that is, an unknown binary data stream. This type typically does not perform any custom processing, and can be downloaded or saved by the client as needed.

Transfer-Encoding: chunked is an HTTP message transfer encoding method used to indicate that the message body is divided into multiple equal-sized chunks (chunks) for transmission. Each block consists of a length field of hexadecimal digits, followed by a CRLF (Carriage Return Line Feed), then the actual data content, and ends with another CRLF.

Using chunked encoding allows the server to be more flexible when sending data of unknown size, while also avoiding some restrictions that limit the size of the entire response body. When the receiver has received all the chunks, it combines them, decompresses them (if necessary), and forms the original response body.

In short, Transfer-Encoding: chunked allows the server to dynamically generate the message body when sending an HTTP response without having to determine its size in advance, thereby improving communication efficiency and flexibility.

Implementation of the server

Act as a chat gpt proxy

If you write a golang http service as a proxy for chat gpt, you only need to loop through each line of results returned by chat gpt, and output each line as an event to the front end. The core code is as follows:

// 设置Content-Type标头为text/event-stream  
w.Header().Set("Content-Type", "text/event-stream")  
// 设置缓存控制标头以禁用缓存  
w.Header().Set("Cache-Control", "no-cache")  
w.Header().Set("Connection", "keep-alive")  
w.Header().Set("Keep-Alive", "timeout=5")  
// 循环读取响应体并将每行作为一个事件发送到客户端  
scanner := bufio.NewScanner(resp.Body)  
for scanner.Scan() {  
   eventData := scanner.Text()  
   if eventData == "" {  
      continue  
   }  
   fmt.Fprintf(w, "%s\n\n", eventData)  
   flusher, ok := w.(http.Flusher)  
   if ok {  
      flusher.Flush()  
   } else {  
      log.Println("Flushing not supported")  
   }  
}

self as server

Here imitates the data structure of openai, acts as the server itself, and returns the streaming output:

const Text = `  
proxy_cache：通过这个模块，Nginx 可以缓存代理服务器从后端服务器请求到的响应数据。当下一个客户端请求相同的资源时，Nginx 可以直接从缓存中返回响应，而不必去请求后端服务器。这大大降低了代理服务器的负载，同时也能提高客户端访问速度。需要注意的是，使用 proxy_cache 模块时需要谨慎配置缓存策略，避免出现缓存不一致或者过期的情况。  
  
proxy_buffering：通过这个模块，Nginx 可以将后端服务器响应数据缓冲起来，并在完整的响应数据到达之后再将其发送给客户端。这种方式可以减少代理服务器和客户端之间的网络连接数，提高并发处理能力，同时也可以防止后端服务器过早关闭连接，导致客户端无法接收到完整的响应数据。  
  
综上所述， proxy_cache 和 proxy_buffering 都可以通过缓存技术提高代理服务器性能和安全性，但需要注意合理的配置和使用，以避免潜在的缓存不一致或者过期等问题。同时， proxy_buffering 还可以通过缓冲响应数据来提高代理服务器的并发处理能力，从而更好地服务于客户端。  
`  
  
type ChatCompletionChunk struct {  
   ID      string `json:"id"`  
   Object  string `json:"object"`  
   Created int64  `json:"created"`  
   Model   string `json:"model"`  
   Choices []struct {  
      Delta struct {  
         Content string `json:"content"`  
      } `json:"delta"`  
      Index        int     `json:"index"`  
      FinishReason *string `json:"finish_reason"`  
   } `json:"choices"`  
}  
  
func handleSelfRequest(w http.ResponseWriter, r *http.Request) {  
   // 设置Content-Type标头为text/event-stream  
   w.Header().Set("Content-Type", "text/event-stream")  
   // 设置缓存控制标头以禁用缓存  
   w.Header().Set("Cache-Control", "no-cache")  
   w.Header().Set("Connection", "keep-alive")  
   w.Header().Set("Keep-Alive", "timeout=5")  
   w.Header().Set("Transfer-Encoding", "chunked")  
   // 生成一个uuid  
   uid := uuid.NewString()  
   created := time.Now().Unix()  
  
   for i, v := range Text {  
      eventData := fmt.Sprintf("%c", v)  
      if eventData == "" {  
         continue  
      }  
      var finishReason *string  
      if i == len(Text)-1 {  
         temp := "stop"  
         finishReason = &temp  
      }  
      chunk := ChatCompletionChunk{  
         ID:      uid,  
         Object:  "chat.completion.chunk",  
         Created: created,  
         Model:   "gpt-3.5-turbo-0301",  
         Choices: []struct {  
            Delta struct {  
               Content string `json:"content"`  
            } `json:"delta"`  
            Index        int     `json:"index"`  
            FinishReason *string `json:"finish_reason"`  
         }{  
            {               Delta: struct {  
                  Content string `json:"content"`  
               }{  
                  Content: eventData,  
               },  
               Index:        0,  
               FinishReason: finishReason,  
            },  
         },  
      }  
  
      fmt.Println("输出：" + eventData)  
      marshal, err := json.Marshal(chunk)  
      if err != nil {  
         return  
      }  
  
      fmt.Fprintf(w, "data: %v\n\n", string(marshal))  
      flusher, ok := w.(http.Flusher)  
      if ok {  
         flusher.Flush()  
      } else {  
         log.Println("Flushing not supported")  
      }  
      if i == len(Text)-1 {  
         fmt.Fprintf(w, "data: [DONE]")  
         flusher, ok := w.(http.Flusher)  
         if ok {  
            flusher.Flush()  
         } else {  
            log.Println("Flushing not supported")  
         }  
      }      time.Sleep(100 * time.Millisecond)  
   }  
}

The core is to write a line of data each time data: xx \n\n, and finally end with data: [DONE].

Implementation of the front end

The front-end code refers to the implementation of https://github.com/Chanzhaoyu/chatgpt-web.

The core here is to use the onDownloadProgress hook of axios. When the stream has output, get the chunk content and update it to the front-end display.

await fetchChatAPIProcess<Chat.ConversationResponse>({  
  prompt: message,  
  options,  
  signal: controller.signal,  
  onDownloadProgress: ({ event }) => {  
    const xhr = event.target  
    const { responseText } = xhr  
    // Always process the final line  
    const lastIndex = responseText.lastIndexOf('\n')  
    let chunk = responseText  
    if (lastIndex !== -1)  
      chunk = responseText.substring(lastIndex)  
    try {  
      const data = JSON.parse(chunk)  
      updateChat(  
        +uuid,  
        dataSources.value.length - 1,  
        {  
          dateTime: new Date().toLocaleString(),  
          text: lastText + data.text ?? '',  
          inversion: false,  
          error: false,  
          loading: false,  
          conversationOptions: { conversationId: data.conversationId, parentMessageId: data.id },  
          requestOptions: { prompt: message, options: { ...options } },  
        },  
      )  
  
      if (openLongReply && data.detail.choices[0].finish_reason === 'length') {  
        options.parentMessageId = data.id  
        lastText = data.text  
        message = ''  
        return fetchChatAPIOnce()  
      }  
  
      scrollToBottom()  
    }  
    catch (error) {  
    //  
    }  
  },  
})

In the underlying request code, set the corresponding header and parameters, monitor the data content, and call back the onProgress function.

const responseP = new Promise((resolve, reject) => {  
  const url = this._apiReverseProxyUrl;  
  const headers = {  
    ...this._headers,  
    Authorization: `Bearer ${this._accessToken}`,  
    Accept: "text/event-stream",  
    "Content-Type": "application/json"  
  };  
  if (this._debug) {  
    console.log("POST", url, { body, headers });  
  }  
  fetchSSE(  
    url,  
    {  
      method: "POST",  
      headers,  
      body: JSON.stringify(body),  
      signal: abortSignal,  
      onMessage: (data) => {  
        var _a, _b, _c;  
        if (data === "[DONE]") {  
          return resolve(result);  
        }  
        try {  
          const convoResponseEvent = JSON.parse(data);  
          if (convoResponseEvent.conversation_id) {  
            result.conversationId = convoResponseEvent.conversation_id;  
          }  
          if ((_a = convoResponseEvent.message) == null ? void 0 : _a.id) {  
            result.id = convoResponseEvent.message.id;  
          }  
          const message = convoResponseEvent.message;  
          if (message) {  
            let text2 = (_c = (_b = message == null ? void 0 : message.content) == null ? void 0 : _b.parts) == null ? void 0 : _c[0];  
            if (text2) {  
              result.text = text2;  
              if (onProgress) {  
                onProgress(result);  
              }  
            }  
          }  
        } catch (err) {  
        }  
      }  
    },  
    this._fetch  
  ).catch((err) => {  
    const errMessageL = err.toString().toLowerCase();  
    if (result.text && (errMessageL === "error: typeerror: terminated" || errMessageL === "typeerror: terminated")) {  
      return resolve(result);  
    } else {  
      return reject(err);  
    }  
  });  
});

nginx configuration

During the construction process, I also encountered another pit. Because there is a layer of nginx proxy in the middle, and "nginx enables caching by default, so the streaming output to nginx is cached", and finally the data obtained by the front end is cached and output at one time. At the same time gzip may also have an impact.

Here you can configure nginx to turn off gzip and cache.

gzip off;

location / {
    proxy_set_header   Host             $host;
    proxy_set_header   X-Real-IP        $remote_addr;
    proxy_set_header   X-Forwarded-For  $proxy_add_x_forwarded_for;
    proxy_cache off;
    proxy_cache_bypass $http_pragma;
    proxy_cache_revalidate on;
    proxy_http_version 1.1;
    proxy_buffering off;
    proxy_pass http://xxx.com:1234;
}

proxy_cache and proxy_buffering are two important proxy modules of Nginx. They can significantly improve proxy server performance and security.

proxy_cache: Through this module, Nginx can cache the response data requested by the proxy server from the backend server. When the next client requests the same resource, Nginx can return the response directly from the cache without having to request the backend server. This greatly reduces the load on the proxy server and also improves the client access speed. It should be noted that when using the proxy_cache module, it is necessary to configure the cache policy carefully to avoid cache inconsistency or expiration.
proxy_buffering: Through this module, Nginx can buffer the response data of the backend server and send it to the client after the complete response data arrives. This method can reduce the number of network connections between the proxy server and the client, improve concurrent processing capabilities, and also prevent the backend server from closing the connection prematurely, causing the client to fail to receive complete response data.

It is measured that only configuring proxy_cache is useless, and the streaming output will only take effect after proxy_buffering is configured.