ChatGPT and Elasticsearch: APM Tools, Performance and Cost Analysis

Author: LUCA WINTERGERST

In this blog, we will test a Python application using OpenAI and analyze its performance and the cost of running the application. Using the data collected from the application, we will also show how to integrate LLMs into your application.

In a previous blog post , we built a small Python application that queries Elasticsearch using a combination of vector search and BM25 to help find the most relevant results in a proprietary dataset. The top results are then passed to OpenAI, which answers the question for us.

In this blog, we will test a Python application using OpenAI and analyze its performance and the cost of running the application. Using the data collected from the application, we will also show how to integrate a large language model (LLM) into your application. As an added bonus, we'll try to answer the question: why does ChatGPT print its output verbatim?

Instrumenting applications using Elastic APM

If you've had a chance to try out our sample application , you may have noticed that the results from the search interface don't load as quickly as you'd expect.

The question now is whether this comes from our two-phase approach of running the query in Elasticsearch first, or whether the slow behavior comes from OpenAI, or whether it's a combination of the two.

Using Elastic APM, we can easily instrument the application for a better look. Everything we need to do for detection is as follows (we'll show the full example at the end of the blog post and in the GitHub repository):

import elasticapm
# the APM Agent is initialized
apmClient = elasticapm.Client(service_name="elasticdocs-gpt-v2-streaming")

# the default instrumentation is applied
# this will instrument the most common libraries, as well as outgoing http requests
elasticapm.instrument()

Since our sample application uses Streamlit, we also need to start at least one transaction and eventually end it again. Additionally, we can provide information about the transaction results to APM so that we can properly track failures.

# start the APM transaction
apmClient.begin_transaction("user-query")

(...)



elasticapm.set_transaction_outcome("success")

# or "failure" for unsuccessful transactions
# elasticapm.set_transaction_outcome("success")

# end the APM transaction
apmClient.end_transaction("user-query")

That's it - that's enough to provide a full APM tool for our application. That being said, we're going to do a little extra work here to get some more interesting data.

In the first step, we add the user's query to the APM metadata. This way we can examine what users are trying to search for and can analyze some popular queries or reproduce errors.

elasticapm.label(query=query)

In our asynchronous method of talking to OpenAI, we will also add some more detections so that we can better visualize the tokens we receive, and collect additional statistics.

async with elasticapm.async_capture_span('openaiChatCompletion', span_type='openai'):
        async for chunk in await openai.ChatCompletion.acreate(engine=engine, messages=[{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": truncated_prompt}],stream=True,):
            content = chunk["choices"][0].get("delta", {}).get("content")
            # since we have the stream=True option, we can get the output as it comes in
            # one iteration is one token
	  # we start a new span here for each token. These spans will be aggregated
            # into a compressed span automatically
            with elasticapm.capture_span("token", leaf=True, span_type="http"):
                if content is not None:
                    # concatenate the output to the previous one, so have the full response at the end
                    output += content
                    # with every token we get, we update the element
                    element.markdown(output)

Finally, in the final stages of the application, we will also add the token amount and approximate cost to the APM transaction. This will allow us to later visualize these metrics and correlate them with application performance.

If you are not using streaming, the OpenAI response will contain a total_tokens field, which is the sum of the context you sent and the response returned. If you use the stream=True option, then it is your responsibility to calculate the number of tokens or an approximate number. A common suggestion is to use "(len(prompt) + len(response)) / 4" for English text, but code snippets in particular may deviate from this approximation. If you need a more accurate number, you can use libraries such as tiktoken to calculate the number of tokens.

# add the number of tokens as a metadata label
elasticapm.label(openai_tokens = st.session_state['openai_current_tokens'])
# add the approximate cost as a metadata label
# currently the cost is $0.002 / 1000 tokens
elasticapm.label(openai_cost = st.session_state['openai_current_tokens'] / 1000 * 0.002)

Checking APM Data - Which is Slower, Elasticsearch or OpenAI?

After instrumenting the application, a quick look at the "Dependencies" can give us a better idea of ​​what's going on. It looks like our requests to Elasticsearch returned within 125 milliseconds on average, while OpenAI took 8,500 milliseconds to complete the request. (This screenshot was taken on a version of the application that does not use streaming. If you use streaming, the default detection only considers the initial POST request in the dependency response time, not the time required to stream the full response. required time.)

If you've used ChatGPT yourself, you might wonder why the UI prints each word individually instead of returning the full response right away.

It turns out that if you use the free version, it’s not actually meant to entice you to pay! This is more a limitation of the inference model. In short, in order to calculate the next token, the model also needs to consider the last token. So there is not much room for parallelization. Since each token is processed individually, this token can also be sent to the client while the calculation for the next token is running.

To improve the user experience, it can be helpful to use streaming methods when using the ChatCompletion feature. This way the user can start using the first results while the full response is being generated. You can see this behavior in the GIF below. Even though all three responses are still loading, the user can scroll down and inspect what's already there.

As mentioned before, we added more custom detection than the bare minimum. This allows us to get detailed information about where our time is spent. Let's take a look at the full trace to see what this stream actually looks like.

Our application is configured to get the top three hits from Elasticsearch and then run a ChatCompletion request against OpenAI in parallel.

As we can see in the screenshot, it takes about 15 seconds to load a single result. We can also see that OpenAI requests that return larger responses take longer to return. But this is just a request. Does this behavior occur for all requests? Is there a clear correlation between response time and the number of tokens supporting our previous claim?

Analyze costs and response times

Instead of using Elastic APM to visualize the data, we can also use custom dashboards and create visualizations based on APM data. We can build two interesting graphs showing the relationship between the number of tokens in the response and the duration of the request.

We can see that the more tokens returned (x-axis in the first chart), the longer the duration (y-axis in the first chart). In the image on the right, we can also see that regardless of the total number of tokens returned (x-axis), the duration for every 100 tokens returned remains almost constant at around 4 seconds.

If you want to improve the responsiveness of an application that uses an OpenAI model, it's a good idea to tell the model to keep responses brief.

In addition to this, we can also track our total spend and average cost per page load, among other statistics.

For our sample application, a single search costs about 1.1 cents. That number doesn't sound like a lot, but it probably won't be showing up as a search option on your public website anytime soon. For internal company data and the occasional search interface, this cost is negligible.

In our testing, we also encountered frequent errors when using the OpenAI API in Azure, which ultimately led us to add a retry loop with exponential backoff to the sample application. We can also catch these errors using Elastic APM.

while tries < 5:
    try:
        print("request to openai for task number: " + str(index) + " attempt: " + str(tries))
        async with elasticapm.async_capture_span('openaiChatCompletion', span_type='openai'):
            async for chunk in await openai.ChatCompletion.acreate(engine=engine, messages=[{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": truncated_prompt}],stream=True,):
                content = chunk["choices"][0].get("delta", {}).get("content")
                counter += 1
                with elasticapm.capture_span("token", leaf=True, span_type="http"):
                    if content is not None:
                        output += content
                        element.markdown(output)
        break
    except Exception as e:
        client = elasticapm.get_client()
        # capture the exception using Elastic APM and send it to the apm server
        client.capture_exception()
        tries += 1
        time.sleep(tries * tries / 2)
        if tries == 5:
            element.error("Error: " + str(e))
        else:
            print("retrying...")

Any captured errors will then be visible in the waterfall chart as part of the span in which the failure occurred.

Additionally, Elastic APM provides an overview of all errors. In the screenshot below, you can see the RateLimitError and APIConnectionError we encountered occasionally. Using our crude exponential retry mechanism we can mitigate most of these problems.

Delayed and failed transactions associated

With all the built-in metadata captured by the Elastic APM agent and the custom tags we added, we can easily analyze if there are any correlations between performance and any metadata such as service version, user queries, etc.

As shown below, there is a small correlation between the query "How can I mount and index on a frozen node?" and slower response times.

A similar analysis can be performed on any transaction that caused the error. In this example, the two queries "How do I create an ingest pipeline" fail more frequently than the other queries, causing them to stand out in this correlation analysis.

import elasticapm
# the APM Agent is initialized
apmClient = elasticapm.Client(service_name="elasticdocs-gpt-v2-streaming")

# the default instrumentation is applied
# this will instrument the most common libraries, as well as outgoing http requests
elasticapm.instrument()

# if a user clicks the "Search" button in the UI
if submit_button:
	# start the APM transaction
apmClient.begin_transaction("user-query")
# add custom labels to the transaction, so we can see the users question in the API UI
elasticapm.label(query=query)



    async with elasticapm.async_capture_span('openaiChatCompletion', span_type='openai'):
        async for chunk in await openai.ChatCompletion.acreate(engine=engine, messages=[{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": truncated_prompt}],stream=True,):
            content = chunk["choices"][0].get("delta", {}).get("content")
            # since we have the stream=True option, we can get the output as it comes in
            # one iteration is one token
            with elasticapm.capture_span("token", leaf=True, span_type="http"):
                if content is not None:
                    # concatenate the output to the previous one, so have the full response at the end
                    output += content
                    # with every token we get, we update the element
                    element.markdown(output)
async def achat_gpt(prompt, result, index, element, model="gpt-3.5-turbo", max_tokens=1024, max_context_tokens=4000, safety_margin=1000):
    output = ""
    # we create on overall Span here to track the total process of doing the completion
    async with elasticapm.async_capture_span('openaiChatCompletion', span_type='openai'):
        async for chunk in await openai.ChatCompletion.acreate(engine=engine, messages=[{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": truncated_prompt}],stream=True,):
            content = chunk["choices"][0].get("delta", {}).get("content")
            # since we have the stream=True option, we can get the output as it comes in
            # one iteration is one token, so we create one small span for each
            with elasticapm.capture_span("token", leaf=True, span_type="http"):
                if content is not None:
                    # concatenate the output to the previous one, so have the full response at the end
                    output += content
                    # with every token we get, we update the element
                    element.markdown(output)

In this blog, we test an application written in Python to use OpenAI and analyze its performance. We studied response delays and failed transactions, and evaluated the cost of running the application. We hope you find this guide useful!

Learn more about the possibilities of Elasticsearch and AI .

In this blog post, we may have used third-party generative AI tools, which are owned and operated by their respective owners. Elastic has no control over third-party tools, and we assume no responsibility for their content, operation, or use or for any loss or damage that may arise from your use of such tools. Use caution when using artificial intelligence tools to handle personal, sensitive or confidential information. Any data you submit may be used for artificial intelligence training or other purposes. There is no guarantee that the information you provide will be kept secure or confidential. You should familiarize yourself with the privacy practices and terms of use of any generative AI tool before using it.

The costs mentioned in this article are based on current OpenAI API pricing and how often we call it when loading the sample application.

Elastic, Elasticsearch and related marks are trademarks, logos or registered trademarks of Elasticsearch NV. in the United States and other countries. All other company and product names are trademarks, logos or registered trademarks of their respective owners.

原文:ChatGPT and Elasticsearch: APM instrumentation, performance, and cost analysis — Elastic Search Labs

Guess you like

Origin blog.csdn.net/UbuntuTouch/article/details/132808631
Recommended