Today I learned the online course of DeepLearning.AI's Building Systems with LLM, and I would like to share with you some of the main contents of the course. We have learned the following knowledge before:

Using Large Language Model (LLM) to Build a System (1): ClassificationUsing
Large Language Model (LLM) to Build a System (2): Content Auditing and Preventing Prompt Injection
Using Large Language Model (LLM) to Build a System (3): Thinking Chain Reasoning
Use Large Language Model (LLM) Construction System (4): Chain prompts
Use Large Language Model (LLM) Construction System (5): Output check
Use Large Language Model (LLM) Construction System (6): Build an end-to-end system
using Large Language Model (LLM) Construction System (7): Evaluation 1

Here is our main code for accessing the LLM model:

import os
import openai
import sys
sys.path.append('../..')
import utils
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ["OPENAI_API_KEY"]
 
def get_completion_from_messages(messages, 
                                 model="gpt-3.5-turbo", 
                                 temperature=0, 
                                 max_tokens=500):
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=temperature, 
        max_tokens=max_tokens,
    )
    return response.choices[0].message["content"]

Answer user queries by running an end-to-end system

Here we use a toolkit utils to execute a series of replies to users' questions about electronic products. The principles of the functions in utils have been explained in previous blogs, so I won't repeat them here.

customer_msg = f"""
tell me about the smartx pro phone and the fotosnap camera, the dslr one.
Also, what TVs or TV related products do you have?"""

products_by_category = utils.get_products_from_query(customer_msg)
category_and_product_list = utils.read_string_to_list(products_by_category)
product_info = utils.get_mentioned_product_info(category_and_product_list)
assistant_answer = utils.answer_user_msg(user_msg=customer_msg,
                                                   product_info=product_info)

Here we translate customer_msg into Chinese, so that everyone can better understand the meaning:

Here we first call the utils.get_products_from_query method to get a list of the product categories involved in the user's question (this process requires access to the LLM). Since the product catalog list returned by LLM is a string type, we also need to call the utils.read_string_to_list method to convert the string into python's List format, and then we query the specific information of all products contained in all product catalog lists (this process No need to visit LLM), and finally we send the user's questions to LLM together with specific product information, and produce the final reply. Let's take a look at the specific content of the above variables:

Finally we look at LLM's final reply:

Since the customer's question involves multiple electronic products, the amount of information in LLM's reply is a bit too much. How can we evaluate whether LLM's reply meets the requirements?

Based on the extracted product information, the LLM’s responses to users are evaluated using scoring criteria

Here we need to formulate a scoring standard to evaluate whether the final reply of LLM is qualified. The evaluation is based on three parts: customer questions, specific product information, and the final reply of LLM. That is to say, we need to integrate these three parts together and let LLM evaluate whether our previous reply is qualified. Next, we define an eval_with_rubric function to realize these functions:

def eval_with_rubric(test_set, assistant_answer):

    cust_msg = test_set['customer_msg']
    context = test_set['context']
    completion = assistant_answer
    
    system_message = """\
    You are an assistant that evaluates how well the customer service agent \
    answers a user question by looking at the context that the customer service \
    agent is using to generate its response. 
    """

    user_message = f"""\
You are evaluating a submitted answer to a question based on the context \
that the agent uses to answer the question.
Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {cust_msg}
    ************
    [Context]: {context}
    ************
    [Submission]: {completion}
    ************
    [END DATA]

Compare the factual content of the submitted answer with the context. \
Ignore any differences in style, grammar, or punctuation.
Answer the following questions:
    - Is the Assistant response based only on the context provided? (Y or N)
    - Does the answer include information that is not provided in the context? (Y or N)
    - Is there any disagreement between the response and the context? (Y or N)
    - Count how many questions the user asked. (output a number)
    - For each question that the user asked, is there a corresponding answer to it?
      Question 1: (Y or N)
      Question 2: (Y or N)
      ...
      Question N: (Y or N)
    - Of the number of questions asked, how many of these questions were addressed by the answer? (output a number)
"""

    messages = [
        {'role': 'system', 'content': system_message},
        {'role': 'user', 'content': user_message}
    ]

    response = get_completion_from_messages(messages)
    return response

Here we translate system_message and user_message into Chinese, so that everyone can better understand:

Next we evaluate LLM's previous responses:

cust_prod_info = {
    'customer_msg': customer_msg,
    'context': product_info
}

evaluation_output = eval_with_rubric(cust_prod_info, assistant_answer)
print(evaluation_output)

Here we define an evaluation standard, that is, let LLM answer the 6 questions we defined in user_message, and these 6 questions are our scoring standards. From the output above, LLM answered all 6 questions correctly.

Evaluate LLM's answers to users against "ideal"/"expert" (human-generated) answers

To evaluate the effect of LLM's answer, in addition to asking LLM to answer simple questions (these questions may be some simple statistics), it is also necessary to evaluate the difference between it and human-generated experts or ideal answers. Below we define an ideal/expert answer:

test_set_ideal = {
    'customer_msg': """\
tell me about the smartx pro phone and the fotosnap camera, the dslr one.
Also, what TVs or TV related products do you have?""",
    'ideal_answer':"""\
Of course!  The SmartX ProPhone is a powerful \
smartphone with advanced camera features. \
For instance, it has a 12MP dual camera. \
Other features include 5G wireless and 128GB storage. \
It also has a 6.1-inch display.  The price is $899.99.

The FotoSnap DSLR Camera is great for \
capturing stunning photos and videos. \
Some features include 1080p video, \
3-inch LCD, a 24.2MP sensor, \
and interchangeable lenses. \
The price is 599.99.

For TVs and TV related products, we offer 3 TVs \


All TVs offer HDR and Smart TV.

The CineView 4K TV has vibrant colors and smart features. \
Some of these features include a 55-inch display, \
'4K resolution. It's priced at 599.

The CineView 8K TV is a stunning 8K TV. \
Some features include a 65-inch display and \
8K resolution.  It's priced at 2999.99

The CineView OLED TV lets you experience vibrant colors. \
Some features include a 55-inch display and 4K resolution. \
It's priced at 1499.99.

We also offer 2 home theater products, both which include bluetooth.\
The SoundMax Home Theater is a powerful home theater system for \
an immmersive audio experience.
Its features include 5.1 channel, 1000W output, and wireless subwoofer.
It's priced at 399.99.

The SoundMax Soundbar is a sleek and powerful soundbar.
It's features include 2.1 channel, 300W output, and wireless subwoofer.
It's priced at 199.99

Are there any questions additional you may have about these products \
that you mentioned here?
Or may do you have other questions I can help you with?
    """
}

Evaluate LLM answers with expert answers

Below we will use the prompts in the OpenAI evals project to implement BLEU (bilingual evaluation understanding) evaluation

BLEU Score: Another way to assess whether two pieces of text are similar.

def eval_vs_ideal(test_set, assistant_answer):

    cust_msg = test_set['customer_msg']
    ideal = test_set['ideal_answer']
    completion = assistant_answer
    
    system_message = """\
    You are an assistant that evaluates how well the customer service agent \
    answers a user question by comparing the response to the ideal (expert) response
    Output a single letter and nothing else. 
    """

    user_message = f"""\
You are comparing a submitted answer to an expert answer on a given question. Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {cust_msg}
    ************
    [Expert]: {ideal}
    ************
    [Submission]: {completion}
    ************
    [END DATA]

Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation.
    The submitted answer may either be a subset or superset of the expert answer, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options:
    (A) The submitted answer is a subset of the expert answer and is fully consistent with it.
    (B) The submitted answer is a superset of the expert answer and is fully consistent with it.
    (C) The submitted answer contains all the same details as the expert answer.
    (D) There is a disagreement between the submitted answer and the expert answer.
    (E) The answers differ, but these differences don't matter from the perspective of factuality.
  choice_strings: ABCDE
"""

    messages = [
        {'role': 'system', 'content': system_message},
        {'role': 'user', 'content': user_message}
    ]

    response = get_completion_from_messages(messages)
    return response

We will translate the above into Chinese, so that everyone can better understand:

Next we compare the LLM's answer with the expert's (ideal) answer and finally give a score (A,B,C,D,E):

eval_vs_ideal(test_set_ideal, assistant_answer)

From the evaluation results, LLM scored A for its own answer, which means that the previous answer of LLM is a subset of the expert's answer, and it is completely consistent with it. From the expert's answer, we can know that the expert has a good understanding of each product. Made very detailed instructions, and some of these instructions are not included in the product information of the system, so LLM's answer is a subset of expert answers, there is no problem with this. Because LLM's reply is based on systematic product information, and there is no possibility of LLM distorting the content of product information, it is completely consistent with part of the content of the expert's answer. Let's try another incorrect LLM response and compare it with the expert answer:

assistant_answer_2 = "life is like a box of chocolates"
eval_vs_ideal(test_set_ideal, assistant_answer_2)

Here we give an LLM reply that has nothing to do with product information: "life is like a box of chocolates". At this time, the evaluation result gives a score of D, which means that there is a difference between the submitted answer and the expert's answer.

Summarize

Today we learned how to use two methods of evaluating LLM responses, the first is to design a simple question for the LLM to answer, and the second is to use a human expert's answer to evaluate the difference between the LLM's answer and the expert's answer and give Give a score (A,B,C,D,E).

Building Systems Using Large Language Models (LLMs) (7): Evaluation 2

Answer user queries by running an end-to-end system

Based on the extracted product information, the LLM’s responses to users are evaluated using scoring criteria

Evaluate LLM's answers to users against "ideal"/"expert" (human-generated) answers

Evaluate LLM answers with expert answers

Summarize

Guess you like