Building Systems Using Large Language Models (LLMs) (7): Evaluation 1

Today I learned the online course of DeepLearning.AI's Building Systems with LLM, and I would like to share with you some of the main contents of the course. We have learned the following knowledge before:

  1. Building Systems Using Large Language Models (LLMs) Part 1: Classification
  2. Using large language model (LLM) to build a system (2): content review, prevention of prompt injection
  3. Building a system using a large language model (LLM) (3): Thinking chain reasoning
  4. Building Systems Using Large Language Models (LLMs) (4): Chained Hints
  5. Building a system using a large language model (LLM) (5): Output check
  6. Building Systems Using Large Language Models (LLM) (6): Building an End-to-End System

In the previous blog, we have completed the large-scale language model (LLM) to build an end-to-end system. After the system development is completed, we need to evaluate the system functions. The so-called evaluation means that we need to design test cases to evaluate the system functions. Test and observe whether the test results meet the requirements. The system we are going to test here is the end-to-end system developed in the previous blog . The following is the main code for our access to the LLM model. Here I added the backoff package. Its function is mainly to prevent Rate limit errors when calling the API. The reason for the error has been explained in the previous blog, so I won’t go into details here. :

import openai
import backoff 
 
#您的openai的api key
openai.api_key ='YOUR-OPENAI-API-KEY' 

@backoff.on_exception(backoff.expo, openai.error.RateLimitError) 
def get_completion_from_messages(messages, 
                                 model="gpt-3.5-turbo", 
                                 temperature=0, 
                                 max_tokens=500):
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=temperature, 
        max_tokens=max_tokens,
    )
    return response.choices[0].message["content"]

Get related products and categories

In the last blog, we developed an end-to-end system. The main function of the system is to develop a personalized customer service robot to answer users' questions about related electronic products. When users ask questions, the system first needs to identify the user's questions Whether the mentioned electronic product is included in the current list of electronic products in the system, the following code is to check the current list of all electronic products in the current system:

products_and_category = utils.get_products_and_category()
products_and_category

Let's take a look at all the categories in the current list of products. There are currently 6 categories in the current list of products:

#所有类别
list(products_and_category.keys())

Here we see that our electronic products include 6 major categories, and there are several specific electronic products under each major category. We will use a python dictionary to store the product catalog list, where the key corresponds to the category, and the value corresponds to is a specific product.

Find related product and category names (version 1)

Next we define a function that looks up the names of related products and categories. The function takes a customer's question and returns a catalog listing of the products and categories that are relevant to the question:

def find_category_and_product_v1(user_input,products_and_category):

    delimiter = "####"
    system_message = f"""
    You will be provided with customer service queries. \
    The customer service query will be delimited with {delimiter} characters.
    Output a python list of json objects, where each object has the following format:
        'category': <one of Computers and Laptops, Smartphones and Accessories, Televisions and Home Theater Systems, \
    Gaming Consoles and Accessories, Audio Equipment, Cameras and Camcorders>,
    AND
        'products': <a list of products that must be found in the allowed products below>


    Where the categories and products must be found in the customer service query.
    If a product is mentioned, it must be associated with the correct category in the allowed products list below.
    If no products or categories are found, output an empty list.
    

    List out all products that are relevant to the customer service query based on how closely it relates
    to the product name and product category.
    Do not assume, from the name of the product, any features or attributes such as relative quality or price.

    The allowed products are provided in JSON format.
    The keys of each item represent the category.
    The values of each item is a list of products that are within that category.
    Allowed products: {products_and_category}
    

    """
    
    few_shot_user_1 = """I want the most expensive computer."""
    few_shot_assistant_1 = """ 
    [{'category': 'Computers and Laptops', \
'products': ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook']}]
    """
    
    messages =  [  
    {'role':'system', 'content': system_message},    
    {'role':'user', 'content': f"{delimiter}{few_shot_user_1}{delimiter}"},  
    {'role':'assistant', 'content': few_shot_assistant_1 },
    {'role':'user', 'content': f"{delimiter}{user_input}{delimiter}"},  
    ] 
    return get_completion_from_messages(messages)

Here we translate system_message into Chinese, which is easy for everyone to understand:

System_message = f"""
您将获得客户服务查询。
客户服务查询将用{delimiter}字符分隔。
输出json对象的python列表,其中每个对象具有以下格式:
“category”:<电脑和笔记本电脑,智能手机和配件,电视和家庭影院系统之一,
游戏机及配件、音响设备、照相机及摄像机>、
和
'products': <必须在下面允许的产品中找到的产品列表>


其中类别和产品必须在客户服务查询中找到。
如果提到某个产品,它必须与下面允许的产品列表中的正确类别相关联。
如果没有找到任何产品或类别,则输出一个空列表。


列出与客户服务查询相关的所有产品
到产品名称和产品类别。
不要从产品的名称中假设任何特征或属性,例如相对质量或价格。

允许的产品以JSON格式提供。
每个项目的键代表类别。
每个项目的值是属于该类别的产品列表。
允许的产品:{products_and_category}
"""

Here, in addition to providing system_message, we also provide a small number of self-questioning learning samples few_shot_user_1 and few_shot_assistant_1. The purpose of this is to train LLM so that it knows how to correctly output query results in the specified format.

evaluate find_category_and_product_v1

After defining the find_category_and_product_v1 method, let's design a few user questions to evaluate the actual effect of find_category_and_product_v1.

customer_msg_0 = f"""Which TV can I buy if I'm on a budget?"""

products_by_category_0 = find_category_and_product_v1(customer_msg_0,
                                                      products_and_category)
print(products_by_category_0)

customer_msg_1 = f"""I need a charger for my smartphone"""

products_by_category_1 = find_category_and_product_v1(customer_msg_1,
                                                      products_and_category)
products_by_category_1

customer_msg_2 = f"""What computers do you have?"""
products_by_category_2 = find_category_and_product_v1(customer_msg_2,
                                                      products_and_category)
products_by_category_2

customer_msg_3 = f"""
tell me about the smartx pro phone and the fotosnap camera, the dslr one.
Also, what TVs do you have?"""

products_by_category_3 = find_category_and_product_v1(customer_msg_3,
                                                      products_and_category)
print(products_by_category_3)

 For the four user questions here, LLM outputs the query results in the specified format. Note that the query results here are all output in the form of strings.

more complex test cases

 Next, we need to design a more complex test case to see if find_category_and_product_v1 can still output the results as required.

customer_msg_4 = f"""
tell me about the CineView TV, the 8K one, Gamesphere console, the X one.
I'm on a budget, what computers do you have?"""

products_by_category_4 = find_category_and_product_v1(customer_msg_4,
                                                      products_and_category)
print(products_by_category_4)

 

 From the above output results, we can see that when the customer's problem is a little more complicated, in addition to the information in the specified format we require in the output of LLM, there is also some text information, but this part of the text information is not needed by us. Because our system needs to convert the output result into the specified Python List format, if the output result is mixed with text information, the format conversion cannot be performed.

Modify prompt to handle more complex test cases (version 2)

Next we need to modify system_message so that it can handle complex user problems like before. The main purpose is to prevent LLM from outputting unnecessary text information.

def find_category_and_product_v2(user_input,products_and_category):
    """
    Added: Do not output any additional text that is not in JSON format.
    Added a second example (for few-shot prompting) where user asks for 
    the cheapest computer. In both few-shot examples, the shown response 
    is the full list of products in JSON only.
    """
    delimiter = "####"
    system_message = f"""
    You will be provided with customer service queries. \
    The customer service query will be delimited with {delimiter} characters.
    Output a python list of json objects, where each object has the following format:
        'category': <one of Computers and Laptops, Smartphones and Accessories, Televisions and Home Theater Systems, \
    Gaming Consoles and Accessories, Audio Equipment, Cameras and Camcorders>,
    AND
        'products': <a list of products that must be found in the allowed products below>
    Do not output any additional text that is not in JSON format.
    Do not write any explanatory text after outputting the requested JSON.


    Where the categories and products must be found in the customer service query.
    If a product is mentioned, it must be associated with the correct category in the allowed products list below.
    If no products or categories are found, output an empty list.
    

    List out all products that are relevant to the customer service query based on how closely it relates
    to the product name and product category.
    Do not assume, from the name of the product, any features or attributes such as relative quality or price.

    The allowed products are provided in JSON format.
    The keys of each item represent the category.
    The values of each item is a list of products that are within that category.
    Allowed products: {products_and_category}
    

    """
    
    few_shot_user_1 = """I want the most expensive computer. What do you recommend?"""
    few_shot_assistant_1 = """ 
    [{'category': 'Computers and Laptops', \
'products': ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook']}]
    """
    
    few_shot_user_2 = """I want the most cheapest computer. What do you recommend?"""
    few_shot_assistant_2 = """ 
    [{'category': 'Computers and Laptops', \
'products': ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook']}]
    """
    
    messages =  [  
    {'role':'system', 'content': system_message},    
    {'role':'user', 'content': f"{delimiter}{few_shot_user_1}{delimiter}"},  
    {'role':'assistant', 'content': few_shot_assistant_1 },
    {'role':'user', 'content': f"{delimiter}{few_shot_user_2}{delimiter}"},  
    {'role':'assistant', 'content': few_shot_assistant_2 },
    {'role':'user', 'content': f"{delimiter}{user_input}{delimiter}"},  
    ] 
    return get_completion_from_messages(messages)

Here we have added two prompts in system_message:

 The meaning of these two sentences is: don't output any additional text in non-JSON format. Do not write any explanatory text after outputting the requested JSON.

In addition, we modified the original learning samples ew_shot_user_1 and few_shot_assistant_1, and added a pair of learning samples few_shot_user_2 and few_shot_assistant_2. This way we have 2 pairs of learning samples for LLM to learn from.

Evaluate the effect of modified prompts on complex problems

customer_msg_4 = f"""
tell me about the CineView TV, the 8K one, Gamesphere console, the X one.
I'm on a budget, what computers do you have?"""

products_by_category_4 = find_category_and_product_v2(customer_msg_4,
                                                      products_and_category)
print(products_by_category_4)

From the above results, this time LLM output the relevant product and classification content in the correct format, without any text information attached. So it shows that our modified prompt has started to work.

 Regression testing: Verify that the model is still valid on previous test cases

 We need to retest the previous test cases to ensure that the modified prompts are still valid for the previous test cases. Here we extract several previous test cases for testing.

customer_msg_0 = f"""Which TV can I buy if I'm on a budget?"""

products_by_category_0 = find_category_and_product_v2(customer_msg_0,
                                                      products_and_category)
print(products_by_category_0)

customer_msg_3 = f"""
tell me about the smartx pro phone and the fotosnap camera, the dslr one.
Also, what TVs do you have?"""

products_by_category_3 = find_category_and_product_v2(customer_msg_3,
                                                      products_and_category)
print(products_by_category_3)

 Here we test two previous test cases, which fully meet the requirements from the output results.

Build test suites for automated testing

Next, we will create a set of test cases for automated testing. Each test case includes two parts: the customer's question (customer_msg) and the ideal answer (ideal_answer). Our purpose is to evaluate whether the LLM's reply is consistent with the ideal answer. And use this to score the LLM.

msg_ideal_pairs_set = [
    
    # eg 0
    {'customer_msg':"""Which TV can I buy if I'm on a budget?""",
     'ideal_answer':{
        'Televisions and Home Theater Systems':set(
            ['CineView 4K TV', 'SoundMax Home Theater', 'CineView 8K TV', 'SoundMax Soundbar', 'CineView OLED TV']
        )}
    },

    # eg 1
    {'customer_msg':"""I need a charger for my smartphone""",
     'ideal_answer':{
        'Smartphones and Accessories':set(
            ['MobiTech PowerCase', 'MobiTech Wireless Charger', 'SmartX EarBuds']
        )}
    },
    # eg 2
    {'customer_msg':f"""What computers do you have?""",
     'ideal_answer':{
           'Computers and Laptops':set(
               ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook'
               ])
                }
    },

    # eg 3
    {'customer_msg':f"""tell me about the smartx pro phone and \
    the fotosnap camera, the dslr one.\
    Also, what TVs do you have?""",
     'ideal_answer':{
        'Smartphones and Accessories':set(
            ['SmartX ProPhone']),
        'Cameras and Camcorders':set(
            ['FotoSnap DSLR Camera']),
        'Televisions and Home Theater Systems':set(
            ['CineView 4K TV', 'SoundMax Home Theater','CineView 8K TV', 'SoundMax Soundbar', 'CineView OLED TV'])
        }
    }, 
    
    # eg 4
    {'customer_msg':"""tell me about the CineView TV, the 8K one, Gamesphere console, the X one.
I'm on a budget, what computers do you have?""",
     'ideal_answer':{
        'Televisions and Home Theater Systems':set(
            ['CineView 8K TV']),
        'Gaming Consoles and Accessories':set(
            ['GameSphere X']),
        'Computers and Laptops':set(
            ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook'])
        }
    },
    
    # eg 5
    {'customer_msg':f"""What smartphones do you have?""",
     'ideal_answer':{
           'Smartphones and Accessories':set(
               ['SmartX ProPhone', 'MobiTech PowerCase', 'SmartX MiniPhone', 'MobiTech Wireless Charger', 'SmartX EarBuds'
               ])
                    }
    },
    # eg 6
    {'customer_msg':f"""I'm on a budget.  Can you recommend some smartphones to me?""",
     'ideal_answer':{
        'Smartphones and Accessories':set(
            ['SmartX EarBuds', 'SmartX MiniPhone', 'MobiTech PowerCase', 'SmartX ProPhone', 'MobiTech Wireless Charger']
        )}
    },

    # eg 7 # this will output a subset of the ideal answer
    {'customer_msg':f"""What Gaming consoles would be good for my friend who is into racing games?""",
     'ideal_answer':{
        'Gaming Consoles and Accessories':set([
            'GameSphere X',
            'ProGamer Controller',
            'GameSphere Y',
            'ProGamer Racing Wheel',
            'GameSphere VR Headset'
     ])}
    },
    # eg 8
    {'customer_msg':f"""What could be a good present for my videographer friend?""",
     'ideal_answer': {
        'Cameras and Camcorders':set([
        'FotoSnap DSLR Camera', 'ActionCam 4K', 'FotoSnap Mirrorless Camera', 'ZoomMaster Camcorder', 'FotoSnap Instant Camera'
        ])}
    },
    
    # eg 9
    {'customer_msg':f"""I would like a hot tub time machine.""",
     'ideal_answer': []
    }
    
]

 Test automation test set

print(f'Customer message: {msg_ideal_pairs_set[7]["customer_msg"]}')
print(f'Ideal answer: {msg_ideal_pairs_set[7]["ideal_answer"]}')

Evaluate test cases by comparing ideal answers

Here we define an evaluation function to compare the return results of the evaluation LLM with the ideal answers in the automatic test case set and calculate the score.

import json
def eval_response_with_ideal(response,
                              ideal,
                              debug=False):
    
    if debug:
        print("response")
        print(response)
    
    # json.loads() expects double quotes, not single quotes
    json_like_str = response.replace("'",'"')
    
    # parse into a list of dictionaries
    l_of_d = json.loads(json_like_str)
    
    # special case when response is empty list
    if l_of_d == [] and ideal == []:
        return 1
    
    # otherwise, response is empty 
    # or ideal should be empty, there's a mismatch
    elif l_of_d == [] or ideal == []:
        return 0
    
    correct = 0    
    
    if debug:
        print("l_of_d is")
        print(l_of_d)
    for d in l_of_d:

        cat = d.get('category')
        prod_l = d.get('products')
        if cat and prod_l:
            # convert list to set for comparison
            prod_set = set(prod_l)
            # get ideal set of products
            ideal_cat = ideal.get(cat)
            if ideal_cat:
                prod_set_ideal = set(ideal.get(cat))
            else:
                if debug:
                    print(f"did not find category {cat} in ideal")
                    print(f"ideal: {ideal}")
                continue
                
            if debug:
                print("prod_set\n",prod_set)
                print()
                print("prod_set_ideal\n",prod_set_ideal)

            if prod_set == prod_set_ideal:
                if debug:
                    print("correct")
                correct +=1
            else:
                print("incorrect")
                print(f"prod_set: {prod_set}")
                print(f"prod_set_ideal: {prod_set_ideal}")
                if prod_set <= prod_set_ideal:
                    print("response is a subset of the ideal answer")
                elif prod_set >= prod_set_ideal:
                    print("response is a superset of the ideal answer")

    # count correct over total number of items in list
    pc_correct = correct / len(l_of_d)
        
    return pc_correct

 Next, we will let LLM answer a certain question in the automated test set, and then compare the output result with the ideal answer in the automated test set and calculate the score.

response = find_category_and_product_v2(msg_ideal_pairs_set[7]["customer_msg"],
                                         products_and_category)
print(f'Resonse: {response}')

eval_response_with_ideal(response,
                              msg_ideal_pairs_set[7]["ideal_answer"])

 From the output results, the number of products in Resonse returned by LLM is less than the number of products in the ideal answer, so Resonse is only a subset of the ideal result, so it seems that the result of Resonse is not complete.

Evaluate all test cases and calculate the proportion of correct responses

Next, we will evaluate all the tests and calculate the proportion of correct responses.

# Note, this will not work if any of the api calls time out
score_accum = 0
for i, pair in enumerate(msg_ideal_pairs_set):
    print(f"example {i}")

    customer_msg = pair['customer_msg']
    ideal = pair['ideal_answer']

    # print("Customer message",customer_msg)
    # print("ideal:",ideal)
    response = find_category_and_product_v2(customer_msg,
                                                      products_and_category)

    # print("products_by_category",products_by_category)
    score = eval_response_with_ideal(response,ideal,debug=False)
    print(f"{i}: {score}")
    score_accum += score


n_examples = len(msg_ideal_pairs_set)
fraction_correct = score_accum / n_examples
print(f"Fraction correct out of {n_examples}: {fraction_correct}")

 We tested all the tests, and from the returned results, only the returned result of the seventh question is not completely correct. This has been discussed before, so there are 9 correct responses and 1 incorrect response. So the final score is 0.9.

Summarize

Today we learned how to evaluate the correctness of LLM's reply, and test the LLM's reply by creating a test case. When the reply of LLM does not meet the requirements, we need to modify the prompt to let LLM output the correct result. When modifying the prompt, we will In system_message, a statement to control the output result is added, and a small number of learning samples (few-shot prompt) are added at the same time, which better fine-tunes the LLM and ensures that the LLM outputs the result in the correct format. Finally, we evaluate the response of the LLM by establishing an automated test case Is it consistent with the ideal answer and has been evaluated by the LLM.

Guess you like

Origin blog.csdn.net/weixin_42608414/article/details/131258737