使用大型语言模(LLM)构建系统(七):评估2

今天我学习了DeepLearning.AI的 Building Systems with LLM 的在线课程,我想和大家一起分享一下该门课程的一些主要内容。之前我们已经学习了下面这些知识:

使用大型语言模(LLM)构建系统(一):分类
使用大型语言模(LLM)构建系统(二):内容审核、预防Prompt注入
使用大型语言模(LLM)构建系统(三):思维链推理
使用大型语言模(LLM)构建系统(四):链式提示
使用大型语言模(LLM)构建系统(五):输出结果检查
使用大型语言模(LLM)构建系统(六):构建端到端系统
使用大型语言模(LLM)构建系统(七):评估1

下面是我们访问LLM模型的主要代码:

import os
import openai
import sys
sys.path.append('../..')
import utils
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ["OPENAI_API_KEY"]
 
def get_completion_from_messages(messages, 
                                 model="gpt-3.5-turbo", 
                                 temperature=0, 
                                 max_tokens=500):
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=temperature, 
        max_tokens=max_tokens,
    )
    return response.choices[0].message["content"]

通过运行端到端系统来回答用户查询

这里我们通过一个工具包utils来执行一系列的用户关于电子产品的问题的回复,utils中的函数在的原理在之前的博客中都有说明,这里不再赘述。

customer_msg = f"""
tell me about the smartx pro phone and the fotosnap camera, the dslr one.
Also, what TVs or TV related products do you have?"""

products_by_category = utils.get_products_from_query(customer_msg)
category_and_product_list = utils.read_string_to_list(products_by_category)
product_info = utils.get_mentioned_product_info(category_and_product_list)
assistant_answer = utils.answer_user_msg(user_msg=customer_msg,
                                                   product_info=product_info)

这里我们把customer_msg翻译成中文,这样便于大家更好的理解其中的含义:

 这里我们首先调用了utils.get_products_from_query方法来获取用户问题中所涉及的产品目录的清单(这个过程需要访问LLM)。由于LLM返回的产品目录清单是字符串型所以我们还需要调用utils.read_string_to_list方法将字符串转换成python的List格式,然后我们查询出所有产品目录清单中所包含的所有产品的具体信息(这个过程不需要访问LLM),最后我们将用户的问题和具体的产品信息一起发送给LLM,并生产最终的回复。下面我们分别看一下上面那些变量中的具体内容:

 最后我们查看LLM的最终回复:

 由于客户的问题中涉及了多个电子产品,因此LLM的回复的信息量有点多,我们如何来评估LLM的回复是否符合要求呢?

根据提取的产品信息,使用评分标准评估LLM对用户的回答

 这里我们需要制定一个评分标准来评估LLM的最终回复是否合格,评估的依据来自于客户的问题,具体的产品信息以及LLM的最终回复这三部分内容。也就是说我们需要把这3部分内容整合在一起然后让LLM再来评估自己之前的回复是否合格,下面我们定义一个eval_with_rubric函数来实现这些功能:

def eval_with_rubric(test_set, assistant_answer):

    cust_msg = test_set['customer_msg']
    context = test_set['context']
    completion = assistant_answer
    
    system_message = """\
    You are an assistant that evaluates how well the customer service agent \
    answers a user question by looking at the context that the customer service \
    agent is using to generate its response. 
    """

    user_message = f"""\
You are evaluating a submitted answer to a question based on the context \
that the agent uses to answer the question.
Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {cust_msg}
    ************
    [Context]: {context}
    ************
    [Submission]: {completion}
    ************
    [END DATA]

Compare the factual content of the submitted answer with the context. \
Ignore any differences in style, grammar, or punctuation.
Answer the following questions:
    - Is the Assistant response based only on the context provided? (Y or N)
    - Does the answer include information that is not provided in the context? (Y or N)
    - Is there any disagreement between the response and the context? (Y or N)
    - Count how many questions the user asked. (output a number)
    - For each question that the user asked, is there a corresponding answer to it?
      Question 1: (Y or N)
      Question 2: (Y or N)
      ...
      Question N: (Y or N)
    - Of the number of questions asked, how many of these questions were addressed by the answer? (output a number)
"""

    messages = [
        {'role': 'system', 'content': system_message},
        {'role': 'user', 'content': user_message}
    ]

    response = get_completion_from_messages(messages)
    return response

 这里我们将system_message和user_message翻译成中文,这样便于大家更好的理解:

 接下来我们来评估LLM先前的回复:

cust_prod_info = {
    'customer_msg': customer_msg,
    'context': product_info
}

evaluation_output = eval_with_rubric(cust_prod_info, assistant_answer)
print(evaluation_output)

 这里我们定义了一个评估的标准,也就是让LLM来回答我们在user_message 中定义的6个问题,这6个问题也就是我们的评分标准。从上面的输出结果上看LLM正确回答了所有的6个问题。

根据“理想”/“专家”(人类生成的)答案评估 LLM 对用户的答案

要评价LLM的回答的效果,除了让LLM回答一下简单的问题(这些问题可能是一些简单的统计)以外,还需要评估它与人类生成的专家或者理想的答案之前的差异。下面我们定义一个理想/专家的答案:

test_set_ideal = {
    'customer_msg': """\
tell me about the smartx pro phone and the fotosnap camera, the dslr one.
Also, what TVs or TV related products do you have?""",
    'ideal_answer':"""\
Of course!  The SmartX ProPhone is a powerful \
smartphone with advanced camera features. \
For instance, it has a 12MP dual camera. \
Other features include 5G wireless and 128GB storage. \
It also has a 6.1-inch display.  The price is $899.99.

The FotoSnap DSLR Camera is great for \
capturing stunning photos and videos. \
Some features include 1080p video, \
3-inch LCD, a 24.2MP sensor, \
and interchangeable lenses. \
The price is 599.99.

For TVs and TV related products, we offer 3 TVs \


All TVs offer HDR and Smart TV.

The CineView 4K TV has vibrant colors and smart features. \
Some of these features include a 55-inch display, \
'4K resolution. It's priced at 599.

The CineView 8K TV is a stunning 8K TV. \
Some features include a 65-inch display and \
8K resolution.  It's priced at 2999.99

The CineView OLED TV lets you experience vibrant colors. \
Some features include a 55-inch display and 4K resolution. \
It's priced at 1499.99.

We also offer 2 home theater products, both which include bluetooth.\
The SoundMax Home Theater is a powerful home theater system for \
an immmersive audio experience.
Its features include 5.1 channel, 1000W output, and wireless subwoofer.
It's priced at 399.99.

The SoundMax Soundbar is a sleek and powerful soundbar.
It's features include 2.1 channel, 300W output, and wireless subwoofer.
It's priced at 199.99

Are there any questions additional you may have about these products \
that you mentioned here?
Or may do you have other questions I can help you with?
    """
}

用专家的答案评估LLM的答案

下面我们会使用 OpenAI evals 项目中的提示语来实现BLEU(bilingual evaluation understudy)评估

BLEU 分数:评估两段文本是否相似的另一种方法。

def eval_vs_ideal(test_set, assistant_answer):

    cust_msg = test_set['customer_msg']
    ideal = test_set['ideal_answer']
    completion = assistant_answer
    
    system_message = """\
    You are an assistant that evaluates how well the customer service agent \
    answers a user question by comparing the response to the ideal (expert) response
    Output a single letter and nothing else. 
    """

    user_message = f"""\
You are comparing a submitted answer to an expert answer on a given question. Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {cust_msg}
    ************
    [Expert]: {ideal}
    ************
    [Submission]: {completion}
    ************
    [END DATA]

Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation.
    The submitted answer may either be a subset or superset of the expert answer, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options:
    (A) The submitted answer is a subset of the expert answer and is fully consistent with it.
    (B) The submitted answer is a superset of the expert answer and is fully consistent with it.
    (C) The submitted answer contains all the same details as the expert answer.
    (D) There is a disagreement between the submitted answer and the expert answer.
    (E) The answers differ, but these differences don't matter from the perspective of factuality.
  choice_strings: ABCDE
"""

    messages = [
        {'role': 'system', 'content': system_message},
        {'role': 'user', 'content': user_message}
    ]

    response = get_completion_from_messages(messages)
    return response

我们将上面的翻译成中文,这样便于大家更好的理解:

 接下来我们将比较LLM的答案和专家的(理想的)答案,并最终给出一个分数(A,B,C,D,E):

eval_vs_ideal(test_set_ideal, assistant_answer)

 

 从评估结果上看,LLM给自己的答案的成绩打了A,也就是说LLM之前的答案是专家答案的子集,并且与其完全一致,从专家的答案中我们可知,专家对每一个产品都做了非常详细的说明,而这些说明有些是没有包含在系统的产品信息中的,所以LLM的答案是专家答案的一个子集,这个没有问题。因为LLM的回复来基于系统的产品信息,并且LLM不存在歪曲产品信息内容的可能性,因此它与专家答案中的部分内容是完全一致的。下面我们再尝试一个不正确的LLM回复,让它再和专家答案比较一下:

assistant_answer_2 = "life is like a box of chocolates"
eval_vs_ideal(test_set_ideal, assistant_answer_2)

 这里我们给出了一个和产品信息无关的LLM回复:“life is like a box of chocolates”,此时评估结果给出的得分是 D,就是说提交的答案与专家的答案存在分歧。

总结

今天我们学习了如何使用两种评估LLM回复的方法,第一种是设计一下简单问题让LLM来回答,第二种是使用一个人类专家的答案来评估LLM的答案与专家答案存在的差异并给出一个评分(A,B,C,D,E)。

猜你喜欢

转载自blog.csdn.net/weixin_42608414/article/details/131375176
今日推荐