Chapter 4: Introduction to RASA training data

I. Introduction

Generally speaking, robots can have conversations with people. The most difficult thing for a robot to say is to write a few rules and templates manually to reply. But it is really very difficult for robots to understand human intentions. Because of the diversity of language, polysemous words, puns, long and short sentences, etc., especially the profoundness of Chinese. Therefore, the robot needs a lot of data, that is, to simulate the questioning method of the human, so that the robot can understand the characteristics of these intentions, understand the questioning method of the human, and how the human responds to other people's questions. This part of the content is called training data in Rasa.

Rasa uses the YAML format as a unified and extensible way to manage all training data, including NLU data, stories data and rules. The training data can be split into any number of YAML files, each of which can contain any combination of NLU data, stories, and rules. The training data parser uses the topmost key to determine the training data type.

Developers still looking for Markdown data format? It was removed in Rasa 3.0, but documentation for markdown NLU data and markdown stories can still be found. If you still have training data in Markdown format, then the recommended method is to use Rasa 2.x to convert your data from Markdown to YAML. The migration guide explains how to do this.

1.1 High-Level Structure

Each file can contain one or more key and corresponding training data . A file can contain multiple keys, but each key can only appear once in a single file . Available keys include:

        version
        nlu
        stories
        rules

Developers should specify the version key in all YAML training data files. If the developer does not specify a version key in the training data file, Rasa will assume that the developer is using the latest training data format supported by the installed Rasa version. Training data files with a Rasa version higher than the version installed on the developer's machine will be skipped. Currently, the latest training data format specification for Rasa 3.x is 3.1.
1.2 example

Here is an example to illustrate, nlu, stroies and rules data are all displayed in one file:

version: "3.1"

nlu:
- intent: greet
  examples: |
    - Hey
    - Hi
    - hey there [Sara](name)

- intent: faq/language
  examples: |
    - What language do you speak?
    - Do you only handle english?

stories:
- story: greet and faq
  steps:
  - intent: greet
  - action: utter_greet
  - intent: faq
  - action: utter_faq

rules:
- rule: Greet user
  steps:
  - intent: greet
  - action: utter_greet

To test specific stories, put them into a separate file and prefix them with test_:

stories:
- story: greet and ask language
- steps:
  - user: |
      hey
    intent: greet
  - action: utter_greet
  - user: |
      what language do you speak
    intent: faq/language
  - action: utter_faq

2. NLU training data

The NLU training data consists of example user conversations categorized by intent. NLU training data also includes entities, which can extract structured information from user messages. You can also add additional information to the training data, such as regular expressions and lookup tables, to help the model correctly identify intents and entities. NLU training data is defined under the NLU key. Things that can be added under this item include:

    Training examples grouped by user intent

For example, an entity with a label, but the label information is optional.

nlu:
- intent: check_balance
  examples: |
    - What's my [credit](account) balance?
    - What's the balance on my [credit card account]{"entity":"account","value":"credit"}

    synonyms

nlu:
- synonym: credit
  examples: |
    - credit card account
    - credit account

    regular expression

nlu:
- regex: account_number
  examples: |
    - \d{10,12}

    check table

nlu:
- lookup: banks
  examples: |
    - JPMC
    - Comerica
    - Bank of America

2.1 Training Examples (training data samples)

Train examples are grouped by intent and listed under the examples field. Typically, developers will list one example per line, like so:

nlu:
- intent: greet
  examples: |
    - hey
    - hi
    - whats up

However, if the developer has a custom NLU component and needs metadata for the samples, the extended format can also be used:

nlu:
- intent: greet
  examples:
  - text: |
      hi
    metadata:
      sentiment: neutral
  - text: |
      hey there!

The metadata field can contain arbitrary key-value data that is associated with an instance and accessible by components in NLU. In the example above, sentiment metadata can be used for sentiment analysis by a custom component in the pipeline.

Developers can also specify this metadata at the intent level:

nlu:
- intent: greet
  metadata:
    sentiment: neutral
  examples:
  - text: |
      hi
  - text: |
      hey there!

In this case, the content contained in the metadata field will be applied to each intent instance.

If the developer needs to specify the retrieval intent, your NLU example might look like this:

nlu:
- intent: chitchat/ask_name
  examples: |
    - What is your name?
    - May I know your name?
    - What do people call you?
    - Do you have a name for yourself?

- intent: chitchat/ask_weather
  examples: |
    - What's the weather like today?
    - Does it look sunny outside today?
    - Oh, do you mind checking the weather for me please?
    - I like sunny days in Berlin.

All retrieval intents have a suffix added to identify the bot's specific response field. In the above example, ask_name and ask_weather are suffixes. The suffix and intent name are separated by a / separator.
2.2 Entities

Entities are structured information that can be extracted from user messages, and are labeled with entity names in the training data. In addition to entity names, developers can label entities with synonyms, roles, or groups.

In the training data, examples of labeled entities are as follows:

nlu:
- intent: check_balance
  examples: |
    - how much do I have on my [savings](account) account
    - how much money is in my [checking]{"entity": "account"} account
    - What's the balance on my [credit card account]{"entity":"account","value":"credit"}

The full syntax for labeling entities is as follows:

[<entity-text>]{"entity": "<entity name>", "role": "<role name>", "group": "<group name>", "value": "<entity synonym>"}

The role, group, and value keywords are optional when annotating. The content of the value field can refer to synonyms. If you want to understand the content of the role and group fields, you can refer to entity roles and groups.
2.3 Synonyms (synonyms)

Synonyms map extracted entities to values ​​outside of the extracted text. Developers can define synonyms using the following formats:

nlu:
- synonym: credit
  examples: |
    - credit card account
    - credit account

Developers can also define synonyms directly in the training data, and set synonyms by specifying the value field:

nlu:
- intent: check_balance
  examples: |
    - how much do I have on my [credit card account]{"entity": "account", "value": "credit"}
    - how much do I owe on my [credit account]{"entity": "account", "value": "credit"}

If you need to know more about synonyms, you can go here NLU Training Data page.
2.4 Regular Expressions (regular expressions)

Developers can use regular expressions to improve the effect of intent classification and entity extraction. Regular expressions mainly use the RegexFeaturizer and RegexEntityExtractor modules.

The format for defining a regular expression is as follows:

 nlu:
- regex: account_number
  examples: |
    - \d{10,12}
- intent: inform
  examples: |
    - my account number is [1234567891](account_number)
    - This is my account number [1234567891](account_number)

account_number is the name of the regular expression. When used as a feature RegexFeaturizer, the name of the regular expression does not matter. When using RegexEntityExtractor, the name of the regular expression should correspond to the name of the entity to be extracted by the Bot.

If you need to know more about regular expressions, you can go here NLU Training Data page.
2.5 Lookup Tables (lookup table)

A lookup table is used to generate a case-insensitive list of regular expressions. They can be used in the same way as regular expressions, in combination with the regexfeatureizer and RegexEntityExtractor components in the pipeline. A lookup table can be used to help extract entities with a known set of possible values. Keep your developer's lookup table as specific as possible. For example, to extract country names, you can add a lookup table of all countries in the world. In fact, this place is the function of adding thesaurus.

nlu:
- lookup: banks
  examples: |
    - JPMC
    - Comerica
    - Bank of America

3. Conversation Training Data

Both stories and rules represent the dialog flow between the user and the dialog assistant, and are mainly used to train the dialog management model. Stories are used to train machine learning models to recognize patterns in conversations and generalize to unseen conversation paths. rules describes the principle that the bot needs to always follow the same path and the training RulePolicy rule strategy.
3.1 Stories

Stories usually consist of the following parts:

    story: The name of the story. The name is arbitrary and not used for training; developers can use it as a human-readable reference to the story for statistics and illustration.
    metadata: arbitrary and optional, not used for training, you can use this to store relevant information about the story, such as the author.
    The steps list: User messages and actions that make up the story.

A sample is as follows:

stories:
- story: Greet the user
  metadata:
    author: Somebody
    key: value
  steps:
  # list of steps
  - intent: greet
  - action: utter_greet

Each step can be one of the following:

    User messages are mainly composed of intents and entities.
    statement under which two or more user messages are contained.
    robot action.
    form.
    Slot events.
    checkpoint, which connects a story to another story.

User Messages

All user messages can be specified with the intent field and the entities field, this field is optional.

When writing stories, developers don't have to deal with the specifics of the messages users send. Instead, developers can leverage the output of the NLU pipeline, which uses a combination of intents and entities to refer to all possible messages that a user could send with the same meaning.

User messages follow the following format:

stories:
- story: user message structure
  steps:
    - intent: intent_name  # Required
      entities:  # Optional
      - entity_name: entity_value
    - action: action_name

In the example "I want to check my credit balance", "credit" is an entity. Entities included in the training data are also important, as the policy learns to predict the next action based on the combination of intents and entities (however, developers can also change this behavior using the use_entities attribute).

Actions

All actions performed by the Bot are specified using the action field, followed by the action name. When writing a story, there are generally two types of actions:

1. Response. Start with "utter_" and return specific messages. For example:

stories:
- story: story with a response
  steps:
  - intent: greet
  - action: utter_greet

2. Customize actions. Started with "action_", run arbitrary code and send any number of messages (or none). For example:

stories:
- story: story with a custom action
  steps:
  - intent: feedback
  - action: action_store_feedback

Forms

Forms are a specific type of custom action that contain the logic to loop through a set of required slots and ask the user for this information. The developer defines a form in the Domain. Once Forms are defined, the developer should specify the path of the form as a rule. Developers should include breaks in form or other "indeterminate paths" in stories so that models can predict unseen dialogue sequences. A form, as a step in an action, usually takes the following format:

stories:
- story: story with a form
  steps:
  - intent: find_restaurant
  - action: restaurant_form                # Activate the form
  - active_loop: restaurant_form           # This form is currently active
  - active_loop: null                      # Form complete, no form is active
  - action: utter_restaurant_found

The action step activates the form and starts traversing the required slots. active_loop: The restaurant_form step indicates that there is currently an action form. Much like the slot_was_set step, the form step does not set the form as active, but rather indicates that it should already be active. Likewise, an active_loop: null step indicates that no form should be active until subsequent steps are executed.

A form can be interrupted and remain active; in this case, the interruption should occur after the action: <form to activate> step, followed by the active_loop: <active form> step. A table break might look like this:

stories:
- story: interrupted food
  steps:
    - intent: request_restaurant
    - action: restaurant_form
    - intent: chitchat
    - action: utter_chitchat
    - active_loop: restaurant_form
    - active_loop: null
    - action: utter_slots_values

Slots

Slot events are specified under the slot_was_set field: with a slot name and an optional slot value. Slots act as the robot's memory, set by the default action_extract_slots field, or by a custom action, according to the slot mapping specified in the domain. They are referenced by stories in the slot_was_set step. For example:

stories:
- story: story with a slot
  steps:
  - intent: celebrate_bot
  - slot_was_set:
    - feedback_value: positive
  - action: utter_yay

This means that stories require the current value of feedback_value to be positive in order for the conversation to proceed as specified.

Whether a value for a slot needs to be included depends on the slot type and whether the value can or should affect the dialog. If the value doesn't matter, like a text slot, you can just list the slot's name:

stories:
- story: story with a slot
  steps:
  - intent: greet
  - slot_was_set:
    - name
  - action: utter_greet_user_by_name

Checkpoints

Checkpoints are specified by the checkpoint field, either at the beginning or end of the story.

Checkpoints are how you connect stories together. They can be the first or last step in a story. If they are the last steps in a story, when training the model, that story will be associated with other stories that start with checkpoints of the same name. Here's an example of a story ending with a checkpoint, and an example of a story starting with the same checkpoint:

stories:
- story: story_with_a_checkpoint_1
  steps:
  - intent: greet
  - action: utter_greet
  - checkpoint: greet_checkpoint

- story: story_with_a_checkpoint_2
  steps:
  - checkpoint: greet_checkpoint
  - intent: book_flight
  - action: action_book_flight

You can also set conditions for slots if the checkpoint is set at the beginning of the story, for example:

stories:
- story: story_with_a_conditional_checkpoint
  steps:
  - checkpoint: greet_checkpoint
    # This checkpoint should only apply if slots are set to the specified value
    slot_was_set:
    - context_scenario: holiday
    - holiday_name: thanksgiving
  - intent: greet
  - action: utter_greet_thanksgiving

Checkpoints can help simplify and reduce redundancy in your training data in development, but don't overuse them. Using lots of checkpoints can make the story difficult to follow. It makes sense to use them if a series of steps are repeated frequently in different stories, but stories without checkpoints are easier to read and write.
3.2 Rules

Rules are generally defined under the rules field, which looks similar to stories. Rules also have a steps field that contains the same list of steps as stories. Rules can also contain conversation_started and conditions fields. These are used to specify the conditions under which the rule applies.

A rule with a condition looks like this:

rules:
- rule: Only say `hey` when the user provided a name
  condition:
  - slot_was_set:
    - user_provided_name: true
  steps:
  - intent: greet
  - action: utter_greet

If you need to know more about rules, you can go here Rules.
4. Test Stories (test stories)

The test stories check whether user messages are correctly classified and action predictions. Test stories use the same format as stories, except that the user message step can include the actual text and annotation entities of the user specified user message. Here is an example of a test story

stories:
- story: A basic end-to-end test
  steps:
  - user: |
     hey
    intent: greet
  - action: utter_ask_howcanhelp
  - user: |
     show me [chinese]{"entity": "cuisine"} restaurants
    intent: inform
  - action: utter_ask_location
  - user: |
     in [Paris]{"entity": "location"}
    intent: inform
  - action: utter_ask_price

Developers can test by executing the following command:

taste test

If you need to know more about testing, you can go hereTesting Your Assistant.
Five, End-to-end Training (end-to-end training)

Through end-to-end training, developers must deal with the specific intent extracted by the NLU pipeline. Instead, developers can use the user field to put the text of the user's message directly in the story.

These end-to-end user messages follow the following format:

stories:
- story: user message structure
  steps:
    - user: the actual text of the user message
    - action: action_name

Additionally, developers can add entity tags that can be extracted by TED Policy. The syntax for entity labels is the same as in the NLU training data. For example, the following story contains the user utterance I can always go for sushi. By using the grammar in the NLU training data [sushi](cuisine), developers can tag sushi as entities of type cuisine.

stories:
- story: story with entities
  steps:
  - user: I can always go for [sushi](cuisine)
  - action: utter_suggest_cuisine

Likewise, developers can place bot utterances directly in stories by using the bot field followed by the text the developer wants the bot to speak.

A story with only the bot field might look like this:

stories:
- story: story with an end-to-end response
  steps:
  - intent: greet
    entities:
    - name: Ivan
  - bot: Hello, a person with a name!

Developers can also have a hybrid end-to-end story:

stories:
- story: full end-to-end story
  steps:
  - intent: greet
    entities:
    - name: Ivan
  - bot: Hello, a person with a name!
  - intent: search_restaurant
  - action: utter_suggest_cuisine
  - user: I can always go for [sushi](cuisine)
  - bot: Personally, I prefer pizza, but sure let's search sushi restaurants
  - action: utter_suggest_cuisine
  - user: Have a beautiful day!
  - action: utter_goodbye

Rasa end-to-end training is fully integrated with standard Rasa methods. This means developers can mix stories where some steps are defined by actions or intents, while others are defined directly by user messages or bot responses.
6. References

Training Data Format

Introduction to Rasa Open Source & Rasa Pro

(13) RASA training data
 

Guess you like

Origin blog.csdn.net/sinat_37574187/article/details/131785525