Using custom example selector in langchain

Using custom example selector in langchain

Introduction

In the previous article, we mentioned that when interacting with the large model, we can provide the large model with some specific example content, so that the large model can obtain the answers it wants from these content. This convenient mechanism is called FewShotPromptTemplate in langchain.

If the example content is small, it doesn't matter. We can send all examples to the large language model for processing.

But if there are too many examples, sending so much content every time will overwhelm our wallets. After all, those third-party large language models are charged by token.

How to do it? Can we find a cost-effective and efficient way to do our job?

The answer is to use the example selector.

Using and customizing the example selector

Let's recall that when using FewShotPromptTemplate, we can actually pass in example_selector and examples at the same time.

prompt = FewShotPromptTemplate(
    example_selector=example_selector, 
    example_prompt=example_prompt, 
    suffix="Question: {input}", 
    input_variables=["input"]
)

Here we use an example_selector, so what is example_selector?

Judging from the name, its main function is to select the required examples from the given examples and provide them for use in large models, thereby reducing the number of session tokens.

Langchain provides such an implementation of example_selector. Let's first take a look at the definition of its base class:

class BaseExampleSelector(ABC):
    """Interface for selecting examples to include in prompts."""

    @abstractmethod
    def add_example(self, example: Dict[str, str]) -> Any:
        """Add new example to store for a key."""

    @abstractmethod
    def select_examples(self, input_variables: Dict[str, str]) -> List[dict]:
        """Select which examples to use based on the inputs."""

You can see that BaseExampleSelector inherits from ABC and defines two abstract methods that need to be implemented.

One method is called add_example. The purpose is to add an example to the selector.

One method is called select_examples, and its main purpose is to find out the content to be selected from examples based on input.

So what is ABC?

ABC is of course what you know as ABC, but it has some additional meanings. The full name of ABC is Abstract Base Class, also called abstract base class. Mainly used to create abstract base classes in Python programs.

He provides some decorative methods such as @abstractmethod and @abstarctproperty to indicate the characteristics of specific classes.

Therefore, if we want to customize an ExampleSelector, we only need to inherit from BaseExampleSelector and then implement these two abstract methods.

ExampleSelector implementation in langchain

In addition to custom implementations, langchain has provided us with several commonly used ExampleSelector implementations. Let’s take a look.

LengthBasedExampleSelector

LengthBasedExampleSelector is a selector that selects based on the length of the example.

Let’s take a look at its specific implementation:

    def add_example(self, example: Dict[str, str]) -> None:
        """Add new example to list."""
        self.examples.append(example)
        string_example = self.example_prompt.format(**example)
        self.example_text_lengths.append(self.get_text_length(string_example))

The logic of add_example is to first add example to the examples list.

Then use example_prompt to format example to get the final output.

Finally, add the last output text length to the example_text_lengths array.

    def select_examples(self, input_variables: Dict[str, str]) -> List[dict]:
        """Select which examples to use based on the input lengths."""
        inputs = " ".join(input_variables.values())
        remaining_length = self.max_length - self.get_text_length(inputs)
        i = 0
        examples = []
        while remaining_length > 0 and i < len(self.examples):
            new_length = remaining_length - self.example_text_lengths[i]
            if new_length < 0:
                break
            else:
                examples.append(self.examples[i])
                remaining_length = new_length
            i += 1
        return examples

The select_examples method actually subtracts the length of the input text from max_length, and then matches the length of example_text, matching one minus one, and finally gets examples of a specific length.

The main function of this selector is to prevent the context window from being exhausted. Because for most large language models, user input is limited in length.

If the input length is exceeded, unexpected results may occur.

This selector is very simple to use. Here are specific examples:

examples = [
    {"input": "happy", "output": "sad"},
    {"input": "tall", "output": "short"},
    {"input": "energetic", "output": "lethargic"},
    {"input": "sunny", "output": "gloomy"},
    {"input": "windy", "output": "calm"},

example_prompt = PromptTemplate(
    input_variables=["input", "output"],
    template="Input: {input}\nOutput: {output}",
)
example_selector = LengthBasedExampleSelector(
    examples=examples, 
    example_prompt=example_prompt, 
    max_length=25,
)

SemanticSimilarityExampleSelector和MaxMarginalRelevanceExampleSelector

These two selectors search for examples based on similarity.

Among them, MaxMarginalRelevanceExampleSelector is the type type of SemanticSimilarityExampleSelector. It has made some algorithm optimizations to SemanticSimilarityExampleSelector. So here we introduce them both together.

These two selectors are different from the selectors introduced before. Because they use vector databases.

What is the vector database used for? Its main purpose is to convert input into various vectors and then store them. The vector database can easily calculate the input familiarity.

Let’s take a look at their add_example method first:

    def add_example(self, example: Dict[str, str]) -> str:
        """Add new example to vectorstore."""
        if self.input_keys:
            string_example = " ".join(
                sorted_values({key: example[key] for key in self.input_keys})
            )
        else:
            string_example = " ".join(sorted_values(example))
        ids = self.vectorstore.add_texts([string_example], metadatas=[example])
        return ids[0]

This method first adds the example key to input_keys and then sorts it. Finally, add the key and value to the vector database by calling add_texts of vectorstore.

The add_example of these two selectors is the same. Only the select_examples method is different.

Among them, SemanticSimilarityExampleSelector calls the similarity_search method of vectorstore to implement similarity search.

MaxMarginalRelevanceExampleSelector calls the max_marginal_relevance_search method of vectorstore to implement the search.

The search algorithms of the two are different.

Because they use vector databases, their calling methods are different from others:

examples = [
    {"input": "happy", "output": "sad"},
    {"input": "tall", "output": "short"},
    {"input": "energetic", "output": "lethargic"},
    {"input": "sunny", "output": "gloomy"},
    {"input": "windy", "output": "calm"},
]

example_selector = SemanticSimilarityExampleSelector.from_examples(
    examples, 
    # 使用的ebeddings
    OpenAIEmbeddings(), 
    # 向量数据库
    Chroma, 
    # 要返回的数目
    k=1
)

NGramOverlapExampleSelector

The last one to introduce is NGramOverlapExampleSelector. This selector uses an ngram overlap matrix to select similar inputs.

The specific implementation algorithms and principles will not be introduced here. Anyone who is interested can explore on their own.

This selector also does not require the use of a vector database.

It works like this:

example_selector = NGramOverlapExampleSelector(
    examples=examples,
    example_prompt=example_prompt,
    threshold=-1.0,
)

There is a different parameter here called threshold.

For negative thresholds: Selector sorts examples by ngram overlap score without excluding any examples.

For thresholds greater than 1.0: the selector excludes all examples and returns an empty list.

For a threshold equal to 0.0: the selector sorts the examples according to the ngram overlap score and excludes those that have no ngram overlap with the input.

Summarize

With these selectors, we can make specific selections among the provided examples, and then input the selection results to the large language model.

This effectively reduces token waste.

Guess you like

Origin blog.csdn.net/superfjj/article/details/132165353