Ng Enda ChatGPT "LangChain Chat with Your Data" notes

Course address: https://learn.deeplearning.ai/langchain-chat-with-your-data/lesson/1/introduction

1. Introduction

LLM like ChatGPT can answer many types of questions, but if it only relies on LLM, it only knows the trained content, but not other content, such as personal data, Internet real-time information, etc. It would be very useful if individual users could use LLM to have a dialogue with their own documents and be able to answer users' questions. The LangChain framework can realize the above functions.

LangChain is an open source development framework for building LLM applications. LangChain consists of many modular components and end-to-end prompt templates. In addition, there are many rich use cases for combining these components for convenience.
insert image description here

The following is a flow chart of using LangChain's document loader to load data from various data sources:
insert image description here
This course mainly includes the following contents:

  • One of the preprocessing steps - splitting these documents into semantically meaningful chunks: This seemingly simple step actually has many details to consider.
  • Semantic search: used to obtain relevant information for user questions
  • Handling of failures
  • Use retrieved documents to answer user questions
  • The key to chatbots that talk to data — memory

2. Document Loading

The first step for a chatbot to talk to data is to load the document data into a usable format.

The documentation provided by LangChain will convert data formats from different sources into a standardized format. For example, data formats such as PDF, HTML, JSON, etc. from different websites and databases can be converted into standard document object formats available to LLM. It contains content and associated metadata.
insert image description here
LangChain contains 80+ different document loaders, as shown in the figure below:
insert image description here

2.1 Retrieval Augmented Generation(RAG)

In Retrieval Augmented Generation (RAG), LLM retrieves contextual documents from external datasets as part of its execution. This is useful if we want to ask a question about a specific document (e.g. PDF, video, etc.). The process of search enhancement generation is shown in the figure below:

insert image description here

2.2 Load PDFs

#! pip install langchain
#! pip install pypdf 
import os
import openai
import sys
sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

Here is an example of loading a PDF document using LangChain:

from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf")
pages = loader.load()

In the loaded data structure, each pageis Documentan object. DocumentContains text (page_content) and metadata.

page = pages[0]
print(page.page_content[0:500])
MachineLearning-Lecture01  
Instructor (Andrew Ng):  Okay. Good morning. Welcome to CS229, the machine 
learning class. So what I wanna do today is ju st spend a little time going over the logistics 
of the class, and then we'll start to  talk a bit about machine learning.  
By way of introduction, my name's  Andrew Ng and I'll be instru ctor for this class. And so 
I personally work in machine learning, and I' ve worked on it for about 15 years now, and 
I actually think that machine learning i

Metadata consists of two parts, source and page, representing the source and page number information respectively:

page.metadata
{
    
    'source': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf', 'page': 0}

2.3 Load YouTube

The following is an example of loading Youtube video information through LangChain:

It uses OpenAI's Whisper model to convert speech into text.

from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader
# ! pip install yt_dlp
# ! pip install pydub

The parameters are also very simple, url specifies the video link, and save_dir specifies the saved directory after conversion.

url="https://www.youtube.com/watch?v=jGwO_UgTS7I"
save_dir="docs/youtube/"
loader = GenericLoader(
    YoutubeAudioLoader([url],save_dir),
    OpenAIWhisperParser()
)
docs = loader.load()

This process can be time consuming as the video is downloaded.

[youtube] Extracting URL: https://www.youtube.com/watch?v=jGwO_UgTS7I
[youtube] jGwO_UgTS7I: Downloading webpage
[youtube] jGwO_UgTS7I: Downloading android player API JSON
[info] jGwO_UgTS7I: Downloading 1 format(s): 140
[download] docs/youtube//Stanford CS229: Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018).m4a has already been downloaded
[download] 100% of   69.71MiB
[ExtractAudio] Not converting audio docs/youtube//Stanford CS229: Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018).m4a; file is already in target format m4a
Transcribing part 1!
docs[0].page_content[0:500]

By analogy, this function of LangChain can fully extract and summarize the massive high-quality educational resources on the Internet, convert them into texts that are easy to read and store, and convert them into personalized knowledge bases for learning.

2.4 Load URLs

Here is an example of using a web content loader:

from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://github.com/basecamp/handbook/blob/master/37signals-is-you.md")
docs = loader.load()
print(docs[0].page_content[:500])

2.5 Load Notion

Below is an example of using the Notion note loader. Need to follow example:

1) Copy the page into your own Notion space and export it as Markdown/CSV.

2) Unzip and save it in the folder containing the markdown file of the Notion page.

insert image description here

from langchain.document_loaders import NotionDirectoryLoader
loader = NotionDirectoryLoader("docs/Notion_DB")
docs = loader.load()
print(docs[0].page_content[0:200])
# Blendle's Employee Handbook

This is a living document with everything we've learned working with people while running a startup. And, of course, we continue to learn. Therefore it's a document that
docs[0].metadata
{
    
    'source': "docs/Notion_DB/Blendle's Employee Handbook e367aa77e225482c849111687e114a56.md"}

3. Document Splitting

With the data object in a standard format, the document data can then be divided into chunks, because the context of document question answering only needs a few relevant parts.

insert image description here
This step is very important and very complicated before storing to the vector database. It contains many details that need to be handled, and if not handled properly, subsequent steps will generate exceptions.

The commonly used segmentation method is based on the length of the string. The disadvantage is that the complete sentence will be in different blocks, the semantics will be lost, and the user's question cannot be answered, because each block does not contain the correct information.

3.1 Splitter Flow

In LangChain, the principle of all text splitters is to split according to the block size and the overlap size between two blocks . The following is a schematic diagram:
insert image description here
where,

  • chunk_sizeRefers to the block size. Can be measured in a few different ways, by passing in a length function to measure the block size. Usually it is measured according to character length or Token length.
  • chunk_overlapRefers to the overlap between two blocks.

The reason for this is that at the beginning of a block, preserving a part of the end of the previous block helps to keep the context coherent .

3.2 Character Splitter

The different types of splitters currently included in LangChain:
insert image description here
The main difference between these text splitters is how blocks are separated and what characters a block consists of. There are also differences in how to measure the length of a block, such as counting by character, counting by token, and so on. Additionally, there are methods that use additional small models to determine where sentences end.

Two types of text splitters are described below: character text splitters and recursive character text splitters.

The parameter of this recursive text splitter seperatoris passed a list, which means that the method will first split on double newlines (to split the document into paragraphs), then on single newlines (to split paragraphs into sentences), then on whitespace (to split sentences into words), and finally by character length (to split words into characters).

from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
chunk_size =26
chunk_overlap = 4
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)
text1 = 'abcdefghijklmnopqrstuvwxyz'
r_splitter.split_text(text1)
['abcdefghijklmnopqrstuvwxyz']
text2 = 'abcdefghijklmnopqrstuvwxyzabcdefg'
r_splitter.split_text(text2)
['abcdefghijklmnopqrstuvwxyz', 'wxyzabcdefg']
text3 = "a b c d e f g h i j k l m n o p q r s t u v w x y z"
r_splitter.split_text(text3)
['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']
c_splitter.split_text(text3)
['a b c d e f g h i j k l m n o p q r s t u v w x y z']
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separator = ' '
)
c_splitter.split_text(text3)
['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

RecursiveCharacterTextSplitterMore suitable for general text segmentation.

some_text = """When writing documents, writers will use document structure to group content. \
This can convey to the reader, which idea's are related. For example, closely related ideas \
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n  \
Paragraphs are often delimited with a carriage return or two carriage returns. \
Carriage returns are the "backslash n" you see embedded in this string. \
Sentences have a period at the end, but also, have a space.\
and words are separated by space."""
len(some_text)
496
c_splitter = CharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0,
    separator = ' '
)
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0, 
    separators=["\n\n", "\n", " ", ""]
)
c_splitter.split_text(some_text)
['When writing documents, writers will use document structure to group content. This can convey to the reader, which idea\'s are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also,',
 'have a space.and words are separated by space.']
r_splitter.split_text(some_text)
["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.",
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also, have a space.and words are separated by space.']
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "\. ", " ", ""]
)
r_splitter.split_text(some_text)
["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related",
 '. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage returns',
 '. Carriage returns are the "backslash n" you see embedded in this string',
 '. Sentences have a period at the end, but also, have a space.and words are separated by space.']
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "(?<=\. )", " ", ""]
)
r_splitter.split_text(some_text)
["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related.",
 'For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage returns.',
 'Carriage returns are the "backslash n" you see embedded in this string.',
 'Sentences have a period at the end, but also, have a space.and words are separated by space.']
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf")
pages = loader.load()
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=1000,
    chunk_overlap=150,
    length_function=len
)
docs = text_splitter.split_documents(pages)
len(docs)
77
len(pages)
22
from langchain.document_loaders import NotionDirectoryLoader
loader = NotionDirectoryLoader("docs/Notion_DB")
notion_db = loader.load()
docs = text_splitter.split_documents(notion_db)
len(notion_db)
52
len(docs)
353

3.3 Token Splitter

The reason why there is a token-based splitter is that the length limit of the context window of LLM is counted according to the token. Therefore, it is crucial to understand the location of tokens and their meaning. In this way, we can segment the text according to the token from the perspective of LLM, so as to get better results.

from langchain.text_splitter import TokenTextSplitter
text_splitter = TokenTextSplitter(chunk_size=1, chunk_overlap=0)
text1 = "foo bar bazzyfoo"
text_splitter.split_text(text1)
['foo', ' bar', ' b', 'az', 'zy', 'foo']
text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)
docs = text_splitter.split_documents(pages)
docs[0]
Document(page_content='MachineLearning-Lecture01  \n', metadata={'source': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf', 'page': 0})
pages[0].metadata
{'source': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf', 'page': 0}

3.4 Markdown Splitter

In practice, we may want to add more metadata information to the block, such as the position information of the block - its position in the document or its position relative to other content. Often, this information can provide more contextual information about the block when question-answering the document.

MarkdownHeaderTokenSplitter can achieve this goal. It splits Markdown files based on headings or subheadings, and then adds these headings as content to metadata.

In addition, it also supports custom title names that you want to separate. The data structure is a list of tuples, where the first parameter represents the title separator, and the second parameter is the title name after division.

from langchain.document_loaders import NotionDirectoryLoader
from langchain.text_splitter import MarkdownHeaderTextSplitter
markdown_document = """# Title\n\n \
## Chapter 1\n\n \
Hi this is Jim\n\n Hi this is Joe\n\n \
### Section \n\n \
Hi this is Lance \n\n 
## Chapter 2\n\n \
Hi this is Molly"""
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)
md_header_splits = markdown_splitter.split_text(markdown_document)
md_header_splits[0]
Document(page_content='Hi this is Jim  \nHi this is Joe', metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1'})
md_header_splits[1]
Document(page_content='Hi this is Lance', metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1', 'Header 3': 'Section'})
loader = NotionDirectoryLoader("docs/Notion_DB")
docs = loader.load()
txt = ' '.join([d.page_content for d in docs])
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
]
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)
md_header_splits = markdown_splitter.split_text(txt)
md_header_splits[0]

Document(page_content="This is a living document with everything we've learned working with people while running a startup. And, of course, we continue to learn. Therefore it's a document that will continue to change.  \n**Everything related to working at Blendle and the people of Blendle, made public.**  \nThese are the lessons from three years of working with the people of Blendle. It contains everything from [how our leaders lead](https://www.notion.so/ecfb7e647136468a9a0a32f1771a8f52?pvs=21) to [how we increase salaries](https://www.notion.so/Salary-Review-e11b6161c6d34f5c9568bb3e83ed96b6?pvs=21), from [how we hire](https://www.notion.so/Hiring-451bbcfe8d9b49438c0633326bb7af0a?pvs=21) and [fire](https://www.notion.so/Firing-5567687a2000496b8412e53cd58eed9d?pvs=21) to [how we think people should give each other feedback](https://www.notion.so/Our-Feedback-Process-eb64f1de796b4350aeab3bc068e3801f?pvs=21) — and much more.  \nWe've made this document public because we want to learn from you. We're very much interested in your feedback (including weeding out typo's and Dunglish ;)). Email us at [email protected]. If you're starting your own company or if you're curious as to how we do things at Blendle, we hope that our employee handbook inspires you.  \nIf you want to work at Blendle you can check our [job ads here](https://blendle.homerun.co/). If you want to be kept in the loop about Blendle, you can sign up for [our behind the scenes newsletter](https://blendle.homerun.co/yes-keep-me-posted/tr/apply?token=8092d4128c306003d97dd3821bad06f2).", metadata={
    
    'Header 1': "Blendle's Employee Handbook"})

4. Vector Stores and Embeddings

After documents have been split into small and semantically meaningful chunks, they can be indexed. In this way, when answering a question, you can easily retrieve several relevant quick shots from the index as context, and let LLM generate the answer.
insert image description here

4.1 Embedding

Embedding is actually a vector consisting of a set of real numbers ranging from -1 to 1. The more semantically similar texts, the closer they are in the vector space.

insert image description here

4.2 Vector Store Workflow

insert image description here

4.3 Usage Examples

LangChain integrates 30+ different vector storage methods. The vector database used in the examples in this section is Chroma. It is lightweight and supports memory storage, very easy to use.

from langchain.document_loaders import PyPDFLoader

# Load PDF
loaders = [
    # Duplicate documents on purpose - messy data
    PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf"),
    PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf"),
    PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture02.pdf"),
    PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture03.pdf")
]
docs = []
for loader in loaders:
    docs.extend(loader.load())
# Split
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)
splits = text_splitter.split_documents(docs)
len(splits)
209
from langchain.embeddings.openai import OpenAIEmbeddings
embedding = OpenAIEmbeddings()
sentence1 = "i like dogs"
sentence2 = "i like canines"
sentence3 = "the weather is ugly outside"
embedding1 = embedding.embed_query(sentence1)
embedding2 = embedding.embed_query(sentence2)
embedding3 = embedding.embed_query(sentence3)
# 1536
len(embedding1)

# 0.22333593666553497
np.max(embedding1)

# -0.6714226007461548
np.min(embedding1)
import numpy as np
np.dot(embedding1, embedding2)
0.9631923847517209
np.dot(embedding1, embedding3)
0.7711111095422276
np.dot(embedding2, embedding3)
0.7596334120325523

vector storage

# ! pip install chromadb
!rm -rf ./docs/chroma  # remove old database files if any
from langchain.vectorstores import Chroma

persist_directory = 'docs/chroma/'

vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory=persist_directory
)

print(vectordb._collection.count())
209

similarity search

question = "is there an email i can ask for help"

docs = vectordb.similarity_search(question,k=3)
# 持久化
vectordb.persist()

4.4 Edge Cases

A fringe case shown here is to pass the same document content to the language model, the second duplicate is worthless, and it would be better to switch to a different one, because the LLM can get more context information from it.

Another edge case is, say, asking questions like: What regression-related stuff did they cover in class 3?

Intuitively, the model should be able to return what is relevant for regression in section 3, and nothing outside of this range. But in fact, the search results include all content related to the word "regression" without restricting these content from the "third section".

If we do semantic search entirely based on Embedding, when we create Embedding for the whole question, then it will highlight the word "regression" in it, without capturing the structural information of "Section III".

5. Retrieval

insert image description here

5.1 Maximum marginal relevance(MMR)

The intuition behind Maximum Marginal Relevance (MMR) search is that if you always select the document in the embedding space that is most similar to the query, you are likely to miss a variety of different information.

The following is an example of searching for "mushroom". If only similarity search is used, information related to "mushroom" will be found, as shown in the blue and green boxes in the figure. But you won't find information like "some mushrooms are poisonous", the contents of the orange box in the picture. With the help of MMR, the above problems can be solved, because it will select a diverse set of documents.
insert image description here

The principle of MMR is: when a query request is initiated, the returned response is determined through the set fetch_kparameters. This step is completely based on similarity search. Afterwards, documents are further screened by MR according to semantic similarity and document diversity, and then selected kdocuments are returned to the user.
insert image description here

5.2 SelfQuery

Using a language model, split the question into two parts - Filter and search content. Most vector databases support metadata filtering, and you can easily filter results based on metadata, such as "year=1980" in the example.
insert image description here

5.3 Compression

Compression techniques are very useful for picking the most relevant parts from candidate passages. For example, through query questions, multiple paragraphs in a document are found, and these paragraphs can be summarized into shorter summaries with the help of compressed LLM. This is then fed into the LLM as context to generate the answer. The cost of doing this is that LLM needs to be called multiple times, which will be costly and performance-intensive, but it can make the final answer focus on the most important content.
insert image description here

5.4 Usage Examples

#!pip install lark
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
persist_directory = 'docs/chroma/'

embedding = OpenAIEmbeddings()
vectordb = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding
)

# 209
print(vectordb._collection.count())
texts = [
    """The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).""",
    """A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.""",
    """A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.""",
]

smalldb = Chroma.from_texts(texts, embedding=embedding)

question = "Tell me about all-white mushrooms with large fruiting bodies"

smalldb.similarity_search(question, k=2)
[Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.', metadata={
    
    }),
 Document(page_content='The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).', metadata={
    
    })]

Similarity search:

question = "what did they say about matlab?"
docs_ss = vectordb.similarity_search(question,k=3)
docs_ss[0].page_content[:100]
'those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people '
docs_ss[1].page_content[:100]
'those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people '

It can be seen that when there is duplicate data (dirty data), the similarity search does not have the ability to deduplicate, but returns it directly.

Increase the diversity of retrieval through MMR:

docs_mmr = vectordb.max_marginal_relevance_search(question,k=3)
docs_mmr[0].page_content[:100]
'those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people '
docs_mmr[1].page_content[:100]
'algorithm then? So what’s different? How come  I was making all that noise earlier about \nleast squa'

MMR can guarantee diversity, even if there is duplicate data, it can also be deduplicated.

Specificity is resolved through the metadata filtering function of the vector database:

question = "what did they say about regression in the third lecture?"

docs = vectordb.similarity_search(
    question,
    k=3,
    filter={
    
    "source":"docs/cs229_lectures/MachineLearning-Lecture03.pdf"}
)

for d in docs:
    print(d.metadata)
{
    
    'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'page': 0}
{
    
    'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'page': 14}
{
    
    'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'page': 4}

Metadata filter conditions are inferred from the query itself:

from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

metadata_field_info = [
    AttributeInfo(
        name="source",
        description="The lecture the chunk is from, should be one of `docs/cs229_lectures/MachineLearning-Lecture01.pdf`, `docs/cs229_lectures/MachineLearning-Lecture02.pdf`, or `docs/cs229_lectures/MachineLearning-Lecture03.pdf`",
        type="string",
    ),
    AttributeInfo(
        name="page",
        description="The page from the lecture",
        type="integer",
    ),
]

document_content_description = "Lecture notes"
llm = OpenAI(temperature=0)
retriever = SelfQueryRetriever.from_llm(
    llm,
    vectordb,
    document_content_description,
    metadata_field_info,
    verbose=True
)

question = "what did they say about regression in the third lecture?"
docs = retriever.get_relevant_documents(question)
query='regression' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='source', value='docs/cs229_lectures/MachineLearning-Lecture03.pdf') limit=None
for d in docs:
    print(d.metadata)
{
    
    'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'page': 14}
{
    
    'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'page': 0}
{
    
    'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'page': 10}
{
    
    'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'page': 10}

Through the above two methods (one is to manually insert metadata filter conditions, and the other is to rely on the model to automatically infer filter conditions), the problem of not being able to focus on structured information in similarity search in Section 4 is solved.

Example use of the compression technique:

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

def pretty_print_docs(docs):
    print(f"\n{
      
      '-' * 100}\n".join([f"Document {
      
      i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))

# Wrap our vectorstore
llm = OpenAI(temperature=0)
compressor = LLMChainExtractor.from_llm(llm)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever()
)

question = "what did they say about matlab?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)
Document 1:

"MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to plot data. And it's sort of an extremely easy to learn tool to use for implementing a lot of learning algorithms."
----------------------------------------------------------------------------------------------------
Document 2:

"MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to plot data. And it's sort of an extremely easy to learn tool to use for implementing a lot of learning algorithms."
----------------------------------------------------------------------------------------------------
Document 3:

"And the student said, "Oh, it was the MATLAB." So for those of you that don't know MATLAB yet, I hope you do learn it. It's not hard, and we'll actually have a short MATLAB tutorial in one of the discussion sections for those of you that don't know it."
----------------------------------------------------------------------------------------------------
Document 4:

"And the student said, "Oh, it was the MATLAB." So for those of you that don't know MATLAB yet, I hope you do learn it. It's not hard, and we'll actually have a short MATLAB tutorial in one of the discussion sections for those of you that don't know it."

Combining compression and retrieval techniques:

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever(search_type = "mmr")
)

question = "what did they say about matlab?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)
Document 1:

"MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to plot data. And it's sort of an extremely easy to learn tool to use for implementing a lot of learning algorithms."
----------------------------------------------------------------------------------------------------
Document 2:

"And the student said, "Oh, it was the MATLAB." So for those of you that don't know MATLAB yet, I hope you do learn it. It's not hard, and we'll actually have a short MATLAB tutorial in one of the discussion sections for those of you that don't know it."

5.5 Other Retrievals

Two other retrievers are introduced.

from langchain.retrievers import SVMRetriever
from langchain.retrievers import TFIDFRetriever
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter


# Load PDF
loader = PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf")
pages = loader.load()
all_page_text=[p.page_content for p in pages]
joined_page_text=" ".join(all_page_text)


# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500,chunk_overlap = 150)
splits = text_splitter.split_text(joined_page_text)


# Retrieve
svm_retriever = SVMRetriever.from_texts(splits,embedding)
tfidf_retriever = TFIDFRetriever.from_texts(splits)
question = "What are major topics for this class?"
docs_svm=svm_retriever.get_relevant_documents(question)
docs_svm[0]
question = "What are major topics for this class?"
docs_svm=svm_retriever.get_relevant_documents(question)
docs_svm[0]
question = "What are major topics for this class?"
docs_svm=svm_retriever.get_relevant_documents(question)
docs_svm[0]
Document(page_content="let me just check what questions you have righ t now. So if there are no questions, I'll just \nclose with two reminders, which are after class today or as you start to talk with other \npeople in this class, I just encourage you again to start to form project partners, to try to \nfind project partners to do your project with. And also, this is a good time to start forming \nstudy groups, so either talk to your friends  or post in the newsgroup, but we just \nencourage you to try to star t to do both of those today, okay? Form study groups, and try \nto find two other project partners.  \nSo thank you. I'm looking forward to teaching this class, and I'll see you in a couple of \ndays.   [End of Audio]  \nDuration: 69 minutes", metadata={
    
    })
question = "what did they say about matlab?"
docs_tfidf=tfidf_retriever.get_relevant_documents(question)
docs_tfidf[0]
Document(page_content="Saxena and Min Sun here did, wh ich is given an image like this, right? This is actually a \npicture taken of the Stanford campus. You can apply that sort of cl ustering algorithm and \ngroup the picture into regions. Let me actually blow that up so that you can see it more \nclearly. Okay. So in the middle, you see the lines sort of groupi ng the image together, \ngrouping the image into [inaudible] regions.  \nAnd what Ashutosh and Min did was they then  applied the learning algorithm to say can \nwe take this clustering and us e it to build a 3D model of the world? And so using the \nclustering, they then had a lear ning algorithm try to learn what the 3D structure of the \nworld looks like so that they could come up with a 3D model that you can sort of fly \nthrough, okay? Although many people used to th ink it's not possible to take a single \nimage and build a 3D model, but using a lear ning algorithm and that sort of clustering \nalgorithm is the first step. They were able to.  \nI'll just show you one more example. I like this  because it's a picture of Stanford with our \nbeautiful Stanford campus. So again, taking th e same sort of clustering algorithms, taking \nthe same sort of unsupervised learning algor ithm, you can group the pixels into different \nregions. And using that as a pre-processing step, they eventually built this sort of 3D model of Stanford campus in a single picture.  You can sort of walk  into the ceiling, look", metadata={
    
    })

6. Question Answering

insert image description here

After document retrieval is complete, the retrieval results can be fed into the final LLM to answer questions. This section describes several methods for accomplishing this task.

6.1 RetrievalQA Chain

insert image description here
When the search results are very large, they may exceed the context token limit of LLM. At this time, these search results need to be refined. Generally, there are the following three methods: These methods have been introduced in the previous LangChain course, and their principles are
insert image description here
:

  • Map_reduce: Input each hit document into LLM for summarization, and then use the summarized shorter texts of multiple Chunks as context, and input them into the final LLM to answer questions. Its advantage is that any number of documents can be answered. The disadvantage is that some contexts are distributed in different blocks, and it may not be possible to give a good answer.
  • Refine: Similar to a recursive operation, the first document is input into LLM for summarization, and then it is divided with the second document and then input into LLM for summary, and so on, until the final summary result is obtained, and then input into LLM to answer questions.
  • Map_rerank: The previous process is similar, but each summarized text will be given a score, and then these summaries will be sorted, and the one with the highest score will be selected and input to the LLM to answer the question.

6.2 Usage Examples

insert image description here

from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
persist_directory = 'docs/chroma/'
embedding = OpenAIEmbeddings()
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)


question = "What are major topics for this class?"
docs = vectordb.similarity_search(question,k=3)
len(docs)
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(model_name=llm_name, temperature=0)
from langchain.chains import RetrievalQA


qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever()
)
result = qa_chain({
    
    "query": question})
result["result"]
'The major topic for this class is machine learning. Additionally, the class may cover statistics and algebra as refreshers in the discussion sections. Later in the quarter, the discussion sections will also cover extensions for the material taught in the main lectures.'
from langchain.prompts import PromptTemplate

# Build prompt
template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Use three sentences maximum. Keep the answer as concise as possible. Always say "thanks for asking!" at the end of the answer. 
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)


# Run chain
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    return_source_documents=True,
    chain_type_kwargs={
    
    "prompt": QA_CHAIN_PROMPT}
)


question = "Is probability a class topic?"
result = qa_chain({
    
    "query": question})
result["result"]
'Yes, probability is assumed to be a prerequisite for this class. The instructor assumes familiarity with basic probability and statistics, and will go over some of the prerequisites in the discussion sections as a refresher course. Thanks for asking!'
qa_chain_mr = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    chain_type="map_reduce"
)

result = qa_chain_mr({
    
    "query": question})
result["result"]
'There is no clear answer to this question based on the given portion of the document. The document mentions familiarity with basic probability and statistics as a prerequisite for the class, and there is a brief mention of probability in the text, but it is not clear if it is a main topic of the class. The instructor mentions using a probabilistic interpretation to derive a learning algorithm, but does not go into further detail about probability as a topic.'
qa_chain_mr = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    chain_type="map_reduce"
)

result = qa_chain_mr({
    
    "query": question})
result["result"]
'There is no clear answer to this question based on the given portion of the document. The document mentions familiarity with basic probability and statistics as a prerequisite for the class, and there is a brief mention of probability in the text, but it is not clear if it is a main topic of the class. The instructor mentions using a probabilistic interpretation to derive a learning algorithm, but does not go into further detail about probability as a topic.'
qa_chain_mr = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    chain_type="refine"
)

result = qa_chain_mr({
    
    "query": question})
result["result"]
"The main topic of the class is machine learning, but the course assumes that students are familiar with basic probability and statistics, including random variables, expectation, variance, and basic linear algebra. The instructor will provide a refresher course on these topics in some of the discussion sections. Later in the quarter, the discussion sections will also cover extensions for the material taught in the main lectures. Machine learning is a vast field, and there are a few extensions that the instructor wants to teach but didn't have time to cover in the main lectures. The class will not be very programming-intensive, but some programming will be done in MATLAB or Octave."

The answers returned by the LLM using these three methods are different, and the effects vary greatly. Some can probabilistically quote the content of the original text or answer a part of the relevant content, but they are still far apart. The reason is that the current solutions are still stateless and do not remember past related questions or answers. The next section describes how to enable a model to have a memory.

insert image description here
insert image description here

7. Chat

7.1 ConversationalRetrievalChain

insert image description here

7.2 Usage Examples

insert image description here

text

from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
persist_directory = 'docs/chroma/'
embedding = OpenAIEmbeddings()
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)
question = "What are major topics for this class?"
docs = vectordb.similarity_search(question,k=3)
len(docs)
3
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(model_name=llm_name, temperature=0)
llm.predict("Hello world!")
'Hello there! How can I assist you today?'
# Build prompt
from langchain.prompts import PromptTemplate
template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Use three sentences maximum. Keep the answer as concise as possible. Always say "thanks for asking!" at the end of the answer. 
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate(input_variables=["context", "question"],template=template,)

# Run chain
from langchain.chains import RetrievalQA
question = "Is probability a class topic?"
qa_chain = RetrievalQA.from_chain_type(llm,
                                       retriever=vectordb.as_retriever(),
                                       return_source_documents=True,
                                       chain_type_kwargs={
    
    "prompt": QA_CHAIN_PROMPT})


result = qa_chain({
    
    "query": question})
result["result"]
'Yes, probability is assumed to be a prerequisite for this class. The instructor assumes familiarity with basic probability and statistics, and will go over some of the prerequisites in the discussion sections as a refresher course. Thanks for asking!'

Memory

from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True
)

ConversationalRetrievalChain

from langchain.chains import ConversationalRetrievalChain
retriever=vectordb.as_retriever()
qa = ConversationalRetrievalChain.from_llm(
    llm,
    retriever=retriever,
    memory=memory
)
question = "Is probability a class topic?"
result = qa({
    
    "question": question})
result['answer']
'Yes, probability is a topic that will be assumed to be familiar to students in this class. The instructor assumes that students have familiarity with basic probability and statistics, and that most undergraduate statistics classes will be more than enough.'
question = "why are those prerequesites needed?"
result = qa({
    
    "question": question})
result['answer']
'The reason for requiring familiarity with basic probability and statistics as prerequisites for this class is that the class assumes that students already know what random variables are, what expectation is, what a variance or a random variable is. The class will not be very programming intensive, but will involve some programming in either MATLAB or Octave.'

7.3 Create a chatbot that works on your documents

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
from langchain.vectorstores import DocArrayInMemorySearch
from langchain.document_loaders import TextLoader
from langchain.chains import RetrievalQA,  ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import TextLoader
from langchain.document_loaders import PyPDFLoader

The chatbot code has been updated a bit since filming. The GUI appearance also varies depending on the platform it is running on.

def load_db(file, chain_type, k):
    # load documents
    loader = PyPDFLoader(file)
    documents = loader.load()
    # split documents
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
    docs = text_splitter.split_documents(documents)
    # define embedding
    embeddings = OpenAIEmbeddings()
    # create vector database from data
    db = DocArrayInMemorySearch.from_documents(docs, embeddings)
    # define retriever
    retriever = db.as_retriever(search_type="similarity", search_kwargs={
    
    "k": k})
    # create a chatbot chain. Memory is managed externally.
    qa = ConversationalRetrievalChain.from_llm(
        llm=ChatOpenAI(model_name=llm_name, temperature=0), 
        chain_type=chain_type, 
        retriever=retriever, 
        return_source_documents=True,
        return_generated_question=True,
    )
    return qa 

import panel as pn
import param

class cbfs(param.Parameterized):
    chat_history = param.List([])
    answer = param.String("")
    db_query  = param.String("")
    db_response = param.List([])
    
    def __init__(self,  **params):
        super(cbfs, self).__init__( **params)
        self.panels = []
        self.loaded_file = "docs/cs229_lectures/MachineLearning-Lecture01.pdf"
        self.qa = load_db(self.loaded_file,"stuff", 4)
    
    def call_load_db(self, count):
        if count == 0 or file_input.value is None:  # init or no file specified :
            return pn.pane.Markdown(f"Loaded File: {
      
      self.loaded_file}")
        else:
            file_input.save("temp.pdf")  # local copy
            self.loaded_file = file_input.filename
            button_load.button_style="outline"
            self.qa = load_db("temp.pdf", "stuff", 4)
            button_load.button_style="solid"
        self.clr_history()
        return pn.pane.Markdown(f"Loaded File: {
      
      self.loaded_file}")

    def convchain(self, query):
        if not query:
            return pn.WidgetBox(pn.Row('User:', pn.pane.Markdown("", width=600)), scroll=True)
        result = self.qa({
    
    "question": query, "chat_history": self.chat_history})
        self.chat_history.extend([(query, result["answer"])])
        self.db_query = result["generated_question"]
        self.db_response = result["source_documents"]
        self.answer = result['answer'] 
        self.panels.extend([
            pn.Row('User:', pn.pane.Markdown(query, width=600)),
            pn.Row('ChatBot:', pn.pane.Markdown(self.answer, width=600, style={
    
    'background-color': '#F6F6F6'}))
        ])
        inp.value = ''  #clears loading indicator when cleared
        return pn.WidgetBox(*self.panels,scroll=True)

    @param.depends('db_query ', )
    def get_lquest(self):
        if not self.db_query :
            return pn.Column(
                pn.Row(pn.pane.Markdown(f"Last question to DB:", styles={
    
    'background-color': '#F6F6F6'})),
                pn.Row(pn.pane.Str("no DB accesses so far"))
            )
        return pn.Column(
            pn.Row(pn.pane.Markdown(f"DB query:", styles={
    
    'background-color': '#F6F6F6'})),
            pn.pane.Str(self.db_query )
        )

    @param.depends('db_response', )
    def get_sources(self):
        if not self.db_response:
            return 
        rlist=[pn.Row(pn.pane.Markdown(f"Result of DB lookup:", styles={
    
    'background-color': '#F6F6F6'}))]
        for doc in self.db_response:
            rlist.append(pn.Row(pn.pane.Str(doc)))
        return pn.WidgetBox(*rlist, width=600, scroll=True)

    @param.depends('convchain', 'clr_history') 
    def get_chats(self):
        if not self.chat_history:
            return pn.WidgetBox(pn.Row(pn.pane.Str("No History Yet")), width=600, scroll=True)
        rlist=[pn.Row(pn.pane.Markdown(f"Current Chat History variable", styles={
    
    'background-color': '#F6F6F6'}))]
        for exchange in self.chat_history:
            rlist.append(pn.Row(pn.pane.Str(exchange)))
        return pn.WidgetBox(*rlist, width=600, scroll=True)

    def clr_history(self,count=0):
        self.chat_history = []
        return 

Create a chatbot

cb = cbfs()

file_input = pn.widgets.FileInput(accept='.pdf')
button_load = pn.widgets.Button(name="Load DB", button_type='primary')
button_clearhistory = pn.widgets.Button(name="Clear History", button_type='warning')
button_clearhistory.on_click(cb.clr_history)
inp = pn.widgets.TextInput( placeholder='Enter text here…')

bound_button_load = pn.bind(cb.call_load_db, button_load.param.clicks)
conversation = pn.bind(cb.convchain, inp) 

jpg_pane = pn.pane.Image( './img/convchain.jpg')

tab1 = pn.Column(
    pn.Row(inp),
    pn.layout.Divider(),
    pn.panel(conversation,  loading_indicator=True, height=300),
    pn.layout.Divider(),
)
tab2= pn.Column(
    pn.panel(cb.get_lquest),
    pn.layout.Divider(),
    pn.panel(cb.get_sources ),
)
tab3= pn.Column(
    pn.panel(cb.get_chats),
    pn.layout.Divider(),
)
tab4=pn.Column(
    pn.Row( file_input, button_load, bound_button_load),
    pn.Row( button_clearhistory, pn.pane.Markdown("Clears chat history. Can use to start a new topic" )),
    pn.layout.Divider(),
    pn.Row(jpg_pane.clone(width=400))
)
dashboard = pn.Column(
    pn.Row(pn.pane.Markdown('# ChatWithYourData_Bot')),
    pn.Tabs(('Conversation', tab1), ('Database', tab2), ('Chat History', tab3),('Configure', tab4))
)
dashboard

insert image description here
insert image description here
insert image description here

8. Conclusion

insert image description here

Guess you like

Origin blog.csdn.net/weixin_39653948/article/details/131874862