Intensive reading of "Using GPT to Interpret PDF"

hatPDF has become very popular recently. After uploading a PDF file, you can ask it to summarize the content for you through question and answer, such as letting it help you summarize the core ideas, ask questions, or make opinion judgments.

Several fashionable technologies are used behind it. Fortunately, ChatGPT for YOUR OWN PDF files with LangChain explained the principle behind it. I think it is very exciting, so I recorded it and did some thinking, hoping to help everyone.

Summary of technical ideas

Since GPT is very powerful, as long as you send him the content of the PDF article, he can answer any questions you have about the article. -- The full text is over.

Wait, so why mention langChain and vector dataBase? Because the content of the PDF article is too long, it is easy to exceed the Token limit if it is directly transmitted to GPT. Even if it allows unlimited Token transmission, it may cost 10-100 US dollars for a question, and this cost is unacceptable.

So here comes the black magic. The picture below is taken from the video ChatGPT for YOUR OWN PDF files with LangChain:
insert image description here
Let’s interpret it step by step:

  1. Find some library to extract PDF content text.
  2. Split these texts into N smaller texts, and use openai for text vectorization.
  3. When the user asks a question, vectorize the user's question, and use a mathematical function to calculate the similarity with the vectorized content of the PDF.
  4. Send the most similar text to openai and let him summarize and answer your questions.

The implementation steps of using GPT to interpret PDF
I will re-introduce each step in the video and add my own understanding.

login colab

You can run python on your local computer to execute step by step, or you can directly log in to colab, the python running platform, which provides a very convenient python environment, and you can execute code step by step and save it, which is very suitable for research.

As long as you have a Google account, you can use colab.

install dependencies

To run a bunch of gpt-related functions, you need to install some packages. Although they are essentially sending http requests to gpt openapi continuously, after encapsulation, it will indeed become more semantic:

!pip install langchain
!pip install openai
!pip install PyPDF2
!pip install faiss-cpu
!pip install tiktoken

Among them, the tiktoken package is not included in the tutorial. When I execute some code, I am prompted that this package is missing. You can press it in advance. Next, introduce some functions that will be used later:

from PyPDF2 import PdfReader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import ElasticVectorSearch, pinecone, Weaviate, FAISS

define openapi token

In order to call the openapi service, you need to apply for a token first. After you apply for the token, define it in the following way:

import os
os.environ["OPENAI_API_KEY"] = "***"

By default, both langchain and openai will access the os.environ of the python environment to find the token, so after defining it here, you can call the service directly.

If you do not have a GPT openapi account, please refer to the nanny-level registration tutorial for details. (It’s a pity that China is blocked. In order to learn first-hand new knowledge, you need to find a VPN by yourself, or even spend money to buy foreign mobile phone number verification code receiving services. Although the process is rough, it is feasible to test it yourself).

Read PDF content

In order to facilitate reading PDF on the colab platform, you can first upload the PDF to your Google Drive, which is a personal cloud service launched by Google, which integrates all cloud services including colab and file storage (PS: Microsoft’s similar service is called One Drive, well, theoretically you can use any giant's service).

After uploading, run the following code in colab, and an authorization webpage will pop up. After authorization, you can access the resources under your drive path:

from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
root_dir = "/content/gdrive/My Drive/"
reader = PdfReader('/content/gdrive/My Drive/2023_GPT4All_Technical_Report.pdf')

We read the 2023_GPT4All_Technical_Report.pdf report, which is a service (evaluation) that claims to be able to run locally against the benchmark GPT4.

Textize and split PDF content into multiple small chunks

First execute the following code to read the PDF text content:

raw_text = ''
for i, page in enumerate(reader.pages):
  text = page.extract_text()
  if text:
    raw_text += text

Next, prepare for calling the openapi service to vectorize the text, because the number of tokens for one call is limited, so we need to split a large piece of text into several small texts:

text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 1000,
    chunk_overlap = 200,
    length_function = len,
)
texts = text_splitter.split_text(raw_text)

Among them, chunk_size=1000 means that a chunk has 1000 characters, and chunk_overlap means that the next chunk will repeat the content of the last 200 characters of the previous chunk, which is convenient for connecting each chunk, so that you can find as many as possible when looking for similarities chunk, to find more context.

Vectorization is here!

The most important step is to use openapi to vectorize the previously split text chunks:

embeddings = OpenAIEmbeddings()
docsearch = FAISS.from_texts(texts, embeddings)

It's that simple, docsearch is an encapsulated object. In this step, the openapi interface has been called several times to convert the text into a very long vector.

Text vectorization is another deep-water area. You can watch this introduction video. Simply put, it is to convert a text into a series of numbers, representing an N-dimensional vector. Using mathematics to calculate similarity, word processing can be converted into continuous numbers. Do math and even add and subtract words (like Beijing-China+USA=Washington).

In short, after this step, we get the corresponding relationship between each paragraph of text and its vector locally. For example, the vector corresponding to "this is a paragraph of text" is [-0.231, 0.423, -0.2347831, …].

Use chain to generate question and answer service

The next step is to connect the complete process. Initialize a QA chain to represent the question-and-answer with GPT using the chat model:

from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
chain = load_qa_chain(OpenAI(), chain_type="stuff")

Then you can ask him PDF related questions:

query = "who are the main author of the article?"
docs = docsearch.similarity_search(query)
chain.run(input_documents=docs, question=query)
#  The main authors of the article are Yuvanesh Anand, Zach Nussbaum, Brandon Duderstadt, Benjamin Schmidt, and Andriy Mulyar.

Of course, you can also ask questions in Chinese, and openapi will call the built-in module to translate it for you:

query = "训练 GPT4ALL 的成本是多少?"
docs = docsearch.similarity_search(query)
chain.run(input_documents=docs, question=query)
#  根据文章,大约四天的工作,800美元的GPU成本(包括几次失败的训练)和500美元的OpenAI API开销。我们发布的模型gpt4all-lora大约在Lambda Labs DGX A100 8x 80GB上需要八个小时的训练,总成本约为100美元。

What happened in QA?

According to my understanding, when you ask the question who are the main author of the article?, the following steps happen.

Step 1: Call openapi to vectorize the problem and get a bunch of vectors.

Step 2: use mathematical functions to match with the local vector database, and find several text chunks with the highest matching degree (the PDF text content we split before).

Step 3: Send these most relevant texts to openapi and let him summarize them for us.

It is not known whether the third step combines langchain for multi-step answering. Next time I plan to capture the packet to see the communication content between this program and openapi, so as to unlock the secrets.

Of course, if the question needs to combine all the content of the PDF to be summarized, this method of vector matching will not work well, because it will always send the most relevant text fragment to the question. However, because the secret of the third step has not been solved, it is very likely that when there are not enough content fragments, gpt4 will ask to find more similar fragments, so that it will be repeated until gpt4 thinks it can answer, and then give the answer (after thinking about it) Cool your back).

Summarize

The technical idea of ​​interpreting PDF can also be applied to any problem, such as web search:

Webpage search is a typical scene of searching for key information from the ocean of knowledge and interpreting it. As long as all webpage information is vectorized and stored in a certain vector database, a GPT search engine can be built. The steps are: 1. Enter keyword segmentation and vectorization. Two: Perform vector matching in the database, and extract the contents of several web pages with the highest matching degree. Three: Feed these contents to GPT, let him summarize the knowledge inside and answer user questions.

Vectorization can solve fuzzy matching in any scenario. For example, my own memo stores many platform accounts and passwords, but one day I searched for the ChatGPT password but failed to find it. Later, I found that the keyword was written as OpenAPI. Vectorization can solve this problem, and he can also search for unmatched keywords in the memo.

Cooperating with vectorized search, coupled with GPT's thinking and summarizing ability, what a super AI assistant can do will far exceed our imagination.

I leave you with a thought question: Combining the two capabilities of vectorization and GPT, what other usage scenarios can you think of?

おすすめ

転載: blog.csdn.net/weixin_42814075/article/details/130323193
GPT