NLP (69) Intelligent Document Q&A Assistant Upgrade

This article goes a step further in the large model intelligent document question and answer project developed by the author before , supporting multiple types of documents and URL links, supporting multiple large model access, and making it more convenient and efficient to use.

Project Introduction

In the article NLP (61) using the Baichuan-13B-Chat model to build intelligent documents , the author introduces in detail how to use Baichuan-13B-Chatthe model to build an intelligent document question and answer assistant.

Generally, the flow chart of using a large model to implement the document question and answer function is as follows:

LangChain document Q&A process

This time, the author went one step further in the previous project and supported the following functions:

  • Supports multiple format documents (including txt, pdf, docx) and URL links
  • Q&A visualization page
  • Questions and answers can be traced and highlighted.
  • Single/multiple model calls
  • Comparison of model effects

described as follows:

  1. Supported document formats are LangChainprovided by, URL link parsing LangChainis provided by selenium and unstructured , and JavaScript-rendered pages can be supported. However, web page parsing (or crawling) is a complex and arduous task, and it is impossible to achieve all web page parsing in this project.
  2. The visual question and answer page is implemented by the Gradio module
  3. It supports single-model or multi-model calling, and can provide question-and-answer traceability. At the same time, it also supports comparison of answer results from different models. This idea comes from OpenCompass .

In terms of engineering development, the added features are as follows:

  • Rich usage documentation
  • Add profile
  • Add log call
  • ES tokenizer supports user dictionaries
  • Milvus supports threshold configuration for preliminary screening

This project has been open sourced to Github. For code implementation, please refer to document_qa_with_llm . The code details will not be explained here.

Supported document formats

This project originally only supported txt format, but now it supports documents in multiple formats (including txt, pdf, docx) and URL links. This is due to the document loading moduleLangChain in the framework , which makes the loading of documents in various formats more unified. Simple and efficient.

The file parsing script in this project is as follows:

# -*- coding: utf-8 -*-
from langchain.document_loaders import TextLoader, PyPDFLoader, Docx2txtLoader, SeleniumURLLoader

from utils.logger import logger


class FileParser(object):
    def __init__(self, file_path):
        self.file_path = file_path

    def txt_loader(self):
        documents = TextLoader(self.file_path, encoding='utf-8').load()
        return documents

    def pdf_loader(self):
        loader = PyPDFLoader(self.file_path)
        documents = loader.load_and_split()
        return documents

    def docx_loader(self):
        loader = Docx2txtLoader(self.file_path)
        documents = loader.load()
        return documents

    def url_loader(self):
        loader = SeleniumURLLoader(urls=[self.file_path])
        documents = loader.load()
        return documents

    def parse(self):
        logger.info(f'parse file: {
      
      self.file_path}')
        if self.file_path.endswith(".txt"):
            return self.txt_loader()
        elif self.file_path.endswith(".pdf"):
            return self.pdf_loader()
        elif self.file_path.endswith(".docx"):
            return self.docx_loader()
        elif "http" in self.file_path:
            return self.url_loader()
        else:
            logger.error("unsupported document type!")
            return []


if __name__ == '__main__':
    txt_file_path = "/Users/admin/PycharmProjects/document_qa_with_llm/files/gdp.txt"
    content = FileParser(txt_file_path).parse()
    print(content)

    pdf_file_path = "/Users/admin/PycharmProjects/document_qa_with_llm/files/oppo_n3_flip.pdf"
    content = FileParser(pdf_file_path).parse()
    print(content)

    docx_file_path = "/Users/admin/PycharmProjects/document_qa_with_llm/files/haicaihua.docx"
    content = FileParser(docx_file_path).parse()
    print(content)

    url = "https://gaokao.xdf.cn/202303/12967078.html"
    url = "https://www.hntv.tv/50rd/article/1/1700396378818207745?v=1.0"
    content = FileParser(url).parse()
    print(content)

Q&A test

The document upload page is as follows. It supports document upload in multiple formats and URL parsing (depending on page parsing capabilities). The page is relatively rough.

File upload page

The uploaded files will be placed in the files folder. Sample documents can be referenced in the files folder in the Github project.

  • txt file

Let’s take files/dengyue.txtan example and the questions and answers are as follows:

Do you know Griffin's position?
Griffin's title is NASA Administrator.

What did Griffin say in his speech?
According to document knowledge, Griffin said the following in his speech: He believes that if the Chinese are willing, they can achieve a manned moon landing project in 2020. In addition, Academician Ye Peijian once expressed his opinion that 2025 is more appropriate. However, according to the 50-year long-term plan compiled by the Chinese Academy of Sciences, 2030 is the time when China will realize the manned moon landing project.

  • pdf file

Let’s take files/oppo_n3_flip.pdfan example and answer as follows:

What is the name of OPPO’s latest folding screen phone?
OPPO's latest foldable phone is the OPPO Find N3 Flip.

Has Tencent released a self-developed large model? When was it released?
Yes, Tencent has released a self-developed large language model called "Hunyuan Large Model". It was officially unveiled at the 2023 Tencent Global Digital Ecosystem Conference. The specific time is September 7th.

  • docx file

Let’s take files/haicaihua.docxan example and answer as follows:

What requirements does sea cauliflower have for the growing environment?
Sea cauliflower has extremely high requirements on the growth environment and can only grow in clean water with high transparency. It is known as the "touchstone" of water quality.

  • URL link

Let's take https://gaokao.xdf.cn/202303/12967078.htmlfor example, the answer is as follows:

How many students will University of Electronic Science and Technology of China enroll in 2022?
The total enrollment plan of the University of Electronic Science and Technology of China in 2022 is 5,030 students, of which the "University of Electronic Science and Technology of China" will enroll more than 3,300 students nationwide, and the "University of Electronic Science and Technology of China (Shahe Campus)" will enroll more than 1,700 students from some provinces.

The official website of University of Electronic Science and Technology of China?
The official website of UESTC is: http://www.zs.uestc.edu.cn/

Visual Q&A

In addition to the previous API calls, this project also supports visual question and answer. This function Gradiois implemented by the module and supports visual Q&A on the page. It also supports multi-model calls. The supported large models are as follows:

  • Baichuan-13B-Chat: A model released by Baichuan Intelligence, now updated to Baichuan2
  • LLAMA-2-Chinese-13b-Chat: A Chinese conversation model fine-tuned on the LLAMA 2 model
  • internlm-chat-7b: Scholar (InternLM) dialogue model released by Shanghai Artificial Intelligence Laboratory

These are large models in Chinese. Theoretically, the supported models are determined by FastChat and the deployed GPU model and quantity. This project only considers the above three types.

This page supports multi-model or single-model Q&A. In multi-model question answering, you can compare the answering effects of different models under the same prompt as a way of model evaluation.

Single model question and answer

Multi-model Q&A

At the same time, this page also supports question and answer traceability, which can track the reference text and the data source used for the answer to the document question and answer.

Q&A traceability

Text highlighting in question and answer tracing

Since Gradiothe table in does not support in-cell text highlighting, we use its own highlighted text control to highlight the referenced text in the Q&A traceability, making it easier for us to confirm the position of the answer content in the original text. Avoid the problem of large model illusion.

The text highlighting algorithm in question and answer tracing is as follows:

  1. Find the quoted text list where the Q&A is located, produced by ES and Milvus
  2. Split quoted text into lists
  3. Get the text with the highest similarity to the answer, and the similarity uses the Jaccard coefficient
  4. Highlight the part of the text with the highest similarity that coincides with the answer

Text highlighting in question and answer tracing

Summarize

Based on the previous open source, this project has added richer functions, including support for multiple format document parsing and URL parsing, support for Q&A visualization pages, support for single/multiple model calls, and support for multi-model effect comparison.

This project has been open sourced to Github, and the code implementation can refer to document_qa_with_llm .

Recommended reading

Welcome to follow my official NLP fantasy journey , and original technical articles will be pushed out as soon as possible.

Welcome to follow my knowledge planet " Natural Language Processing Fantasy Journey ". The author is working hard to build his own technical community.

Reference link

[1] Large model intelligent document question and answer project: https://github.com/percent4/document_qa_with_llm
[2] OpenCompass: https://opencompass.org.cn/
[3] document_qa_with_llm: https://github.com/percent4 /document_qa_with_llm
[4] Document loading module: https://python.langchain.com/docs/integrations/document_loaders/
[5] FastChat: https://github.com/lm-sys/FastChat
[6] document_qa_with_llm: https: //github.com/percent4/document_qa_with_llm

Guess you like

Origin blog.csdn.net/jclian91/article/details/132783378