Build a cloud-based customized PDF knowledge base AI chatbot based on GPT-4 and LangChain

reference:

GitHub - mayooear/gpt4-pdf-chatbot-langchain: GPT4 & LangChain Chatbot for large PDF docs

1. Summary:

Build a chatGPT chatbot for multiple large PDF files using the new GPT-4 API.

The technology stack used includes LangChain, Pinecone, Typescript, Openai and Next.js. LangChain is a framework that makes it easier to build scalable AI/LLM large language model applications and chatbots. Pinecone is a vector store for storing embedded and text-formatted PDFs for later retrieval of similar documents.

2. Preparation work:

OpenAI API Key GPT-3.5orGPT-4  openai 

Pinecone API Key/Environment/Index  pinecone

Indexes for Pinecone Starter (free) plan users are deleted after 7 days. To prevent this, send an API request to Pinecone 7 days ago to reset the counter. You can continue to use it for free.

3. Clone or download the project gpt4-pdf-chatbot-langchain

git clone https://github.com/mayooear/gpt4-pdf-chatbot-langchain.git

4. Install dependency packages

Use npm to install yarn. If you don’t have npm, please refer to the installation guide. 

npm/Node.js introduction and quick installation-Linux CentOS_Entropy-Go's blog-CSDN blog

npm install yarn -g

 Then use yarn to install dependent packages

 Enter the project root directory and execute the command

yarn install

After successful installation, you can see the node_modules directory

gpt4-pdf-chatbot-langchain-main$ ls -a
.           declarations  .eslintrc.json  node_modules        .prettierrc  styles               utils           yarn.lock
..          docs          .gitignore      package.json        public       tailwind.config.cjs  venv
components  .env          .idea           pages               README.md    tsconfig.json        visual-guide
config      .env.example  next.config.js  postcss.config.cjs  scripts      types                yarn-error.log

5.Environment configuration

Copy .env.example into .env configuration file

OPENAI_API_KEY=sk-xxx

# Update these with your pinecone details from your dashboard.
# PINECONE_INDEX_NAME is in the indexes tab under "index name" in blue
# PINECONE_ENVIRONMENT is in indexes tab under "Environment". Example: "us-east1-gcp"
PINECONE_API_KEY=xxx
PINECONE_ENVIRONMENT=us-west1-gcp-free
PINECONE_INDEX_NAME=xxx

config/pinecone.ts modification

In the config folder, replace PINECONE_NAME_SPACE with the namespace in which you want to store the embed into PINECONE_NAME_SPACE when you run npm run ingest. This namespace will be used later for querying and retrieval.

Modify the chatbot’s prompt words and OpenAI model

Change QA_PROMPT in utils/makechain.ts for your own use case.

If you have access to gpt-4 api, change modelName in new OpenAI to gpt-4. Please verify outside of this repo that you have access to the gpt-4 api, otherwise the application will not work.

import { OpenAI } from 'langchain/llms/openai';
import { PineconeStore } from 'langchain/vectorstores/pinecone';
import { ConversationalRetrievalQAChain } from 'langchain/chains';

const CONDENSE_PROMPT = `Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question.

Chat History:
{chat_history}
Follow Up Input: {question}
Standalone question:`;

const QA_PROMPT = `You are a helpful AI assistant. Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say you don't know. DO NOT try to make up an answer.
If the question is not related to the context, politely respond that you are tuned to only answer questions that are related to the context.

{context}

Question: {question}
Helpful answer in markdown:`;

export const makeChain = (vectorstore: PineconeStore) => {
  const model = new OpenAI({
    temperature: 0, // increase temepreature to get more creative answers
    modelName: 'gpt-3.5-turbo', //change this to gpt-4 if you have access
  });

  const chain = ConversationalRetrievalQAChain.fromLLM(
    model,
    vectorstore.asRetriever(),
    {
      qaTemplate: QA_PROMPT,
      questionGeneratorTemplate: CONDENSE_PROMPT,
      returnSourceDocuments: true, //The number of source documents returned is 4 by default
    },
  );
  return chain;
};

6. Add PDF documents as knowledge base

Because there will be data interaction with OpenAI and Pinecone, it is recommended to carefully consider data privacy and security before uploading documents.

Upload one or more PDF documents to the docs directory

Execute upload command

npm run ingest

Check whether the upload is successful on Pinecone

7. Run the knowledge base chatbot

Once you've verified that the embed and content have been successfully added to your Pinecone, you can run the application npm run dev to start the local development environment, then enter a question into the chat interface to start a conversation.

Excuting an order:

npm run dev

8. FAQTroubleshooting

https://github.com/mayooear/gpt4-pdf-chatbot-langchain#troubleshooting

In general, keep an eye out in the issues and discussions section of this repo for solutions.

General errors

  • Make sure you're running the latest Node version. Run node -v
  • Try a different PDF or convert your PDF to text first. It's possible your PDF is corrupted, scanned, or requires OCR to convert to text.
  • Console.log the env variables and make sure they are exposed.
  • Make sure you're using the same versions of LangChain and Pinecone as this repo.
  • Check that you've created an .env file that contains your valid (and working) API keys, environment and index name.
  • If you change modelName in OpenAI, make sure you have access to the api for the appropriate model.
  • Make sure you have enough OpenAI credits and a valid card on your billings account.
  • Check that you don't have multiple OPENAPI keys in your global environment. If you do, the local env file from the project will be overwritten by systems env variable.
  • Try to hard code your API keys into the process.env variables if there are still issues.

Pinecone errors

  • Make sure your pinecone dashboard environment and index matches the one in the pinecone.ts and .env files.
  • Check that you've set the vector dimensions to 1536.
  • Make sure your pinecone namespace is in lowercase.
  • Pinecone indexes of users on the Starter(free) plan are deleted after 7 days of inactivity. To prevent this, send an API request to Pinecone to reset the counter before 7 days.
  • Retry from scratch with a new Pinecone project, index, and cloned repo.

Guess you like

Origin blog.csdn.net/holyvslin/article/details/132403863