reference:
GitHub - mayooear/gpt4-pdf-chatbot-langchain: GPT4 & LangChain Chatbot for large PDF docs
1. Summary:
Build a chatGPT chatbot for multiple large PDF files using the new GPT-4 API.
The technology stack used includes LangChain, Pinecone, Typescript, Openai and Next.js. LangChain is a framework that makes it easier to build scalable AI/LLM large language model applications and chatbots. Pinecone is a vector store for storing embedded and text-formatted PDFs for later retrieval of similar documents.
2. Preparation work:
OpenAI API Key GPT-3.5orGPT-4 openai
Pinecone API Key/Environment/Index pinecone
Indexes for Pinecone Starter (free) plan users are deleted after 7 days. To prevent this, send an API request to Pinecone 7 days ago to reset the counter. You can continue to use it for free.
3. Clone or download the project gpt4-pdf-chatbot-langchain
git clone https://github.com/mayooear/gpt4-pdf-chatbot-langchain.git
4. Install dependency packages
Use npm to install yarn. If you don’t have npm, please refer to the installation guide.
npm/Node.js introduction and quick installation-Linux CentOS_Entropy-Go's blog-CSDN blog
npm install yarn -g
Then use yarn to install dependent packages
Enter the project root directory and execute the command
yarn install
After successful installation, you can see the node_modules directory
gpt4-pdf-chatbot-langchain-main$ ls -a
. declarations .eslintrc.json node_modules .prettierrc styles utils yarn.lock
.. docs .gitignore package.json public tailwind.config.cjs venv
components .env .idea pages README.md tsconfig.json visual-guide
config .env.example next.config.js postcss.config.cjs scripts types yarn-error.log
5.Environment configuration
Copy .env.example into .env configuration file
OPENAI_API_KEY=sk-xxx
# Update these with your pinecone details from your dashboard.
# PINECONE_INDEX_NAME is in the indexes tab under "index name" in blue
# PINECONE_ENVIRONMENT is in indexes tab under "Environment". Example: "us-east1-gcp"
PINECONE_API_KEY=xxx
PINECONE_ENVIRONMENT=us-west1-gcp-free
PINECONE_INDEX_NAME=xxx
config/pinecone.ts modification
In the config folder, replace PINECONE_NAME_SPACE with the namespace in which you want to store the embed into PINECONE_NAME_SPACE when you run npm run ingest. This namespace will be used later for querying and retrieval.
Modify the chatbot’s prompt words and OpenAI model
Change QA_PROMPT in utils/makechain.ts for your own use case.
If you have access to gpt-4 api, change modelName in new OpenAI to gpt-4. Please verify outside of this repo that you have access to the gpt-4 api, otherwise the application will not work.
import { OpenAI } from 'langchain/llms/openai';
import { PineconeStore } from 'langchain/vectorstores/pinecone';
import { ConversationalRetrievalQAChain } from 'langchain/chains';
const CONDENSE_PROMPT = `Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question.
Chat History:
{chat_history}
Follow Up Input: {question}
Standalone question:`;
const QA_PROMPT = `You are a helpful AI assistant. Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say you don't know. DO NOT try to make up an answer.
If the question is not related to the context, politely respond that you are tuned to only answer questions that are related to the context.
{context}
Question: {question}
Helpful answer in markdown:`;
export const makeChain = (vectorstore: PineconeStore) => {
const model = new OpenAI({
temperature: 0, // increase temepreature to get more creative answers
modelName: 'gpt-3.5-turbo', //change this to gpt-4 if you have access
});
const chain = ConversationalRetrievalQAChain.fromLLM(
model,
vectorstore.asRetriever(),
{
qaTemplate: QA_PROMPT,
questionGeneratorTemplate: CONDENSE_PROMPT,
returnSourceDocuments: true, //The number of source documents returned is 4 by default
},
);
return chain;
};
6. Add PDF documents as knowledge base
Because there will be data interaction with OpenAI and Pinecone, it is recommended to carefully consider data privacy and security before uploading documents.
Upload one or more PDF documents to the docs directory
Execute upload command
npm run ingest
Check whether the upload is successful on Pinecone
7. Run the knowledge base chatbot
Once you've verified that the embed and content have been successfully added to your Pinecone, you can run the application npm run dev to start the local development environment, then enter a question into the chat interface to start a conversation.
Excuting an order:
npm run dev
8. FAQTroubleshooting
https://github.com/mayooear/gpt4-pdf-chatbot-langchain#troubleshooting
In general, keep an eye out in the issues
and discussions
section of this repo for solutions.
General errors
- Make sure you're running the latest Node version. Run
node -v
- Try a different PDF or convert your PDF to text first. It's possible your PDF is corrupted, scanned, or requires OCR to convert to text.
Console.log
theenv
variables and make sure they are exposed.- Make sure you're using the same versions of LangChain and Pinecone as this repo.
- Check that you've created an
.env
file that contains your valid (and working) API keys, environment and index name. - If you change
modelName
inOpenAI
, make sure you have access to the api for the appropriate model. - Make sure you have enough OpenAI credits and a valid card on your billings account.
- Check that you don't have multiple OPENAPI keys in your global environment. If you do, the local
env
file from the project will be overwritten by systemsenv
variable. - Try to hard code your API keys into the
process.env
variables if there are still issues.
Pinecone errors
- Make sure your pinecone dashboard
environment
andindex
matches the one in thepinecone.ts
and.env
files. - Check that you've set the vector dimensions to
1536
. - Make sure your pinecone namespace is in lowercase.
- Pinecone indexes of users on the Starter(free) plan are deleted after 7 days of inactivity. To prevent this, send an API request to Pinecone to reset the counter before 7 days.
- Retry from scratch with a new Pinecone project, index, and cloned repo.