【LangChain系列文章】4. 向量数据库Vector Stores

LangChain No.4

文章目录

文本嵌入模型
- Text embedding models简介
- 文本嵌入模型使用
向量数据库使用
- 通过文本创建索引
- 加载文件创建索引
向量数据库类别
如何选择向量数据库

存储和搜索非结构化数据的常用方法之一是嵌入并存储生成的嵌入向量，然后在查询时嵌入非结构化查询并检索与嵌入查询“最相似”的嵌入向量。矢量存储负责存储嵌入数据并执行矢量搜索。

使用向量数据库的一个关键步骤是创建文本向量，并存储进数据库。这个工作通常是通过Embedding实现的。所以，使用向量数据库前，首先需要熟悉文本嵌入模型text embedding model

文本嵌入模型

Text embedding models简介

提供文本嵌入模型的有很多种，如：OpenAI,Cohere, Hugging Face等，LangChain定义了一个抽象的Embddings类，用来与这些文本嵌入模型交互。Embeddings的本质是对文本数据创建一个向量表示，这种对文本的数字表示，意味着我们就可以在一个向量空间中来考虑文本，如进行语义检索等需要查找文本片段的操作，使用embeddings就非常类似于在向量空间中进行检索，对于查找相似文档等非常有用。

Embeddings是LangChain定义的进行文本嵌入模型的基本类，它暴露了两个方法：embedQuery和embedDoucments，前者用于创建基于文本参数的嵌入，接收单个文本作为输入参数；后者是用于文档的嵌入，接收多个文本作为输入参数。之将它们作为两个单独的方法，是因为一些嵌入模型程序对于文档和查询使用了不同的方法。

文本嵌入模型使用

下面的示例演示了如何使用OpenAI的嵌入模型。首先需要从langchain/embeddings/openai中引入OpenAIEmbeddings类：

import {
    
     OpenAIEmbeddings } from "langchain/embeddings/openai";

其次，使用OpenAIEmbeddings类来创建一个嵌入模型的实例，下面是使用embedQuery方法，创建一个简单文本Hello world的嵌入。可见，它的结果是一个数字的向量。

/* Create instance */
const embeddings = new OpenAIEmbeddings();

/* Embed queries */
const res = await embeddings.embedQuery("Hello world");

/*
[
   -0.004845875,   0.004899438,  -0.016358767,  -0.024475135, -0.017341806,
    0.012571548,  -0.019156644,   0.009036391,  -0.010227379, -0.026945334,
    0.022861943,   0.010321903,  -0.023479493, -0.0066544134,  0.007977734,
   0.0026371893,   0.025206111,  -0.012048521,   0.012943339,  0.013094575,
   -0.010580265,  -0.003509951,   0.004070787,   0.008639394, -0.020631202,
  -0.0019203906,   0.012161949,  -0.019194454,   0.030373365, -0.031028723,
   0.0036170771,  -0.007813894, -0.0060778237,  -0.017820721, 0.0048647798,
   -0.015640393,   0.001373733,  -0.015552171,   0.019534737, -0.016169721,
    0.007316074,   0.008273906,   0.011418369,   -0.01390117, -0.033347685,
    0.011248227,  0.0042503807,  -0.012792102, -0.0014595914,  0.028356876,
    0.025407761, 0.00076445413,  -0.016308354,   0.017455231, -0.016396577,
    0.008557475,   -0.03312083,   0.031104341,   0.032389853,  -0.02132437,
    0.003324056,  0.0055610985, -0.0078012915,   0.006090427, 0.0062038545,
      0.0169133,  0.0036391325,  0.0076815626,  -0.018841568,  0.026037913,
    0.024550753,  0.0055264398, -0.0015824712, -0.0047765584,  0.018425668,
   0.0030656934, -0.0113742575, -0.0020322427,   0.005069579, 0.0022701253,
    0.036095154,  -0.027449455,  -0.008475555,   0.015388331,  0.018917186,
   0.0018999106,  -0.003349262,   0.020895867,  -0.014480911, -0.025042271,
    0.012546342,   0.013850759,  0.0069253794,   0.008588983, -0.015199285,
  -0.0029585673,  -0.008759124,   0.016749462,   0.004111747,  -0.04804285,
  ... 1436 more items
]
*/

你也可以尝试embedDoucments方法的使用，并查看文档嵌入的结果。下列代码演示了通过[“Hello world”, “Bye bye”]方法，生成一个简单文档嵌入的方法。你还可以通过使用文档的加载器，来加载一个完整的文档，查看不同文档转换的结果。

/* Embed documents */
const documentRes = await embeddings.embedDocuments(["Hello world", "Bye bye"]);

总之，文本嵌入模型主要是生成一个对文本的向量表示，以此来更方便的实现文档的检索等功能。

向量数据库使用

此处演示MemoryVectorStore的使用，它是将创建的文本嵌入存储在内存中，并且对相似的嵌入进行精确的线性搜索。

通过文本创建索引

首先需要引入矢量数据库和OpenAIEmbeddings类

import {
    
     MemoryVectorStore } from "langchain/vectorstores/memory";
import {
    
     OpenAIEmbeddings } from "langchain/embeddings/openai";

然后通过.fromTexts方法，创建一个矢量数据库的实例，然后进行相似搜索，使用.similaritySearch方法。

const vectorStore = await MemoryVectorStore.fromTexts(
  ["Hello world", "Bye bye", "hello nice world"],
  [{
    
     id: 2 }, {
    
     id: 1 }, {
    
     id: 3 }],
  new OpenAIEmbeddings()
);

const resultOne = await vectorStore.similaritySearch("hello world", 1);
console.log(resultOne);

结果如下：

/*
  [
    Document {
      pageContent: "Hello world",
      metadata: { id: 2 }
    }
  ]
*/

加载文件创建索引

除引入矢量数据库和OpenAIEmbeddings类之外，创建要给基于文件的索引还需要引入文件加载器TextLoader。首先引入需要的类

import {
    
     MemoryVectorStore } from "langchain/vectorstores/memory";
import {
    
     OpenAIEmbeddings } from "langchain/embeddings/openai";
import {
    
     TextLoader } from "langchain/document_loaders/fs/text";

其次，通过文件加载器获取到文件的内容，使用fromDocuments方法将文档内容加载到矢量数据库，最后，通过similaritySearch方法进行检索。

// Create docs with a loader
const loader = new TextLoader("src/document_loaders/example_data/example.txt");
const docs = await loader.load();

// Load the docs into the vector store
const vectorStore = await MemoryVectorStore.fromDocuments(
  docs,
  new OpenAIEmbeddings()
);

// Search for the most similar document
const resultOne = await vectorStore.similaritySearch("hello world", 1);

console.log(resultOne);

/*
  [
    Document {
      pageContent: "Hello world",
      metadata: { id: 2 }
    }
  ]
*/

你可以使用一个文档或使用一段文本和它们的相关元数据来创建一个矢量数据库，或者使用已经存在的索引来创建一个矢量数据库。LangChain中定义创建矢量数据库的方法有两个：fromTexts和fromDocuments，它们所接收的参数分别有（来自langchain源码）：

abstract class BaseVectorStore implements VectorStore {
    
    
  static fromTexts(
    texts: string[],
    metadatas: object[] | object,
    embeddings: Embeddings,
    dbConfig: Record<string, any>
  ): Promise<VectorStore>;

  static fromDocuments(
    docs: Document[],
    embeddings: Embeddings,
    dbConfig: Record<string, any>
  ): Promise<VectorStore>;
}

上述实例中使用到了VectorStore接口所提供的similaritySearch方法，VectorStore类还提供了其它的方法供使用（来自langchain源码）：

interface VectorStore {
    
    
  /**
   * Add more documents to an existing VectorStore. 向已存在的矢量数据库中添加文档
   * Some providers support additional parameters, e.g. to associate custom ids
   * with added documents or to change the batch size of bulk inserts. 一些供应商还支持其它附加的参数，如将自定义 ID 与已添加的文档相关联或更改批量插入的批量大小
   * Returns an array of ids for the documents or nothing. 返回文档的id数组或无返回值
   */
  addDocuments(
    documents: Document[],
    options?: Record<string, any>
  ): Promise<string[] | void>;

  /**
   * Search for the most similar documents to a query 搜索与查询条件最相似的文档
   */
  similaritySearch(
    query: string,
    k?: number,
    filter?: object | undefined
  ): Promise<Document[]>;

  /**
   * Search for the most similar documents to a query, 搜索与查询最相似的文档
   * and return their similarity score 返回相似的范围
   */
  similaritySearchWithScore(
    query: string,
    k = 4,
    filter: object | undefined = undefined
  ): Promise<[object, number][]>;

  /**
   * Turn a VectorStore into a Retriever 向量数据库转Retriever 
   */
  asRetriever(k?: number): BaseRetriever;

  /**
   * Delete embedded documents from the vector store matching the passed in parameter. 从向量数据库中删除与传入参数匹配的嵌入文档
   * Not supported by every provider.
   */
  delete(params?: Record<string, any>): Promise<void>;

  /**
   * Advanced: Add more documents to an existing VectorStore, 向已经存在的向量数据库中添加文档
   * when you already have their embeddings
   */
  addVectors(
    vectors: number[][],
    documents: Document[],
    options?: Record<string, any>
  ): Promise<string[] | void>;

  /**
   * Advanced: Search for the most similar documents to a query, 搜索与查询相似的文档
   * when you already have the embedding of the query
   */
  similaritySearchVectorWithScore(
    query: number[],
    k: number,
    filter?: object
  ): Promise<[Document, number][]>;
}

向量数据库类别

向量数据库有很多个，可点击这里进行查看。常见的如Memory, Elasticsearch, Redis, Chroma等。
在这里插入图片描述

如何选择向量数据库

下面是一个根据不同使用场景的快速向导，它可以帮助你选择正确的向量数据库

If you’re after something that can just run inside your Node.js application, in-memory, without any other servers to stand up, then go for HNSWLib, Faiss, or LanceDB.如果仅需要在NodeJS程序内运行、在内存中运行，而不需要任何其它服务器的支持，那么选择HNSWLib、Faiss或LanceDB
If you’re looking for something that can run in-memory in browser-like environments, then go for MemoryVectorStore 如果需要能够运行在浏览器内存中，则选择MemoryVectorStore
If you come from Python and you were looking for something similar to FAISS, try HNSWLib or Faiss. python语言如果需要类似 FAISS的功能，可以尝试HNSWLib 或 Faiss
If you’re looking for an open-source full-featured vector database that you can run locally in a docker container, then go for Chroma. 如果需要一个开源的全功能的矢量数据库，并且可以在本地dockers容器中运行，则可以使用Chroma
If you’re looking for an open-source vector database that offers low-latency, local embedding of documents and supports apps on the edge, then go for Zep 如果需要一个开源数据库，该数据库能够提供低延迟、本地文档镶嵌并且支持在Edge上的应用，则可以使用Zep
If you’re looking for an open-source production-ready vector database that you can run locally (in a docker container) or hosted in the cloud, then go for Weaviate. 如果是需要一个运行在本地docker容器或云平台中的开源生产矢量数据库，可以选择Weaviate.
If you’re using Supabase already then look at the Supabase vector store to use the same Postgres database for your embeddings too.如果已经在使用Supabase矢量数据库，可以参照它去使用Postgres
If you’re looking for a production-ready vector store you don’t have to worry about hosting yourself, then go for Pinecone.如果是需要一个已经生产可用的、不需要自己托管的矢量数据库，可以使用 Pinecone
If you are already utilizing SingleStore, or if you find yourself in need of a distributed, high-performance database, you might want to consider the SingleStore vector store. 如果是需要一个分布式、高性能的数据库，可以考虑使用 SingleStore 矢量数据库
If you are looking for an online MPP (Massively Parallel Processing) data warehousing service, you might want to consider the AnalyticDB vector store. 如果需要一个在线的MMP（大规模多并发处理）数据仓库服务，可以考虑使用AnalyticDB 矢量数据库
If you’re in search of a cost-effective vector database that allows run vector search with SQL, look no further than MyScale. 如果需要一个经济高效且允许使用SQL进行搜索的数据库，则可以使用 MyScale.