python写入zilliz cloud向量数据库

# 步骤 1: 安装必要的库
# pip install pymilvus  # 示例中使用 Milvus 作为客户端库
# pip install sklearn  # 用于 TF-IDF 向量化

from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
from pymilvus import DataType



# 步骤 2 & 3: 连接到 Zilliz Cloud
connections.connect(
  alias='default', 
  #  Public endpoint obtained from Zilliz Cloud
  uri="https://in03-1fcf3c08b7d1dbb.api.gcp-us-west1.zillizcloud.com",
  # API key or a colon-separated cluster username and password
  token="XXXXXXX", 
)


# 步骤 4: 数据预处理和向量化
# 解析问答对
# 问题和答案列表
qa_pairs = []
with open('婚姻法事.txt', 'r', encoding='utf-8') as file:
    lines = file.readlines()

    current_question = None
    current_answer = None
    for line in lines:
        if line.startswith("问题："):
            if current_question and current_answer:
                qa_pairs.append((current_question, current_answer))
            current_question = line.strip()[3:]
            current_answer = None
        elif line.startswith("回答："):
            current_answer = line.strip()[3:]

    if current_question and current_answer:
        qa_pairs.append((current_question, current_answer))

# 提取所有问题和答案文本
questions, answers = zip(*qa_pairs)

# 向量化问题
vectorizer = TfidfVectorizer()
# 向量化问题并转换为密集格式
question_vectors = vectorizer.fit_transform(questions).toarray()


# 步骤 5: 存储向量数据
collection_name = "qa_collection"
schema = CollectionSchema([
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),  # 添加主键字段
    FieldSchema(name="question", dtype=DataType.FLOAT_VECTOR, dim=len(question_vectors[0]))
], description="question answer collection")
collection = Collection(name=collection_name, schema=schema)

# 生成主键列表
ids = list(range(len(question_vectors)))

# 向 Zilliz Cloud 中插入数据
mr = collection.insert([ids, question_vectors.tolist()])



# 步骤 6: 实现问答系统

# 创建索引
index_params = {
    
    
    "metric_type": "L2",  # 选择距离计算方式，例如 L2 距离
    "index_type": "IVF_FLAT",  # 选择索引类型，例如 IVF_FLAT
    "params": {
    
    "nlist": 1024}  # 索引参数
}
collection.create_index(field_name="question", index_params=index_params)

# 加载集合
collection.load()

# 现在可以执行搜索操作
def answer_question(question):
    question_vector = vectorizer.transform([question]).toarray()[0]
    search_params = {
    
    "metric_type": "L2", "params": {
    
    "nprobe": 10}}
    results = collection.search([question_vector.tolist()], "question", search_params, limit=1)
    top_match = results[0][0]
    answer_index = top_match.id
    return answers[answer_index]

# 测试问答系统
##问题：如何在英国申请离婚？
##回答：在英国申请离婚需要通过提交离婚申请到家庭法院，并提供婚姻破裂的证据。

# 可以
##test_question = "如何在英国申请离婚？"
# 不可以
##test_question = "在英国，怎么申请离婚？"
test_question = "在英国如何申请离婚？"

print(answer_question(test_question))

这段代码是一个用 Python 编写的脚本，用于使用 Milvus（一个开源向量数据库）和 sklearn（用于 TF-IDF 向量化）创建一个问答（QA）系统。代码分为几个步骤：

安装必要的库和导入模块:
- 安装 pymilvus（Milvus 客户端库）和 sklearn（用于 TF-IDF 向量化）。
- 导入所需模块，如 pymilvus 用于数据库操作，TfidfVectorizer 来自 sklearn 用于文本处理。
连接到 Zilliz Cloud:
- 使用 Milvus 建立与 Zilliz Cloud 的连接。这需要一个 URI（端点）和一个用于身份验证的令牌。
数据预处理和向量化:
- 读取一个包含问题-答案对的文本文件（婚姻法事.txt），格式为以“问题：”和“回答：”开头的行。
- 将这些问题-答案对解析为一个列表，然后将问题部分使用 TF-IDF 向量化。
存储向量数据:
- 创建一个名为 qa_collection 的 Milvus 集合，并定义其模式（schema），包括主键字段和问题字段（问题的向量表示）。
- 将问题向量和对应的 ID 插入到 Milvus 数据库的这个集合中。
实现问答系统:
- 为集合创建一个索引，以提高搜索效率。
- 加载集合以准备搜索。
- 定义一个 answer_question 函数，它将一个问题向量化，然后在 Milvus 集合中搜索最匹配的问题，返回对应的答案。
测试问答系统:
- 提供一个测试问题，通过 answer_question 函数获取答案，并打印出来。

这个脚本的核心是使用 TF-IDF 向量化将文本问题转换为向量，然后使用 Milvus 进行相似性搜索，以找到最接近的问题并返回其答案。

python写入zilliz cloud向量数据库

猜你喜欢