独家 | Scikit-LLM：Sklearn邂逅大语言模型

作者：Fareed Khan翻译：陈之炎
校对：赵茹萱


本文约2600字，建议阅读8分钟
本文为您介绍文本分析的工具包Scikit-LLM。

标签：LLM

Scikit-LLM是文本分析的游戏规则改变者，它将功能强大的ChatGPT语言模型和scikit-learn相结合，为理解和分析文本提供了一个无与伦比的工具包。利用scikit-LLM，可以在各种类型的文本数据中发现隐含的模式、情绪和上下文，如客户反馈、社交媒体帖子和新闻文章等。它汇集了语言模型和scikit-learn的优势，能够从文本中提取有价值的见解。

官方GitHub存储库：

https://github.com/iryna-kondr/scikit-llm

全部示例均直接取自官方的存储库。

接下来，开启Scikit-LLM曼妙之旅吧！

安装Scikit-LLM

从安装Scikit-LLM开始，它集成了scikit-learn和语言模型功能强大的各种库，可以使用pip来安装它：

pip install scikit-llm

获取OpenAI API密钥

截至2023年5月，Scikit-LLM兼容一组特定的OpenAI模型，要求用户提供自己的OpenAI API密钥才能成功集成。

首先从Scikit-LLM库中导入SKLLMConfig模块，然后添加openAI密钥：

# importing SKLLMConfig to configure OpenAI API (key and Name)
from skllm.config import SKLLMConfig


# Set your OpenAI API key
SKLLMConfig.set_openai_key("<YOUR_KEY>")


# Set your OpenAI organization (optional)
SKLLMConfig.set_openai_org("<YOUR_ORGANIZATION>")

正如在GitHub存储库中所述：

如果是免费试用的OpenAI帐户，那么费率限制就不够了（每分钟3个请求）。请先切换到“随付”计划。

当调用SKLLMConfig.set_openai_org时，必须提供组织ID，而非组织名称。可以从以下链接中找到组织ID： https://platform.openai.com/account/org-settings

零样本GPT分类器

ChatGPT的闪酷之处是它在无需经过专门的训练的情况下，便能够实现文本分类，需要的只是描述性的标签。

在此引入ZeroShotGPTClassifier，它是Scikit-LLM中的一个类，利用它创建scikit-learn分类器。

# importing zeroshotgptclassifier module and classification dataset
from skllm import ZeroShotGPTClassifier
from skllm.datasets import get_classification_dataset


# get classification dataset from sklearn
X, y = get_classification_dataset()


# defining the model
clf = ZeroShotGPTClassifier(openai_model="gpt-3.5-turbo")


# fitting the data
clf.fit(X, y)


# predicting the data
labels = clf.predict(X)

不仅如此，Scikit-LLM还确保它能收到包含一个有效标签的响应，如果没有收到包含一个有效标签的响应，Scikit-LLM将随机选择一个标签，并根据标签在训练数据中出现的频度来计算其概率。

简单地说，Scikit-LLM处理API的内容，并确保能获取到可用的标签。如果响应中缺少标签，它会根据训练数据中出现的频度选取一个填充标签。

如果没有带标记的数据怎么办？

更为有趣的是——甚至不需要有带标记的数据来训练模型，只需要提供一个候选标签的列表：

# importing zeroshotgptclassifier module and classification dataset
from skllm import ZeroShotGPTClassifier
from skllm.datasets import get_classification_dataset


# get classification dataset from sklearn for prediction only


X, _ = get_classification_dataset()


# defining the model
clf = ZeroShotGPTClassifier()


# Since no training so passing the labels only for prediction
clf.fit(None, ['positive', 'negative', 'neutral'])


# predicting the labels
labels = clf.predict(X)

太酷了吧？通过指定隐性标签，可以训练没有明确标记的数据。

正如在GitHub存储库中所述：

在零样本分类中，分类器的有效性取决于标签本身的结构，它可以用自然语言、描述性语言和自言自明来表达。

例如，在语义分类任务中，将标签从“<semantics>”转换为“the semantics of the provided text is <semantics>” （所提供文本的语义为语义）可能会更加有益。

多标签零样本文本分类

执行多标签零样本文本分类比想象得要更加容易：

# importing Multi-Label zeroshot module and classification dataset
from skllm import MultiLabelZeroShotGPTClassifier
from skllm.datasets import get_multilabel_classification_dataset


# get classification dataset from sklearn 
X, y = get_multilabel_classification_dataset()


# defining the model
clf = MultiLabelZeroShotGPTClassifier(max_labels=3)


# fitting the model
clf.fit(X, y)


# making predictions
labels = clf.predict(X)

多标签零样本文本分类的唯一区别是当创建MultiLabelZeroShotGPTClassifier类的实例时，需要指定分配给每个样本的最大标签数量（这里：max_labels=3）

如果没有标记好的数据（多标签示例）怎么办？

在上面的示例中，MultiLabelZeroShotGPTClassifier 用标记好的数据（X和y）进行训练。也可以通过提供候选标签列表来训练不带标记数据的分类器。在这种情况下，y的类型应该是List[List[str]]。

下面是一个不带标记数据的训练示例：

# getting classification dataset for prediction only
X, _ = get_multilabel_classification_dataset()


# Defining all the labels that needs to predicted
candidate_labels = [
    "Quality",
    "Price",
    "Delivery",
    "Service",
    "Product Variety"
]


# creating the model
clf = MultiLabelZeroShotGPTClassifier(max_labels=3)


# fitting the labels only
clf.fit(None, [candidate_labels])


# predicting the data
labels = clf.predict(X)

向量化文本

文本向量化是文本数字化的过程，使得计算机能够更容易地理解和分析它。此时，Scikit-LLM的 GPTVectorizer 模块能帮助转换一段文本，无论文本有多长，均将其转换为一个固定大小的数字集，称之为向量。

# Importing the GPTVectorizer class from the skllm.preprocessing module
from skllm.preprocessing import GPTVectorizer


# Creating an instance of the GPTVectorizer class and assigning it to the variable 'model'
model = GPTVectorizer()  


# transorming the
vectors = model.fit_transform(X)

将GPTVectorizer实例的fit_transform 方法应用于输入数据X，将模型拟合到数据，并将文本转换为固定维度的向量，然后将得到的向量分配给向量变量。

接下来演示在scikit-learn 管道中组合GPTVectorizer 和XGBoost Classifier的例子，这种方法可以有效地实现文本预处理和分类：

# Importing the necessary modules and classes
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier


# Creating an instance of LabelEncoder class
le = LabelEncoder()


# Encoding the training labels 'y_train' using LabelEncoder
y_train_encoded = le.fit_transform(y_train)


# Encoding the test labels 'y_test' using LabelEncoder
y_test_encoded = le.transform(y_test)


# Defining the steps of the pipeline as a list of tuples
steps = [('GPT', GPTVectorizer()), ('Clf', XGBClassifier())]


# Creating a pipeline with the defined steps
clf = Pipeline(steps)


# Fitting the pipeline on the training data 'X_train' and the encoded training labels 'y_train_encoded'
clf.fit(X_train, y_train_encoded)


# Predicting the labels for the test data 'X_test' using the trained pipeline
yh = clf.predict(X_test)

文本摘要

GPT擅长总结文本，究其原因，是Scikit-LLM中有一个GPTSummarizer模块。可以通过两种方式使用它：单独使用它，或者作为在做其他事情之前的一个步骤（比如减少数据的大小，使用文本而非数字）来使用它：

# Importing the GPTSummarizer class from the skllm.preprocessing module
from skllm.preprocessing import GPTSummarizer


# Importing the get_summarization_dataset function
from skllm.datasets import get_summarization_dataset


# Calling the get_summarization_dataset function
X = get_summarization_dataset()


# Creating an instance of the GPTSummarizer
s = GPTSummarizer(openai_model='gpt-3.5-turbo', max_words=15)


# Applying the fit_transform method of the GPTSummarizer instance to the input data 'X'.
# It fits the model to the data and generates the summaries, which are assigned to the variable 'summaries'
summaries = s.fit_transform(X)

注意，max_words超参数对生成摘要中单词数量做了灵活的限制。除了提供提示之外，它并没有严格执行。这意味着，在某些情况下，所生成摘要中的实际单词数可能会略微超过指定的限制。简单地说，虽然max_words为摘要长度设置了一个粗略的目标，但根据输入文本的上下文和内容，偶尔可能会生成略长一点的摘要。

如果有任何疑问，请随时提问！

原文标题：Scikit-LLM: Sklearn Meets Large Language Models

原文链接：https://medium.com/@fareedkhandev/scikit-llm-sklearn-meets-large-language-models-11fc6f30e530

编辑：黄继彦

译者简介

陈之炎，北京交通大学通信与控制工程专业毕业，获得工学硕士学位，历任长城计算机软件与系统公司工程师，大唐微电子公司工程师，现任北京吾译超群科技有限公司技术支持。目前从事智能化翻译教学系统的运营和维护，在人工智能深度学习和自然语言处理（NLP）方面积累有一定的经验。业余时间喜爱翻译创作，翻译作品主要有：IEC-ISO 7816、伊拉克石油工程项目、新财税主义宣言等等，其中中译英作品“新财税主义宣言”在GLOBAL TIMES正式发表。能够利用业余时间加入到THU 数据派平台的翻译志愿者小组，希望能和大家一起交流分享，共同进步

翻译组招募信息

工作内容：需要一颗细致的心，将选取好的外文文章翻译成流畅的中文。如果你是数据科学/统计学/计算机类的留学生，或在海外从事相关工作，或对自己外语水平有信心的朋友欢迎加入翻译小组。

你能得到：定期的翻译培训提高志愿者的翻译水平，提高对于数据科学前沿的认知，海外的朋友可以和国内技术应用发展保持联系，THU数据派产学研的背景为志愿者带来好的发展机遇。

其他福利：来自于名企的数据科学工作者，北大清华以及海外等名校学生他们都将成为你在翻译小组的伙伴。

点击文末“阅读原文”加入数据派团队~

转载须知

如需转载，请在开篇显著位置注明作者和出处（转自：数据派ID：DatapiTHU），并在文章结尾放置数据派醒目二维码。有原创标识文章，请发送【文章名称-待授权公众号名称及ID】至联系邮箱，申请白名单授权并按要求编辑。

发布后请将链接反馈至联系邮箱（见下方）。未经许可的转载以及改编者，我们将依法追究其法律责任。

点击“阅读原文”拥抱组织