Scikit-Learn and LLM for large models join forces!

d79aca4bc9920bb662895447ab52d983.png

来源:大数据与机器学习文摘
本文约2200字,建议阅读5分钟
本文为你简要介绍Scikit-LLM。

We previously introduced Pandas and ChaGPT integration, so that DataFrame can be operated without knowing Pandas.

Now someone has open sourced Scikit-LLM, which combines powerful language models such as ChatGPT and scikit-learn. But this does not allow us to automate scikit-learn, but to integrate scikit-learn and language models, and scikit-learn can also process text data.

Scikit-learn

Scikit-learn (sklearn for short) is an open source Python library for machine learning, which provides a wealth of tools and functions for building and applying various machine learning models. A powerful and easy-to-use tool, scikit-learn has become one of the most popular libraries in the field of machine learning.

The Scikit-learn library provides algorithms and tools for common machine learning tasks including classification, regression, clustering, dimensionality reduction, and model selection. It supports various supervised and unsupervised learning methods, such as Support Vector Machines (SVM), Random Forests (Random Forests), Logistic Regression (Logistic Regression), K-Means Clustering (K-Means Clustering), Principal Component Analysis (PCA )wait. These algorithms are optimized and implemented to run efficiently on large-scale datasets.

In addition to algorithms and models, scikit-learn also provides tools for data preprocessing, feature selection, and evaluation. It has extensive data transformation and feature extraction capabilities to help you process and prepare your datasets. In addition, scikit-learn provides common metrics and techniques for model evaluation and parameter selection, such as cross-validation and grid search.

One of the design philosophies of Scikit-learn is to provide a consistent and easy-to-use API interface. This allows users to easily switch between different machine learning tasks and try different models. It also has rich documentation and sample code, providing users with resources to learn and use.

In addition to the above functions, scikit-learn is also tightly integrated with other Python libraries and tools, such as NumPy, SciPy, and Matplotlib, making it easy for users to interact and extend these libraries.

LLM

LLM (Large Language Model) refers to a large-scale language model based on deep learning. These models can generate human language with semantic and grammatical correctness by training on large amounts of text data, such as massive texts on the Internet. The training process of these models relies on deep neural networks and powerful computing resources.

A representative example of a large model LLM is OpenAI's GPT (Generative Pre-trained Transformer) series, including the latest GPT-3. These models have billions of parameters and perform well on multiple language tasks such as text generation, automatic question answering, text classification, and machine translation, among others.

The training of large model LLM is usually divided into two stages: pre-training and fine-tuning. In the pre-training stage, the model uses large-scale text data for unsupervised learning, and learns the statistical structure and contextual information of the language by tasks such as predicting the next word or filling a mask. In the fine-tuning stage, the model is trained using a task-specific supervised dataset to adapt to the requirements of that task. This two-stage training method enables the large model LLM to show strong versatility in various language tasks.

The advantage of large model LLMs is that they can understand and generate complex language structures, and have strong language understanding and generation capabilities. They can automatically generate coherent text, answer natural language questions, and in some cases even be creative. This makes them have broad application potential in natural language processing, intelligent dialogue systems, content generation and other fields.

Here I would like to share with you an article from Deephub Imba, how to use scikit and large model LLM together.

Install

pip install scikit-llm

Since you want to integrate with the Open AI model, you need its Key, import the SKLLMConfig module from the Scikit-LLM library, and add the openAI key:

# importing SKLLMConfig to configure OpenAI API (key and Name)
 from skllm.config import SKLLMConfig


 # Set your OpenAI API key
 SKLLMConfig.set_openai_key("<YOUR_KEY>")


 # Set your OpenAI organization (optional)
 SKLLMConfig.set_openai_org("<YOUR_ORGANIZATION>")

ZeroShotGPTClassifier

By integrating ChatGPT, no special training is required to classify text. ZeroShotGPTClassifier, like any other scikit-learn classifier, is very simple to use.

# importing zeroshotgptclassifier module and classification dataset
 from skllm import ZeroShotGPTClassifier
 from skllm.datasets import get_classification_dataset


 # get classification dataset from sklearn
 X, y = get_classification_dataset()


 # defining the model
 clf = ZeroShotGPTClassifier(openai_model="gpt-3.5-turbo")


 # fitting the data
 clf.fit(X, y)


 # predicting the data
 labels = clf.predict(X)

Scikit-LLM has special handling on the results to ensure that the response contains only one valid tag. It can also fill in if the response is missing a label, choosing a label for you based on how often it appears in the training data.

For our own labeled data, we only need to provide a list of candidate labels, the code looks like this:

# importing zeroshotgptclassifier module and classification dataset
 from skllm import ZeroShotGPTClassifier
 from skllm.datasets import get_classification_dataset


 # get classification dataset from sklearn for prediction only


 X, _ = get_classification_dataset()


 # defining the model
 clf = ZeroShotGPTClassifier()


 # Since no training so passing the labels only for prediction
 clf.fit(None, ['positive', 'negative', 'neutral'])


 # predicting the labels
 labels = clf.predict(X)
MultiLabelZeroShotGPTClassifier
多标签也类似


 # importing Multi-Label zeroshot module and classification dataset
 from skllm import MultiLabelZeroShotGPTClassifier
 from skllm.datasets import get_multilabel_classification_dataset


 # get classification dataset from sklearn
 X, y = get_multilabel_classification_dataset()


 # defining the model
 clf = MultiLabelZeroShotGPTClassifier(max_labels=3)


 # fitting the model
 clf.fit(X, y)


 # making predictions
 labels = clf.predict(X)

When creating an instance of the MultiLabelZeroShotGPTClassifier class, specify the maximum number of labels to assign to each sample (here: max_labels=3)

What if the data has no labels? A classifier can be trained without labeled data by providing a list of candidate labels. The type of y should be List[List[str]]. Here is a training example with no labeled data:

# getting classification dataset for prediction only
 X, _ = get_multilabel_classification_dataset()


 # Defining all the labels that needs to predicted
 candidate_labels = [
     "Quality",
     "Price",
     "Delivery",
     "Service",
     "Product Variety"
 ]


 # creating the model
 clf = MultiLabelZeroShotGPTClassifier(max_labels=3)


 # fitting the labels only
 clf.fit(None, [candidate_labels])


 # predicting the data
 labels = clf.predict(X)

text vectorization

Text vectorization is the process of converting text to numbers. The GPTVectorizer module in Scikit-LLM can convert a piece of text (no matter how long the text is) into a set of fixed-size vectors.

# Importing the necessary modules and classes
 from sklearn.pipeline import Pipeline
 from sklearn.preprocessing import LabelEncoder
 from xgboost import XGBClassifier


 # Creating an instance of LabelEncoder class
 le = LabelEncoder()


 # Encoding the training labels 'y_train' using LabelEncoder
 y_train_encoded = le.fit_transform(y_train)


 # Encoding the test labels 'y_test' using LabelEncoder
 y_test_encoded = le.transform(y_test)


 # Defining the steps of the pipeline as a list of tuples
 steps = [('GPT', GPTVectorizer()), ('Clf', XGBClassifier())]


 # Creating a pipeline with the defined steps
 clf = Pipeline(steps)


 # Fitting the pipeline on the training data 'X_train' and the encoded training labels 'y_train_encoded'
 clf.fit(X_train, y_train_encoded)


 # Predicting the labels for the test data 'X_test' using the trained pipeline
 yh = clf.predict(X_test)

text summary

GPT is very good at summarizing text. In Scikit-LLM there is a module called GPTSummarizer.

# Importing the GPTSummarizer class from the skllm.preprocessing module
 from skllm.preprocessing import GPTSummarizer


 # Importing the get_summarization_dataset function
 from skllm.datasets import get_summarization_dataset


 # Calling the get_summarization_dataset function
 X = get_summarization_dataset()


 # Creating an instance of the GPTSummarizer
 s = GPTSummarizer(openai_model='gpt-3.5-turbo', max_words=15)


 # Applying the fit_transform method of the GPTSummarizer instance to the input data 'X'.
 # It fits the model to the data and generates the summaries, which are assigned to the variable 'summaries'
 summaries = s.fit_transform(X)

Note that the max_words hyperparameter is a flexible limit on the number of words in the generated summary. While max_words sets a rough target for summary length, the summarizer may occasionally generate slightly longer summaries depending on the context and content of the input text.

Summarize

The popularity of ChaGPT has made more progress in the generalization model. This progress has also brought great changes to our daily use. Scikit-LLM integrates LLM into Scikit's workflow. If you are interested, here is the source code:

https://github.com/iryna-kondr/scikit-llm

Editor: Yu Tengkai

Proofreading: Lin Yilin

82c5979614d6ec6def943efe7d98e3a2.png

Guess you like

Origin blog.csdn.net/tMb8Z9Vdm66wH68VX1/article/details/131606315