python machine learning to use sklearn library pictures, text classification (incidental sklearn installation resources and tutorials)

python machine learning to use sklearn library pictures, text classification

Download and install sklearn

sklearn is a library of python, pip need to install:

pip install sklearn

But often can not successfully install , because sklearn rely numpy and scipy, but most people are numpy pip directly installed but not the full version, so when usually installed sicpy of error.
So some people think: I go to the official website to download the file to install whl slightly. Unfortunately, the official website download speed universal wall of 10kb / s role, often not just great on the error. So I left to the reader hereResources, Extraction code: sdwi (Baidu network disk no matter how frenzied you will not be 10kb / s, right)
to get the installation package later, these two files into the Scripts folder inside your computer python directory, like this:
Here Insert Picture Descriptionalready installed numpy friends, uninstall numpy: pip uninstall numpy
then open cmd, cd all the way to this position, then:

pip install numpy-1.16.4+mkl-cp37-cp37m-win_amd64.whl
pip install scipy-1.1.0-cp37-cp37m-win_amd64.whl

Which specific front behind which I forget, there is a need can try.
You can then pip install sklearnfriends.

Preparation: download the data set

This article describes is based on the processed image data sets relevant digits (handwriting) and text (news), so it is best to download the data set in advance

  • Download News datasets
from sklearn.datasets import fetch_20newsgroups
# sample_cate 指定需要下载哪几个主题类别的新闻数据
sample_cate = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med', 'rec.sport.baseball']
# 需要从网络上下载,受连接外网速度限制, 可能要耐心等待几分钟时间
newsgroups_train = fetch_20newsgroups(subset='train', categories=sample_cate, shuffle=True)
# 以上得到训练集,以下代码得到测试集
newsgroups_test = fetch_20newsgroups(subset='test', categories=sample_cate, shuffle=True)

The training set on newsgroups_train, the test set on newsgroups_test.

Handwriting Category:

Use the SVM, Naive Bayes, the KNN, using digits Sklearn own handwriting recognition training data set
the data set is a 8 * 8 1797 grayscale pixel size, when the classifier using handwriting recognition, are the images each as 64-dimensional feature vector .

# The digits dataset
digits = datasets.load_digits()
print(digits.DESCR)

Digital display grayscale (required import matplotlib.pyplot as plt)

#显示数据集中的第一个图像
plt.imshow(digits.images[0], cmap=plt.cm.gray_r, interpolation='nearest')

This experiment requires three classifiers, knn, svm, naive Bayes, specific description can google, or end view files inside word github repository resources
Next, create a classifier:

# 创建svm分类器
svc_clf = svm.SVC(gamma=0.001)
# 创建KNN分类器
knn_clf = KNeighborsClassifier()
# 创建朴素贝叶斯分类器
nb_clf = MultinomialNB()

Then divide the data set (news which is not needed, because when we download data sets as well as good points)

# Split data into train and test subsets
# 使用train_test_split将数据集分为训练集,测试集, y_train 表示训练集中样本的类别标签, y_test表示测试集中样本的类别标签
# test_size = 0.5 表示使用一半数据进行测试, 另一半就用于训练
X_train, X_test, y_train, y_test = train_test_split(
    data, digits.target, test_size=0.5, shuffle=False)

Use fit for training:

# 调用fit方法进行训练,传入训练集样本和样本的类别标签,进行有监督学习
svc_clf.fit(X_train, y_train)
knn_clf.fit(X_train, y_train)
nb_clf.fit(X_train, y_train)

Prediction: `

# 调用predict, 用训练得到的模型在测试集进行类别预测,得到预测的类别标签
svc_predicted = svc_clf.predict(X_test)
knn_predicted = knn_clf.predict(X_test)
nb_predicted = nb_clf.predict(X_test)`

Then start output:

svc_images_and_predictions = list(zip(digits.images[n_samples // 2:], svc_predicted))
knn_images_and_predictions = list(zip(digits.images[n_samples // 2:], knn_predicted))
nb_images_and_predictions = list(zip(digits.images[n_samples // 2:], nb_predicted))

# 在图表的第二行输出svm在测试集的前四个手写体图像上的分类结果,大家可以在图上看看结果对不对
for ax, (image, svc_prediction) in zip(axes[1, :], svc_images_and_predictions[:4]):
    ax.set_axis_off()
    ax.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
    ax.set_title('Prediction: %i' % svc_prediction)

# 在图表的第三行输出KNN在测试集的前四个手写体图像上的分类结果,大家可以在图上看看结果对不对
# 大家应该可以发现KNN把第二列的8这个手写数字识别为3,发生错误
for ax, (image, knn_prediction) in zip(axes[2, :], knn_images_and_predictions[:4]):
    ax.set_axis_off()
    ax.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
    ax.set_title('Prediction: %i' % knn_prediction)

# 在图表的第四行输出朴素贝叶斯在测试集的前四个手写体图像上的分类结果,大家可以在图上看看结果对不对
for ax, (image, nb_prediction) in zip(axes[3, :], nb_images_and_predictions[:4]):
    ax.set_axis_off()
    ax.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
    ax.set_title('Prediction: %i' % nb_prediction)
    
# 绘制出图
plt.show()

Output performance indicators:

# 输出三个分类器的性能指标,大家需要了解二分类、多分类的性能评估指标主要有哪些

# 输出svm的分类性能指标
print("Classification report for classifier %s:\n%s\n"
      % (svc_clf, metrics.classification_report(y_test, svc_predicted)))

# 输出KNN的分类性能指标
print("Classification report for classifier %s:\n%s\n"
      % (knn_clf, metrics.classification_report(y_test, knn_predicted)))

# 输出naive bayes的分类性能指标
print("Classification report for classifier %s:\n%s\n"
      % (nb_clf, metrics.classification_report(y_test, nb_predicted)))

Handwriting overall classification codes: GitHub link
operating results:
Here Insert Picture DescriptionHere Insert Picture Description

News text classification

It is simpler than the image of the text points, because no matrix processing.
Process and handwriting word almost, after predicting fit, and then output performance

from sklearn.datasets import fetch_20newsgroups
# sample_cate 指定需要下载哪几个主题类别的新闻数据
sample_cate = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med', 'rec.sport.baseball']
# 需要从网络上下载,受连接外网速度限制, 可能要耐心等待几分钟时间
newsgroups_train = fetch_20newsgroups(subset='train', categories=sample_cate, shuffle=True)
# 以上得到训练集,以下代码得到测试集
newsgroups_test = fetch_20newsgroups(subset='test', categories=sample_cate, shuffle=True)

print(len(newsgroups_train.data))
print(len(newsgroups_test.data))

import matplotlib.pyplot as plt
from sklearn import datasets, svm, metrics
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

count_vectorizer = CountVectorizer(stop_words='english')
cv_news_train_vector = count_vectorizer.fit_transform(newsgroups_train.data)
print("news_train_vector.shape:",cv_news_train_vector.shape)
cv_news_test_vector = count_vectorizer.transform(newsgroups_test.data)
print("news_test_vector.shape:",cv_news_test_vector.shape)

svc_clf = svm.SVC(kernel="linear")
svc_clf.fit(cv_news_train_vector,newsgroups_train.target)
cv_svc_predict = svc_clf.predict(cv_news_test_vector)
print("Classification report for classifier %s:\n%s\n" % (svc_clf,metrics.classification_report(newsgroups_test.target,cv_svc_predict,target_names=newsgroups_test.target_names)))

knn_clf = KNeighborsClassifier()
knn_clf.fit(cv_news_train_vector,newsgroups_train.target)
cv_knn_predict = knn_clf.predict(cv_news_test_vector)
print("Classification report for classifier %s:\n%s\n" % (knn_clf,metrics.classification_report(newsgroups_test.target,cv_knn_predict,target_names=newsgroups_test.target_names)))

nb_clf = MultinomialNB()
nb_clf.fit(cv_news_train_vector,newsgroups_train.target)
cv_nb_predict = nb_clf.predict(cv_news_test_vector)
print("Classification report for classifier %s:\n%s\n" % (nb_clf,metrics.classification_report(newsgroups_test.target,cv_nb_predict,target_names=newsgroups_test.target_names)))

###################################################################

tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_news_train_vector = tfidf_vectorizer.fit_transform(newsgroups_train.data)
print("news_train_vectorizer.shape",tfidf_news_train_vector.shape)
tfidf_news_test_vector = tfidf_vectorizer.transform(newsgroups_test.data)
print("news_test_vectorizer.shape",tfidf_news_train_vector.shape)

svc_clf2 = svm.SVC(kernel="linear")
svc_clf2.fit(tfidf_news_train_vector,newsgroups_train.target)
tfidf_svc_predict2 = svc_clf2.predict(tfidf_news_test_vector)
print("Classification report for classifier %s:\n%s\n" % (svc_clf2,metrics.classification_report(newsgroups_test.target,tfidf_svc_predict2,target_names=newsgroups_test.target_names)))

knn_clf = KNeighborsClassifier()
knn_clf.fit(tfidf_news_train_vector,newsgroups_train.target)
tfidf_knn_predict = knn_clf.predict(tfidf_news_test_vector)
print("Classification report for classifier %s:\n%s\n" % (knn_clf,metrics.classification_report(newsgroups_test.target,tfidf_knn_predict,target_names=newsgroups_test.target_names)))

nb_clf = MultinomialNB()
nb_clf.fit(tfidf_news_train_vector,newsgroups_train.target)
tfidf_nb_predict = nb_clf.predict(tfidf_news_test_vector)
print("Classification report for classifier %s:\n%s\n" % (nb_clf,metrics.classification_report(newsgroups_test.target,tfidf_nb_predict,target_names=newsgroups_test.target_names)))
©

operation result:
Here Insert Picture DescriptionHere Insert Picture Description

Released three original articles · won praise 0 · Views 64

Guess you like

Origin blog.csdn.net/qq_43534805/article/details/104725604