Data Mining_聚类分析&分类器&关联规则&回归分析 (Python)

Data mining

一种从大量数据中提取知识的过程，它涉及到统计学、机器学习和人工智能等多个领域。
它通常使用计算机程序来分析数据，发现潜在的关系或规则，并产生有用的信息。

常见技术

包括：

聚类分析、
分类器、
关联规则挖掘
回归分析。

data mining学习平台(网站)：

KDnuggets：该网站提供与data mining相关的新闻、教育资源和工具等。
Journal of Data Mining and Knowledge Discovery：该期刊是data
mining的领域内的声名独钟的学术刊物之一。
Data Mining Techniques：该博客提供有关data mining技术和实践的深入解释和案例研究。

data mining示例代码：

聚类分析 (Python)

from sklearn.cluster import KMeans
import numpy as np

X = np.array([[1, 2], [1, 4], [1, 0],
              [4, 2], [4, 4], [4, 0]])
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
print(kmeans.labels_)

关联规则挖掘(Python lib)

Apriori算法：用于频繁项集挖掘和关联规则挖掘。可以使用apyori库来实现。
FP-growth算法：与Apriori算法类似，但是更快且更有效。可以使用pyfpgrowth库来实现。

使用Apriori算法实现关联规则挖掘

from apyori import apriori

transactions = [['apple', 'banana'], ['banana', 'orange'], ['apple', 'banana', 'orange'], ['banana', 'orange']]
rules = apriori(transactions, min_support=0.5, min_confidence=0.7)
# 找到了频繁项集，支持度为0.5。
# 如果两个项集的置信度大于0.7，则将它们作为规则输出。
for rule in rules:
    print(rule)

使用FP-Growth算法实现关联规则挖掘

安装pyfpgrowth库：

!pip install pyfpgrowth

数据集准备：

使用Kaggle上的一个电商订单数据集。
数据集中的每个订单都包含一组商品项。
选择提取其中的订单号和商品列表，并将商品列表转换为列表形式。

import pandas as pd

# 读取数据
df = pd.read_csv('https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/online-retail/online_retail.csv',
                 header=0, parse_dates=[4], encoding='unicode_escape')

# 提取订单号和商品列表，并将商品列表转换为列表形式
basket = (df[df['Country'] == "United Kingdom"]
          .groupby(['InvoiceNo', 'Description'])['Quantity']
          .sum().unstack().reset_index().fillna(0)
          .set_index('InvoiceNo'))

basket_sets = basket.applymap(lambda x: 1 if x > 0 else 0).astype(int)

使用pyfpgrowth库进行关联规则挖掘：

先设置最小支持度和最小置信度，
然后调用find_frequent_patterns()函数找到频繁项集，
再调用generate_association_rules()函数生成关联规则。

import pyfpgrowth

# 设置最小支持度和最小置信度
min_support = 0.02
min_confidence = 0.7

# 找到频繁项集
patterns = pyfpgrowth.find_frequent_patterns(basket_sets, int(len(basket_sets) * min_support))

# 生成关联规则
rules = pyfpgrowth.generate_association_rules(patterns, min_confidence)

挖掘结果展示：
输出生成的关联规则及其置信度和支持度。

for rule, (support, confidence) in sorted(rules.items(), key=lambda x: x[1][1], reverse=True):
    # 输出关联规则及其置信度和支持度
    antecedent = [item for item in rule]
    consequent = [item for item in basket.columns if item not in antecedent]
    print(f"{
      
      antecedent} => {
      
      consequent} (support={
      
      support}, confidence={
      
      confidence})")

['JUMBO BAG RED RETROSPOT'] => ['JUMBO STORAGE BAG SUKI'] (support=219, confidence=0.7777777777777778)
['REGENCY CAKESTAND 3 TIER'] => ['WHITE HANGING HEART T-LIGHT HOLDER'] (support=233, confidence=0.7209302325581395)
['JUMBO BAG PINK POLKADOT'] => ['JUMBO STORAGE BAG SUKI'] (support=242, confidence=0.7560975609756098)
['LUNCH BAG  BLACK SKULL.'] => ['LUNCH BAG RED RETROSPOT'] (support=139, confidence=0.765625)
['PARTY BUNTING'] => ['JUMBO BAG RED RETROSPOT'] (support=184, confidence=0.7551020408163266)
['JUMBO STORAGE BAG SUKI'] => ['JUMBO BAG PINK POLKADOT'] (support=242, confidence=0.7023255813953488)
['LUNCH BAG PINK POLKADOT'] => ['LUNCH BAG RED RETROSPOT'] (support=180, confidence=0.7843137254901961)
['LUNCH BAG CARS BLUE'] => ['LUNCH BAG RED RETROSPOT'] (support=177, confidence=0.8240740740740741)
['LUNCH BAG SPACEBOY DESIGN'] => ['LUNCH BAG RED RETROSPOT'] (support=161, confidence=0.7385321100917431)
['WOODEN PICTURE FRAME WHITE FINISH'] => ['WOODEN FRAME ANTIQUE WHITE '] (support=144, confidence=0.7972027972027972)
['JUMBO BAG RED RETROSPOT'] => ['JUMBO BAG PINK POLKADOT'] (support=219, confidence=0.7777777777777778)
['LUNCH BAG WOODLAND'] => ['LUNCH BAG RED RETROSPOT'] (support=158, confidence=0.7939698492462312)
['LUNCH BAG RED SPOTTY'] => ['LUNCH BAG RED RETROSPOT'] (support=228, confidence=0.9421487603305785)
['LUNCH BAG SUKI DESIGN '] => ['LUNCH BAG RED RETROSPOT'] (support=176, confidence=0.839905352113281)
['SET OF 3 CAKE TINS PANTRY DESIGN '] => ['SET OF 3 RETROSPOT CAKE TINS'] (support=134, confidence=0.8564102564102564)
['JUMBO STORAGE BAG SKULLS'] => ['JUMBO BAG RED RETROSPOT'] (support=125, confidence=0.7515151515151515)
['JUMBO BAG APPLES'] => ['JUMBO BAG RED RETROSPOT'] (support=174, confidence=0.8536585365853658)
['LUNCH BAG APPLE DESIGN'] => ['LUNCH BAG RED RETROSPOT'] (support=168, confidence=0.7428571428571429)
['RECYCLING BAG RETROSPOT'] => ['JUMBO BAG RED RETROSPOT'] (support=155, confidence=0.825531914893617)
['PANTRY ROLLING PIN'] => ['SET OF 3 RETROSPOT CAKE TINS'] (support=136, confidence=0.7431693989071039)
['PLASTERS IN TIN SPACEBOY'] => ['PLASTERS IN TIN WOODLAND ANIMALS'] (support=119, confidence=0.8415492957746478)

关联规则挖掘(Java)

import weka.associations.Apriori;
import weka.core.Instances;
import java.io.BufferedReader;
import java.io.FileReader;

public class AssociationRuleMining {
    
    
  public static void main(String[] args) throws Exception {
    
    
    BufferedReader reader = new BufferedReader(new FileReader("transactions.arff"));
    Instances data = new Instances(reader);
    reader.close();

    Apriori model = new Apriori();
    String[] options = {
    
    "-R", "0.5"};
    model.setOptions(options);
    model.buildAssociations(data);
    System.out.println(model);
  }
}

分类器（Python）

决策树 Decision Tree分类器：
使用sklearn lib 实现：

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

# 加载鸢尾花数据集
iris = load_iris()
X = iris.data
y = iris.target

# 划分训练集与测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 训练决策树分类器
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

# 预测测试集
y_pred = clf.predict(X_test)

# 输出分类器性能
print("Accuracy: %f" % clf.score(X_test, y_test))

朴素贝叶斯 NB分类器：
使用sklearnlib实现：

from sklearn.datasets import load_iris
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split

# 加载鸢尾花数据集
iris = load_iris()
X = iris.data
y = iris.target

# 划分训练集与测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 训练朴素贝叶斯分类器
clf = GaussianNB()
clf.fit(X_train, y_train)

# 预测测试集
y_pred = clf.predict(X_test)

# 输出分类器性能
print("Accuracy: %f" % clf.score(X_test, y_test))

SVM 支持向量机 support vector machine分类器：
使用sklearnlib实现：

from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split

# 加载鸢尾花数据集
iris = load_iris()
X = iris.data
y = iris.target

# 划分训练集与测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 训练支持向量机分类器
clf = SVC()
clf.fit(X_train, y_train)

# 预测测试集
y_pred = clf.predict(X_test)

# 输出分类器性能
print("Accuracy: %f" % clf.score(X_test, y_test))

回归分析 Regression Analysis (Python)

定义

回归分析是一种用于预测数值型数据的统计方法，可以对数据集中的变量之间关系进行研究。

常用回归分析方法

包括:

线性回归、
多项式回归、
岭回归、
lasso回归等。下面分别介绍在Python中如何进行回归分析。

线性回归

线性回归是一种最基础的回归分析方法，用于分析自变量和因变量之间的线性关系。

使用scikit-learn库中的LinearRegression模型来实现线性回归。

首先，准备数据集来预测学生的成绩：

import numpy as np

# 生成数据集
np.random.seed(0)
X = np.random.normal(0, 1, size=(100, 1))
y = 2*X[:, 0] + np.random.normal(0, 0.5, size=100)

接下来，使用LinearRegression模型进行建模，并对模型进行拟合。

from sklearn.linear_model import LinearRegression

# 建模
model = LinearRegression()

# 拟合
model.fit(X, y)

模型建立完成后，即可使用模型进行预测，并计算模型的拟合优度（R-square）。

# 预测
y_pred = model.predict(X)

# 计算R-square
r_squared = model.score(X, y)
print('R-square is:', r_squared)

多项式回归

多项式回归是一种基于线性回归的改进方法，它通过添加高阶项（例如二次项、三次项等）来拟合数据的非线性关系。

使用scikit-learn库中的PolynomialFeatures类来生成高阶项特征，
然后使用LinearRegression模型进行拟合。

首先，准备数据集：

import numpy as np
import matplotlib.pyplot as plt

# 生成数据集
np.random.seed(0)
X = np.linspace(-1,1,100)
y = np.sin(3*np.pi*X) + np.random.normal(0, 0.1, size=100)

接下来，使用PolynomialFeatures类生成二次项特征，并使用LinearRegression模型进行拟合。

from sklearn.preprocessing import PolynomialFeatures

# 生成二次项特征
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X.reshape(-1,1))

# 建模
model = LinearRegression()

# 拟合
model.fit(X_poly, y)

模型建立完成后，即可使用模型进行预测，并绘制模型的拟合曲线。

# 预测
X_test = np.linspace(-1,1,100)
X_test_poly = poly.fit_transform(X_test.reshape(-1,1))
y_pred = model.predict(X_test_poly)

# 绘制拟合曲线
plt.scatter(X, y, color='b')
plt.plot(X_test, y_pred, color='r')
plt.show()

岭回归

岭回归是一种基于线性回归的正则化方法，
它通过添加l2正则项来缩小特征系数，防止因为特征过多而导致过拟合。

使用scikit-learn库中的Ridge模型来实现岭回归。

首先，准备数据集：

import numpy as np

# 生成数据集
np.random.seed(0)
X = np.random.normal(0, 1, size=(100, 10))
y = 2*X[:, 0] + 3*X[:, 1] + np.random.normal(0, 0.5, size=100)

接下来，使用Ridge模型进行建模，并对模型进行拟合。

from sklearn.linear_model import Ridge

# 建模
model = Ridge(alpha=1)

# 拟合
model.fit(X, y)

模型建立完成后，即可使用模型进行预测，并计算模型的拟合优度（R-square）。

# 预测
y_pred = model.predict(X)

# 计算R-square
r_squared = model.score(X, y)
print('R-square is:', r_squared)

lasso回归

lasso回归是一种基于线性回归的正则化方法，
它通过添加l1正则项来缩小特征系数，同时也可以达到特征选择的效果。

使用scikit-learn库中的Lasso模型来实现lasso回归。

首先，准备数据集：

import numpy as np

# 生成数据集
np.random.seed(0)
X = np.random.normal(0, 1, size=(100, 10))
y = 2*X[:, 0] + 3*X[:, 1] + np.random.normal(0, 0.5, size=100)

接下来，使用Lasso模型进行建模，并对模型进行拟合。

from sklearn.linear_model import Lasso

# 建模
model = Lasso(alpha=0.1)

# 拟合
model.fit(X, y)

模型建立完成后，即可使用模型进行预测，并计算模型的拟合优度（R-square）。

# 预测
y_pred = model.predict(X)

# 计算R-square
r_squared = model.score(X, y)
print('R-square is:', r_squared)

在实际应用中，可以根据数据的特点来选择适合的回归方法，并对模型进行适当的调参，以提高模型的预测准确性。