Data Mining_Cluster Analysis & Classifiers & Association Rules & Regression Analysis (Python)

Data mining

A process of extracting knowledge from large amounts of data, which involves many fields such as statistics, machine learning, and artificial intelligence.
It typically uses computer programs to analyze data, discover underlying relationships or rules, and produce useful information.

Common techniques

include:

Cluster analysis,
Classifier,
Association rule mining
regression analysis.

data mining learning platform (website):

KDnuggets: This website provides news, educational resources and tools related to data mining.
Journal of Data Mining and Knowledge Discovery: This journal is
one of the most prestigious academic publications in the field of data mining.
Data Mining Techniques: This blog provides in-depth explanations and case studies on data mining techniques and practices.

data mining sample code:

Cluster analysis (Python)

from sklearn.cluster import KMeans
import numpy as np

X = np.array([[1, 2], [1, 4], [1, 0],
              [4, 2], [4, 4], [4, 0]])
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
print(kmeans.labels_)

Association rule mining (Python lib)

Apriori algorithm: used for frequent itemset mining and association rule mining. This can be achieved using the apyori library.
FP-growth algorithm: similar to the Apriori algorithm, but faster and more efficient. This can be achieved using the pyfpgrowth library.

Using Apriori algorithm to implement association rule mining

from apyori import apriori

transactions = [['apple', 'banana'], ['banana', 'orange'], ['apple', 'banana', 'orange'], ['banana', 'orange']]
rules = apriori(transactions, min_support=0.5, min_confidence=0.7)
# 找到了频繁项集，支持度为0.5。
# 如果两个项集的置信度大于0.7，则将它们作为规则输出。
for rule in rules:
    print(rule)

Implement association rule mining using FP-Growth algorithm

Installation pyfpgrowthlibrary:

!pip install pyfpgrowth

Data set preparation:

Use an e-commerce order data set on Kaggle.
Each order in the dataset contains a set of items.
Select to extract the order number and product list, and convert the product list into list form.

import pandas as pd

# 读取数据
df = pd.read_csv('https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/online-retail/online_retail.csv',
                 header=0, parse_dates=[4], encoding='unicode_escape')

# 提取订单号和商品列表，并将商品列表转换为列表形式
basket = (df[df['Country'] == "United Kingdom"]
          .groupby(['InvoiceNo', 'Description'])['Quantity']
          .sum().unstack().reset_index().fillna(0)
          .set_index('InvoiceNo'))

basket_sets = basket.applymap(lambda x: 1 if x > 0 else 0).astype(int)

Use the pyfpgrowth library for association rule mining:

First set the minimum support and minimum confidence,
Then call the find_frequent_patterns() function to find frequent itemsets ,
Then call the generate_association_rules() function to generate association rules .

import pyfpgrowth

# 设置最小支持度和最小置信度
min_support = 0.02
min_confidence = 0.7

# 找到频繁项集
patterns = pyfpgrowth.find_frequent_patterns(basket_sets, int(len(basket_sets) * min_support))

# 生成关联规则
rules = pyfpgrowth.generate_association_rules(patterns, min_confidence)

Mining result display:
output the generated association rules and their confidence and support .

for rule, (support, confidence) in sorted(rules.items(), key=lambda x: x[1][1], reverse=True):
    # 输出关联规则及其置信度和支持度
    antecedent = [item for item in rule]
    consequent = [item for item in basket.columns if item not in antecedent]
    print(f"{
      
      antecedent} => {
      
      consequent} (support={
      
      support}, confidence={
      
      confidence})")

['JUMBO BAG RED RETROSPOT'] => ['JUMBO STORAGE BAG SUKI'] (support=219, confidence=0.7777777777777778)
['REGENCY CAKESTAND 3 TIER'] => ['WHITE HANGING HEART T-LIGHT HOLDER'] (support=233, confidence=0.7209302325581395)
['JUMBO BAG PINK POLKADOT'] => ['JUMBO STORAGE BAG SUKI'] (support=242, confidence=0.7560975609756098)
['LUNCH BAG  BLACK SKULL.'] => ['LUNCH BAG RED RETROSPOT'] (support=139, confidence=0.765625)
['PARTY BUNTING'] => ['JUMBO BAG RED RETROSPOT'] (support=184, confidence=0.7551020408163266)
['JUMBO STORAGE BAG SUKI'] => ['JUMBO BAG PINK POLKADOT'] (support=242, confidence=0.7023255813953488)
['LUNCH BAG PINK POLKADOT'] => ['LUNCH BAG RED RETROSPOT'] (support=180, confidence=0.7843137254901961)
['LUNCH BAG CARS BLUE'] => ['LUNCH BAG RED RETROSPOT'] (support=177, confidence=0.8240740740740741)
['LUNCH BAG SPACEBOY DESIGN'] => ['LUNCH BAG RED RETROSPOT'] (support=161, confidence=0.7385321100917431)
['WOODEN PICTURE FRAME WHITE FINISH'] => ['WOODEN FRAME ANTIQUE WHITE '] (support=144, confidence=0.7972027972027972)
['JUMBO BAG RED RETROSPOT'] => ['JUMBO BAG PINK POLKADOT'] (support=219, confidence=0.7777777777777778)
['LUNCH BAG WOODLAND'] => ['LUNCH BAG RED RETROSPOT'] (support=158, confidence=0.7939698492462312)
['LUNCH BAG RED SPOTTY'] => ['LUNCH BAG RED RETROSPOT'] (support=228, confidence=0.9421487603305785)
['LUNCH BAG SUKI DESIGN '] => ['LUNCH BAG RED RETROSPOT'] (support=176, confidence=0.839905352113281)
['SET OF 3 CAKE TINS PANTRY DESIGN '] => ['SET OF 3 RETROSPOT CAKE TINS'] (support=134, confidence=0.8564102564102564)
['JUMBO STORAGE BAG SKULLS'] => ['JUMBO BAG RED RETROSPOT'] (support=125, confidence=0.7515151515151515)
['JUMBO BAG APPLES'] => ['JUMBO BAG RED RETROSPOT'] (support=174, confidence=0.8536585365853658)
['LUNCH BAG APPLE DESIGN'] => ['LUNCH BAG RED RETROSPOT'] (support=168, confidence=0.7428571428571429)
['RECYCLING BAG RETROSPOT'] => ['JUMBO BAG RED RETROSPOT'] (support=155, confidence=0.825531914893617)
['PANTRY ROLLING PIN'] => ['SET OF 3 RETROSPOT CAKE TINS'] (support=136, confidence=0.7431693989071039)
['PLASTERS IN TIN SPACEBOY'] => ['PLASTERS IN TIN WOODLAND ANIMALS'] (support=119, confidence=0.8415492957746478)

Association rule mining (Java)

import weka.associations.Apriori;
import weka.core.Instances;
import java.io.BufferedReader;
import java.io.FileReader;

public class AssociationRuleMining {
    
    
  public static void main(String[] args) throws Exception {
    
    
    BufferedReader reader = new BufferedReader(new FileReader("transactions.arff"));
    Instances data = new Instances(reader);
    reader.close();

    Apriori model = new Apriori();
    String[] options = {
    
    "-R", "0.5"};
    model.setOptions(options);
    model.buildAssociations(data);
    System.out.println(model);
  }
}

Classifier (Python)

Decision Tree Decision Tree classifier: implemented
using lib:sklearn

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

# 加载鸢尾花数据集
iris = load_iris()
X = iris.data
y = iris.target

# 划分训练集与测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 训练决策树分类器
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

# 预测测试集
y_pred = clf.predict(X_test)

# 输出分类器性能
print("Accuracy: %f" % clf.score(X_test, y_test))

Naive Bayes NB classifier: implemented
using lib:sklearn

from sklearn.datasets import load_iris
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split

# 加载鸢尾花数据集
iris = load_iris()
X = iris.data
y = iris.target

# 划分训练集与测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 训练朴素贝叶斯分类器
clf = GaussianNB()
clf.fit(X_train, y_train)

# 预测测试集
y_pred = clf.predict(X_test)

# 输出分类器性能
print("Accuracy: %f" % clf.score(X_test, y_test))

SVM support vector machine support vector machine classifier: implemented
using sklearnlib:

from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split

# 加载鸢尾花数据集
iris = load_iris()
X = iris.data
y = iris.target

# 划分训练集与测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 训练支持向量机分类器
clf = SVC()
clf.fit(X_train, y_train)

# 预测测试集
y_pred = clf.predict(X_test)

# 输出分类器性能
print("Accuracy: %f" % clf.score(X_test, y_test))

Regression Analysis (Python)

definition

回归分析It is a statistical method used to predict numerical data and can study the relationship between variables in the data set.

Commonly used regression analysis methods

include:

linear regression,
polynomial regression,
Ridge return,
lasso return etc. The following describes how to perform regression analysis in Python.

linear regression

Linear regression is the most basic regression analysis method, used to analyze the linear relationship between independent variables and dependent variables.

Use models scikit-learnfrom the library LinearRegressionto implement linear regression.

First, prepare the dataset to predict student grades:

import numpy as np

# 生成数据集
np.random.seed(0)
X = np.random.normal(0, 1, size=(100, 1))
y = 2*X[:, 0] + np.random.normal(0, 0.5, size=100)

Next, use the LinearRegression model to model and fit the model.

from sklearn.linear_model import LinearRegression

# 建模
model = LinearRegression()

# 拟合
model.fit(X, y)

After the model is established, you can use the model to make predictions and calculate the goodness of fit ( R-square ) of the model.

# 预测
y_pred = model.predict(X)

# 计算R-square
r_squared = model.score(X, y)
print('R-square is:', r_squared)

polynomial regression

Polynomial regression is an improved method based on linear regression that fits the nonlinear relationship of the data by adding higher-order terms (such as quadratic terms, cubic terms, etc.).

Use classes scikit-learnin the library PolynomialFeaturesto generate higher-order term features and then fit them
using a model.LinearRegression

First, prepare the dataset:

import numpy as np
import matplotlib.pyplot as plt

# 生成数据集
np.random.seed(0)
X = np.linspace(-1,1,100)
y = np.sin(3*np.pi*X) + np.random.normal(0, 0.1, size=100)

Next, use the PolynomialFeatures class to generate quadratic term features and fit them using the LinearRegression model.

from sklearn.preprocessing import PolynomialFeatures

# 生成二次项特征
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X.reshape(-1,1))

# 建模
model = LinearRegression()

# 拟合
model.fit(X_poly, y)

After the model is established, you can use the model to make predictions and draw the model's fitting curve .

# 预测
X_test = np.linspace(-1,1,100)
X_test_poly = poly.fit_transform(X_test.reshape(-1,1))
y_pred = model.predict(X_test_poly)

# 绘制拟合曲线
plt.scatter(X, y, color='b')
plt.plot(X_test, y_pred, color='r')
plt.show()

ridge regression

Ridge regression is a regularization method based on linear regression.
It reduces the feature coefficients by adding l2 regular terms to prevent overfitting due to too many features.

Use models scikit-learnfrom the library Ridgeto implement ridge regression.

First, prepare the dataset:

import numpy as np

# 生成数据集
np.random.seed(0)
X = np.random.normal(0, 1, size=(100, 10))
y = 2*X[:, 0] + 3*X[:, 1] + np.random.normal(0, 0.5, size=100)

Next, use the Ridge model for modeling and fit the model .

from sklearn.linear_model import Ridge

# 建模
model = Ridge(alpha=1)

# 拟合
model.fit(X, y)

After the model is established, you can use the model to make predictions and calculate the goodness of fit ( R-square ) of the model.

# 预测
y_pred = model.predict(X)

# 计算R-square
r_squared = model.score(X, y)
print('R-square is:', r_squared)

lasso returns

Lasso regression is a regularization method based on linear regression.
It reduces the feature coefficients by adding l1 regular terms , and can also achieve the effect of feature selection.

Use models scikit-learnfrom the library Lassoto implement lasso regression.

First, prepare the data set:

import numpy as np

# 生成数据集
np.random.seed(0)
X = np.random.normal(0, 1, size=(100, 10))
y = 2*X[:, 0] + 3*X[:, 1] + np.random.normal(0, 0.5, size=100)

Next, use the Lasso model for modeling and fit the model .

from sklearn.linear_model import Lasso

# 建模
model = Lasso(alpha=0.1)

# 拟合
model.fit(X, y)

After the model is established, you can use the model to make predictions and calculate the goodness of fit ( R-square ) of the model.

# 预测
y_pred = model.predict(X)

# 计算R-square
r_squared = model.score(X, y)
print('R-square is:', r_squared)

In practical applications, a suitable regression method can be selected based on the characteristics of the data , and the parameters of the model can be appropriately adjusted to improve the prediction accuracy of the model.