Top 10 algorithms for data mining

Classification algorithm: C4.5, Naive Bayes, SVM, KNN, Adaboost, CARTL. Clustering algorithm: K-Means, EMl. Association analysis: Aprilil. Connection analysis: PageRank


Preface

The international authoritative academic organization ICDM (the IEEE International Conference on Data Mining) selected ten classic algorithms. According to different purposes, I can divide these algorithms into four categories.


Classification algorithm: C4.5, Naive Bayes (Naive Bayes), SVM, KNN, Adaboost, CARTL
Clustering algorithm: K-Means, EML
association analysis: Aprilil
connection analysis: PageRank

One, C4.5 algorithm

C4.5 is a decision tree algorithm. It creatively prunes branches during the construction of the decision tree, and can handle continuous attributes as well as incomplete data. It can be said to be a landmark algorithm in decision tree classification.

Two, SVM algorithm

SVM is called Support Vector Machine in Chinese, and Support Vector Machine in English, or SVM for short. SVM establishes a hyperplane classification model during training.

Three, KNN algorithm

KNN is also called K nearest neighbor algorithm, and the English is K-Nearest Neighbor. The so-called K nearest neighbors means that each sample can be represented by its K nearest neighbors. If a sample, its K nearest neighbors belong to category A, then this sample also belongs to category A.

Four, AdaBoost algorithm

Adaboost established a joint classification model during training. Boost means boost in English, so Adaboost is a boost algorithm for building classifiers. It allows us to form a strong classifier from multiple weak classifiers, so Adaboost is also a commonly used classification algorithm.

Five, CART algorithm

CART stands for Classification and Regression Trees, and English is Classification and Regression Trees. It builds two trees: one is a classification tree, and the other is a regression tree. Like C4.5, it is a decision tree learning method.

Six, Apriori algorithm

Apriori is an algorithm for mining association rules. It reveals the association relationship between items by mining frequent item sets. It is widely used in business mining and network security. Frequent itemsets refer to the collection of items that often appear together, and association rules imply that there may be a strong relationship between two items.

Seven, K-Means algorithm

K-Means algorithm is a clustering algorithm. I want to divide objects into K categories. Assume that there is a "center point" in each category, which is the core of this category. Now there is a new point to be classified. At this time, you only need to calculate the distance between this new point and K center points, and which center point is closer to which category will become.

8. Naive Bayes algorithm

The naive Bayes model is based on the principle of probability theory. Its idea is as follows: For a given unknown object to be classified, it is necessary to solve the probability of each category under the condition that the unknown object appears, which is the largest, Think of which category this unknown object belongs to.

Nine, EM algorithm

The EM algorithm is also called the maximum expectation algorithm, which is a method for obtaining the maximum likelihood estimation of parameters. The principle is this: suppose we want to evaluate parameter A and parameter B, both of which are unknown in the starting state, and knowing the information of A can get the information of B, and conversely knowing the B will get A . You can consider first giving A some initial value to get the estimate of B, and then starting from the estimate of B, re-estimating the value of A, this process continues until convergence. The EM algorithm is often used in the field of clustering and machine learning.

Ten, PageRank algorithm

PageRank originated from the calculation method of paper influence. If a literary theory is introduced more times, it means that the influence of the paper is stronger. Similarly, PageRank was creatively applied to the calculation of web page weight by Google: when a page has more pages linked, the more "references" the page has, and the higher the frequency of this page being linked, it means that the page is The higher the number of citations. Based on this principle, we can get the weight division of the website.

Second, use steps

1. Introduce the library

The code is as follows (example):

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
import  ssl
ssl._create_default_https_context = ssl._create_unverified_context

2. Read in the data

The code is as follows (example):

data = pd.read_csv(
    'https://labfile.oss.aliyuncs.com/courses/1283/adult.data.csv')
print(data.head())

The data requested by the url network used here.


to sum up

Tip: Here is a summary of the article:
For example, the above is what we are going to talk about today. This article only briefly introduces the use of pandas, and pandas provides a large number of functions and methods that enable us to process data quickly and conveniently.

Guess you like

Origin blog.csdn.net/weixin_43290383/article/details/114284670