Data Mining Course Design Report Summary

1. Experimental topic

Experiment 1 Design and Application of Apriori Algorithm

2. Background introduction

Apriori algorithm is a frequent itemsets algorithm for mining association rules. Its core idea is to mine frequent itemsets through two stages of candidate set generation and downward closure detection.

3. Experimental content

1.3.1 Theoretical knowledge applied

Association rule mining is one of the most active research methods in data mining. The original motivation is proposed for the problem of market basket analysis, and its purpose is to discover the practice rules between different commodities in the transaction database. Through the minimum support given by the user, find all frequent item sets, that is, all item subsets that satisfy Support not less than Minsupport. Through the minimum credibility given by the user, in each maximum frequent item set, find association rules whose Confidence is not less than Minconfidence.
Support: The support is the proportion of the number of times several associated data appear in the data set to the total data set. Or the probability of occurrence of several data associations.
Confidence: Confidence reflects the probability of another data appearing after one data appears, or the conditional probability of data.

1.3.2 Experimental principle

The basic idea of ​​the Apriori algorithm:
first scan the data that needs to be analyzed in the database, after setting the support and confidence, generate frequent items with the support of the minimum support, that is, count the frequency of all items that contain one or more elements , to find itemsets that are greater than or equal to the set support. The second is the self-join of frequent items. Furthermore, if the subset of frequent items since the connection is not a frequent item, pruning is performed. Then the frequent items are processed to generate candidate items. The last loop calls to generate frequent itemsets.

1.3.3 Algorithm detailed design

(1) Define the data set, minimum support, minimum support, minimum confidence, and a Map collection that stores all frequent item sets. Call the encapsulated initDataList() method to initialize the data.
(2) Call the getAllElement() method to obtain all category elements allElement contained in the data set, and arrange and combine the allElement array to obtain the candidate set candidateL1. Call the getItemSets() method to traverse candidateL1, and add items whose occurrence times are not less than the minimum support count minSupportCount to itemSets. After the traversal is complete, the 1-frequent item set L1 is obtained, and L1 is printed out.
(3) Start looping to find k-frequent itemsets until the obtained k-frequent itemsets Lk is empty, then jump out of the loop. Inside the loop body, call the getCombination() method with all the category elements allElementLast contained in the k-1-candidate set candidateLast as parameters to obtain the k-candidate set candidateNow, traverse the k-candidate set, and select the number of occurrences not less than the minimum support number Items are added to itemSets and returned to obtain frequent item set Lk. If the k-frequent itemset is empty, the loop ends. If Lk is not empty, add Lk to the collection of all frequent itemsets allItemSets, which is convenient for finding strong association rules later.
(4) Call the correlation() method to mine strong association rules. Inside the method body, traverse all frequent item sets, get the non-empty subset set subSet of each frequent item set, traverse the non-empty subset of each frequent item set, and use this non-empty subset and this frequent item set as parameters , call the isConfidence() method to judge whether the confidence degree is greater than the minimum confidence degree, and if so, output the complement of the non-empty subset ==> non-empty subset.

1.3.4 Key source code

/**
 * 获取候选集
 * @param allItemStr 含频繁项目集所有元素的数组
 * @param k 要生成k候选集
 * @return
 */
public static List<String[]> getCombination(String[] allItemStr, int k){
    
    
    //定义候选集
    List<String[]> candidateSets = new ArrayList<>();
    //对allItemStr进行k组合
    IGenerator<List<String>> candidateList = Generator.combination(allItemStr).simple(k);
    for (List<String> candidate : candidateList) {
    
    
        String[] candidateStr = candidate.toArray(new String[candidate.size()]);
        candidateSets.add(candidateStr);//将每一项组合放入候选集中
    }
    return candidateSets;
}
/**
 * 处理候选集,获取频繁项目集
 * @param itemList 候选集
 * @return 频繁项目集
 */
public static Multimap<Integer, String[]> getItemSets(List<String[]> itemList){
    
    
    Multimap<Integer, String[]> itemSets = ArrayListMultimap.create(); //项目集
    //得到排列组合结果(候选集)每一项在数据集中出现的次数
    Multimap<Integer, String[]> itemCount = getItemCount(itemList);
    //使用迭代器遍历multimap
    Iterator<Map.Entry<Integer, String[]>> iterator = itemCount.entries().iterator();
    //遍历排列组合结果的每一项,将出现次数不小于minSupportCount的项加入到itemSets
    while (iterator.hasNext()){
    
    
        Map.Entry<Integer, String[]> entry = iterator.next();
        if (entry.getKey() >= minSupportCount){
    
    
            itemSets.put(entry.getKey(), entry.getValue());
        }
    }
    return itemSets;
}
/**
 * 找强关联规则
 */
public static void correlation(){
    
    
    //遍历所有频繁项目集
    for (int k = 1; k <= allItemSets.size(); k++) {
    
    
        //获取k-频繁项目集
        Multimap<Integer, String[]> keyItemSet = allItemSets.get(k);
        Iterator<Map.Entry<Integer, String[]>> iterator = keyItemSet.entries().iterator();
        //遍历k频繁项目集
        while (iterator.hasNext()){
    
    
            Map.Entry<Integer, String[]> entry = iterator.next();
            String[] value = entry.getValue();
//                List<String> valueList = Arrays.asList(value);
//                List<String> valueList = new ArrayList<>();
//                Collections.addAll(valueList, value);
            //求value的非空子集
            for (int i = 1; i < value.length; i++) {
    
    
                List<String[]> subSet = getCombination(value, i); //非空子集的集合
                for (String[] subSetItem : subSet) {
    
     //subSetItm是频繁项目集每一个非空子集
                    List<String> valueList = new ArrayList<>();
                    Collections.addAll(valueList, value);
                    List<String> subSetItemList = Arrays.asList(subSetItem);
                    //去除已经求得子集后的valueList
                    valueList.removeAll(subSetItemList); //此时valueList中存放非空子集的补集
                    if (isConfidence(subSetItem, value)){
    
    
                        System.out.println(Arrays.toString(subSetItem) + "==>" + Arrays.toString(valueList.toArray(new String[valueList.size()])));
                    }
                }
            }
        }
    }
}

4. Experimental results and analysis

1.4.1 Test code

public static void main(String[] args) {
    
    
    // 初始化数据集
    initDataList();
    String[] allElement = getAllElement(dataList);
    //获取候选集L1
    List<String[]> candidateL1 = getCombination(allElement, 1);
    Multimap<Integer, String[]> itemSetsL1 = getItemSets(candidateL1);
    allItemSets.put(1, itemSetsL1);
    printResult(itemSetsL1, 1);
    List<List<String[]>> stack = new ArrayList<>();
    stack.add(candidateL1);
    for (int k = 2; true; k++) {
    
    
        List<String[]> candidateLast = stack.get(0);
        String[] allElementLast = getAllElement(candidateLast);
        List<String[]> candidateNow = getCombination(allElementLast, k);
        Multimap<Integer, String[]> itemSetsLk = getItemSets(candidateNow);
        if (itemSetsLk.isEmpty()) break;
        allItemSets.put(k, itemSetsLk);
        printResult(itemSetsLk, k);
        stack.remove(0);
        stack.add(candidateNow);
    }
    correlation();
}

1.4.2 Test results

insert image description here
insert image description here

Since there are too many strong association rules, only some of them are shown here, as shown in Figure 1-4-2:

insert image description here

Finally, among all the frequent itemsets, the largest frequent itemset is 5-frequent itemsets. Since I don't know whether the result is correct, I input the data set with the correct answer and tested it, and the output result is no problem. I feel that the confidence and support given in this experiment are too small, or in the given data set, the coincidence degree of the types of data elements contained in each item is too high, resulting in too many strong association rules, and even each item The confidence of the association rules is greater than the minimum confidence.

5. Summary and experience

This experiment is implemented in Java language. I think it is not important which language is implemented. What is important is to understand the basic idea and implementation principle of the algorithm. Apriori algorithm is one of the ten classic algorithms of data mining. Mastering its algorithm idea is very important. This experiment is implemented in Java language, with a total of 260 lines. I think some codes are still a bit redundant and can be streamlined.
What is remarkable about the implementation of this experiment is that when looking for candidate sets, the author did not use conventional algorithm ideas to obtain the permutation and combination of arrays, but used the combination() method in the combinatorics3 dependency to pass in all types of data elements The array and k value of the array can directly get the k combination result of the array. In addition, the implementation of this experiment also uses the Multimap collection in Google's guava dependency. The Multimap collection can store key-value pairs with repeated key values, which is convenient for storing candidate sets and the number of times they appear in the data set.

1. Experimental topic

Experiment 5 KNN Algorithm Design and Application

2. Background introduction

k-Nearest Neighbor (kNN, k-NearestNeighbor) is to select the k nearest neighbors from the input data point in the training set, and use the category with the most occurrences (maximum voting rule) among the k neighbors as the category of the data point.

3. Experimental content

2.3.1 Theoretical knowledge applied

Classification is a very important task in data mining. The purpose of classification is to learn a classification function or classification model (also often referred to as a classifier), which can map data items in the database to a certain category of a given type. Classification can be used for prediction. The purpose of forecasting is to automatically derive a description of the trend of given data from historical data records, so that future data can be predicted. A commonly used forecasting method in statistics is regression. The classification in data mining and the regression method in statistics are a pair of interrelated but different concepts. Generally, the output of classification is discrete category value, while the output of regression is continuous value.
Given a database D={t1,t2,…,tn} and a set of classes C={C1,C2,…,Cm}. For any tuple ti={ti1,ti2,…,tik}∈D, if there exists a Cj∈C such that sim(ti,Cj)≥sim(ti,Cp), there exists Cp∈C, Cp≠Cj, Then ti is assigned to the class Cj, where sim(ti,Cj) is called the similarity.
In actual calculation, the distance is often used to represent, the closer the distance, the greater the similarity, and the farther the distance, the smaller the similarity.
In order to calculate the similarity, we need to get the vector representing each class first. There are many calculation methods, for example, the vector representing each class can be represented by calculating the center of each class. In addition, in pattern recognition, a pre-defined image is used to represent each class, and classification is to compare the samples to be classified with the pre-defined image.

2.3.2 Experimental principle

Basic idea of ​​KNN algorithm:
The idea of ​​KNN algorithm is relatively simple. Assuming that each class contains multiple training data, and each training data has a unique category label, the main idea of ​​the KNN algorithm is to calculate the distance between each training data and the tuple to be classified, and take the closest distance to the tuple to be classified The k training data of the k data, which category of training data is in the majority among the k data, the tuple to be classified belongs to which category.

2.3.3 Algorithm detailed design

(1) Define the student class Student, define attributes such as name, height, and grade in the student class, and use the @Data annotation that lombok relies on to inject the get and set methods of the Student class. Define the initial data set, define 14 entity student classes, and add the 14 entity Student classes to the initial data set dataList.
(2) Call the initData() method to initialize the data set, define a Student class stuV0 and instantiate its name and height as input, call the Knn() method to obtain the Student class object student with grades, and carry out the object student output.
(3) Inside the Knn() method body, the first 5 items of the data set are initially added to the categoryList collection. The categoryList collection is used to store the k students closest to stuV0, and only the first 5 items of the data set are initially stored. Traverse the data set dataList, calculate the distance v0Tod between stuV0 and each item starting from item 6 of the data set, and call the getCalculate() method to calculate the distance between stuV0 and the student object stuU in the categoryList collection, if stuU is the height of stuV0 If the distance uToV0 is greater than v0Tod, remove stuU from the categoryList, and add the item in the data set to the categoryList collection.
(4) Inside the body of the getCalculate() method, define the variable maxHeight to store the farthest distance between stuV0 and the category set categoryList, and define the Student class object resultStu to store the students to be returned, that is, the farthest distance between stuV0 and the category set categoryList student. Traverse the categoryList collection, if the distance between stuU and stuV0 is greater than maxHeight, assign v0ToU to maxHeight, assign stuU to resultStu, and finally return the Student class object resultStu.
(5) Call the getCategoryStudent() method to find the rank of the student with the largest proportion of the same grade in the categoryList, and finally instantiate the rank attribute of stuV0 with rank, and return stuV0.
(6) In the body of the getCategoryStudent() method, traverse the categoryList to find the student grades with the largest proportion of the same grade. In fact, it is to find the students with the highest grade, the middle grade, and the short grade, which category has the largest number of students , and return the category with the largest number of students.

2.3.4 Key source code

/**
 * 对输入学生类进行Knn算法实例化该学生的等级后,将该学生返回
 * @param stuV0
 * @return
 */
public static Student Knn(Student stuV0){
    
    
    List<Student> categoryList = new ArrayList<>(); //存放距离stuV0最近的k个学生,最初存放数据集的前5项
    for (int i = 0; i < dataList.size(); i++) {
    
    
        if (i < 5) categoryList.add(dataList.get(i));
        else {
    
    
            //stuV0距离剩下数据集中某项的距离
            int v0Tod = Math.abs(stuV0.getHeight() - dataList.get(i).getHeight());
            Student stuU =  getCalculate(stuV0, categoryList); //存放stuV0距离类别集合中最远的学生
            int uToV0 = Math.abs(stuU.getHeight() - stuV0.getHeight());
            if (uToV0 > v0Tod){
    
    
                categoryList.remove(stuU); //在集合列表中去除stuU
                categoryList.add(dataList.get(i));
            }
        }
    }
    System.out.println(categoryList.toString());
    String rank = getCategoryStudent(categoryList);
    stuV0.setRank(rank);
    return stuV0;
}
/**
 * 计算出stuV0距离categoryList集合中最远的学生对象
 * @param stuV0
 * @param categoryList
 * @return
 */
public static Student getCalculate(Student stuV0, List<Student> categoryList) {
    
    
    int maxHeight = 0; //存放stuV0与类别集合categoryList的最远距离
    Student resultStu = new Student(); //存放要返回的学生,即stuV0与类别集合categoryList距离最远的学生
    for (Student stuU : categoryList) {
    
    
        int v0ToU = Math.abs(stuV0.getHeight() - stuU.getHeight()); //stuV0与stuU的距离
        if (v0ToU > maxHeight){
    
     //stuV0与stuU的距离大于maxHeight,则对maxHeight和resultStu进行更新
            maxHeight = v0ToU;
            resultStu = stuU;
        }
    }
    return resultStu;
}
/**
 * 找出同等级占比最多的学生等级
 * @param categoryList
 * @return
 */
public static String getCategoryStudent(List<Student> categoryList){
    
    
    int tallCount = 0;
    int midCount = 0;
    int smallCount = 0;
    for (Student stuU : categoryList) {
    
    
        if (stuU.getRank().equals("高")) tallCount++;
        else if (stuU.getRank().equals("中等")) midCount++;
        else smallCount++;
    }
    int max = 0;
    max = tallCount > midCount ? tallCount : midCount;
    max = smallCount > max ? smallCount : max;
    if (smallCount == max) return "矮";
    else if (tallCount == max) return "高";
    else return "中等";
}

4. Experimental results and analysis

2.4.1 Test code

public static void main(String[] args) {
    
    
    initData();
    Student stuV0 = new Student("易昌", 174);
    Student student = Knn(stuV0);
    System.out.println(student.toString());
}

2.4.2 Test results

insert image description here
In addition, in addition to outputting the student's height grade required by the question, the author also outputs the cluster where the input student is located to compare the question to ensure that the result is correct.
insert image description here
The student Yi Chang whose height is 174 is finally obtained, and his height grade is medium, and all the students in his cluster are also shown in Table 2-4-1 above. As far as the author sees, this data set and input data are not good. The final clusters are all students with "medium" height, so Yi Chang is naturally "medium". I feel that if the final cluster There are 3 students with a medium height grade and 2 students with a high height grade. I feel that such a data set is more representative and can better reflect the idea of ​​​​the KNN algorithm.

5. Summary and experience

This experiment is implemented in the Java language. I feel that the implementation of the KNN algorithm is quite simple. Personally, I think it is much simpler than the implementation of the Apriori algorithm. Of course, the author may write the Apriori algorithm too complicated. As one of the top ten classic algorithms of data mining, the KNN algorithm is very helpful for learning the classification ideas in data mining. The total code line is 122 lines. This algorithm also creates the Student class and uses the Java object-oriented programming idea.
Some problems were encountered during the implementation of this experiment, such as the implementation of the getCategoryStudent() method. In fact, the implementation of the method itself is not difficult. It just traverses through several numbers to find the maximum value. At that time, the author’s idea was to find out Is there any Java-encapsulated class library that can be used directly, but I found nothing after searching for a long time, so I wrote it myself. There is also a small interesting point here, that is, when finding the maximum value of the three numbers, the conventional if judgment is not used, but the ternary operator is used. This is because the process of finding the maximum value is not difficult. , The readability of the code is relatively simple, so the ternary operator is used, but the code that is difficult to understand in the process personally thinks that it is better to use if to judge, which can increase the readability of the code.
By completing this experiment, readers have a better understanding of the KNN classification algorithm, can use the Java language more proficiently, and have a better understanding of Java's object-oriented programming ideas.

1. Experimental topic

Experiment 7 DBSCAN Algorithm Design and Application

2. Background introduction

DBSCAN algorithm: If the area of ​​a point q contains more than MinPts objects, create a cluster with q as the core object. Then, iteratively find objects that are directly density-reachable from these core objects, and merge some density-reachable clusters. The process ends when no new points can be added to any cluster.

3. Experimental content

3.3.1 Theoretical knowledge applied

DBSCAN is a representative density-based clustering algorithm. Different from partitioning and hierarchical clustering methods, it defines clusters as the largest collection of density-connected points, can divide areas with sufficiently high density into clusters, and can find clusters of arbitrary shapes in "noisy" spatial databases. kind. Clustering is to group data objects into multiple classes or clusters. The principle of division is that objects in the same cluster have a high degree of similarity, while objects in different clusters are quite different. Different from classification, the class to be divided in the clustering operation is unknown in advance, and the formation of the class is completely data-driven, which belongs to an unsupervised learning method.
ε-region of an object: the area within a radius ε of a given object.
Core object: If the ε-field of an object contains at least a minimum number of MinPts objects, the object is called a core object.
Direct density reachability: Given a set D of objects, if p is in the ε-field of q, and q is a core object, we say that object p is directly density reachable from object q.
Density-connected: If there exists an object o in object set D such that objects p and q are density-reachable from o with respect to ε and MinPts, then objects p and q are density-connected with respect to ε and MinPts.

3.3.2 Experimental principle

The basic idea of ​​DBSCAN algorithm:
DBSCAN is a representative density-based clustering algorithm. Different from partitioning and hierarchical clustering methods, it defines clusters as the largest collection of density-connected points, can divide areas with sufficiently high density into clusters, and can find clusters of arbitrary shapes in "noisy" spatial databases. kind.
Extract an unprocessed point from the database, if the extracted point is a core point, then find all objects that are reachable from the point density to form a cluster; if the extracted point is an edge point (non-core object), jump out This loop, look for the next point until all points are processed.

3.3.3 Detailed design of the algorithm

(1) Define the Point class. The Point class contains attributes such as abscissa x and ordinate y, including the static method getIsSame(): to determine whether two Point class objects are the same, calculateDistance() method: to calculate the distance between two Point class objects The distance (Euclidean distance), calculateMHDDistance () method: calculate the distance between two Point objects (Manhattan distance). Define the ExcelData class. The ExcelData class contains the abscissa x (add annotation @ExcelProperty(value="abscissa")), and the ordinate y (add annotation @ExcelProperty(value="ordinate")). The ExcelData class is mainly used for reading Take the data of the excel file and map it to the ExcelData class. Define the Cluster class, which contains the attribute core point corePoint and the collection of all points in the cluster sameList.
(2) Define the initial data set dataList, define the radius e, define the minimum number MinPts of objects in the field of the core object e, and call the getFileData() method to initialize the initial data set. Inside the getFileData() method body, use EasyExcel to read the excel file and map it to the ExcelData class object, use the attribute x and attribute y in the ExcelData class object as construction parameters, instantiate the Point class object point, and convert all point is added to the dataList collection to complete the initialization of the data set.
(3) Create a clusterList collection to store all clusters. Traverse each Point class object point in the dataList collection. Inside the loop body, call the getEPointList() method to obtain all point collections ePointList in the field of a Point class object. If the length of the ePointList collection is not less than MinPts, it means that the point point is a core object , then instantiate a cluster with point as the core object, and use ePointList to instantiate the sameList property of the cluster, then call the canReachPoint() method to traverse the directly density-reachable points of the core object, merge all its density-reachable points, and The final cluster newCluster is added to the cluster set newCluster. Inside the loop body, first call the isExitCluster() method to determine whether the point already exists in a certain cluster, and the point already in the cluster will not be considered, and the next code in the loop body will not be executed, and the next cycle will be traversed directly until After traversing each item in the dataList collection, the loop ends.
(4) Traverse the clusetrList collection and output each item cluster in the collection.
(5) In the body of the isExistCluster() method, judge whether the point object is already in an existing cluster, traverse each cluster in the clusterList collection, obtain the sameList attribute in the cluster cluster, and judge whether the sameList collection contains point, Returns true if it contains. After traversal, return false.
(6) Inside the body of the canReachPoint() method, traverse all the points contained in the cluster, and judge whether each point except the core object point is a core object. If the point is also a core object, all points in its domain are The density-reachable points of the cluster core points can also be merged into the cluster cluster, and these points are added to the density-reachable point set reachPointList. When the loop is over, all the density-reachable points in the set reachPointList are added. Go to the sameList collection of the cluster, re-instantiate the cluster, and finally return the cluster.
(7) The function of the getEpointList() method is to obtain the collection of all points in the field of a point e. Inside the method body, define the point set pointList to store all points in the e field of point, traverse each point p in the data set dataList, and call the static method calculateMHDDistance() method in the Point class to calculate point point and point p The Manhattan distance is stored in the variable ptoPoint. If ptoPoint is smaller than the radius e, it means that the point p is in the field of e of the point point, and then p is added to the pointList collection, and finally the pointList collection is returned.

3.3.4 Key source code

/**
 * 判断point是否已经在已存在的簇中
 * @param point
 * @param clusterList
 * @return
 */
public static boolean isExistCluster(Point point, List<Cluster> clusterList){
    
    
    for (Cluster cluster : clusterList) {
    
    
        List<Point> pointList = cluster.getSameList();
        if (pointList.contains(point)) return true;
    }
    return false;
}
/**
 * 遍历核心对象直接密度可达的点,合并其所有密度可达的点
 * @param cluster
 * @return
 */
public static Cluster canReachPoint(Cluster cluster){
    
    
    List<Point> pointList = cluster.getSameList();
    List<Point> reachPointList = new ArrayList<>(); //存放核心点所有密度可达的点(暂存要新加入进来的点)
    for (Point point : pointList) {
    
    
        Point corePoint = cluster.getCorePoint();
        if (Point.getIsSame(corePoint, point)) continue; //这里不再遍历核心对象点
        List<Point> reachList = getEPointList(point); //核心对象直接密度可达的点其e领域内所有的点的集合
        if (reachList.size() >= MinPts){
    
     //说明point也是核心对象,其领域内的所有点也可以合并到cluster中
            for (Point reachPoint : reachList) {
    
    
                if (pointList.contains(reachPoint)) continue; //对于pointList中已经有的点不再重复添加
                reachPointList.add(reachPoint); //将密度可达的点添加到密度可达的点集合中
            }
        }
    }
    pointList.addAll(reachPointList); //将密度可达的点全加入到簇中
    cluster.setSameList(pointList);
    return cluster;
}
/**
 * 获取一个点的e领域内所有的点集合
 * @param point
 * @return
 */
public static List<Point> getEPointList(Point point){
    
    
    List<Point> pointList = new ArrayList<>(); //存放point的e领域内所有的点
    for (Point p : dataList) {
    
    
        double ptoPoint = Point.calculateMHDDistance(point, p);
        if (ptoPoint <= e) pointList.add(p); //说明点p在point的e领域内
    }
    return pointList;
}
四、实验结果与分析
3.4.1 测试代码
public static void main(String[] args) {
    
    
    getFileData();
//        initDataList();
    List<Cluster> clusterList = new ArrayList<>();
    for (Point point : dataList) {
    
    
        if (isExistCluster(point, clusterList)) continue; //已经在簇中的点不再考虑
        List<Point> ePointList = getEPointList(point);
        if (ePointList.size() >= MinPts){
    
     //说明点point是核心对象
            Cluster cluster = new Cluster(point);
            cluster.setSameList(ePointList);
            Cluster newCluster = canReachPoint(cluster);
            clusterList.add(newCluster);
        }
    }
    int pointSum = 0;
    for (Cluster cluster : clusterList) {
    
    
        System.out.println(cluster);
        pointSum += cluster.getSameList().size();
    }
    System.out.println(pointSum);
}

3.4.2 Test results

insert image description here

In fact, after seeing the output, the author was very confused. I felt that the number of points in the output cluster was not enough, and then output the number of points in the cluster. The result was that there were 27 clustered points. This experiment Given 63 experimental sample points, the result is only 27 clustered points. Are the other 36 sample points all noise sample points? The author is worried that there is a problem with the code, and then tested the example of the DBSCAN algorithm in the book. After using the data set in the book, it is no problem to compare the answer in the book. Then the author checked the next 63 sample data points and found that some points were repeated points, and then unit tested the number of non-repeated points among the 63 sample points, and output the result, the result was only 36 The sample points are not repeated, that is to say, 27 of the 63 sample points are exactly the same.
insert image description here
There are a total of 8 clusters in the final result cluster set, and the number of points contained in the clusters is 27. After looking at the final results, some points in the clusters are repeated. I once wondered if there was a problem with the code. , I haven’t found any problems after debugging several times. It may be that the data set is a bit strange, and the radius e and the minimum number of points in the core object field MinPts are not very good.

5. Summary and experience

This experiment was implemented in the Java language. The DBSCAN algorithm gave me the feeling that it was a paper tiger. In fact, this experiment was written at the end. At that time, I felt that the data set sample points were given 63, and they were created one by one like the previous experiments. Entity objects, and then it is not realistic to add entity objects to the data set collection. It is necessary to store the sample point data in txt text or excel form, and then use code to read and store them in the dataList data set. Because the data needs to be processed, the author writes it at the end. First, in order to ensure that more time is spent on thinking about the code implementation ideas of DBSCAN, the example in the book is used first (because the initial data set sample of this example in the book There are fewer points, and there are answers).
During the implementation process, I found that the code idea is not as difficult as I thought at the beginning, but I still encountered some problems. For example, when implementing the canReachPoint() method, the bug in the implementation process of this method was not found until the error was reported during the initial implementation. When traversing the pointList collection and directly adding the points with reachable density to the pointList collection, this directly changes the length of the pointList collection, but at this time during the traversal of the for loop, after running the debugging and finding a bug, the author defined reachPointList before traversing The collection is used to store all the density-reachable points of the core points. After traversing the pointList collection loop, call the addAll() method of the list to add all the density-reachable points contained in the reachPointList to the pointList collection. This is very good This bug is circumvented, and the code is executed successfully.
Through this experiment, the author has a better understanding of the code implementation ideas of the DBSCAN algorithm, and is more proficient in the use of Java language collections.

1. Experimental topic

Experiment 8 K-means Algorithm Design and Application

2. Background introduction

The k-means algorithm, also known as k-means or k-means, is one of the most widely used clustering algorithms. The calculation of similarity is based on the average value of objects in a cluster. The algorithm first randomly selects k objects, and each object initially represents the mean or center of a cluster. For each remaining object, assign it to the nearest cluster according to its distance from the center of each cluster. Then recompute the mean for each cluster. This process is repeated until the criterion function converges.

3. Experimental content

4.3.1 Theoretical knowledge applied

Clustering is to group data objects into multiple classes or clusters. The principle of division is that objects in the same cluster have a high degree of similarity, while objects in different clusters are quite different. Different from classification, the class to be divided in the clustering operation is unknown in advance, and the form of the class is completely data-driven, which belongs to an unsupervised learning method.
Cluster analysis originates from many research fields, including data mining, statistics, machine learning, pattern recognition, etc. It is a function in data mining, but it can also be used as an independent tool to obtain data distribution, summarize the characteristics of each cluster, or focus on specific clusters for further analysis. Furthermore, cluster analysis can also be used as a preprocessing step for other analysis algorithms (such as association rules, classification, etc.) that operate on the resulting clusters.
Clustering: Clustering is a process of classifying and organizing data members that are similar in some respects. Clustering is a technique for discovering this internal structure. Clustering techniques are often referred to as unsupervised learning.
K-means clustering: K-means clustering is the most famous partitioning clustering algorithm, and it is the most widely used of all clustering algorithms due to its simplicity and efficiency. Given a set of data points and the required number of clusters k, k is specified by the user, the k-means algorithm repeatedly divides the data into k clusters according to a certain distance function.

4.3.2 Experimental principle

The basic idea of ​​the K-means clustering algorithm:
The K-means clustering algorithm first randomly selects K objects as the initial clustering centers. Then calculate the distance between each object and each seed cluster center, and assign each object to the cluster center closest to it. The cluster centers and the objects assigned to them represent a cluster. Once all objects have been assigned, the cluster centers for each cluster are recalculated based on the existing objects in the cluster. This process will be repeated until a certain termination condition is met. The termination condition can be that no (or minimum number) objects are reassigned to different clusters, no (or minimum number) cluster centers change again, and the sum of squared errors is locally minimum.

4.3.3 Algorithm detailed design

(1) Define the Point class. The Point class contains attributes such as abscissa x and ordinate y, including the static method getIsSame(): to determine whether two Point class objects are the same, calculateDistance() method: to calculate the distance between two Point class objects The distance (Euclidean distance), calculateMHDDistance () method: calculate the distance between two Point objects (Manhattan distance). Define the Cluster class, which contains the attribute core point corePoint and the collection of all points in the cluster sameList.
(2) Define the initial data set dataList, define the number k of clusters, call the initDataList() method to initialize the data set, and call the getInitCluster() method to initialize the clusters. The main function of the getInitCluster() method is to obtain any k objects as the initial cluster centers, and return a set containing k clusters. Inside the body of the getInitCluster() method, define the clusterList collection to store k clusters, call the getRandomArray() method to obtain an array randomArray containing k non-repeating random numbers, and store the subscripts of k objects in the data set in the randomArray array, traverse The array randomArray takes out k Point objects with arbitrary subscripts as the core object points of the corresponding cluster, and adds the cluster after each cycle definition and instantiation to the clusterList, and finally returns the clusterList collection.
(3) Enter the while loop, traverse each point in the data set dataList, call the getBelongCluster() method to obtain the subscript index in the clusterList of the cluster to which the point belongs, take out the cluster with the specified subscript index in the clusterList, and put the point point Add to the sameList of the cluster. Then after traversing the data set, call the calculateClusterCore() method to calculate the new cluster center and determine whether the point set of each cluster in the cluster set has changed. If there is no change, jump out of the while loop, indicating that K-means clustering The class ends, otherwise it enters the next while loop. Before traversing the data set, the sameList collection of each cluster in the clusterList collection should be cleared.
(4) Traverse the clusetrList collection and output each item cluster in the collection.
(5) The main function of the getBelongCluster() method is to obtain the subscript of which cluster a certain point belongs to. Inside the method body, the variable closestDistance and the variable resultClusterIndex are defined to store the closest distance between the point and the cluster center, and the subscript of which cluster the point belongs to, respectively. Traverse the cluster collection clusterList, call the static method calculateDistance() in the Point class to calculate the distance between the point and the core point of the cluster cluster and assign it to distance, and assign the distance value obtained in the first traversal to the closestDistance. If the distance is smaller than the closestDistance in subsequent traversals, then Assign distance to closestDistance, and assign index to resultClusterIndex at the same time, the loop traversal ends, and finally return resultClusterIndex.
(6) The main function of the calculateClusterCore() method is to calculate the new cluster center and return whether the point set of the cluster has changed. Define the flag variable flag inside the method body, then traverse each item cluster in the clusterList collection, define the variable sumX and the variable sumY to store the sum of all x coordinates of the cluster center point set, and all the y coordinates of the cluster center point set Sum, calculate the average value of sumX and sumY and assign it to the new cluster center point clusterCore, call the static method getIsSame() in the Point class to determine whether the clusterCore is the same as the original cluster center, if not, set the flag to true. After traversing the cluster set, the flag value is returned. It should be noted that the formal parameter type here is a List collection, and the address of the List collection is passed. Modifying the collection in the method body will cause the value of the actual parameter to change.

4.3.4 Key source code

/**
 * 计算出新的簇中心并返回簇的点集合是否有变化
 * @param clusterList
 * @return
 */
public static boolean calculateClusterCore(List<Cluster> clusterList){
    
    
    boolean flag = false;
    //遍历簇集合中的每一项,更新其簇中心
    for (Cluster cluster : clusterList) {
    
    
        List<Point> sameList = cluster.getSameList();
        double sumX = 0; //存放簇中点集合所有的X坐标之和
        double sumY = 0; //存放簇中点集合所有的Y坐标之和
        for (Point point : sameList) {
    
    
            sumX += point.getX();
            sumY += point.getY();
        }
        //更新簇的中心
        Point clusterCore = new Point(sumX * 1.0 / sameList.size(), sumY * 1.0 / sameList.size());
        if (!Point.getIsSame(clusterCore, cluster.getCorePoint())) flag = true;
        cluster.setCorePoint(clusterCore);
    }
    return flag;
}
/**
 * 获取某个点属于哪个簇的下标
 * @param point
 * @return
 */
public static int getBelongCluster(Point point, List<Cluster> clusterList){
    
    
    double closestDistance = 0.0; //存放point距离簇中心最近的距离
    int resultClusterIndex = 0; //存放point属于的那个簇的下标
    int index = 0;
    //遍历簇集合,计算point到簇中心的距离,找出point属于的簇
    for (Cluster cluster : clusterList) {
    
    
        double distance = Point.calculateDistance(point, cluster.getCorePoint());
        if (index == 0) closestDistance = distance;
        if (distance < closestDistance){
    
    
            closestDistance = distance;
            resultClusterIndex = index;
        }
        index++;
    }
    return resultClusterIndex;
}
/**
 * 获取任意k个对象作为初始簇中心,将含有k个簇的集合返回
 * @return
 */
public static List<Cluster> getInitCluster(){
    
    
    List<Cluster> clusterList = new ArrayList<>();
    int[] randomArray = getRandomArray();
    //任意选取k个对象作为初始簇中心,数据集中k个对象的下标存放在randomArray中
    for (int i = 0; i < randomArray.length; i++) {
    
    
        Point point = dataList.get(randomArray[i]);
        Cluster cluster = new Cluster(point);
        clusterList.add(cluster);
    }
    return clusterList;
}

4. Experimental results and analysis

4.4.1 Test code

public static void main(String[] args) {
    
    
    //初始化数据集和初始簇
    initDataList();
    List<Cluster> clusterList = getInitCluster();
    while(true){
    
    
        for (int j = 0; j < k; j++) {
    
    
            clusterList.get(j).getSameList().clear();
        }
        for (Point point : dataList) {
    
    
            int index = getBelongCluster(point, clusterList); //获取point属于的那个簇在clusterList中的下标
            clusterList.get(index).getSameList().add(point); //把point加入到clusterList的对应簇中;
        }
        if (!calculateClusterCore(clusterList)) break;
    }
    for (Cluster cluster : clusterList) {
    
    
        System.out.println(cluster);
    }
}

4.4.2 Test results

insert image description here

The output is undoubtedly 3 clusters, because the value of k is 3. Here, the author conducted several tests and found that as the number of tests increases, the test results will be different. After collecting data, the author personally thinks that this situation is normal. The reason is that the cluster center is randomly selected at the beginning. point, the position of the selected cluster center point may be too compact or too evacuated, which will affect the final output results. After many tests, the author found that there is a group of output results with the highest frequency of occurrence. This group of output results is shown in Figure 4 -4-1 shown.
insert image description here
The final result looks normal. Here, for the set of initialized clusters, the author randomly selected 3 numbers between 0 and 12. In order to avoid duplication of the selected 3 numbers, the author also used a while loop until Until the subscripts not included in the subscript set, enter the next fetch. Finally, after several times of debugging, no bugs have been found for the time being. As for the output of some different results in multiple tests, the author thinks it is normal.

5. Summary and experience

This experiment is implemented in the Java language. The K-means clustering algorithm is one of the ten classic algorithms for data mining. During the implementation process, some problems were encountered. When using the List collection as a method parameter, the author originally wanted to use oldClusterList to temporarily store the clusterList collection before passing the parameter, and use newClusterList to temporarily store the clusterList collection after passing the parameter. At first, when the List collection was ignored as the method parameter, the collection address was passed. If the content of the collection is changed in the body, the content of the actual parameter collection will also change. In the subsequent debugging, the author discovered this problem and corrected it.
Through this experiment, the author has a better understanding of the K-means clustering algorithm. The best thing about this experiment is that the point set can be visualized, so that the clustering results can be seen more clearly. Due to time reasons, there is no It is visualized, and it may be updated later when there is energy and time.

1. Experimental topic

Experiment 9 PageRank Algorithm Design and Application

2. Background introduction

PageRank algorithm: Calculate the PageRank value of each web page, and then sort the importance of the web pages according to the size of this value.

3. Experimental content

5.3.1 Theoretical knowledge applied

From the user's point of view, a website is a collection of several pages. However, for the designer of the website, these pages are carefully organized and are connected in series through the links of the pages as a whole. Therefore, Web structure mining is mainly to discover the page link structure in the website. For example: when designing services such as search engines, mining the link structure of Web pages can obtain useful knowledge to improve retrieval efficiency and quality.
In general, links to Web pages are similar to academic citations, so an important page may have links to many pages pointing to it. That is, if there are many links pointing to a page, it must be an important page. Likewise, a page has significant value if it links to many pages.
Suppose u is a Web page, Bu is the set of all pages pointing to u, Fu is the set of all pages pointing to u, c (<1) is a normalization factor, then the rank R(u) of u page is given by Defined as: R(u)=c∑_(v∈Bu)▒(R(v))/(|Fv|), obviously, the basic page classification method mainly considers the in-degree of a page, that is, by entering the The page rank of the page is obtained. At the same time, when the rank value of a page is passed, it is passed to all the pages it points to using the average distribution method, that is, each page linked from it divides its rank value equally.
The core part of the PageRank algorithm can start with a directed graph. The most typical method is to construct an adjacency matrix according to the directed graph for processing. The element ai,j(∈[0,1]) in the adjacency matrix A=(ai,j) represents the probability of pointing from page j to page i.

5.3.2 Experimental principle

The basic idea of ​​the PageRank algorithm:
When the basic PageRank algorithm calculates the rank value, each page will evenly distribute its own rank value to the page nodes it refers to. Assuming that the level value of a page is 1, there are n hyperlinks on the page, and the level value assigned to each hyperlink page is 1/n, then it can be understood that the page jumps to on any page it refers to.
Generally, the adjacency matrix A is converted into the so-called transition probability matrix M to implement the PageRank algorithm: M=(1-d) Q+d A, where Q is a constant matrix, the most commonly used is Q=(qi,j ), qi,j=1/n, the transition probability matrix M can be used as a vector transformation matrix to help complete the iterative calculation of the page level value vector R: Ri+1=M*R

5.3.3 Algorithm detailed design

(1) Define the Score score class, which includes the numerator son and denominator mom attributes in the Score score class, and also includes the static method simplify() for reducing the numerator and denominator, and the static method getAdd() for adding and simplifying scores As a result, the static method getMultiply() is used to multiply fractions and call simplify() to simplify the result.
(2) Define the initial data set dataList, define each Score class object of the transition matrix, and call the InitData() method to initialize the data set.
(3) Define the Score class object v0Score and instantiate it. The v0Score object represents a score value of 1/4, which means that the probability that people click on the four web pages A, B, C, and D is the same 1/4. Define a one-dimensional array V0 to store 4 v0Scores, and the initial PageRank is also assigned to V0.
(4) Enter the while loop and call the getPageRank() method to obtain the PageRank matrix. Define the pageRankList collection in the body of the getPageRank() method to store the final PageRank value, traverse each item of data dataItem in the dataList data set, and use the for loop to traverse the dataItem, so that the dataItem and Vk complete the matrix multiplication, and the result of each item after the matrix multiplication Save it in itemSum. After traversing dataItm, add itemSum to the pageRankList collection. After traversing the data set dataList collection, convert pageRankList into an array and return.
(5) Call the isRankEqual() method to judge whether the newly obtained PageRank matrix value is the same as the PageRank matrix value obtained last time. If not, continue to iterate, and if they are the same, stop iterating. The isRankEqual() method is mainly to judge whether the values ​​in the PageRank matrix of formal parameter V1 and formal parameter V2 are equal, and return a value of boolean type. The method body traverses the array V1 and compares the numerator of each item in V1 with each element in V2. Whether the numerators of an item are the same, if there are different, return false, if there is no difference after the traversal, return true.
(6) If the newly obtained PageRank matrix value is not the same as the PageRank matrix value obtained last time, assign the newly obtained PageRank matrix value to the final output pageRank variable, and output the PageRank value obtained each time.

5.3.4 Key source code

/**
 * 判断V1和V2的PageRank矩阵内的值是否相等
 * @param V1
 * @param V2
 * @return
 */
public static boolean isRankEqual(Score[] V1, Score[] V2){
    
    
    for (int i = 0; i < V1.length; i++) {
    
    
        int subSon = V1[i].getSon() - V2[i].getSon();
        int subMom = V1[i].getMom() - V2[i].getMom();
        if (subSon != 0 || subMom != 0) return false;
    }
    return true;
}
/**
 * 获取PageRank矩阵
 * @param Vk
 * @return
 */
public static Score[] getPageRank(Score[] Vk){
    
    
    List<Score> pageRankList = new ArrayList<>();
    for (Score[] dataItem : dataList) {
    
    
        Score itemSum = new Score(0,0); //itemSum中存放PageRank矩阵的每一项
        //通过遍历数据集的每一行和Vk的每一列实现矩阵乘法
        for (int i = 0; i < dataItem.length; i++) {
    
    
            Score multiply = Score.getMultiply(dataItem[i], Vk[i]);
            itemSum = Score.getAdd(multiply, itemSum); //将对应项相乘的结果累加到itemSum中
        }
        pageRankList.add(itemSum);
    }
    return pageRankList.toArray(new Score[pageRankList.size()]);
}

4. Experimental results and analysis

5.4.1 Test code

public static void main(String[] args) {
    
    
    initData();
    Score voScore = new Score(1, 4);
    Score[] V0 = {
    
    voScore, voScore, voScore, voScore};
    Score[] pageRank = V0;
    while (true){
    
    
        Score[] tmpVk = getPageRank(pageRank);
        if (isRankEqual(pageRank, tmpVk)) break; //新得到的PageRank矩阵和上次得到的PageRank矩阵不相同,则继续迭代,相同则不再迭代
        pageRank = tmpVk;
        System.out.println(Arrays.toString(pageRank));
    }
    System.out.println(Arrays.toString(pageRank));
}

5.4.2 Test results

insert image description here
In the face of this iterative result, the author still has some doubts. The final iterative result in the book is said to be stable and the final result is [3/9, 2/9, 2/9, 2/9], but this experiment In the end, I did not get a satisfactory result, but the program was terminated directly because of data overflow, but it can be seen that the result of the last output is very close to the correct answer, including the results of each iteration. Find out what's the problem.
insert image description here
The final result is indeed very close to the answer [3/9, 2/9, 2/9, 2/9] in the book. The author has not found any problems with the code for the time being. Maybe the answer in the book is also rough. After estimation, a stable value is selected, and the report will be updated in time if there are new discoveries.

5. Summary and experience

This experiment is implemented in Java language. The PageRank algorithm is the second experiment written by the author. The PageRank algorithm is too familiar. At that time, the teacher explained that this is the most important thing on the eve of the exam. I also reviewed it many times during the final review. Algorithm, I feel very familiar with the experiment of this algorithm, so I started to do it. The implementation of the PageRank algorithm is relatively simple. What can be said is that the score class Score and the multiplication of the matrix are defined here. For understanding, the score class Score also defines class static methods related to fraction addition and fraction multiplication, which is convenient for direct calling in the implementation of the algorithm.
In the process of implementing the PageRank algorithm, the author also encountered some difficulties, such as the final output result did not get the expected [3/9, 2/9, 2/9, 2/9], the author has carried out many times Verify that no problems have been found so far.
Through this experiment, the author has a better understanding of the idea of ​​Java object-oriented programming, and is more familiar with the idea of ​​the PageRank algorithm.

Guess you like

Origin blog.csdn.net/qq_54162207/article/details/128366958