Data Mining Experiment-Principal Component Analysis and Class Characterization

Dataset & code icon-default.png?t=N7T8https://www.aliyundrive.com/s/ibeJivEcqhm

1. Principal component analysis

1. Experimental purpose

  • Understand the purpose, content and process of principal component analysis.

  • Master principal component analysis and be able to program it.

2. Experimental principle

The purpose of principal component analysis

Principal component analysis is to transform the original multiple indicators into a few well-represented comprehensive indicators. These few indicators can reflect most of the information of the original indicators (more than 85%), and each indicator remains independent. , to avoid overlapping information. Principal component analysis mainly plays the role of dimensionality reduction and simplification of data structure.

Mathematical model of principal component analysis

Assume that there are p indicators in the actual problem discussed. Treat these p indicators as p random variables, denoted as X1, X2,...,Xp. Principal component analysis is to transform the problem of these p indicators into Discuss m new indicators F1, F2,..., Fm(m<p), which fully reflect the information of the original indicators in accordance with the principle of retaining the main amount of information, and are independent of each other.

The method of principal component analysis is to find the linear combination Fi of the original indicators.

 Meet the following conditions:

  • The sum of the squared coefficients of the principal components is 1:

  • The principal components are independent of each other and have no overlapping information:

  • The variances between principal components decrease in sequence:

The process of solving the principal components

  1. Find the sample mean \overline{X} and the sample covariance matrix S;

  2. Find the characteristic roots of S\lambda

  3. Find the unit eigenvector corresponding to the eigenroot

  4. Write the expression for the principal components

 When selecting the principal components, the first m principal components are selected according to the cumulative contribution rate.

Selection principles:

3. Experimental process

R language has a built-in calculation function for principal component analysis, and there are also other packages that can perform principal component analysis. In the experiment, R language was used, and the built-in PCA principal component analysis function was used to perform principal component analysis. The following takes the principal component analysis of the cancer data set BLCA as an example to illustrate the process of completing the principal component analysis. The method for other data sets is the same.

(1)Environmental preparation

Install and import the factoextra package for multivariate statistical analysis visualization.

options(repos=structure(c(CRAN="https://mirrors.tuna.tsinghua.edu.cn/CRAN/"))) 
install.packages("factoextra",dependencies = TRUE)
library(factoextra)

Import the data.

data <- read.csv('D:\\Files\\文档\\HNU5\\数据挖掘\\数据集\\BLCA\\rna.csv')   

Do a transpose so that the samples are on the rows and the attribute (gene_id) is on the columns.

data <- t(data)

(2) Perform principal component analysis

Use the built-in prcomp function in R language to perform principal component analysis:

cancer.pr <- prcomp(data)

The prcomp function in R can have two additional parameters:

  • SCALE: Set to true to standardize the data before performing principal component analysis.

  • RANK: the number of principal components.

Standardization is used when there are different dimensions between variables. When analyzing the aggregation of this book, considering that the data are all gene expression values ​​of samples, it is assumed that the dimensions are the same and no standardization is performed.

Use the summary function to view the principal component contribution rate:

summary(cancer.pr)

Some results are as follows:

Among them, Standard deviation represents the standard deviation, Proportion of Variance represents the contribution rate of a single principal component, and Cumulative Proportion represents the cumulative contribution rate. According to the principle of selecting principal components, the cumulative contribution rate of the selected m principal components should be greater than 80%. Therefore, it can be seen that the first 116 principal components need to be selected. At this time, the cumulative contribution rate of the principal components is 80.15%, which can reflect the original Most of the information about the indicator.

When selecting principal components, you can also generate a scree plot to view the contribution values ​​of the largest to smallest arrangement:

fviz_eig(cancer.pr, addlabels = TRUE, ylim = c(0, 100))  

It can be seen here that the largest principal component contribution value is 15%, and most of the principal component contribution values ​​are relatively small. Therefore, the first 116 principal components are needed to make the cumulative contribution rate reach 80%. This also shows that the original data cannot be dimensionally reduced. to very small dimensions.

You can also draw a variable correlation visualization diagram. The further away from the origin in the correlation diagram, the higher the representation of the variable by the principal component PC (stronger correlation), and the closer variables are positively correlated.

fviz_pca_var(cancer.pr)

 There are also some other drawing functions and principal component analysis result information, indicating the representativeness of the principal components to the variables, the contribution of each variable to a specific principal component, etc. Here, since the data dimension is still very high after the principal component analysis, it is not Do relevant analysis.

(4) Calculate principal component data

Principal component analysis mainly plays the role of dimensionality reduction and simplification of data structure. After solving the principal components above, it has been determined that the first 20 principal components can be retained, and only the principal component values ​​of each sample need to be calculated. After that, the analysis of the data only needs to be based on the principal component data, and there is no need to use the original data. Finally, the results are exported to complete the principal component analysis of the data.

cancer.pca <- predict(cancer.pr)[,1:116]
write.table (cancer.pca, file ="D:\\Files\\文档\\HNU5\\数据挖掘\\数据集\\BLCA\\pca.csv", sep =",", row.names =TRUE, col.names =TRUE, quote =TRUE)

For samples after principal component analysis, coordinate visualization can be performed:

fviz_pca_ind(cancer.pr)#score

 For data sets of different cancer types, after principal component analysis:

type Number of principal components Cumulative contribution rate
BLCA 116 80.15%
BRCA 212 80.072%
KIRC 114 80.125%
CLAY 140 80.099%
PATH 45 80.15%

4. Analysis of experimental results

After performing principal component analysis on different data sets, the data dimensionality is significantly reduced, which reflects the role of principal component analysis in data dimensionality reduction. However, many principal components still need to be retained to meet the requirement of a cumulative contribution rate of 80%, which reflects the complexity of the data. The individual data sets after principal component analysis did not have the same properties, and it was unclear how to compare the data after principal component analysis for different cancer types, so it was not completed.

2. Class concept description and characterization analysis

1. Experimental purpose

  • Understand the purpose, content and process of class characterization and class comparative analysis.

  • Master class characterization and class comparative analysis, and be able to implement it in programming.

  • Master attribute correlation analysis and implement attribute correlation analysis based on information gain.

2. Experimental principle

Class Characterization and Class Comparative Analysis

Concept (class) description is the most basic form of descriptive data mining. Class descriptions consist of class characteristics and comparisons.

  • Class characterization: Aggregating and describing a data set called a target class.

  • Class (Comparison) Contrast: Summarize and differentiate a data set called the target class from other data sets called comparison classes.

Attribute correlation analysis

  • By identifying irrelevant or weakly relevant attributes and excluding them from the concept description process, we determine which attributes should be included in class characterization and class comparison.

  • The basic idea of ​​attribute correlation analysis is to calculate some measure that quantifies the correlation of an attribute with a given class or concept. Measures that can be adopted include information gain, uncertainty, and correlation coefficient.

information gain

Information gain obtains the information gain of an attribute by calculating the expected information of a sample classification and the entropy of the attribute, and determines the relevance of the attribute to the current characterization task.

Information gain is calculated as follows:

S is a set of training samples in which the class number of each set is known. Each sample is a tuple. There is an attribute to determine the class number of a certain training sample.

Assume that there are m classes in S, a total of S training samples, and each class Ci has Si samples. Then the probability that any sample belongs to class Ci is Si/S. Then the expected information used to classify a given sample is:

 An attribute A{a1,a2,...,av} with v values ​​can divide S into v subsets {S1, S2,...,Sv}, where Sj contains the value aj on attribute A in S. sample. Suppose Sj contains sij samples of class Ci. The expected information according to this division of A is called the entropy of A.

 The information gain obtained by this division on A is defined as:

 Attributes with high information gain are attributes with high discriminability in a given set. Therefore, the correlation ranking of an attribute can be obtained by calculating the information gain of each attribute of the sample in S. It is possible to identify irrelevant or weakly relevant attributes and exclude them from the concept description process.

The process of class characterization including attribute correlation analysis

  • data collection

    • Collect relevant data in the database through query processing and divide it into a target class and one or more comparison classes.

  • pre-correlation analysis

    • Identify collections of attributes and dimensions

    • Remove or generalize attributes that have a large number of different values

    • Generate candidate relationships

  • Remove irrelevant and weakly correlated attributes using selected correlation analysis measures

    • Evaluate each attribute in a candidate relationship using a correlation analysis metric (information gain)

    • Sort attributes based on calculated relevance

    • Irrelevant and weakly relevant attributes below the critical value are removed

    • Generate initial target class working relationship

  • generate concept descriptions

The process of class comparison analysis including attribute correlation analysis

The process of class comparative analysis is as follows:

  • data collection:

Collect relevant data in the database through query processing and divide it into a target class and one or more comparison classes.

  • Dimension correlation analysis:

Use attribute correlation analysis methods to include only strongly correlated dimensions in our task.

  • simultaneous generalization

Simultaneously generalize on the target class and contrast class to obtain the main target class relationship/cube and the main comparison class relationship/cube.

  • Export a representation of the comparison

Use visualization techniques to express class comparison descriptions, which usually include "contrast" metrics that reflect the comparison between the target class and the comparison class.

3. Experimental process

Use attribute correlation analysis to perform class characterization and class comparison analysis, and use Python language to complete programming implementation.

(1)Data collection

Using the given data set for class characterization and class comparison analysis, three types of cancer, BLCA, BRCA, and KIRC, were selected for analysis. Among them, BLCA is selected as the target class, and BRCA and KIRC are selected as the comparison class. The original three data sets are used directly here without processing. The data will be processed as needed when attribute-related analysis is performed.

(2) Pre-correlation analysis

This step requires deleting or generalizing attributes with a large number of different values, and should be as conservative as possible to retain more attributes for subsequent analysis. For the cancer gene data set, each attribute is a gene, and it is difficult to decide which attribute can be deleted. Some attributes may be able to climb up based on concepts, but this requires professional knowledge of genetic information. For example: It is observed that some gene IDs have the same prefix, such as CXCL11|6373, CXCL14|9547. After querying the data, it is found that CXC is a motif encoding chemokine proteins, so the attributes may be climbed upward based on the concept (first of all, Only when the value of the original data is generalized in one step can upward generalization be carried out).

Considering that there are too many attributes in the data set and it is difficult to delete or merge equal generalized tuples, no attributes are deleted or generalized by merging here. Instead, we make the most basic generalization, that is, generalize the value of each gene in the original data set into a level, and divide the range of the original data to distinguish different levels. This is similar to generalizing GPA when mining the general characteristics of graduate students. According to GPA of gradations as conceptual hierarchies.

There should be a standard for generalizing the values ​​in the data set using interval division. Similar to a score below 60, it can be classified as failing, and above 85, it can be classified as excellent. This standard should be based on the meaning of the data. set. In the cancer data set, this value should be determined by professionals who understand the data. This is only for practicing and familiarizing with class characterization and class comparative analysis. Therefore, the data is simply divided into equal intervals and the maximum value of the data in the three data sets is found. values ​​and minimum values, divided into equally spaced intervals. This step is relatively simple and can be completed directly in the EXCEL table. The largest data set in the three cancer categories is 4.62 and the smallest is -1.85. It is divided into 6 intervals, and the length of each interval is 1.08. Use a python program to complete the generalization and save it to a file.

The core code is as follows:

step = 1.08 
base_value = -1.86
base_path = "./属性概化结果"
# 对数据进行基本概化,将值概化为不同分级
def generalize(file_path:str):
    file_name = os.path.basename(file_path)                             #文件名
    output_path = os.path.join(base_path,file_name)                     #保存结果
    df = pd.read_csv(file_path,index_col="gene_id")
    df = pd.DataFrame(df.values.T,index=df.columns,columns=df.index)    #进行转置
    labels = list(df.columns.values)                                    #gene_id
    for index,row in df.iterrows():
        for gene_id in labels:    
            for i in range(0,6):
                if row[gene_id]>base_value+i*step and row[gene_id]<base_value+(i+1)*step:
                    row[gene_id] = i
                    break
    df.to_csv(output_path)

The result is obtained in the following form:

 (3) Attribute correlation analysis based on information gain

After completing the above attribute generalization, calculate the information gain of each attribute and remove irrelevant or irrelevant attributes.

In order to facilitate the calculation of information gain, statistics are first made on the generalized data, and the number of samples with different division values ​​for each gene is counted:

base_path = "./类特征化与对比分析数据"
def pre(file_path):
    file_name = os.path.basename(file_path)
    output_path = os.path.join(base_path,file_name)
    df = pd.read_csv(file_path,index_col=0)
    labels = list(df.columns.values) 
    data = {}
    for gene_id in labels:
        data[gene_id] = [0,0,0,0,0,0]
    for index,row in df.iterrows():
        for gene_id in labels: 
            data[gene_id][int(row[gene_id])] +=1 
    newdf = pd.DataFrame(data)
    newdf.to_csv(output_path)
    
def main():
    # 数据统计
    pre("./属性概化数据/BLCArna.csv")
    pre("./属性概化数据/BRCArna.csv")
    pre("./属性概化数据/KIRCrna.csv")

After performing the above statistics, the statistical data will be in the following form:

The expected information for classifying a given sample is calculated as follows:

 Assume that the expected information for classification of gene A is calculated. There are three categories in total, then s1, s2, and s3 are the number of samples of the three categories on gene A. The calculation is as follows:

# 计算分类一个给定样本的期望信息
def I(s:list)->float:
    sum_s = 0                                               
    res_I = 0.0
    for si in s:
        sum_s += s
    for si in s:                                    
        if si>0:
            res_I += -(si/sum_s * math.log(si/sum_s,2)) 
    return res_I

If the sample is divided into v subsets based on gene A, calculate the expected information required for classification of the given sample (the entropy of A):

The information gain obtained by this division on A is:

 In order to facilitate data processing, the calculation of E(A) is included in the calculation of Gain(A). The complete calculation is implemented as follows:

def main():
    # 数据统计
    pre("./属性概化数据/BLCArna.csv")
    pre("./属性概化数据/BRCArna.csv")
    pre("./属性概化数据/KIRCrna.csv")
    df1 = pd.read_csv("./类特征化与对比分析数据/BLCArna.csv",index_col=0)
    df2 = pd.read_csv("./类特征化与对比分析数据/BRCArna.csv",index_col=0)
    df3 = pd.read_csv("./类特征化与对比分析数据/KIRCrna.csv",index_col=0)
    result_path = "./类特征化与对比分析数据/Gain.csv"
    # 计算每个属性的信息增益
    labels = list(df1.columns.values)               # 取出gene_id,接下来计算每个gene_id的信息增益
    result = {}
    for gene_id in labels:
        gain = 0
        # 先计算I(s1,s2,s3)
        s = [0,0,0]
        for i in range(6):
            s[0] += df1[gene_id][i]
            s[1] += df2[gene_id][i]
            s[2] += df3[gene_id][i]
        I_s = I(s)                                  # 该基因整体对样本分类的期望信息
        sum_s = s[0]+s[1]+s[2]                      # 该基因样本总数
        # 计算E(gene_id)
        E = 0
        for i in range(6):
            s1i = df1[gene_id][i]
            s2i = df2[gene_id][i]
            s3i = df3[gene_id][i]
            silist = [s1i,s2i,s3i]
            E += I(silist) * ((s1i+s2i+s3i)/sum_s)  # 基因不同子集对样本分类的期望信息加权和即为属性的熵
        gain = I_s - E                              # 该基因的信息增益
        result[gene_id] = gain
    resultdf = pd.DataFrame(result,index=["gain"])
    # 转置,行为基因,列为信息增益
    resultdf = pd.DataFrame(resultdf.values.T,index=resultdf.columns,columns=resultdf.index) 
    # 按信息增益从小到大排序
    resultdf=resultdf.sort_values(by='gain',axis=0,ascending=True)                              
    resultdf.to_csv(result_path)

The information gain after calculation is completed is as follows:

After obtaining the information gain of each attribute, irrelevant and weakly related attributes can be deleted based on the information gain. For setting the critical value of irrelevant and weak correlation, if the total number of attributes is small, the critical value can be made smaller to retain more attributes; if there are many attributes like this data set, it may be necessary to set the critical value Make it larger, delete a large number of weakly correlated attributes, and retain strongly correlated attributes as much as possible.

Draw a simple density plot to observe the information gain of the attributes:

df = pd.read_csv("./类特征化与对比分析数据./gain.csv",index_col=0)
df.plot.kde()

 It can be observed that the information gain of most attributes is very small, less than 0.25. Therefore, in fact, a critical value of 0.25 can remove most of the weakly relevant and irrelevant attributes. Since the dimension of the data set itself is very large and there are many attributes, there are still many attributes after removing the weakly correlated and irrelevant attributes. In the following comparative analysis, only the strongly correlated attributes will be compared and analyzed, so no specific details will be taken here. Critical value attributes were deleted.

(4) Characterization and comparative analysis

After completing the above attribute generalization and attribute correlation analysis, the final characterization and comparative analysis can be carried out. Quantification rules can be adopted and t_weight is used to represent the typicality of tuples in the main generalization relationship.

 Only the three attributes with the largest information gain in attribute correlation analysis are used to derive a generalized representation and a comparative representation.

result_path = "./类特征化与对比分析数据/characterize_discriminate.csv"
def main():
    df = pd.read_csv("./类特征化与对比分析数据./gain.csv",index_col=0)
    # 取信息增益最高的3个属性
    gene_id = list(df.index[-3:-1])
    gene_id.append(df.index[-1])    
    # 导出概化和比较的表示
    data_count("./属性概化数据/BLCArna.csv",gene_id)
    data_count("./属性概化数据/BRCArna.csv",gene_id)
    data_count("./属性概化数据/KIRCrna.csv",gene_id)
            
def data_count(file_name,gene_id):                     
    # 导出概化的表示
    df = pd.read_csv(file_name,index_col=0)
    # 原数据只保留指定的属性,然后进行count,这里只保留3个强相关属性
    df = df[gene_id]
    count = {}
    total = len(df)
    cols = gene_id.copy()                                   
    cols.append('count')                                        # 保存count列
    newdf = pd.DataFrame(columns=cols)                          # 结果
    for row in df.iterrows():
        item = tuple(row[1][gene_id])                           # item即为这三个属性的一种取值组合
        if item not in count:
            count[item] = 1
        else:
            count[item] += 1
    for key,value in count.items():                             # 保存结果到文件
        n_row = list(key)
        n_row.append("%.2f%%" % (round(value/total,4)*100))
        newdf.loc[len(newdf)] = n_row
    newdf.to_csv(result_path,index = False, mode = 'a')

The exported results are as follows, from top to bottom, the generalized representation results of BLCA, BRCA, and KIRC:

Main generalization results of target class BLCA:

 The distinguishing characteristics of the target class and the comparison class in the class comparison description can also be expressed by quantitative rules, that is, quantitative distinction rules:

 This is similar to the way t_weight is calculated in class characterization, except that the ratio of a tuple in the initial target class working relationship to the total number of tuples of this type in the working relationship between the target class and the comparison class is calculated. The implementation is as follows:

def d_weight(gene_id):
    # 计算d_weight,需要用到三个数据集中的数据
    df1 = pd.read_csv("./属性概化数据/BLCArna.csv",index_col=0)
    df2 = pd.read_csv("./属性概化数据/BRCArna.csv",index_col=0)
    df3 = pd.read_csv("./属性概化数据/KIRCrna.csv",index_col=0)
    # 保留指定属性
    df1 = df1[gene_id]
    df2 = df2[gene_id]
    df3 = df3[gene_id]
    count = {}
    # 这里和特征化的过程是类似的,但是需要注意count的含义不同了,计算的是d_weight,此外结果还有一列为类型
    cols = gene_id.copy()                                   
    cols.append('count')                                        # count列
    cols.insert(0,'type')                                       # 增加一个类型列
    newdf = pd.DataFrame(columns=cols)
    # 计算d_weight,先各个类的概化元组数
    for row in df1.iterrows():
        item = tuple(row[1][gene_id])                         
        if item not in count:
            count[item] = [1,0,0]                               # 三个值表示目标类,两个对比类的该种元组数
        else:
            count[item][1] += 1
    for row in df2.iterrows():
        item = tuple(row[1][gene_id])                         
        if item not in count:
            count[item] = [0,1,0]                               # 对比类的概化元组
        else:
            count[item][1] += 1
    for row in df3.iterrows():
        item = tuple(row[1][gene_id])                         
        if item not in count:
            count[item] = [0,0,1]
        else:
            count[item][2] += 1
    # 计算d_weight
    for key,value in count.items():                             # 保存结果到文件
        n_row = list(key)
        sum_num = value[0]+value[1]+value[2]                    # 总元组数
        n_row1 = n_row.copy()
        n_row2 = n_row.copy()
        n_row3 = n_row.copy()
        n_row1.append("%.2f%%" % (round(value[0]/sum_num,4)*100))
        n_row2.append("%.2f%%" % (round(value[1]/sum_num,4)*100))
        n_row3.append("%.2f%%" % (round(value[2]/sum_num,4)*100))
        n_row1.insert(0,"BLCA")
        n_row2.insert(0,"BRCA")
        n_row3.insert(0,"KIRC")
        newdf.loc[len(newdf)] = n_row1
        newdf.loc[len(newdf)] = n_row2
        newdf.loc[len(newdf)] = n_row3
    newdf.to_csv("./类特征化与对比分析数据/d_weight.csv",index = False)

The result is in the form:

 A higher count (i.e. d_weight) value indicates that the concept represented by the generalized tuple mainly comes from the target class. However, in the final result set, the d_weight value of the target class BLCA is not very high. Instead, the d_weight of the comparison classes BRCA and KIRC is close to 100% on many generalized tuples, which may mean that the characteristics of the comparison class are more obvious.

4. Analysis of experimental results

Regarding the results in the characterization analysis, the characteristics of the target class BLCA do not seem to be obvious, and the t_weight of the generalized tuple is generally low. In the comparison categories BRCA and KIRC, some t_weight values ​​are very high. For example, the levels of BRCA on these three attributes are 0,1,0 (0 corresponds to the original data -1.85~-0.77, and 1 corresponds to the original data -0.77 ~0.31) samples account for 60% of the total samples, which can be considered an important feature of this category. Finally, when using quantitative distinction rules for class comparison analysis, it is similar to class characterization. The d_weight of the target class is not large, while the comparison class has significant d_weight on some generalized tuples. These generalized tuples can clearly distinguish the target class. Classes are distinguished from contrasting classes.

Some simplifications during the processing may have an impact on the results. For example, when generalizing attributes, the values ​​are divided into intervals. This division directly divides the maximum and minimum values ​​into equivalent values. In fact, it may just divide the range of the characteristics into Two parts lead to inaccurate results.

3. Experiment summary

Through this experiment, we deepened our understanding of principal component analysis, class characterization and class contrastive analysis of concept description, and implemented them. The main problem with the first question is that there are too many genes, which is much larger than the sample size. The results of principal component analysis are still very large. It is not sure if the results of principal component analysis are better if the samples are sufficient. After principal component analysis, we still don’t know how to process or analyze the results. The second question is mainly about realizing class characterization and class comparative analysis. The most important part is the attribute correlation analysis. In the implementation process, it mainly focused on learning and trying to analyze methods and steps. Many processes were simplified and unreasonable. For example, the interval division of values ​​when preprocessing was used to generalize attributes. In the end, only three parameters obtained based on the calculated information gain were used. Class characterization and class comparison analysis are performed on strongly correlated attributes. Although there are still many problems (problems with the process itself, lack of skills in visual display of data, etc.), the analysis process has been completed relatively completely through experiments, and the purpose and steps of principal component analysis, class characterization and class comparative analysis have been understood.

Guess you like

Origin blog.csdn.net/Aaron503/article/details/127595164