GCN-Cora Dataset数据集熟悉-老年痴呆自我回忆手册

程序来源：https://github.com/tkipf/pygcn
此文仅作为个人笔记，方便复习查阅
Cora Dataset是对Machine Learning Paper进行分类的数据集
– README: 对数据集的介绍;
– cora.cites: 论文之间的引用关系图。文件中每行包含两个Paper ID，第一个ID是被引用的Paper ID；第二个是引用的Paper ID。
– cora.content: 包含了2708篇Paper的信息，每行的数据格式如下: <paper_id> <word_attributes>+ <class_label>。
paper id是论文的唯一标识；
word_attributes是是一个维度为1433的词向量，词向量的每个元素对应一个词，0表示该元素对应的词不在Paper中，1表示该元素对应的词在Paper中。
class_label是论文的类别，每篇Paper被映射到如下7个分类之一: Case_Based、Genetic_Algorithms、Neural_Networks、Probabilistic_Methods、Reinforcement_Learning、Rule_Learning、Theory。



import pandas as pd
import numpy as np
import scipy.sparse as sp
# 导入数据：分隔符为Tab
raw_data_content = pd.read_csv('cora/cora.content',sep = '\t',header = None)

# [2708 * 1435]
(row, col) = raw_data_content.shape
print("Cora Contents’s Row: {}, Col: {}".format(row, col))
print("=============================================")

# 每行是1435维的向量，第一维是论文的ID，最后一维是论文的Label
raw_data_sample = raw_data_content.head(3)  #读取前3行的数据
features_sample =raw_data_sample.iloc[:,1:-1]  #iloc通过行号来取行数据 ,排除ID and label
labels_sample = raw_data_sample.iloc[:, -1]    #读取两边的 ID和label
labels_onehot_sample = pd.get_dummies(labels_sample)

print("features:{}".format(features_sample))

print("=============================================")

print("labels:{}".format(labels_sample))
print("=============================================")

print("labels one hot:{}".format(labels_onehot_sample))
raw_data_cites = pd.read_csv('cora/cora.cites',sep = '\t',header = None)
# [5429 * 2]
(row, col) = raw_data_cites.shape


print("Cora Cites’s Row: {}, Col: {}".format(row, col))

print("=============================================")

raw_data_cites_sample = raw_data_cites.head(10)

print(raw_data_cites_sample)

print("=============================================")



# Convert Cite to adj matrix
idx = np.array(raw_data_content.iloc[:, 0], dtype=np.int32)
idx_map = {
    
    j: i for i, j in enumerate(idx)}       #序号和ID

# 将样本之间的引用关系用样本索引之间的关系表示 ,序号和ID,引用与被引（把ID的关系变成索引序号的关系）
edge_indexs = np.array(list(map(idx_map.get, raw_data_cites.values.flatten())), dtype=np.int32)
edge_indexs = edge_indexs.reshape(raw_data_cites.shape)   #把相邻的两个索引联系起来
# 构建图的邻接矩阵，用坐标形式的稀疏矩阵表示，非对称邻接矩阵
adjacency = sp.coo_matrix((np.ones(len(edge_indexs)),
            (edge_indexs[:, 0], edge_indexs[:, 1])),
            shape=(edge_indexs.shape[0], edge_indexs.shape[0]), dtype="float32")
            #矩阵大小 #引用与被引  #n*n
print(adjacency)

GCN-Cora Dataset数据集熟悉-老年痴呆自我回忆手册

猜你喜欢