Data sets used by GCN in Cora, Citeseer, Pubmed, Tox21 formats


This article shares the format of some data sets used in the graph convolutional network GCN

Cora、Citeseer、Pubmed

data set source #Figure #node #side #feature #Label(y)
Cora “Collective classification in network data,” AI magazine,2008 1 2708 5429 1433 7
Citeseer “Collective classification in network data,” AI magazine,2008 1 3327 4732 3703 6
Pubmed “Collective classification in network data,” AI magazine,2008 1 19717 44338 500 3
├── gcn
│   ├── data          //图数据
│   │   ├── ind.citeseer.allx
│   │   ├── ind.citeseer.ally
│   │   ├── ind.citeseer.graph
│   │   ├── ind.citeseer.test.index
│   │   ├── ind.citeseer.tx
│   │   ├── ind.citeseer.ty
│   │   ├── ind.citeseer.x
│   │   ├── ind.citeseer.y
│   │   ├── ind.cora.allx
│   │   ├── ind.cora.ally
│   │   ├── ind.cora.graph
│   │   ├── ind.cora.test.index
│   │   ├── ind.cora.tx
│   │   ├── ind.cora.ty
│   │   ├── ind.cora.x
│   │   ├── ind.cora.y
│   │   ├── ind.pubmed.allx
│   │   ├── ind.pubmed.ally
│   │   ├── ind.pubmed.graph
│   │   ├── ind.pubmed.test.index
│   │   ├── ind.pubmed.tx
│   │   ├── ind.pubmed.ty
│   │   ├── ind.pubmed.x
│   │   └── ind.pubmed.y
│   ├── __init__.py
│   ├── inits.py    //初始化的公用函数
│   ├── layers.py   //GCN层定义
│   ├── metrics.py  //评测指标的计算
│   ├── models.py   //模型结构定义
│   ├── train.py    //训练
│   └── utils.py    //工具函数的定义
├── LICENCE
├── README.md
├── requirements.txt
└── setup.py

The three types of data are composed of the following eight files, with similar storage formats :

ind.dataset_str.x => the feature vectors of the training instances as scipy.sparse.csr.csr_matrix object;
ind.dataset_str.tx => the feature vectors of the test instances as scipy.sparse.csr.csr_matrix object;
ind.dataset_str.allx => the feature vectors of both labeled and unlabeled training instances 
    (a superset of ind.dataset_str.x) as scipy.sparse.csr.csr_matrix object;
    
ind.dataset_str.y => the one-hot labels of the labeled training instances as numpy.ndarray object;
ind.dataset_str.ty => the one-hot labels of the test instances as numpy.ndarray object;
ind.dataset_str.ally => the labels for instances in ind.dataset_str.allx as numpy.ndarray object;

ind.dataset_str.graph => a dict in the format {index: [index_of_neighbor_nodes]} as collections.defaultdict object;
ind.dataset_str.test.index => the indices of test instances in graph, for the inductive setting as list object.

All objects above must be saved using python pickle module.
    
以cora为例:
ind.dataset_str.x => 训练实例的特征向量,是scipy.sparse.csr.csr_matrix类对象,shape:(140, 1433)
ind.dataset_str.tx => 测试实例的特征向量,shape:(1000, 1433)
ind.dataset_str.allx => 有标签的+无无标签训练实例的特征向量,是ind.dataset_str.x的超集,shape:(1708, 1433)

ind.dataset_str.y => 训练实例的标签,独热编码,numpy.ndarray类的实例,是numpy.ndarray对象,shape:(140, 7)
ind.dataset_str.ty => 测试实例的标签,独热编码,numpy.ndarray类的实例,shape:(1000, 7)
ind.dataset_str.ally => 对应于ind.dataset_str.allx的标签,独热编码,shape:(1708, 7)

ind.dataset_str.graph => 图数据,collections.defaultdict类的实例,格式为 {index:[index_of_neighbor_nodes]}
ind.dataset_str.test.index => 测试实例的id,2157行

上述文件必须都用python的pickle模块存储
  • The GCN in the Semi-Supervised Classification with Graph Convolutional Networks paper is semi-supervised learning, so some of the training data sets have labels and some have no labels.

Take Cora as an example

Original data set link: http://linqs.cs.umd.edu/projects/projects/lbc/
Data set division method: https://github.com/kimiyoung/planetoid (Zhilin Yang, William W. Cohen, Ruslan Salakhutdinov , Revisiting Semi-Supervised Learning with Graph Embeddings , ICML 2016)

The Cora data set is composed of machine learning papers, and is a data set that is very popular in recent years for graph deep learning. In the data set, papers are divided into one of the following seven categories:

  • Case-based
  • Genetic algorithm
  • Neural Networks
  • Probabilistic method
  • Reinforcement learning
  • Rule learning
  • theory

The selection of papers is that each paper is cited or cited by at least one other paper in the final corpus. There are 2708 papers in the entire corpus.

After stemming and removing the endings, only 1433 unique words remain. All words with document frequency less than 10 are deleted. The cora data set contains 1433 unique words, so the feature is 1433 dimensions. 0 and 1 describe whether each word exists in the paper .

The variable data is a scipy.sparse.csr.csr_matrix, similar to a sparse matrix, the output is the non-zero row and column coordinates and values ​​of the matrix

Data format example
(1)--------------------------------------ind.cora.x
def load_cora():
    names = ['x']
    with open("data/ind.cora.x", 'rb') as f:
        if sys.version_info > (3, 0):
            print(f)  # <_io.BufferedReader name='data/ind.cora.x'>
            data = pkl.load(f, encoding='latin1')
            print(type(data)) #<class 'scipy.sparse.csr.csr_matrix'>

            print(data.shape)   #(140, 1433)-ind.cora.x是140行,1433列的
            print(data.shape[0]) #row:140
            print(data.shape[1]) #column:1433
            print(data[1])
  # 变量data是个scipy.sparse.csr.csr_matrix,类似稀疏矩阵,输出得到的是矩阵中非0的行列坐标及值
  # (0, 19)	1.0
  # (0, 88)	1.0
  # (0, 149)	1.0
  # (0, 212)	1.0
  # (0, 233)	1.0
  # (0, 332)	1.0
  # (0, 336)	1.0
  # (0, 359)	1.0
  # (0, 472)	1.0
  # (0, 507)	1.0
  # (0, 548)	1.0
  # ...

# print(data[100][1]) #IndexError: index (1) out of range
            nonzero=data.nonzero()
            print(nonzero)     #输出非零元素对应的行坐标和列坐标
# (array([  0,   0,   0, ..., 139, 139, 139], dtype=int32), array([  19,   81,  146, ..., 1263, 1274, 1393], dtype=int32))
            # nonzero是个tuple
            print(type(nonzero)) #<class 'tuple'>
            print(nonzero[0])    #行:[  0   0   0 ... 139 139 139]
            print(nonzero[1])    #列:[  19   81  146 ... 1263 1274 1393]
            print(nonzero[1][0])  #19
            print(data.toarray())
# [[0. 0. 0. ... 0. 0. 0.]
#  [0. 0. 0. ... 0. 0. 0.]
#  [0. 0. 0. ... 0. 0. 0.]
#  ...
#  [0. 0. 0. ... 0. 1. 0.]
#  [0. 0. 0. ... 0. 0. 0.]
#  [0. 1. 0. ... 0. 0. 0.]]

(2)--------------------------------------ind.cora.y

def load_cora():
    with open("data/ind.cora.y", 'rb') as f:
        if sys.version_info > (3, 0):
            print(f)  #<_io.BufferedReader name='data/ind.cora.y'>
            data = pkl.load(f, encoding='latin1')
            print(type(data)) #<class 'numpy.ndarray'>
            print(data.shape)   #(140, 7)
            print(data.shape[0]) #row:140
            print(data.shape[1]) #column:7
            print(data[1]) #[0 0 0 0 1 0 0]
            
(3)--------------------------------------ind.cora.graph

def load_cora():
    with open("data/ind.cora.graph", 'rb') as f:
        if sys.version_info > (3, 0):
            data = pkl.load(f, encoding='latin1')
            print(type(data)) #<class 'collections.defaultdict'>
            print(data) 
# defaultdict(<class 'list'>, {0: [633, 1862, 2582], 1: [2, 652, 654], 2: [1986, 332, 1666, 1, 1454], 
#   , ... , 
#   2706: [165, 2707, 1473, 169], 2707: [598, 165, 1473, 2706]})


(4)--------------------------------------ind.cora.test.index

test_idx_reorder = parse_index_file("data/ind.{}.test.index".format(dataset_str))
print("test index:",test_idx_reorder)
#test index: [2692, 2532, 2050, 1715, 2362, 2609, 2622, 1975, 2081, 1767, 2263,..]
print("min_index:",min(test_idx_reorder))
# min_index: 1708

(5)citeseer数据集中一些孤立点的特殊处理
    #处理citeseer中一些孤立的点
    if dataset_str == 'citeseer':
        # Fix citeseer dataset (there are some isolated nodes in the graph)
        # Find isolated nodes, add them as zero-vecs into the right position

        test_idx_range_full = range(min(test_idx_reorder), max(test_idx_reorder)+1)
        # print("test_idx_range_full.length",len(test_idx_range_full))
        #test_idx_range_full.length 1015

        #转化成LIL格式的稀疏矩阵,tx_extended.shape=(1015,1433)
        tx_extended = sp.lil_matrix((len(test_idx_range_full), x.shape[1]))
        # print(tx_extended)
        #[2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325
        # ....
        # 3321 3322 3323 3324 3325 3326]

        #test_idx_range-min(test_idx_range):列表中每个元素都减去min(test_idx_range),即将test_idx_range列表中的index值变为从0开始编号
        tx_extended[test_idx_range-min(test_idx_range), :] = tx
        # print(tx_extended.shape) #(1015, 3703)

        # print(tx_extended)
        # (0, 19) 1.0
        # (0, 21) 1.0
        # (0, 169) 1.0
        # (0, 170) 1.0
        # (0, 425) 1.0
        #  ...
        # (1014, 3243) 1.0
        # (1014, 3351) 1.0
        # (1014, 3472) 1.0

        tx = tx_extended
        # print(tx.shape)
        # (1015, 3703)
        #997,994,993,980,938...等15行全为0


        ty_extended = np.zeros((len(test_idx_range_full), y.shape[1]))
        ty_extended[test_idx_range-min(test_idx_range), :] = ty
        ty = ty_extended
        # for i in range(ty.shape[0]):
        #     print(i," ",ty[i])
        #     # 980 [0. 0. 0. 0. 0. 0.]
        #     # 994 [0. 0. 0. 0. 0. 0.]
        #     # 993 [0. 0. 0. 0. 0. 0.]

  • allx is all training examples in the training set, including labeled and unlabeled, from 0-1707, a total of 1708
  • ally is the label corresponding to allx, from 1708-2707, a total of 1000
  • There are some isolated points in the test data set of citeseer (there is no corresponding index in test.index, 15). These points can be added to the test set tx as nodes with all 0 features, and the corresponding labels are in ty in
  • The input is a whole picture, so put tx and allx together as a feature
  • The y value of data without labels: [0,0,0,0,0,0,0]
  • The features in the data set are also sparse, stored in a LIL sparse matrix, the format is as follows
A=np.array([[1,0,2,0],[0,0,0,0],[3,0,0,0],[1,0,0,4]])
AS=sp.lil_matrix(A)
print(AS)
# (0, 0) 1
# (0, 2) 2
# (2, 0) 3
# (3, 0) 1
# (3, 3) 4

Tox21 data set

This data set comes from a 2014 competition on a PubChem website: https://tripod.nih.gov/tox21/challenge/about.jsp
PubChem is an open chemistry database of the National Institutes of Health (NIH). The largest collection of free chemical information.
PubChem data is provided by hundreds of data sources, including: government agencies, chemical suppliers, journal publishers, etc.

The Toxology for the 21st Century (Tox21) program is a federal cooperation program of NIH, the Environmental Protection Agency, and the Food and Drug Administration to develop better methods for toxicity assessment. The goal is to quickly and effectively test whether certain compounds have the potential to disrupt processes in the human body that may cause adverse health effects. The Tox21 dataset is one of the datasets used in the competition. It contains the structural information of chemical compounds determined by 12 toxicological tests.

  • Estrogen receptor alpha, LBD (ER, LBD)
  • Estrogen receptor α, full (ER, full)
  • aromatic
  • Aromatic hydrocarbon receptor (AhR)
  • Androgen receptor, full (AR, full)
  • Androgen receptor, LBD (AR, LBD)
  • Peroxisome proliferator activated receptor γ (PPAR-γ)
  • Nuclear factor (erythrocyte-derived 2)-like 2/antioxidant response element (Nrf2/ARE)
  • Heat shock factor response element (HSE)
  • ATAD5
  • Mitochondrial membrane potential (MMP)
  • P53

Each toxicology experiment tested PUBCHEM_SID from 144203552-144214049, a total of 10486 compounds, including environmentally friendly compounds, some listed drugs and other substances.
For example, the measurement results of p53 experiments can be viewed online .

  • PubChem AID: biological activity identification record ID
  • PubChem SID: Substance ID
  • PubChem CID: Compound ID

The data set can be downloaded here: https://tripod.nih.gov/tox21/challenge/data.jsp#

The training set and test set are files in sdf format composed of multiple molecular structures.
The information storage format of a molecule is as follows:

  • The first line: usually used as the molecular name, such as NCGC00255644-01, sometimes a space
  • Second line: comment, Marvin 07111412562D
  • The third line: usually a blank line
  • The fourth line: is the starting line of the number of atoms, keys, etc.
  • Information such as the number of keys and the number of atoms at the end of the line where M END is located.

The following is the attribute value, the number of attributes is variable

  • Property 1
  • Attribute 1 value
  • Blank line
  • Attribute 2
  • Attribute 2 value
  • Blank line
  • (End the information storage of a molecule with four dollar signs.)
  • In the training set, the label is "Active", "1" means activity, and "0" means no activity
  • In the test set, there is no label "Active"

The information storage format of a molecule in the training set is as follows:

NCGC00255644-01
  Marvin  07111412562D          

 26 27  0  0  1  0            999 V2000
    4.5831   -4.3075    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    5.2840   -3.9061    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.9910   -4.3075    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.2840   -3.0973    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    1.4379   -1.6595    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.4379   -2.4863    0.0000 C   0  0  1  0  0  0  0  0  0  0  0  0
    2.1508   -2.0609    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.4379   -3.3010    0.0000 C   0  0  2  0  0  0  0  0  0  0  0  0
    0.7070   -2.0609    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.8577   -2.4863    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.1508   -1.2342    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.7070   -3.7084    0.0000 C   0  0  1  0  0  0  0  0  0  0  0  0
    2.1508   -3.7084    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000   -2.4863    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.8577   -3.3010    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.5646   -2.0609    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.8577   -0.8388    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.1323   -4.4273    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.3056   -4.4273    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000   -3.3010    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.5646   -1.2342    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.7189   -5.1463    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    4.2955   -0.8388    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.0085   -1.2342    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.2955    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.4379   -4.1338    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
  2  3  1  0  0  0  0
  2  4  2  0  0  0  0
  6  5  1  1  0  0  0
  6  7  1  0  0  0  0
  6  8  1  0  0  0  0
  6  9  1  0  0  0  0
  7 10  1  0  0  0  0
  7 11  2  0  0  0  0
  8 12  1  0  0  0  0
  8 13  1  0  0  0  0
  8 26  1  6  0  0  0
  9 14  1  0  0  0  0
 10 15  1  0  0  0  0
 10 16  2  0  0  0  0
 11 17  1  0  0  0  0
 12 18  1  6  0  0  0
 12 19  1  1  0  0  0
 12 20  1  0  0  0  0
 13 15  1  0  0  0  0
 14 20  1  0  0  0  0
 16 21  1  0  0  0  0
 17 21  2  0  0  0  0
 18 22  1  0  0  0  0
 21 23  1  0  0  0  0
 23 24  1  0  0  0  0
 23 25  1  0  0  0  0
M  END
>  <Formula>
C22H35NO2

>  <FW>
345.5188 (60.0520+285.4668)

>  <DSSTox_CID>
27102

>  <Active>
0

$$$$

The information storage format of a molecule in the test set is as follows:

NCGC00261443
  Marvin  10161415332D          

 20 22  0  0  1  0            999 V2000
    0.5185    2.9762    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.2330    2.5637    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    1.2330    1.7387    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.5185    1.3262    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -0.2661    1.5812    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   -0.7510    0.9137    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -0.2661    0.2463    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   -0.5210   -0.5383    0.0000 C   0  0  2  0  0  0  0  0  0  0  0  0
   -1.3056   -0.7933    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   -1.3056   -1.6183    0.0000 C   0  0  2  0  0  0  0  0  0  0  0  0
   -1.9731   -2.1032    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -2.7268   -1.7676    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   -0.5210   -1.8732    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -0.2661   -2.6578    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   -0.0361   -1.2058    0.0000 C   0  0  1  0  0  0  0  0  0  0  0  0
    0.7889   -1.2058    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    0.5185    0.5012    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.2330    0.0887    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    1.9475    0.5012    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.9475    1.3262    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
  2  3  1  0  0  0  0
  3  4  2  0  0  0  0
  4  5  1  0  0  0  0
  5  6  2  0  0  0  0
  6  7  1  0  0  0  0
  7  8  1  0  0  0  0
  8  9  1  1  0  0  0
  9 10  1  0  0  0  0
 10 11  1  1  0  0  0
 11 12  1  0  0  0  0
 10 13  1  0  0  0  0
 13 14  1  0  0  0  0
 13 15  1  0  0  0  0
  8 15  1  0  0  0  0
 15 16  1  6  0  0  0
  7 17  1  0  0  0  0
  4 17  1  0  0  0  0
 17 18  2  0  0  0  0
 18 19  1  0  0  0  0
 19 20  2  0  0  0  0
  3 20  1  0  0  0  0
M  END
>  <Compound ID>
NCGC00261443

>  <Compound Batch ID>
NCGC00261443-01

>  <NR-AR>
0

>  <NR-AR-LBD>
0

>  <NR-AhR>
0

>  <NR-ER>
0

>  <NR-ER-LBD>
0

>  <NR-PPAR-gamma>
0

>  <SR-ARE>
0

>  <SR-ATAD5>
1

>  <SR-HSE>
0

>  <SR-MMP>
0

>  <SR-p53>
0

$$$$

The goal should be to predict the activity of the molecular structure in the test set based on the molecular structure information of the training set and the active label. A molecule in the training set may constitute a graph, and the atoms and keys in it form nodes and edges. However, I have not found a more specific introduction to the atoms and keys in the data set. I don’t know the meaning of each row of data.

I hope to point out where there are errors. Welcome to the group to exchange GNNs&GCNs ( group remarks!!!, format: name- (school or other institution information) -research direction ).

Guess you like

Origin blog.csdn.net/yyl424525/article/details/100831452