1_Graph neural network GNN basic knowledge learning

Install PyTorch Geometric

Installation kit

Open the link https://github.com/pyg-team/pytorch_geometric, click on the arrow in the picture, and use the compiled version to install:

Insert image description here

Use the following code snippets to view PyTorch, CUDA, and Python versions:

import torch

# 查看PyTorch版本
print("PyTorch版本:", torch.__version__)

# 查看CUDA版本(如果使用GPU)
if torch.cuda.is_available():
    print("CUDA版本:", torch.version.cuda)
else:
    print("未找到可用的CUDA")

# 查看Python版本
import sys
print("Python版本:", sys.version)

Running screenshot:

Insert image description here

Click the link below according to the version:
Insert image description here

Then choose to install the corresponding dependency package according to the python version, install the command pip install 包名: finally
Insert image description here
execute the commandpip install torch_geometric

Node classification using graph convolutional networks (GCN) on the KarateClub dataset

This part is about code learning. You can put the code into Jupyter and run it.

Two drawing functions

%matplotlib inline
import torch
import networkx as nx
import matplotlib.pyplot as plt


def visualize_graph(G, color):
    plt.figure(figsize=(7,7))
    plt.xticks([])
    plt.yticks([])
    nx.draw_networkx(G, pos=nx.spring_layout(G, seed=42), with_labels=False,
                     node_color=color, cmap="Set2")
    plt.show()


def visualize_embedding(h, color, epoch=None, loss=None):
    plt.figure(figsize=(7,7))
    plt.xticks([])
    plt.yticks([])
    h = h.detach().cpu().numpy()
    plt.scatter(h[:, 0], h[:, 1], s=140, c=color, cmap="Set2")
    if epoch is not None and loss is not None:
        plt.xlabel(f'Epoch: {
      
      epoch}, Loss: {
      
      loss.item():.4f}', fontsize=16)
    plt.show()

Parse the above code:

The code mainly includes two parts of functionality: using networkxand matplotlibto visualize the graph structure ( G) and embedding vectors ( h).

First, we parse the code line by line:

  1. %matplotlib inline: Jupyter Notebook's magic command, which ensures that the drawn graphics are displayed inside the notebook.

  2. Import the required libraries:

    • torch: An open source deep learning library.
    • networkx as nx: A Python library for creating, manipulating, and studying complex network structures and dynamics.
    • matplotlib.pyplot as plt: Library for drawing.
  3. visualize_graph(G, color)function:

    • Function: Visualization diagram G.
    • parameter:
      • G: The graph to be visualized.
      • color: The color of the nodes in the graph.
    • Code analysis:
      • Set the graphic size to 7x7.
      • Remove the x, y axis labels.
      • Use nx.draw_networkxto draw the graph. Among them nx.spring_layoutis a layout strategy that simulates the spring effect between nodes to make the layout look more balanced.
      • Display graphics.
  4. visualize_embedding(h, color, epoch=None, loss=None)function:

    • Function: Visualize embedding vectors h.
    • parameter:
      • h: Embedding vector to be visualized.
      • color: The color of the vector.
      • epoch(optional): The current number of training iterations.
      • loss(optional): Current loss value.
    • Code analysis:
      • Set the graphic size to 7x7.
      • Remove the x, y axis labels.
      • Transfer embeddings from GPU to CPU and from PyTorch tensors to numpy arrays.
      • Use plt.scatterthe function to plot each embedding in 2D space.
      • epochIf and are provided loss, these values ​​are displayed on the graph's x-axis label.
      • Display graphics.

In short, this code provides two functions, one for visualizing the graph structure and another for visualizing the embeddings. This is particularly useful in the context of graph neural networks, for example when one needs to see the evolution of node embeddings or compare with the original graph structure.

Graph Neural Networks

  • Committed to solving irregular data structures (the relative formats of images and text are fixed, but the formats of social networks and chemical molecules are certainly not fixed)
  • The iterative update of the GNN model is mainly based on the information of each node and its neighbors in the graph, which is basically expressed as follows:

x v ( ℓ + 1 ) = f θ ( ℓ + 1 ) ( x v ( ℓ ) , { x w ( ℓ ) : w ∈ N ( v ) } ) \mathbf{x}_v^{(\ell + 1)} = f^{(\ell + 1)}_{\theta} \left( \mathbf{x}_v^{(\ell)}, \left\{ \mathbf{x}_w^{(\ell)} : w \in \mathcal{N}(v) \right\} \right) xv(+1)=fi(+1)(xv(),{ xw():wN(v)})

Characteristics of nodes: xv ( ℓ ) \mathbf{x}_v^{(\ell)}xv()v ∈ V v \in \mathcal{V}vV in the figureG= ( V , E ) \mathcal{G} = (\mathcal{V}, \mathcal{E})G=(V,E ) updatesN ( v ) based on its neighbor information \mathcal{N}(v)N(v):

Dataset: Zachary's karate club network.

This graph describes the social relationship of a karate club member, with 34 members as nodes. If the two members still maintain a social relationship outside the club, an edge is added between the nodes.
Each node has a 34-dimensional feature vector and a total of 78 edges.
During the process of collecting data, a conflict arose between the manager John A and the coach Mr. Hi (pseudonym). The members chose to take sides. Half of the members followed Mr. Hi to establish a new club, and the remaining half found a new coach. Or quit the club.
Insert image description here

PyTorch Geometric

  • This is our core. To put it bluntly, various methods in graph neural networks are implemented here.
  • We can just call it directly: PyTorch Geometric (PyG) library

Dataset introduction

  • You can directly refer to its API: https://pytorch-geometric.readthedocs.io/en/latest/modules/datasets.html#torch_geometric.datasets.KarateClub
from torch_geometric.datasets import KarateClub

dataset = KarateClub()
print(f'Dataset: {
      
      dataset}:')
print('======================')
print(f'Number of graphs: {
      
      len(dataset)}')
print(f'Number of features: {
      
      dataset.num_features}')
print(f'Number of classes: {
      
      dataset.num_classes}')

data = dataset[0]  # Get the first graph object.
print(data)

The output is:

Dataset: KarateClub():
======================
Number of graphs: 1
Number of features: 34
Number of classes: 4

Data(x=[34, 34], edge_index=[2, 156], y=[34], train_mask=[34])

Parsing the above code:
This code usestorch_geometric.datasetsinKarateClub, which is a classic dataset often used in graph neural network research. KarateClubThe dataset describes the relationships between members of a school karate club, where the members are divided into two factions.

Next, we parse the code step by step:

  1. Import dataset :

    from torch_geometric.datasets import KarateClub
    
  2. Load the dataset :

    dataset = KarateClub()
    

    This instantiates KarateClubthe dataset and downloads the relevant data if it is not already there.

  3. Print general information about the dataset :

    • print(f'Dataset: {dataset}:'): Print a description of the dataset.
    • print(f'Number of graphs: {len(dataset)}'): Print the number of plots in the dataset. The output shows only one plot.
    • print(f'Number of features: {dataset.num_features}'): Print the number of features for each node. The output shows that each node has 34 features.
    • print(f'Number of classes: {dataset.num_classes}'): Print the number of categories in the dataset. The output shows that there are 4 categories (four classification task).
  4. Get the first graph object :

    data = dataset[0]
    

    This gets the first (and only) graph object in the dataset. torch_geometricGraph data in is usually Datarepresented by objects, which contain nodes, edges, and other related information.

  5. Print the description of the graph object :

    print(data)
    

    The output is Data(x=[34, 34], edge_index=[2, 156], y=[34], train_mask=[34]). We can learn from this:

    • x=[34, 34]: Indicates that there are 34 nodes in the graph, and each node has 34 features.

    • edge_index=[2, 156]: Describes the edges of the graph. It is a 2x156 integer tensor where each column represents an edge from the source node to the destination node. 156 means there are 156 edges in the graph.

    • y=[34]: is an integer tensor of length 34, representing the label of each node.

    • train_mask=[34]: is a Boolean tensor of length 34 that indicates which nodes should be used for training. This is common in semi-supervised learning settings, where the labels of only a small subset of nodes are known.

      When we learn in graph data, especially in the context of semi-supervised learning, we may only have labels for some nodes in the graph. Semi-supervised learning means that only a small part of our data is labeled, and most of the data is unlabeled. The goal is to use this small amount of labeled data to learn the representation of the entire data at the same time.

      In the context of graph neural networks, we may only have certain nodes in the graph that are labeled. Therefore, in order to only consider these labeled nodes during training, we need a way to distinguish which nodes are used for training and which nodes are not. That's train_maskwhat it does.

      train_maskis a Boolean tensor with the same length as the number of nodes in the graph. If train_maska value in is True, it means that the node at that position is used for training (that is, the node has a label); if the value is False, it means that the node is not used for training.

      For example, assume we have the following train_mask:

      train_mask = [True, True, False, False, True]
      

      This means that the first, second, and fifth nodes in the graph have labels and will be used for training; while the third and fourth nodes have no labels and will not be used for training.

      During the training process of the graph neural network, we only calculate and backpropagate the loss of those nodes with train_maskvalue True, allowing the model to take advantage of this known label information.

Overall, this code simply loads KarateClubthe dataset and displays its basic information. This data set describes a relationship network within a karate club, which contains 34 nodes (members) and 156 edges (relationships). Each node has a 34-dimensional feature vector and a class label.

edge_index

  • edge_index: Indicates the connection relationship of the graph (start and end two sequences)
  • node features: features of each point
  • node labels: label of each point
  • train_mask: Some nodes have no labels (used to indicate which nodes need to calculate loss)
edge_index = data.edge_index
print(edge_index.t())

Parse the above code:

These two lines of code mainly focus on the side information in the graph, especially the edge_indexattributes. In PyTorch Geometric (a popular graph neural network library), edge information is usually edge_indexstored in the form of .

  1. edge_index = data.edge_index :
    This line of code datatakes edge_indexthe property from the object and assigns it to a new variable edge_index. In PyTorch Geometric, edge_indexis a tensor representing all edges in the graph.

    edge_indexThe dimension of is [2, E], where Eis the number of edges in the graph. Each column represents an edge, where the value in the first row is the starting node of the edge and the value in the second row is the ending node of the edge.

    For example, consider the following edge_index:

    tensor([[0, 2, 2],
            [1, 0, 3]])
    

    This means that there are three edges in the graph: from node 0 to node 1, from node 2 to node 0, and from node 2 to node 3.

  2. print(edge_index.t()) :
    This line of code first .t()transposes edge_indexthe tensor using the method and then prints it. The transpose operation [2, E]turns a tensor of dimensions into [E, 2]dimensions.

    Continuing the above example, the transposed tensor is:

    tensor([[0, 1],
            [2, 0],
            [2, 3]])
    

    This makes each row represent an edge, where the first value is the starting node of the edge and the second value is the ending node of the edge. This representation is more intuitive and easier to read, especially when the number of edges is very large.

In summary, these two lines of code extract edge information from the graph data object and print it out in a more readable format.

Visual display using networkx

from torch_geometric.utils import to_networkx

G = to_networkx(data, to_undirected=True)
visualize_graph(G, color=data.y)

The output is as follows:

Insert image description here

Parsing the above code:
The purpose of this code is to convert PyTorch Geometric graph data into NetworkX graph format andvisualize_graphvisualize it using the previously defined function. Here is a detailed breakdown of the code:

  1. Import the required tools :

    from torch_geometric.utils import to_networkx
    

    This line of code imports the function torch_geometric.utilsfrom to_networkx, which can convert PyTorch Geometric graph data into NetworkX graph format.

  2. Convert graph data to NetworkX format :

    G = to_networkx(data, to_undirected=True)
    
    • Here, to_networkx(data, to_undirected=True)PyTorch Geometric graph data is dataconverted into NetworkX graph G. The parameter to_undirected=Trueindicates that even though the original graph data may be directed, we want to get an undirected graph.
  3. Visualize the converted graph :

    visualize_graph(G, color=data.y)
    
    • Call the previously defined visualize_graphfunction to visualize the NetworkX graph G.
    • The parameter color=data.ymeans that the color of the node is based on data.ythe label data in . This way, different node labels are given different colors, making them easily distinguishable in the visualization.

To sum up, this code first converts the PyTorch Geometric format graph data into a NetworkX format graph, and then colors and visualizes the graph using the given node labels. This kind of visualization is often helpful in understanding the structure of the graph and the relationships between nodes, especially when the node labels are meaningful (e.g., representing different communities or categories).

Graph Neural Networks network definition:

xv ( l + 1 ) = W ( l + 1 ) ∑ w ∈ N ( v ) ∪ { v } 1 cw , v ⋅ xw ( l ) \mathbf{x}_v^{(\ell + 1)} = \ mathbf{W}^{(\ell + 1)} \sum_{w \in \mathcal{N}(v)\,\cup\,\{v\}}\frac{1}{c_{w,v }} \cdot \mathbf{x}_w^{(\ell)}xv(+1)=W(+1)wN(v){ v}cw,v1xw()

import torch
from torch.nn import Linear
from torch_geometric.nn import GCNConv


class GCN(torch.nn.Module):
    def __init__(self):
        super().__init__()
        torch.manual_seed(1234)
        self.conv1 = GCNConv(dataset.num_features, 4) # 只需定义好输入特征和输出特征即可
        self.conv2 = GCNConv(4, 4)
        self.conv3 = GCNConv(4, 2)
        self.classifier = Linear(2, dataset.num_classes)

    def forward(self, x, edge_index):
        h = self.conv1(x, edge_index) # 输入特征与邻接矩阵(注意格式,上面那种)
        h = h.tanh()
        h = self.conv2(h, edge_index)
        h = h.tanh()
        h = self.conv3(h, edge_index)
        h = h.tanh()  
        
        # 分类层
        out = self.classifier(h)

        return out, h

model = GCN()
print(model)

输出如下:

GCN(
  (conv1): GCNConv(34, 4)
  (conv2): GCNConv(4, 4)
  (conv3): GCNConv(4, 2)
  (classifier): Linear(in_features=2, out_features=4, bias=True)
)

解析上面的代码:

代码定义了一个简单的图卷积网络(GCN)模型。详细分析这段代码:

  1. 导入必要的库和模块:

    import torch
    from torch.nn import Linear
    from torch_geometric.nn import GCNConv
    

    这些库和模块是构建模型所必需的。

  2. 定义 GCN 类:

    class GCN(torch.nn.Module):
    

    通过继承 torch.nn.Module,我们定义了一个新的神经网络模型类 GCN

  3. 初始化方法:

    def __init__(self):
        super().__init__()
        torch.manual_seed(1234)
        ...
    
    • 使用 super().__init__() 调用父类的初始化方法。
    • torch.manual_seed(1234) 设置随机种子,以确保模型的权重初始化是确定的。
  4. 定义图卷积层和分类器:

    • self.conv1 = GCNConv(dataset.num_features, 4): 定义第一个图卷积层,它将输入特征(即图中节点的特征数,这里为34)映射到4个特征。
    • 接下来的两个图卷积层 self.conv2self.conv3 进一步对特征进行转换。
    • self.classifier = Linear(2, dataset.num_classes): 这是一个线性分类层,用于将最后一个图卷积层的输出(2个特征)映射到目标类别的数量(这里是4)。
  5. 定义前向传播方法:

    def forward(self, x, edge_index):
        ...
        return out, h
    
    • 这定义了如何对输入数据进行操作以获得模型的输出。
    • 数据通过三个图卷积层,并在每层后应用双曲正切激活函数 tanh
    • 输出经过分类器并返回。
  6. 实例化模型并打印:

    model = GCN()
    print(model)
    

    这些行实例化上面定义的 GCN 类并打印模型的结构。输出显示了模型包含的各个层及其配置。

输出的解析:

GCN(
  (conv1): GCNConv(34, 4)
  (conv2): GCNConv(4, 4)
  (conv3): GCNConv(4, 2)
  (classifier): Linear(in_features=2, out_features=4, bias=True)
)

This output describes GCNthe structure of the model. It has three graph convolutional layers and a linear classifier. For example, (conv1): GCNConv(34, 4)means the first graph convolutional layer accepts 34 features as input and outputs 4 features. Finally, a linear classifier maps 2 features to 4 output categories.

Overall, this code defines a three-layer graph convolutional network and generates a classification score for each node.

Output feature display

  • Didn’t we output two-dimensional features in the end? Let’s draw it and see what it looks like.
  • But, but, our model has not started training yet. . .
model = GCN()

_, h = model(data.x, data.edge_index)
print(f'Embedding shape: {
      
      list(h.shape)}')

visualize_embedding(h, color=data.y)

The output is as follows:


Parse the above code:

This code mainly focuses on two things: first, it runs the defined GCNmodel on a graph to obtain node embeddings; second, it uses a visualization function to display these embeddings. Here is a detailed breakdown of the code:

  1. Model instantiation :

    model = GCN()
    

    This line of code creates GCNa new instance of the class. The model has been defined in the previous code, and it contains three graph convolutional layers and a linear classifier.

  2. Model forward propagation :

    _, h = model(data.x, data.edge_index)
    

    This line of code calls GCNthe forward propagation method of the model, passing in node features data.xand edge indices data.edge_indexas parameters. These two parameters come from data, which is a PyTorch Geometric graph data object.

    The output is a tuple, where the first element _is the main output of the model (classification score), while the second element his the output of the last graph convolutional layer of the model and represents the embedding of the node.

  3. Print embedded shapes :

    print(f'Embedding shape: {
            
            list(h.shape)}')
    

    hThis line of code prints the shape of the embedded tensor . This helps us understand the dimensions of the embedding, which is usually the case [节点数, 嵌入维度].

  4. Visual embedding :

    visualize_embedding(h, color=data.y)
    

    Using the previously defined visualize_embeddingfunction, this line of code will hvisualize the embedding as a scatter plot. Each point in this scatter plot represents a node, and its position is determined by its embedding. The color of the points is based on data.y, which usually represents the label or category of the node.

Summary: This code runs a graph convolutional network model, obtains node embeddings, and visualizes these embeddings. This visualization helps us understand how the model distributes nodes in the embedding space, and the similarities and differences between nodes.

Training model (semi-supervised)

import time

model = GCN()
criterion = torch.nn.CrossEntropyLoss()  # Define loss criterion.
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)  # Define optimizer.

def train(data):
    optimizer.zero_grad()  
    out, h = model(data.x, data.edge_index) #h是两维向量,主要是为了咱们画个图 
    loss = criterion(out[data.train_mask], data.y[data.train_mask])  # semi-supervised
    loss.backward()  
    optimizer.step()  
    return loss, h

for epoch in range(401):
    loss, h = train(data)
    if epoch % 10 == 0:
        visualize_embedding(h, color=data.y, epoch=epoch, loss=loss)
        time.sleep(0.3)

Parse the above code:

This code defines the training process and executes 400 training epochs. Next, we will analyze each part of the code step by step.

  1. Initialize models and tools :

    model = GCN()
    criterion = torch.nn.CrossEntropyLoss()  # Define loss criterion.
    optimizer = torch.optim.Adam(model.parameters(), lr=0.01)  # Define optimizer.
    
    • model = GCN(): Creates GCNa new instance of a class that has been previously defined.
    • criterion: The loss function is defined, here cross-entropy loss is used, which is suitable for classification problems.
    • optimizer: Defines the optimizer for updating the weights of the model. The Adam optimizer is used here, and the learning rate is set to 0.01.
  2. Define the training function :

    def train(data):
        ...
        return loss, h
    

    This function defines a training step of the model and returns the loss and node embeddings. Specific steps are as follows:

    • optimizer.zero_grad(): Clear the remnants of the previous gradient.
    • out, h = model(data.x, data.edge_index): Perform forward propagation on the model.
    • loss = criterion(out[data.train_mask], data.y[data.train_mask]): Calculate the loss. Since this is a semi-supervised learning task, we only compute the loss on nodes with labels, which are train_maskindicated by.
    • loss.backward(): Perform backpropagation based on the calculated loss and calculate the gradient.
    • optimizer.step(): Use the optimizer to update the model's weights.
  3. Training loop :

    for epoch in range(401):
        loss, h = train(data)
        ...
    

    This loop was executed for 401 training epochs. At each epoch, it calls trainthe function defined above to obtain the loss and embedding.

  4. Visualization :

    if epoch % 10 == 0:
        visualize_embedding(h, color=data.y, epoch=epoch, loss=loss)
        time.sleep(0.3)
    

    Every 10 cycles, the code calls the previously defined visualize_embeddingfunction to display the embedding of the node. time.sleep(0.3)This means that there will be a 0.3 second pause between each visualization so that the visualization does not change too quickly, so that we can more easily observe the changes in node embedding.

Overall, this code defines the training process of the graph convolutional network model and performs an embedding visualization after every 10 training epochs, so that we can see how the model gradually learns to place similar nodes in the embedding space Closely located.

review

Overview

At the beginning of training, the embeddings ( hrepresented by ) may be random or uniformly distributed in the embedding space. This means that, in a scatterplot, nodes with different labels may be mixed together without obvious clustering.

As training proceeds, the model attempts to place similar nodes (based on their characteristics and their position in the graph) at similar locations in the embedding space, and to separate dissimilar nodes. This is represented in the visualization as:

  1. Nodes with the same label start to cluster together.
  2. Clear boundaries are formed between different clusters or categories.

After many iterations, if the model is successfully trained, we should see several clear clusters, where each cluster represents a category of nodes. In the KarateClub dataset, since there are 4 classes, we expect to see 4 clusters.


Review and summarize the above code and practices.

1. Goal :

Node classification using graph convolutional networks (GCN) on the KarateClub dataset.

2. Data loading :

The dataset is loaded first KarateClub, which is a classic small graph dataset. It describes the relationships among 34 members in a karate club, and the goal is to predict each member's group affiliation based on their social network relationships.

3. Data visualization :

Use networkxthe library and provided tool functions to visualize graph data into network structure diagrams.

4.Model definition :

A simple GCN model is defined, which consists of three GCNConvlayers to learn embedding representations of nodes in the graph, and a linear classifier to map these embeddings to predicted categories.

5. Embed visualization :

Immediately after the model definition, a forward pass was performed and the initial node embeddings were visualized using the provided utility functions. This provides a reference point for understanding the initial state of node embeddings before model training.

6. Model training :

  • Cross-entropy loss and Adam optimizer are set.
  • A function is defined trainto describe the process of a single training iteration.
  • The model was trained for 400 epochs and node embeddings were visualized every 10 epochs to observe how the model gradually updated and optimized the embeddings.

Summarize:

We successfully apply graph neural networks to the node classification task of the KarateClub dataset. The data was first loaded and visualized, then a GCN model was defined, and the initial node embeddings were visualized. During the following training process, we observed how the embeddings gradually formed clusters as training progressed, so that nodes with the same label were closer together. This practice demonstrates how graph neural networks work and their effectiveness on node classification tasks.

Replenish

Regarding embedding (denoted by h):

Embedding is a very common concept in deep learning and natural language processing. Embedding is the conversion of a certain type of data (such as words, nodes, users, or other entities) into a fixed-size vector so that machine learning models can process it more easily.

In the code, embedding ( hrepresented by) refers specifically to the node embedding of the graph. This means that every node in the graph is converted into a vector. These vectors capture the characteristics of the node and its structural position in the graph.

Here are some detailed points about embedding:

  1. Capture information : Embedding vectors are usually designed to capture meaningful information about the original data. For example, in word embeddings, similar words will have similar embeddings.

  2. Fixed size : Embedding vectors have a fixed length regardless of the size or form of the original data. This is useful for machine learning models as they require fixed size inputs.

  3. Graph node embedding : In graph neural networks (such as GCN), node embedding captures the characteristic information of nodes and their adjacency relationships in the graph. Adjacent or similar nodes may have similar embeddings.

  4. Visualization : Since embeddings are vector representations of high-dimensional data, they can be used for visualization. In the code, heach node embedding is a 2D vector, which allows them to be drawn directly on the plane. This visualization helps us understand how the model organizes nodes in the embedding space.

In context, his the output of a graph convolutional network (GCN), which provides a 2D embedding for each node in the graph. These embedding vectors are updated as the model is trained to better reflect the characteristics of the node and its position in the graph.

Node classification using graph convolutional networks (GCN) on paper citation datasets

Point classification task learning

Cora dataset(dataset description: Yang et al. (2016) )

  • The paper cites the data set, each point has a 1433-dimensional vector
  • In the end, 7 classification tasks are performed on each point (only 20 points are labeled in each category)
from torch_geometric.datasets import Planetoid#下载数据集用的
from torch_geometric.transforms import NormalizeFeatures

dataset = Planetoid(root='data/Planetoid', name='Cora', transform=NormalizeFeatures())#transform预处理

print()
print(f'Dataset: {
      
      dataset}:')
print('======================')
print(f'Number of graphs: {
      
      len(dataset)}')
print(f'Number of features: {
      
      dataset.num_features}')
print(f'Number of classes: {
      
      dataset.num_classes}')

data = dataset[0]  # Get the first graph object.

print()
print(data)
print('===========================================================================================================')

# Gather some statistics about the graph.
print(f'Number of nodes: {
      
      data.num_nodes}')
print(f'Number of edges: {
      
      data.num_edges}')
print(f'Average node degree: {
      
      data.num_edges / data.num_nodes:.2f}')
print(f'Number of training nodes: {
      
      data.train_mask.sum()}')
print(f'Training node label rate: {
      
      int(data.train_mask.sum()) / data.num_nodes:.2f}')
print(f'Has isolated nodes: {
      
      data.has_isolated_nodes()}')
print(f'Has self-loops: {
      
      data.has_self_loops()}')
print(f'Is undirected: {
      
      data.is_undirected()}')

The output is as follows:

Dataset: Cora():
======================
Number of graphs: 1
Number of features: 1433
Number of classes: 7

Data(x=[2708, 1433], edge_index=[2, 10556], y=[2708], train_mask=[2708], val_mask=[2708], test_mask=[2708])
===========================================================================================================
Number of nodes: 2708
Number of edges: 10556
Average node degree: 3.90
Number of training nodes: 140
Training node label rate: 0.05
Has isolated nodes: False
Has self-loops: False
Is undirected: True

Parse the above code:

PlanetoidThe dataset is a dataset often used in graph neural network research. Then the data was loaded Cora, which is a large graph mainly used for document classification tasks. In Corathe data set, each node represents a document, each edge represents a citation relationship between documents, and the feature of a node is the presence of a word in the document (or bag-of-word representation).

Here is a detailed analysis of the code:

Data loading and preprocessing:

  1. Planetoid: Used to download and load Planetoid datasets (Cora, CiteSeer and PubMed).
  2. NormalizeFeatures: This is a preprocessing step that normalizes node features.

Statistics:

Next, some statistical information about the data is printed:

  1. Dataset information :

    • Number of graphs: The number of graphs in the dataset. Cora only has one graph, so this number is 1.
    • Number of features: Number of features per node. In Cora, this represents a bag-of-words representation of a document.
    • Number of classes: The number of categories to be classified. This is the number of categories for the document classification task.
  2. Graph data :

    • Data(...): shows the main properties of the graph.

    This output describes a graph structure in the Cora dataset, which contains the graph’s node features, edges, node labels, and other information. Let’s analyze them one by one:

    1. x=[2708, 1433]:

      • This is a node feature matrix, which contains 2708 nodes.
      • Each node has 1433 features. These features are bag-of-words representations based on literature. In other words, each document (node) is represented by a 1433-dimensional vector, each dimension represents a word in the vocabulary, and its value represents the number of occurrences or importance of the word in the document.
    2. edge_index=[2, 10556]:

      • This is a tensor that defines the edges in the graph.
      • "2" represents the number of rows of the tensor, each row is a connected pair of nodes. This way is used to define edges in the graph. The first row contains the index of the source node, while the second row contains the index of the target node.
      • "10556" means there are a total of 10556 edges in the graph.
    3. y=[2708]:

      • This is a vector of node labels.
      • It contains 2708 labels, corresponding to 2708 nodes. Each label may be an integer value indicating the category it belongs to.
    4. train_mask=[2708], val_mask=[2708], test_mask=[2708]:

      • These are Boolean masks used to separate nodes in the dataset into training, validation and test sets.
      • All three masks have 2708 boolean values. If train_mask[i]True, it means that the i-th node should be used for training; similarly, val_maskand test_maskare used for verification and testing respectively.
      • This separation method is for a semi-supervised learning setting, where only some of the node labels are known and used for training, while the labels of the remaining nodes are used for validation and testing.

    To sum up, the Cora data set contains a large graph with 2708 nodes and 10556 edges. Each node has 1433 bag-of-words model-based features and is divided into 7 categories. In order to perform machine learning tasks such as node classification, these nodes are divided into training, validation and test sets.

  3. Graph statistics :

    • Number of nodes: The number of nodes in the graph.
    • Number of edges: The number of edges in the graph.
    • Average node degree: Average node degree, indicating the average number of edges connected to each node.
    • Number of training nodes: Number of nodes used for training.
    • Training node label rate: The proportion of nodes used for training.
    • Has isolated nodes: Whether there are isolated nodes in the graph.
    • Has self-loops: Whether there is a self-loop in the graph, that is, whether a node has an edge pointing to itself.
    • Is undirected: Whether the graph is undirected.

Output analysis:

From the output, we can see the following information for the Cora dataset:

  • It consists of a large graph with 2708 nodes and 10556 edges.
  • Each node has 1433 features, which are based on the bag-of-word representation of the literature.
  • Documents need to be classified into 7 different categories.
  • On average, each node has 3.90 edges.
  • Only 140 node labels are used for training , which means that most of the node labels are hidden. This setting simulates a semi-supervised learning scenario.
  • The graph has no isolated nodes and self-loops and is undirected.

Overall, this code loads the Cora dataset, performs preprocessing, and obtains detailed statistics about the graph structure and its properties.

Explanation - "Each node has 1433 features, which are based on the bag-of-word representation of the literature"

"Each node has 1433 features, and these features are based on the bag-of-word representation of the literature." This sentence involves two important concepts: node features and bag-of-words representation (Bag-of-Words, BoW). Let’s analyze them one by one:

  1. Node Characteristics : In a graph, each node can have one or more attributes or characteristics. In many graph neural network tasks, these features are used to predict node labels, classification, etc. For example, in a social network, a node may represent a person, and node characteristics may include the person's age, gender, occupation, etc.

  2. Bag of Words Representation (BoW) : BoW is a technology that converts text data (such as sentences, paragraphs, or entire documents) into numerical features. Specifically, it is a text model where each document is represented as a fixed-length vector. The length of this vector is usually the same as the size of the vocabulary (or the set of all considered words). The elements of each vector represent the number of times the corresponding word in the vocabulary appears in the document.

In the context of the Cora dataset:

  • Each node represents a document : The nodes in the Cora dataset represent academic documents.

  • 1433 features : means 1433 different words were considered (or possibly 1433 features extracted by other text processing techniques, such as TF-IDF).

  • Document-based bag-of-word representation : The 1433 feature values ​​of each node (or document) are created based on the document content. A specific feature value represents the number of occurrences of the corresponding word in the document or other related measures.

For example, suppose our vocabulary only has three words: ["apple", "banana", "cherry"]. If "apple" is mentioned 10 times and "banana" 5 times in a document, but "cherry" is not mentioned, then the BoW representation of this document will be [10, 5, 0].

In the actual context of Cora, each document is transformed into a 1433-dimensional vector, each dimension represents a word in the vocabulary, and the value represents the importance or number of occurrences of the word in the document.

# 可视化部分
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

def visualize(h, color):
    z = TSNE(n_components=2).fit_transform(h.detach().cpu().numpy())

    plt.figure(figsize=(10,10))
    plt.xticks([])
    plt.yticks([])

    plt.scatter(z[:, 0], z[:, 1], s=70, c=color, cmap="Set2")
    plt.show()

Parse the above code:

The purpose of this code is to visualize high-dimensional data. In order to facilitate observation in two-dimensional space, the code uses the t-SNE (t-distributed stochastic neighbor embedding) algorithm to map high-dimensional data to two-dimensional space. Here’s a detailed breakdown of each part:

  1. Import related libraries :

    %matplotlib inline
    import matplotlib.pyplot as plt
    from sklearn.manifold import TSNE
    
    • %matplotlib inline: is a special command of Jupyter Notebook, which allows the generated image to be displayed directly in the notebook.
    • import matplotlib.pyplot as plt: Import the drawing module of matplotlib, which is often used for data visualization.
    • from sklearn.manifold import TSNE: Import the t-SNE implementation in the sklearn library.
  2. Define visualization functionvisualize :

    def visualize(h, color):
        z = TSNE(n_components=2).fit_transform(h.detach().cpu().numpy())
        ...
    
    • visualizeThe function has two parameters: hand color. hare the high-dimensional data that need to be visualized, and . coloris a list used to specify a color for each data point.
  3. Use t-SNE for data mapping :

    z = TSNE(n_components=2).fit_transform(h.detach().cpu().numpy())
    
    • t-SNE is a nonlinear dimensionality reduction method often used to visualize high-dimensional data. Here, we reduce the data to 2 dimensions.
    • h.detach().cpu().numpy(): Since hmay be a PyTorch tensor, first use to detach()separate it so that it no longer has gradient information; then use to cpu()make sure it is on the CPU; finally, convert to a numpy array.
  4. Drawing settings :

    plt.figure(figsize=(10,10))
    plt.xticks([])
    plt.yticks([])
    
    • plt.figure(figsize=(10,10)): Define the size of the image to be 10x10 units.
    • plt.xticks([])and plt.yticks([]): Hide the x and y axis ticks.
  5. Draw a scatter plot :

    plt.scatter(z[:, 0], z[:, 1], s=70, c=color, cmap="Set2")
    
    • z[:, 0]and z[:, 1]: represent the x and y coordinates of the two-dimensional data mapped by the t-SNE algorithm.

    • s=70: Specifies the scatter point size to be 70.

    • c=color: Specify the color of each scatter point. coloris the parameter passed to visualizethe function, usually a list or array equal to the number of data points, representing the color or category of each data point.

    • cmap="Set2": Defines a color map that ensures that the scatterplot colors are chosen from the "Set2" palette.

  6. Display graphics :

    plt.show()
    
    • Use plt.show()to display previously defined and configured images. In Jupyter Notebook, this displays the image directly below the cell.

In summary, this code defines a visualizefunction called which accepts high-dimensional data and color information, then uses the t-SNE algorithm to reduce the high-dimensional data to two dimensions and visualize it in a scatter plot. Through this visualization, we can observe the distribution and clustering trends of data points in a low-dimensional space, thereby gaining an intuitive understanding of the data structure and pattern.

What if you try to use the traditional fully connected layer directly (Multi-layer Perception Network)

import torch
from torch.nn import Linear
import torch.nn.functional as F


class MLP(torch.nn.Module):
    def __init__(self, hidden_channels):
        super().__init__()
        torch.manual_seed(12345)
        self.lin1 = Linear(dataset.num_features, hidden_channels)
        self.lin2 = Linear(hidden_channels, dataset.num_classes)

    def forward(self, x):
        x = self.lin1(x)
        x = x.relu()
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.lin2(x)
        return x

model = MLP(hidden_channels=16)
print(model)

The output is as follows:

MLP(
  (lin1): Linear(in_features=1433, out_features=16, bias=True)
  (lin2): Linear(in_features=16, out_features=7, bias=True)
)

Parse the above code:

This code defines a simple multilayer perceptron (MLP) and initializes it based on the given data set. Here is a detailed analysis of the code:

  1. Import necessary libraries and modules :

    import torch
    from torch.nn import Linear
    import torch.nn.functional as F
    
    • torch: PyTorch library for tensor computation and neural networks.
    • Linear: A module representing a fully connected linear layer.
    • F: This is a functional module in PyTorch, which contains various neural network operations such as ReLU and dropout.
  2. Define MLP class :

    class MLP(torch.nn.Module):
    

    This class inherits from torch.nn.Module, indicating that it is a PyTorch model.

  3. Constructor( __init__) :

    def __init__(self, hidden_channels):
        super().__init__()
        torch.manual_seed(12345)
        self.lin1 = Linear(dataset.num_features, hidden_channels)
        self.lin2 = Linear(hidden_channels, dataset.num_classes)
    
    • hidden_channels: This is the size of the hidden layer of the MLP.
    • torch.manual_seed(12345): Set a random seed to ensure that the model's weight initialization is repeatable.
    • self.lin1and self.lin2: defines two linear layers. The first linear layer maps the input features to the hidden layer, and the second linear layer maps the hidden layer to the output layer.
  4. Forward propagation ( forwardmethod) :

    def forward(self, x):
        x = self.lin1(x)
        x = x.relu()
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.lin2(x)
        return x
    
    • The input xfirst passes through lin1the linear layer.
    • Then apply the ReLU activation function.
    • Then a 50% dropout is applied. Dropout is a regularization technique that randomly turns off some neurons during training to prevent overfitting.
    • Finally, the data passes through lin2the linear layer.
  5. Model instantiation :

    model = MLP(hidden_channels=16)
    

    Here a new MLP model instance is created with a hidden layer size of 16.

  6. Print model :

    print(model)
    

    When you print a model, PyTorch displays the structure of the model. For this particular model, the output is as follows:

    MLP(
      (lin1): Linear(in_features=1433, out_features=16, bias=True)
      (lin2): Linear(in_features=16, out_features=7, bias=True)
    )
    

    This means that the MLP has two linear layers. The first linear layer accepts 1433 features as input and outputs 16 hidden channels. The second linear layer accepts input from 16 hidden channels and outputs 7 results, which correspond to the 7 categories in the dataset.


model = MLP(hidden_channels=16)
criterion = torch.nn.CrossEntropyLoss()  # Define loss criterion.
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)  # Define optimizer.

def train():
    model.train()
    optimizer.zero_grad()  # Clear gradients.
    out = model(data.x)  # Perform a single forward pass.
    loss = criterion(out[data.train_mask], data.y[data.train_mask])  # Compute the loss solely based on the training nodes.
    loss.backward()  # Derive gradients.
    optimizer.step()  # Update parameters based on gradients.
    return loss

def test():
    model.eval()
    out = model(data.x)
    pred = out.argmax(dim=1)  # Use the class with highest probability.
    test_correct = pred[data.test_mask] == data.y[data.test_mask]  # Check against ground-truth labels.
    test_acc = int(test_correct.sum()) / int(data.test_mask.sum())  # Derive ratio of correct predictions.
    return test_acc

for epoch in range(1, 201):
    loss = train()
    print(f'Epoch: {
      
      epoch:03d}, Loss: {
      
      loss:.4f}')

Parse the above code:

This code describes the process of training and testing on graph data using a multilayer perceptron (MLP). Here is a detailed analysis of the code:

  1. Model initialization :

    model = MLP(hidden_channels=16)
    

    Here a model instance is created MLPwith a hidden layer size of 16.

  2. Loss function definition :

    criterion = torch.nn.CrossEntropyLoss()
    

    This defines the cross-entropy loss function, which is commonly used in multi-classification tasks.

  3. Optimizer definition :

    optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)
    

    The model was trained using the Adam optimizer with the learning rate set to 0.01 and weight decay applied, which helps prevent the model from overfitting.

  4. Training function :

    def train():
    

    This function defines the training steps of the model:

    • model.train(): Set the model to training mode.
    • optimizer.zero_grad(): Clear gradients at the beginning of each training iteration.
    • out = model(data.x): Perform a forward propagation through the model.
    • loss = criterion(...): Calculate loss based on training nodes only.
    • loss.backward(): Calculate the gradient of the loss.
    • optimizer.step(): Update model parameters.
  5. Test function :

    def test():
    

    This function defines the test steps:

    • model.eval(): Sets the model to evaluation mode, which means layers such as dropout and batch normalization will run in inference mode.
    • pred = out.argmax(dim=1): Maximize the output class probability of each node to obtain the predicted class.
    • test_correct = ...: Check if the predicted value is equal to the true label.
    • test_acc = ...: Calculates the rate of correct predictions.
  6. Training loop :

    for epoch in range(1, 201):
        loss = train()
        print(f'Epoch: {
            
            epoch:03d}, Loss: {
            
            loss:.4f}')
    

    This cycle is performed for 200 training cycles. In each epoch, the model is trained once on the training data and prints out the loss value for that epoch.

Summary: This code describes how to use a simple multilayer perceptron to train and evaluate graph data. It first defines the model, loss function and optimizer, then performs the training and evaluation of the model through the training and testing functions, and finally trains the model in 200 training epochs.

Graph Neural Network (GNN)

Replace the fully connected layer with a GCN layer

from torch_geometric.nn import GCNConv


class GCN(torch.nn.Module):
    def __init__(self, hidden_channels):
        super().__init__()
        torch.manual_seed(1234567)
        self.conv1 = GCNConv(dataset.num_features, hidden_channels)
        self.conv2 = GCNConv(hidden_channels, dataset.num_classes)

    def forward(self, x, edge_index):
        x = self.conv1(x, edge_index)
        x = x.relu()
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.conv2(x, edge_index)
        return x

model = GCN(hidden_channels=16)
print(model)

The output is as follows:

GCN(
  (conv1): GCNConv(1433, 16)
  (conv2): GCNConv(16, 7)
)

Parse the above code:

This code describes the definition of using a graph convolutional network (GCN) model to operate on graph data. Here is a detailed analysis of the code:

  1. Model definition :

    class GCN(torch.nn.Module):
    

    defines a class based torch.nn.Moduleon GCN, which means it is a PyTorch model class.

  2. Constructor initialization :

    def __init__(self, hidden_channels):
    

    GCNThe constructor of takes one parameter hidden_channels, which defines the size of the hidden layer.

  3. Layer definition :

    • torch.manual_seed(1234567): Set a random seed to ensure reproducibility of the experiment.
    • self.conv1 = GCNConv(dataset.num_features, hidden_channels): The first layer is a graph convolutional layer that converts node features from original feature size ( dataset.num_features, here 1433) to hidden_channelssize (here 16).
    • self.conv2 = GCNConv(hidden_channels, dataset.num_classes): The second layer converts the output of the hidden layer into a class size, which is 7 in this case.
  4. Forward propagation :

    def forward(self, x, edge_index):
    

    The forward propagation process of the model is defined:

    • x = self.conv1(x, edge_index): Through the first layer of graph convolution.
    • x = x.relu():Apply the ReLU activation function to the output.
    • x = F.dropout(x, p=0.5, training=self.training): Apply dropout to prevent overfitting, with a dropout rate of 0.5.
    • x = self.conv2(x, edge_index): Through the second layer of graph convolution.
  5. Model initialization :

    model = GCN(hidden_channels=16)
    

    This line of code creates a GCNmodel instance with a hidden layer size of 16.

  6. Output :

    GCN(
      (conv1): GCNConv(1433, 16)
      (conv2): GCNConv(16, 7)
    )
    

    This is a printout of the model. It shows two graph convolutional layers of the model, the first layer accepts 1433 features and outputs 16 features, while the second layer accepts these 16 features and outputs 7 categories.

Summary: This code describes a simple neural network model based on graph convolution. It consists of two graph convolutional layers, where the first graph convolutional layer is responsible for feature transformation, while the second graph convolutional layer produces the final category output. This model produces a category output for each node in the graph, which can be used for classification tasks of nodes in the graph.

During visualization, since the output is a 7-dimensional vector, the dimension is reduced to 2 dimensions for display.

model = GCN(hidden_channels=16)
model.eval()

out = model(data.x, data.edge_index)
visualize(out, color=data.y)

The output is as follows:

Insert image description here


Train GCN model

model = GCN(hidden_channels=16)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)
criterion = torch.nn.CrossEntropyLoss()

def train():
    model.train()
    optimizer.zero_grad()  
    out = model(data.x, data.edge_index)  
    loss = criterion(out[data.train_mask], data.y[data.train_mask])  
    loss.backward() 
    optimizer.step()  
    return loss

def test():
    model.eval()
    out = model(data.x, data.edge_index)
    pred = out.argmax(dim=1)  
    test_correct = pred[data.test_mask] == data.y[data.test_mask]  
    test_acc = int(test_correct.sum()) / int(data.test_mask.sum())  
    return test_acc


for epoch in range(1, 101):
    loss = train()
    print(f'Epoch: {
      
      epoch:03d}, Loss: {
      
      loss:.4f}')

Parse the above code:

This code implements the training process of the GCN model based on graph data, and also provides a test function to evaluate the performance of the model. Below I will analyze this code in detail:

  1. Model initialization :

    model = GCN(hidden_channels=16)
    

    Here a new GCN model instance is created with a hidden layer size of 16.

  2. Optimizer initialization :

    optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)
    

    Use the Adam optimizer to update the model's weights. The learning rate is set to 0.01 and the weight decay (used for regularization) is 5e-4.

  3. Define the loss function :

    criterion = torch.nn.CrossEntropyLoss()
    

    Choose the cross-entropy loss function, which is commonly used in multi-classification problems.

  4. Define the training function :

    • model.train(): Set the model to training mode.
    • optimizer.zero_grad(): Clear all optimized gradients.
    • out = model(data.x, data.edge_index): Use the model for forward propagation.
    • loss = criterion(...): Calculate the loss using training nodes.
    • loss.backward(): Perform backpropagation to calculate gradients.
    • optimizer.step(): Use the optimizer to update model parameters.
  5. Define test function :

    • model.eval(): Set the model to evaluation mode.
    • out = model(...): Perform forward propagation.
    • pred = out.argmax(dim=1): Get the index of the maximum value for each node's output, which represents the predicted category.
    • test_correct = ...: Check if the predicted class matches the true class.
    • test_acc = ...: Calculate the accuracy on the test set.
  6. Training loop :

    for epoch in range(1, 101):
        loss = train()
        print(f'Epoch: {
            
            epoch:03d}, Loss: {
            
            loss:.4f}')
    

    The model is trained for 100 epochs and the current loss is printed after each epoch.

In summary, this code describes how to use the GCN model, Adam optimizer, and cross-entropy loss function for training on graph data. trainTwo functions and are defined testto implement the logic of training and testing respectively, and the model is trained in the main loop.


# 准确率计算
test_acc = test()
print(f'Test Accuracy: {
      
      test_acc:.4f}')

The output is as follows:

Test Accuracy: 0.8150

From 59% to 81%, this improvement is quite large; the visual display after training is as follows:

model.eval()

out = model(data.x, data.edge_index)
visualize(out, color=data.y)

The output is as follows:

Insert image description here

Guess you like

Origin blog.csdn.net/Waldocsdn/article/details/132928997