Introduction to Graphs
Graph-structured data have been widely utilized in many real-world scenarios. For example, each user on Facebook can be seen as a vertex and their relations like friendship or followership can be seen as edges in the graph. We might be interested in predicting the interests of users, or whether a pair of nodes in a network might have an edge connecting them.
How to represent a graph in CogDL
A graph is used to store information of structured data. CogDL represents a graph with a
Graph holds the following attributes:
x: Node feature matrix with shape
[num_nodes, num_features], torch.Tensor
edge_index: COO format sparse matrix, Tuple
edge_weight: Edge weight with shape
edge_attr: Edge attribute matrix with shape
y: Target labels of each node, with shape
[num_nodes,]for single label case and [num_nodes, num_labels] for mult-label case
row_indptr: Row index pointer for CSR sparse matrix, torch.Tensor.
col_indices: Column indices for CSR sparse matrix, torch.Tensor.
num_nodes: The number of nodes in graph.
num_edges: The number of edges in graph.
The above are the basic attributes but are not necessary. You may define a graph with g = Graph(edge_index=edges) and omit the others.
Graph is not restricted to these attributes and other self-defined attributes, e.g., graph.mask = mask, are also supported.
Graph stores sparse matrix with COO or CSR format. COO format is easier to add or remove edges, e.x. add_self_loops, and CSR is stored for fast message-passing.
Graph automatically convert between two formats and you can use both on demands without worrying. You can create a Graph with edges or assign edges
to a created graph. edge_weight will be automatically initialized as all ones, and you can modify it to fit your need.
import torch from cogdl.data import Graph edges = torch.tensor([[0,1],[1,3],[2,1],[4,2],[0,3]]).t() g = Graph() g.edge_index = edges g = Graph(edge_index=edges) # equivalent to that above print(g.edge_weight) >> tensor([1., 1., 1., 1., 1.]) g.num_nodes >> 5 g.num_edges >> 5 g.edge_weight = torch.rand(5) print(g.edge_weight) >> tensor([0.8399, 0.6341, 0.3028, 0.0602, 0.7190])
We also implement commonly used operations in
add_self_loops: add self loops for nodes in graph,
add_remaining_self_loops: add self-loops for nodes without it.
sym_norm: symmetric normalization of edge_weight used GCN:
row_norm: row-wise normalization of edge_weight:
degrees: get degrees for each node. For directed graph, this function returns in-degrees of each node.
import torch from cogdl.data import Graph edge_index = torch.tensor([[0,1],[1,3],[2,1],[4,2],[0,3]]).t() g = Graph(edge_index=edge_index) >> Graph(edge_index=[2, 5]) g.add_remaining_self_loops() >> Graph(edge_index=[2, 10], edge_weight=) >> print(edge_weight) # tensor([1., 1., ..., 1.]) g.row_norm() >> print(edge_weight) # tensor([0.3333, ..., 0.50])
subgraph: get a subgraph containing given nodes and edges between them.
edge_subgraph: get a subgraph containing given edges and corresponding nodes.
sample_adj: sample a fixed number of neighbors for each given node.
from cogdl.datasets import build_dataset_from_name g = build_dataset_from_name("cora") g.num_nodes >> 2707 g.num_edges >> 10184 # Get a subgraph contaning nodes [0, .., 99] sub_g = g.subgraph(torch.arange(100)) >> Graph(x=[100, 1433], edge_index=[2, 18], y=) # Sample 3 neighbors for each nodes in [0, .., 99] nodes, adj_g = g.sample_adj(torch.arange(100), size=3) >> Graph(edge_index=[2, 300]) # adj_g
train/eval: In inductive settings, some nodes and edges are unseen during training,
train/evalprovides access to switching backend graph for training/evaluation. In transductive setting, you may ignore this.
# train_step model.train() graph.train() # inference_step model.eval() graph.eval()
How to construct mini-batch graphs
In node classification, all operations are in one single graph. But in tasks like graph classification, we need to deal with
many graphs with mini-batch. Datasets for graph classification contains graphs which can be accessed with index, e.x.
To support mini-batch training/inference, CogDL combines graphs in a batch into one whole graph, where adjacency matrices form sparse block diagnal matrices
and others(node features, labels) are concatenated in node dimension.
cogdl.data.Dataloader handles the process.
from cogdl.data import DataLoader from cogdl.datasets import build_dataset_from_name dataset = build_dataset_from_name("mutag") >> MUTAGDataset(188) dataswet >> Graph(x=[17, 7], y=, edge_index=[2, 38]) loader = DataLoader(dataset, batch_size=8) for batch in loader: model(batch) >> Batch(x=[154, 7], y=, batch=, edge_index=[2, 338])
batch is an additional attributes that indicate the respective graph the node belongs to. It is mainly used to do global
pooling, or called readout to generate graph-level representation. Concretely,
batch is a tensor like:
The following code snippet shows how to do global pooling to sum over features of nodes in each graph:
def batch_sum_pooling(x, batch): batch_size = int(torch.max(batch.cpu())) + 1 res = torch.zeros(batch_size, x.size(1)).to(x.device) out = res.scatter_add_( dim=0, index=batch.unsqueeze(-1).expand_as(x), src=x ) return out
How to edit the graph?
Changes can be applied to edges in some settings. In such cases, we need to generate a graph for calculation while keep the original graph. CogDL provides graph.local_graph to set up a local scape and any out-of-place operation will not reflect to the original graph. However, in-place operation will affect the original graph.
graph = build_dataset_from_name("cora") graph.num_edges >> 10184 with graph.local_graph(): mask = torch.arange(100) row, col = graph.edge_index graph.edge_index = (row[mask], col[mask]) graph.num_edges >> 100 graph.num_edges >> 10184 graph.edge_weight >> tensor([1.,...,1.]) with graph.local_graph(): graph.edge_weight += 1 graph.edge_weight >> tensor([2.,...,2.])
Common graph datasets
CogDL provides a bunch of commonly used datasets for graph tasks like node classification, graph classification and others. You can access them conveniently shown as follows. Statistics of datasets are on this page .
from cogdl.datasets import build_dataset_from_name dataset = build_dataset_from_name("cora") from cogdl.datasets import build_dataset dataset = build_dataset(args) # if args.dataet = "cora"
For all datasets for node classification, we use train_mask, val_mask, test_mask to denote train/validation/test split for nodes.