Unsupervised Node Classification

In this tutorial, we will introduce a important task, unsupervised node classification. In this task, we usually apply L2 normalized logisitic regression to train a classifier and use F1-score or Accuracy to measure the performance.

Unsupervied node classificatioin includes network embedding methods(DeepWalk, LINE, ProNE adn etc.) and GNN self-supervied methods(DGI, GraphSAGE and etc.). In this section, we mainly introduce the part for network embeddings and the other will be presented in next section trainer.

Unsupervised Graph Embedding Methods

Method Weighted shallow network Matrix Factorization Reproducibility GPU support
DeepWalk      
LINE    
Node2Vec    
NetMF    
NetSMF    
HOPE    
GraRep      
SDNE  
DNGR    
ProNE    

Unsupervised Graph Neural Network Representation Learning Methods

Method Sampling Inductive Reproducibility
DGI    
MVGRL
GRACE    
GraphSAGE

First we define the UnsupervisedNodeClassification class, which has two parameters hidden-size and num-shuffle . hidden-size represents the dimension of node representation, while num-shuffle means the shuffle times in classifier.

@register_task("unsupervised_node_classification")
class UnsupervisedNodeClassification(BaseTask):
    """Node classification task."""

    @staticmethod
    def add_args(parser):
        """Add task-specific arguments to the parser."""
        # fmt: off
        parser.add_argument("--hidden-size", type=int, default=128)
        parser.add_argument("--num-shuffle", type=int, default=5)
        # fmt: on

    def __init__(self, args):
        super(UnsupervisedNodeClassification, self).__init__(args)

Then we can build dataset according to input graph’s type, and get self.label_matrix.

dataset = build_dataset(args)
self.data = dataset[0]
if issubclass(dataset.__class__.__bases__[0], InMemoryDataset):
    self.num_nodes = self.data.y.shape[0]
    self.num_classes = dataset.num_classes
    self.label_matrix = np.zeros((self.num_nodes, self.num_classes), dtype=int)
    self.label_matrix[range(self.num_nodes), self.data.y] = 1
    self.data.edge_attr = self.data.edge_attr.t()
else:
    self.label_matrix = self.data.y
    self.num_nodes, self.num_classes = self.data.y.shape

After that, we can build model and run model.train(G) to obtain node representation.

self.model = build_model(args)
self.is_weighted = self.data.edge_attr is not None


def train(self):
    G = nx.Graph()
    if self.is_weighted:
        edges, weight = (
            self.data.edge_index.t().tolist(),
            self.data.edge_attr.tolist(),
        )
        G.add_weighted_edges_from(
            [(edges[i][0], edges[i][1], weight[0][i]) for i in range(len(edges))]
        )
    else:
        G.add_edges_from(self.data.edge_index.t().tolist())
    embeddings = self.model.train(G)

The spectral propagation in ProNE/ProNE++ can improve the quality of representation learned from other methods, so we can use enhance_emb to enhance performance. ProNE++ automatically searches for the best graph filter to help improve the embedding.

if self.enhance is True:
    embeddings = self.enhance_emb(G, embeddings)

When the embeddings are obtained, we can save them at self.save_dir.

At last, we evaluate embedding via run num_shuffle times classification under different training ratio with features_matrix and label_matrix.

def _evaluate(self, features_matrix, label_matrix, num_shuffle):
    # shuffle, to create train/test groups
    shuffles = []
    for _ in range(num_shuffle):
        shuffles.append(skshuffle(features_matrix, label_matrix))

    # score each train/test group
    all_results = defaultdict(list)
    training_percents = [0.1, 0.3, 0.5, 0.7, 0.9]
    for train_percent in training_percents:
        for shuf in shuffles:

In each shuffle, split data into two parts(training and testing) and use LogisticRegression to evaluate.

# ... shuffle to generate train/test set X_train/X_test, y_train/y_test

clf = TopKRanker(LogisticRegression())
clf.fit(X_train, y_train)

# find out how many labels should be predicted
top_k_list = list(map(int, y_test.sum(axis=1).T.tolist()[0]))
preds = clf.predict(X_test, top_k_list)
result = f1_score(y_test, preds, average="micro")
all_results[train_percent].append(result)

Node in graph may have multiple labels, so we conduct multilbel classification built from TopKRanker.

from sklearn.multiclass import OneVsRestClassifier

class TopKRanker(OneVsRestClassifier):
    def predict(self, X, top_k_list):
        assert X.shape[0] == len(top_k_list)
        probs = np.asarray(super(TopKRanker, self).predict_proba(X))
        all_labels = sp.lil_matrix(probs.shape)

        for i, k in enumerate(top_k_list):
            probs_ = probs[i, :]
            labels = self.classes_[probs_.argsort()[-k:]].tolist()
            for label in labels:
                all_labels[i, label] = 1
        return all_labels

Finally, we get the results of Micro-F1 score under different training ratio for different models on datasets.

Cogdl supports evaluating the trained embeddings ignoring the training process. With –load-emb-path set to the path of your result, Cogdl will skip the training and directly evalute the embeddings.

The overall implementation of UnsupervisedNodeClassification is at (https://github.com/THUDM/cogdl/blob/master/cogdl/tasks/unsupervised_node_classification.py).

To run UnsupervisedNodeClassification, we can use following instruction:

python scripts/train.py --task unsupervised_node_classification --dataset ppi wikipedia --model deepwalk prone -seed 0 1

Then We get experimental results like this:

Variant Micro-F1 0.1 Micro-F1 0.3 Micro-F1 0.5 Micro-F1 0.7 Micro-F1 0.9
(‘ppi’, ‘deepwalk’) 0.1547±0.0002 0.1846±0.0002 0.2033±0.0015 0.2161±0.0009 0.2243±0.0018
(‘ppi’, ‘prone’) 0.1777±0.0016 0.2214±0.0020 0.2397±0.0015 0.2486±0.0022 0.2607±0.0096
(‘wikipedia’, ‘deepwalk’) 0.4255±0.0027 0.4712±0.0005 0.4916±0.0011 0.5011±0.0017 0.5166±0.0043
(‘wikipedia’, ‘prone’) 0.4834±0.0009 0.5320±0.0020 0.5504±0.0045 0.5586±0.0022 0.5686±0.0072