Skip to content

Latest commit

 

History

History
311 lines (282 loc) · 13.3 KB

README.md

File metadata and controls

311 lines (282 loc) · 13.3 KB

PBGENA

Parallelized Binary embedding GENerator for Attributed graphs

A Sketch-based Approach towards Scalable and Efficient Attributed Network Embedding

Key Features:

  1. BGENA is a super-fast sketch-based ANE solver, which uses BinSketch coupled with a novel edge propagation mechanism.
  2. PBGENA, Parallelized BGENA, works with MPI to leverage a system's multi-core architecture to further accelerate the embedding process.
  3. PBGENA's binary embeddings allow efficient bitarray/sparse-matrix storage.
  4. PBGENA embeddings achieve state-of-the-art performance in graph analysis tasks like node classification and link prediction.
  5. PBGENA is highly flexible and can work with just the topology or the attributes of the graph and enables turning off propagation altogether.

Get PBGENA:

git clone https://github.com/tapadeep/PBGENA.git

Satisfy all requirements for PBGENA:

pip install -r PBGENA/requirements.txt

Execute BGENA:

Perform Node Embedding:

cd PBGENA/Code/Algorithm
python BGENA.py --graph Facebook --N 2000 --alpha 0.5 --b_a 0.7 --b_t 0.6

Perform Node Classification:

cd 'PBGENA/Code/Node Classification'
python node_classification.py --graph Facebook --algorithm BGENA --tr 0.7 --multi True

Ignore the --multi flag for graphs that are not multi-labeled.

Perform Link Prediction:

cd 'PBGENA/Code/Link Prediction'
python -B link_prediction.py --graph Facebook --algorithm BGENA --erf 0.3 --N 2000 --alpha 0.8 --b_a 1 --b_t 0

Execute PBGENA:

Perform Node Embedding:

cd PBGENA/Code/Algorithm
python PBGENA.py --graph PubMed --N 8000 --alpha 0.65 --b_a 0 --b_t 0.8 --p 6

The system must have Open MPI, MPICH, or Microsoft MPI installed.

Perform Node Classification:

cd 'PBGENA/Code/Node Classification'
python node_classification.py --graph PubMed --algorithm PBGENA --tr 0.7

Perform Link Prediction:

cd 'PBGENA/Code/Link Prediction'
python -B link_prediction.py --graph PubMed --algorithm PBGENA --erf 0.3 --N 8000 --alpha 1 --b_a 0.25 --b_t 0 --p 6

Execute Baselines:

Perform Node Embedding:

cd PBGENA/Code/Algorithm
python Baseline.py --graph CiteSeer --algorithm FeatherNode --reduction_dimensions 25 --eval_points 5 --order 1

Perform Node Classification:

cd 'PBGENA/Code/Node Classification'
python node_classification.py --graph CiteSeer --algorithm FeatherNode --tr 0.7

Perform Link Prediction:

cd 'PBGENA/Code/Link Prediction'
python -B link_prediction.py --graph CiteSeer --algorithm FeatherNode --erf 0.3 --reduction_dimensions 25 --eval_points 5 --order 1

Relevant Flags:

Flag Type Description Range Default
--algorithm STRING Embedding Method to be used {BGENA, PBGENA,
Baselines}
PBGENA
--alpha FLOAT Fraction of the Embedding Dimension
to be used for attributes
[0, 1] -
--b_a FLOAT Attribute Bitset Probability [0, 1] -
--b_t FLOAT Topology Bitset Probability [0, 1] -
--erf FLOAT Fraction of edges to be removed
for link prediction
(0, 1) 0.3
--f INTEGER Fragment parameter for PBGENA
communication calls
1
--f_a FLOAT Reduction fraction for b_a each pass [1, ∞) 2
--f_t FLOAT Reduction fraction for b_t each pass [1, ∞) 2
--graph STRING Network to be used for embedding Any Network -
--l_a INTEGER Number of passes of attribute
edge propagation
1
--l_t INTEGER Number of passes of topology
edge propagation
1
--multi BOOLEAN Identify if the graph is Multi-Labeled {True, False} False
--N INTEGER Embedding Dimension 2000
--p INTEGER Number of workers to use 32
--tr FLOAT Training Ratio for Node Classification (0, 1) 0.7

Some of the others are standard flags, and the remaining flag descriptions are available with the python graph learning framework Karate Club.

Examples:

An online Google Colaboratory Notebook demonstrates PBGENA's working, and other examples are in the Examples folder.

Relevant Hyperparameters:

Node Classification Hyperparameters:

Graph α b_a b_t
Wikipedia 0.85 0.00 0.20
Cora 0.60 0.80 0.80
CiteSeer 0.80 0.90 0.40
Facebook 0.50 0.70 0.60
BlogCatalog 0.60 0.00 0.00
Flickr 1.00 0.00 0.00
PubMed 0.65 0.00 0.80
PPI 0.10 0.95 0.50
Twitter 0.60 0.00 0.00
Google+ 0.00 0.00 0.00
Reddit 0.00 0.00 0.00
TWeibo 0.60 0.00 0.00
MAKG 0.85 0.86 0.60

Link Prediction Hyperparameters:

Graph α b_a b_t
Wikipedia 0.95 0.20 0.20
Cora 0.60 0.30 0.40
CiteSeer 0.90 0.20 0.40
Facebook 0.80 1.00 0.00
BlogCatalog 0.90 0.05 0.00
Flickr 0.90 0.00 0.00
PubMed 1.00 0.25 0.00
PPI 0.00 0.00 0.30
Twitter 1.00 0.00 0.00
Google+ 0.90 0.05 0.10
Reddit 0.95 0.20 0.00
TWeibo 0.95 0.00 0.00
MAKG 0.85 0.86 0.60

The other baseline hyperparameters used in the experiments can be obtained here.

Datasets:

Graph #Vertices #Edges #Attributes #Labels Multi-
labeled?
Wikipedia 2,405 11,596 4,973 17 No
Cora 2,708 5,278 1,433 7 No
CiteSeer 3,312 4,536 3,703 6 No
Facebook 4,039 88,234 1,283 193 Yes
BlogCatalog 5,196 171,743 8,189 6 No
Flickr 7,575 239,738 12,047 9 No
PubMed 19,717 44,324 500 3 No
PPI 56,944 793,632 50 121 Yes
Twitter 81,306 1,342,296 216,839 4,065 Yes
Google+ 107,614 12,238,285 15,907 468 Yes
Reddit 232,965 57,307,946 602 41 No
TWeibo 2,320,895 50,133,369 1,657 9 No
MAKG 59,249,719 976,901,586 7,211 100 Yes

Some preprocessed networks are added to the Datasets folder, and the remaining can be downloaded online. All of the datasets are obtained from Dr. Jieming Shi's website. To generate files suitable for PBGENA with data from their website, use:

cd PBGENA/Datasets
python data_processing.py --file cora --graph Cora

This converts cora.attr.tar.gz to the Cora folder containing relevant files which can be used to train PBGENA. Use the --multi True flag for converting multi-labeled networks.

Bechmarked Performance:

Graph Node Classification (%) Link Prediction (%) Time (s)
Macro-F1 Micro-F1 AUC-ROC Avg Precision Serial Parallel
Wikipedia 70.23±4.45 79.53±1.78 85.32±0.26 86.49±0.27 0.85 0.08
Cora 84.88±1.56 85.66±1.15 89.50±0.60 90.42±0.65 0.64 0.08
CiteSeer 68.95±0.95 72.07±0.54 92.93±0.55 94.33±0.45 0.80 0.09
Facebook 32.89±1.65 75.41±0.88 97.70±0.05 97.88±0.04 1.22 0.16
BlogCatalog 91.97±0.58 92.07±0.59 80.23±0.14 80.46±0.17 0.36 0.06
Flickr 78.08±0.76 77.92±0.81 90.40±0.07 90.49±0.08 0.09 0.06
PubMed 86.55±0.24 86.75±0.22 91.91±0.23 92.38±0.28 2.09 0.20
PPI 40.30±0.06 54.58±0.05 96.12±0.03 96.82±0.03 16.30 1.05
Twitter 25.43±0.33 56.47±0.74 97.46±0.02 97.50±0.02 46.19 1.61
Google+ 59.24±0.41 84.57±0.23 93.63±0.05 93.21±0.12 13.83 1.14
Reddit 87.01±0.11 88.13±0.12 88.10±0.01 86.16±0.03 75.32 3.68
TWeibo 17.14±0.20 56.83±0.09 96.22±0.02 96.39±0.01 84.08 14.02
MAKG 37.77±0.16 50.26±0.21 85.15±0.03 84.81±0.05 24504.66 6136.49

Except for MAKG all of the above reported results were performed with N=2000, tr=0.7, erf=0.3, and p=32 in a server with 270GB RAM with proper hyperparameters. For MAKG, we used tr=0.1, p=6, and f=100. To get even better results perform the experiments at N=8000. A more comprehensive performance report can be viewed here.

Create your own network:

To create your network, you need three files: edge_list.npy, attribute_matrix.npz, and label_array.npy or label_array.npz (depending on whether the graph is single-labeled or multi-labeled). Ensure all your vertex IDs are in the range (0, nodes-1). Create a numpy array of the edges of shape (edges, 2) and save that file as edge_list.npy. Store the attributes as a sparse CSR matrix of shape (nodes, attributes). To run PBGENA for non-attributed graphs, create an empty attribute matrix for the graph of the proper shape and set α=0. Save the attribute file as attribute_matrix.npz. The file for label array is required only for node classification and can be ignored for performing link prediction on unlabeled graphs. If the graph is single-labeled, the file label_array.npy is a simple 1D array where the i'th number denotes the label for node i. For multi-labeled graphs, label_array.npz is a MultiLabelBinarized CSR-matrix of shape (nodes, labels). Finally, put these three files in a folder with the graph name and add it to the Datasets folder.