This notebook demonstrates different methods for node embeddings and how to further reduce their dimensionality to be able to visualize them in a 2D plot.
Node embeddings are essentially an array of floating point numbers (length = embedding dimension) that can be used as "features" in machine learning. These numbers approximate the relationship and similarity information of each node and can also be seen as a way to encode the topology of the graph.
Due to dimensionality reduction some information gets lost, especially when visualizing node embeddings in two dimensions. Nevertheless, it helps to get an intuition on what node embeddings are and how much of the similarity and neighborhood information is retained. The latter can be observed by how well nodes of the same color and therefore same community are placed together and how much bigger nodes with a high centrality score influence them.
If the visualization doesn't show a somehow clear separation between the communities (colors) here are some ideas for tuning:
- Clean the data, e.g. filter out very few nodes with extremely high degree that aren't actually that important
- Try directed vs. undirected projections
- Tune the embedding algorithm, e.g. use a higher dimensionality
- Tune t-SNE that is used to reduce the node embeddings dimension to two dimensions for visualization.
It could also be the case that the node embeddings are good enough and well suited the way they are despite their visualization for the down stream task like node classification or link prediction. In that case it makes sense to see how the whole pipeline performs before tuning the node embeddings in detail.
PageRank centrality and Leiden community are also fetched from the Graph and need to be calculated first. This makes it easier to see if the embeddings approximate the structural information of the graph in the plot. If these properties are missing you will only see black dots all of the same size.
-
node2vec (neo4j) computes a vector representation of a node based on second order random walks in the graph.
-
Complete guide to understanding Node2Vec algorithm
The openTSNE version is: 1.0.1 The pandas version is: 1.5.1
The following function takes the original node embeddings with a higher dimensionality, e.g. 64 floating point numbers, and reduces them into a two dimensional array for visualization.
It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data.
(see https://opentsne.readthedocs.io)
Fast Random Projection is used to reduce the dimensionality of the node feature space while preserving most of the distance information. Nodes with similar neighborhood result in node embedding with similar vectors.
👉Hint: To skip existing node embeddings and always calculate them based on the parameters below edit Node_Embeddings_0a_Query_Calculated
so that it won't return any results.
The results have been provided by the query filename: ../cypher/Node_Embeddings/Node_Embeddings_0a_Query_Calculated.cypher
codeUnitName | shortCodeUnitName | projectName | communityId | centrality | embedding | |
---|---|---|---|---|---|---|
0 | org.axonframework.commandhandling | commandhandling | axon-messaging-4.10.3 | 0 | 0.073179 | [0.024917516857385635, 0.1452712118625641, 0.1... |
1 | org.axonframework.commandhandling.callbacks | callbacks | axon-messaging-4.10.3 | 0 | 0.015708 | [-0.0323147177696228, 0.11063505709171295, 0.2... |
2 | org.axonframework.commandhandling.distributed | distributed | axon-messaging-4.10.3 | 0 | 0.023111 | [-0.06542801856994629, 0.20766612887382507, 0.... |
3 | org.axonframework.commandhandling.distributed.... | commandfilter | axon-messaging-4.10.3 | 0 | 0.013919 | [-0.17967315018177032, 0.03907765448093414, 0.... |
4 | org.axonframework.commandhandling.gateway | gateway | axon-messaging-4.10.3 | 0 | 0.013360 | [-0.016019124537706375, 0.17991754412651062, 0... |
This step takes the original node embeddings with a higher dimensionality, e.g. 64 floating point numbers, and reduces them into a two dimensional array for visualization. For more details look up the function declaration for "prepare_node_embeddings_for_2d_visualization".
--------------------------------------------------------------------------------
TSNE(early_exaggeration=12, random_state=47, verbose=1)
--------------------------------------------------------------------------------
===> Finding 90 nearest neighbors using exact search using euclidean distance...
--> Time elapsed: 0.03 seconds
===> Calculating affinity matrix...
--> Time elapsed: 0.00 seconds
===> Calculating PCA-based initialization...
--> Time elapsed: 0.00 seconds
===> Running optimization with exaggeration=12.00, lr=9.50 for 250 iterations...
Iteration 50, KL divergence -0.5852, 50 iterations in 0.0569 sec
Iteration 100, KL divergence 1.2084, 50 iterations in 0.0158 sec
Iteration 150, KL divergence 1.2084, 50 iterations in 0.0147 sec
Iteration 200, KL divergence 1.2084, 50 iterations in 0.0147 sec
Iteration 250, KL divergence 1.2084, 50 iterations in 0.0147 sec
--> Time elapsed: 0.12 seconds
===> Running optimization with exaggeration=1.00, lr=114.00 for 500 iterations...
Iteration 50, KL divergence 0.1750, 50 iterations in 0.0512 sec
Iteration 100, KL divergence 0.1544, 50 iterations in 0.0505 sec
Iteration 150, KL divergence 0.1507, 50 iterations in 0.0448 sec
Iteration 200, KL divergence 0.1510, 50 iterations in 0.0443 sec
Iteration 250, KL divergence 0.1500, 50 iterations in 0.0440 sec
Iteration 300, KL divergence 0.1500, 50 iterations in 0.0446 sec
Iteration 350, KL divergence 0.1500, 50 iterations in 0.0455 sec
Iteration 400, KL divergence 0.1500, 50 iterations in 0.0440 sec
Iteration 450, KL divergence 0.1501, 50 iterations in 0.0440 sec
Iteration 500, KL divergence 0.1498, 50 iterations in 0.0440 sec
--> Time elapsed: 0.46 seconds
(114, 2)
codeUnit | artifact | communityId | centrality | x | y | |
---|---|---|---|---|---|---|
0 | org.axonframework.commandhandling | axon-messaging-4.10.3 | 0 | 0.073179 | -4.877143 | -2.958076 |
1 | org.axonframework.commandhandling.callbacks | axon-messaging-4.10.3 | 0 | 0.015708 | -4.769724 | -4.035093 |
2 | org.axonframework.commandhandling.distributed | axon-messaging-4.10.3 | 0 | 0.023111 | -3.712223 | -4.364472 |
3 | org.axonframework.commandhandling.distributed.... | axon-messaging-4.10.3 | 0 | 0.013919 | -4.061612 | -4.657984 |
4 | org.axonframework.commandhandling.gateway | axon-messaging-4.10.3 | 0 | 0.013360 | -4.987060 | -3.558041 |
HashGNN resembles Graph Neural Networks (GNN) but does not include a model or require training. It combines ideas of GNNs and fast randomized algorithms. For more details see HashGNN. Here, the latter 3 steps are combined into one for HashGNN.
The results have been provided by the query filename: ../cypher/Node_Embeddings/Node_Embeddings_0a_Query_Calculated.cypher
codeUnitName | shortCodeUnitName | projectName | communityId | centrality | embedding | |
---|---|---|---|---|---|---|
0 | org.axonframework.commandhandling | commandhandling | axon-messaging-4.10.3 | 0 | 0.073179 | [-0.21650634706020355, 0.6495190411806107, -1.... |
1 | org.axonframework.commandhandling.callbacks | callbacks | axon-messaging-4.10.3 | 0 | 0.015708 | [1.2990380823612213, 0.6495190411806107, -1.73... |
2 | org.axonframework.commandhandling.distributed | distributed | axon-messaging-4.10.3 | 0 | 0.023111 | [0.4330126941204071, -0.21650634706020355, -1.... |
3 | org.axonframework.commandhandling.distributed.... | commandfilter | axon-messaging-4.10.3 | 0 | 0.013919 | [1.0825317353010178, 0.6495190411806107, -1.94... |
4 | org.axonframework.commandhandling.gateway | gateway | axon-messaging-4.10.3 | 0 | 0.013360 | [0.21650634706020355, 0.6495190411806107, -2.1... |
--------------------------------------------------------------------------------
TSNE(early_exaggeration=12, random_state=47, verbose=1)
--------------------------------------------------------------------------------
===> Finding 90 nearest neighbors using exact search using euclidean distance...
--> Time elapsed: 0.00 seconds
===> Calculating affinity matrix...
--> Time elapsed: 0.00 seconds
===> Calculating PCA-based initialization...
--> Time elapsed: 0.00 seconds
===> Running optimization with exaggeration=12.00, lr=9.50 for 250 iterations...
Iteration 50, KL divergence -0.1234, 50 iterations in 0.0674 sec
Iteration 100, KL divergence 1.2090, 50 iterations in 0.0172 sec
Iteration 150, KL divergence 1.2090, 50 iterations in 0.0146 sec
Iteration 200, KL divergence 1.2090, 50 iterations in 0.0145 sec
Iteration 250, KL divergence 1.2090, 50 iterations in 0.0146 sec
--> Time elapsed: 0.13 seconds
===> Running optimization with exaggeration=1.00, lr=114.00 for 500 iterations...
Iteration 50, KL divergence 0.5769, 50 iterations in 0.0589 sec
Iteration 100, KL divergence 0.5679, 50 iterations in 0.0488 sec
Iteration 150, KL divergence 0.5639, 50 iterations in 0.0464 sec
Iteration 200, KL divergence 0.5571, 50 iterations in 0.0468 sec
Iteration 250, KL divergence 0.5559, 50 iterations in 0.0490 sec
Iteration 300, KL divergence 0.5549, 50 iterations in 0.0469 sec
Iteration 350, KL divergence 0.5548, 50 iterations in 0.0472 sec
Iteration 400, KL divergence 0.5551, 50 iterations in 0.0466 sec
Iteration 450, KL divergence 0.5553, 50 iterations in 0.0465 sec
Iteration 500, KL divergence 0.5551, 50 iterations in 0.0464 sec
--> Time elapsed: 0.48 seconds
(114, 2)
codeUnit | artifact | communityId | centrality | x | y | |
---|---|---|---|---|---|---|
0 | org.axonframework.commandhandling | axon-messaging-4.10.3 | 0 | 0.073179 | 0.678880 | 5.248723 |
1 | org.axonframework.commandhandling.callbacks | axon-messaging-4.10.3 | 0 | 0.015708 | 6.818006 | -2.306395 |
2 | org.axonframework.commandhandling.distributed | axon-messaging-4.10.3 | 0 | 0.023111 | 6.656165 | -1.290762 |
3 | org.axonframework.commandhandling.distributed.... | axon-messaging-4.10.3 | 0 | 0.013919 | 6.599361 | -2.317792 |
4 | org.axonframework.commandhandling.gateway | axon-messaging-4.10.3 | 0 | 0.013360 | 5.806672 | 1.136186 |
The results have been provided by the query filename: ../cypher/Node_Embeddings/Node_Embeddings_0a_Query_Calculated.cypher
codeUnitName | shortCodeUnitName | projectName | communityId | centrality | embedding | |
---|---|---|---|---|---|---|
0 | org.axonframework.commandhandling | commandhandling | axon-messaging-4.10.3 | 0 | 0.073179 | [0.519773006439209, -0.4005315601825714, 0.260... |
1 | org.axonframework.commandhandling.callbacks | callbacks | axon-messaging-4.10.3 | 0 | 0.015708 | [0.46675097942352295, -0.28508755564689636, 0.... |
2 | org.axonframework.commandhandling.distributed | distributed | axon-messaging-4.10.3 | 0 | 0.023111 | [0.24383212625980377, -0.49725285172462463, 0.... |
3 | org.axonframework.commandhandling.distributed.... | commandfilter | axon-messaging-4.10.3 | 0 | 0.013919 | [0.31580623984336853, -0.4177212715148926, 0.2... |
4 | org.axonframework.commandhandling.gateway | gateway | axon-messaging-4.10.3 | 0 | 0.013360 | [0.6338015794754028, -0.3599361479282379, 0.22... |
--------------------------------------------------------------------------------
TSNE(early_exaggeration=12, random_state=47, verbose=1)
--------------------------------------------------------------------------------
===> Finding 90 nearest neighbors using exact search using euclidean distance...
--> Time elapsed: 0.00 seconds
===> Calculating affinity matrix...
--> Time elapsed: 0.00 seconds
===> Calculating PCA-based initialization...
--> Time elapsed: 0.00 seconds
===> Running optimization with exaggeration=12.00, lr=9.50 for 250 iterations...
Iteration 50, KL divergence -0.5602, 50 iterations in 0.0656 sec
Iteration 100, KL divergence 1.1655, 50 iterations in 0.0170 sec
Iteration 150, KL divergence 1.1655, 50 iterations in 0.0148 sec
Iteration 200, KL divergence 1.1655, 50 iterations in 0.0148 sec
Iteration 250, KL divergence 1.1655, 50 iterations in 0.0148 sec
--> Time elapsed: 0.13 seconds
===> Running optimization with exaggeration=1.00, lr=114.00 for 500 iterations...
Iteration 50, KL divergence 0.3424, 50 iterations in 0.0534 sec
Iteration 100, KL divergence 0.3257, 50 iterations in 0.0488 sec
Iteration 150, KL divergence 0.3173, 50 iterations in 0.0511 sec
Iteration 200, KL divergence 0.3175, 50 iterations in 0.0506 sec
Iteration 250, KL divergence 0.3173, 50 iterations in 0.0507 sec
Iteration 300, KL divergence 0.3173, 50 iterations in 0.0513 sec
Iteration 350, KL divergence 0.3173, 50 iterations in 0.0505 sec
Iteration 400, KL divergence 0.3172, 50 iterations in 0.0504 sec
Iteration 450, KL divergence 0.3172, 50 iterations in 0.0504 sec
Iteration 500, KL divergence 0.3173, 50 iterations in 0.0507 sec
--> Time elapsed: 0.51 seconds
(114, 2)
codeUnit | artifact | communityId | centrality | x | y | |
---|---|---|---|---|---|---|
0 | org.axonframework.commandhandling | axon-messaging-4.10.3 | 0 | 0.073179 | -5.161687 | 2.887667 |
1 | org.axonframework.commandhandling.callbacks | axon-messaging-4.10.3 | 0 | 0.015708 | -6.285653 | 3.105135 |
2 | org.axonframework.commandhandling.distributed | axon-messaging-4.10.3 | 0 | 0.023111 | -3.713993 | 3.977746 |
3 | org.axonframework.commandhandling.distributed.... | axon-messaging-4.10.3 | 0 | 0.013919 | -3.784379 | 4.204248 |
4 | org.axonframework.commandhandling.gateway | axon-messaging-4.10.3 | 0 | 0.013360 | -6.043272 | 3.018806 |