Skip to content

Latest commit

 

History

History
592 lines (508 loc) · 20.3 KB

File metadata and controls

592 lines (508 loc) · 20.3 KB

Node Embeddings

This notebook demonstrates different methods for node embeddings and how to further reduce their dimensionality to be able to visualize them in a 2D plot.

Node embeddings are essentially an array of floating point numbers (length = embedding dimension) that can be used as "features" in machine learning. These numbers approximate the relationship and similarity information of each node and can also be seen as a way to encode the topology of the graph.

Considerations

Due to dimensionality reduction some information gets lost, especially when visualizing node embeddings in two dimensions. Nevertheless, it helps to get an intuition on what node embeddings are and how much of the similarity and neighborhood information is retained. The latter can be observed by how well nodes of the same color and therefore same community are placed together and how much bigger nodes with a high centrality score influence them.

If the visualization doesn't show a somehow clear separation between the communities (colors) here are some ideas for tuning:

  • Clean the data, e.g. filter out very few nodes with extremely high degree that aren't actually that important
  • Try directed vs. undirected projections
  • Tune the embedding algorithm, e.g. use a higher dimensionality
  • Tune t-SNE that is used to reduce the node embeddings dimension to two dimensions for visualization.

It could also be the case that the node embeddings are good enough and well suited the way they are despite their visualization for the down stream task like node classification or link prediction. In that case it makes sense to see how the whole pipeline performs before tuning the node embeddings in detail.

Note about data dependencies

PageRank centrality and Leiden community are also fetched from the Graph and need to be calculated first. This makes it easier to see if the embeddings approximate the structural information of the graph in the plot. If these properties are missing you will only see black dots all of the same size.


References

Dimensionality reduction with t-distributed stochastic neighbor embedding (t-SNE)

The following function takes the original node embeddings with a higher dimensionality, e.g. 64 floating point numbers, and reduces them into a two dimensional array for visualization.

It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data.

(see https://opentsne.readthedocs.io)

1. Java Packages

1.1 Generate Node Embeddings using Fast Random Projection (Fast RP) for Java Packages

Fast Random Projection is used to reduce the dimensionality of the node feature space while preserving most of the distance information. Nodes with similar neighborhood result in node embedding with similar vectors.

👉Hint: To skip existing node embeddings and always calculate them based on the parameters below edit Node_Embeddings_0a_Query_Calculated so that it won't return any results.

The results have been provided by the query filename: ../cypher/Node_Embeddings/Node_Embeddings_0a_Query_Calculated.cypher
codeUnitName shortCodeUnitName projectName communityId centrality embedding
0 org.axonframework.commandhandling commandhandling axon-messaging-4.10.3 0 0.073179 [0.024917516857385635, 0.1452712118625641, 0.1...
1 org.axonframework.commandhandling.callbacks callbacks axon-messaging-4.10.3 0 0.015708 [-0.0323147177696228, 0.11063505709171295, 0.2...
2 org.axonframework.commandhandling.distributed distributed axon-messaging-4.10.3 0 0.023111 [-0.06542801856994629, 0.20766612887382507, 0....
3 org.axonframework.commandhandling.distributed.... commandfilter axon-messaging-4.10.3 0 0.013919 [-0.17967315018177032, 0.03907765448093414, 0....
4 org.axonframework.commandhandling.gateway gateway axon-messaging-4.10.3 0 0.013360 [-0.016019124537706375, 0.17991754412651062, 0...

1.2 Dimensionality reduction with t-distributed stochastic neighbor embedding (t-SNE)

This step takes the original node embeddings with a higher dimensionality, e.g. 64 floating point numbers, and reduces them into a two dimensional array for visualization. For more details look up the function declaration for "prepare_node_embeddings_for_2d_visualization".

--------------------------------------------------------------------------------
TSNE(early_exaggeration=12, random_state=47, verbose=1)
--------------------------------------------------------------------------------
===> Finding 90 nearest neighbors using exact search using euclidean distance...
   --> Time elapsed: 0.03 seconds
===> Calculating affinity matrix...
   --> Time elapsed: 0.00 seconds
===> Calculating PCA-based initialization...
   --> Time elapsed: 0.00 seconds
===> Running optimization with exaggeration=12.00, lr=9.50 for 250 iterations...
Iteration   50, KL divergence -0.5852, 50 iterations in 0.0569 sec
Iteration  100, KL divergence 1.2084, 50 iterations in 0.0158 sec
Iteration  150, KL divergence 1.2084, 50 iterations in 0.0147 sec
Iteration  200, KL divergence 1.2084, 50 iterations in 0.0147 sec
Iteration  250, KL divergence 1.2084, 50 iterations in 0.0147 sec
   --> Time elapsed: 0.12 seconds
===> Running optimization with exaggeration=1.00, lr=114.00 for 500 iterations...
Iteration   50, KL divergence 0.1750, 50 iterations in 0.0512 sec
Iteration  100, KL divergence 0.1544, 50 iterations in 0.0505 sec
Iteration  150, KL divergence 0.1507, 50 iterations in 0.0448 sec
Iteration  200, KL divergence 0.1510, 50 iterations in 0.0443 sec
Iteration  250, KL divergence 0.1500, 50 iterations in 0.0440 sec
Iteration  300, KL divergence 0.1500, 50 iterations in 0.0446 sec
Iteration  350, KL divergence 0.1500, 50 iterations in 0.0455 sec
Iteration  400, KL divergence 0.1500, 50 iterations in 0.0440 sec
Iteration  450, KL divergence 0.1501, 50 iterations in 0.0440 sec
Iteration  500, KL divergence 0.1498, 50 iterations in 0.0440 sec
   --> Time elapsed: 0.46 seconds



(114, 2)
codeUnit artifact communityId centrality x y
0 org.axonframework.commandhandling axon-messaging-4.10.3 0 0.073179 -4.877143 -2.958076
1 org.axonframework.commandhandling.callbacks axon-messaging-4.10.3 0 0.015708 -4.769724 -4.035093
2 org.axonframework.commandhandling.distributed axon-messaging-4.10.3 0 0.023111 -3.712223 -4.364472
3 org.axonframework.commandhandling.distributed.... axon-messaging-4.10.3 0 0.013919 -4.061612 -4.657984
4 org.axonframework.commandhandling.gateway axon-messaging-4.10.3 0 0.013360 -4.987060 -3.558041

1.3 Visualization of the node embeddings reduced to two dimensions

png

1.4 Node Embeddings for Java Packages using HashGNN

HashGNN resembles Graph Neural Networks (GNN) but does not include a model or require training. It combines ideas of GNNs and fast randomized algorithms. For more details see HashGNN. Here, the latter 3 steps are combined into one for HashGNN.

The results have been provided by the query filename: ../cypher/Node_Embeddings/Node_Embeddings_0a_Query_Calculated.cypher
codeUnitName shortCodeUnitName projectName communityId centrality embedding
0 org.axonframework.commandhandling commandhandling axon-messaging-4.10.3 0 0.073179 [-0.21650634706020355, 0.6495190411806107, -1....
1 org.axonframework.commandhandling.callbacks callbacks axon-messaging-4.10.3 0 0.015708 [1.2990380823612213, 0.6495190411806107, -1.73...
2 org.axonframework.commandhandling.distributed distributed axon-messaging-4.10.3 0 0.023111 [0.4330126941204071, -0.21650634706020355, -1....
3 org.axonframework.commandhandling.distributed.... commandfilter axon-messaging-4.10.3 0 0.013919 [1.0825317353010178, 0.6495190411806107, -1.94...
4 org.axonframework.commandhandling.gateway gateway axon-messaging-4.10.3 0 0.013360 [0.21650634706020355, 0.6495190411806107, -2.1...
--------------------------------------------------------------------------------
TSNE(early_exaggeration=12, random_state=47, verbose=1)
--------------------------------------------------------------------------------
===> Finding 90 nearest neighbors using exact search using euclidean distance...
   --> Time elapsed: 0.00 seconds
===> Calculating affinity matrix...
   --> Time elapsed: 0.00 seconds
===> Calculating PCA-based initialization...
   --> Time elapsed: 0.00 seconds
===> Running optimization with exaggeration=12.00, lr=9.50 for 250 iterations...
Iteration   50, KL divergence -0.1234, 50 iterations in 0.0674 sec
Iteration  100, KL divergence 1.2090, 50 iterations in 0.0172 sec
Iteration  150, KL divergence 1.2090, 50 iterations in 0.0146 sec
Iteration  200, KL divergence 1.2090, 50 iterations in 0.0145 sec
Iteration  250, KL divergence 1.2090, 50 iterations in 0.0146 sec
   --> Time elapsed: 0.13 seconds
===> Running optimization with exaggeration=1.00, lr=114.00 for 500 iterations...
Iteration   50, KL divergence 0.5769, 50 iterations in 0.0589 sec
Iteration  100, KL divergence 0.5679, 50 iterations in 0.0488 sec
Iteration  150, KL divergence 0.5639, 50 iterations in 0.0464 sec
Iteration  200, KL divergence 0.5571, 50 iterations in 0.0468 sec
Iteration  250, KL divergence 0.5559, 50 iterations in 0.0490 sec
Iteration  300, KL divergence 0.5549, 50 iterations in 0.0469 sec
Iteration  350, KL divergence 0.5548, 50 iterations in 0.0472 sec
Iteration  400, KL divergence 0.5551, 50 iterations in 0.0466 sec
Iteration  450, KL divergence 0.5553, 50 iterations in 0.0465 sec
Iteration  500, KL divergence 0.5551, 50 iterations in 0.0464 sec
   --> Time elapsed: 0.48 seconds



(114, 2)
codeUnit artifact communityId centrality x y
0 org.axonframework.commandhandling axon-messaging-4.10.3 0 0.073179 0.678880 5.248723
1 org.axonframework.commandhandling.callbacks axon-messaging-4.10.3 0 0.015708 6.818006 -2.306395
2 org.axonframework.commandhandling.distributed axon-messaging-4.10.3 0 0.023111 6.656165 -1.290762
3 org.axonframework.commandhandling.distributed.... axon-messaging-4.10.3 0 0.013919 6.599361 -2.317792
4 org.axonframework.commandhandling.gateway axon-messaging-4.10.3 0 0.013360 5.806672 1.136186

png

2.5 Node Embeddings for Java Packages using node2vec

The results have been provided by the query filename: ../cypher/Node_Embeddings/Node_Embeddings_0a_Query_Calculated.cypher
codeUnitName shortCodeUnitName projectName communityId centrality embedding
0 org.axonframework.commandhandling commandhandling axon-messaging-4.10.3 0 0.073179 [0.519773006439209, -0.4005315601825714, 0.260...
1 org.axonframework.commandhandling.callbacks callbacks axon-messaging-4.10.3 0 0.015708 [0.46675097942352295, -0.28508755564689636, 0....
2 org.axonframework.commandhandling.distributed distributed axon-messaging-4.10.3 0 0.023111 [0.24383212625980377, -0.49725285172462463, 0....
3 org.axonframework.commandhandling.distributed.... commandfilter axon-messaging-4.10.3 0 0.013919 [0.31580623984336853, -0.4177212715148926, 0.2...
4 org.axonframework.commandhandling.gateway gateway axon-messaging-4.10.3 0 0.013360 [0.6338015794754028, -0.3599361479282379, 0.22...
--------------------------------------------------------------------------------
TSNE(early_exaggeration=12, random_state=47, verbose=1)
--------------------------------------------------------------------------------
===> Finding 90 nearest neighbors using exact search using euclidean distance...
   --> Time elapsed: 0.00 seconds
===> Calculating affinity matrix...
   --> Time elapsed: 0.00 seconds
===> Calculating PCA-based initialization...
   --> Time elapsed: 0.00 seconds
===> Running optimization with exaggeration=12.00, lr=9.50 for 250 iterations...
Iteration   50, KL divergence -0.5602, 50 iterations in 0.0656 sec
Iteration  100, KL divergence 1.1655, 50 iterations in 0.0170 sec
Iteration  150, KL divergence 1.1655, 50 iterations in 0.0148 sec
Iteration  200, KL divergence 1.1655, 50 iterations in 0.0148 sec
Iteration  250, KL divergence 1.1655, 50 iterations in 0.0148 sec
   --> Time elapsed: 0.13 seconds
===> Running optimization with exaggeration=1.00, lr=114.00 for 500 iterations...
Iteration   50, KL divergence 0.3424, 50 iterations in 0.0534 sec
Iteration  100, KL divergence 0.3257, 50 iterations in 0.0488 sec
Iteration  150, KL divergence 0.3173, 50 iterations in 0.0511 sec
Iteration  200, KL divergence 0.3175, 50 iterations in 0.0506 sec
Iteration  250, KL divergence 0.3173, 50 iterations in 0.0507 sec
Iteration  300, KL divergence 0.3173, 50 iterations in 0.0513 sec
Iteration  350, KL divergence 0.3173, 50 iterations in 0.0505 sec
Iteration  400, KL divergence 0.3172, 50 iterations in 0.0504 sec
Iteration  450, KL divergence 0.3172, 50 iterations in 0.0504 sec
Iteration  500, KL divergence 0.3173, 50 iterations in 0.0507 sec
   --> Time elapsed: 0.51 seconds



(114, 2)
codeUnit artifact communityId centrality x y
0 org.axonframework.commandhandling axon-messaging-4.10.3 0 0.073179 -5.161687 2.887667
1 org.axonframework.commandhandling.callbacks axon-messaging-4.10.3 0 0.015708 -6.285653 3.105135
2 org.axonframework.commandhandling.distributed axon-messaging-4.10.3 0 0.023111 -3.713993 3.977746
3 org.axonframework.commandhandling.distributed.... axon-messaging-4.10.3 0 0.013919 -3.784379 4.204248
4 org.axonframework.commandhandling.gateway axon-messaging-4.10.3 0 0.013360 -6.043272 3.018806

png