active infection source finding

pre-processing

snap -> gml -> gt:

convert_snap_network.py: to gml
convert_graphml_to_gt.py: to gt

debugging/profiling

scripts/test_paper_experiment.sh

scripts for ordered steiner tree

/home/cloud-user/documents/order-steiner-tree/new_result.tex (result document)
- url
scripts/gen_paper_experiment_cmds.sh (needs to be edited)
scripts/gen_eval_cmds.sh (needs to be edited)
paper_experiment_plot.ipynb: to plot
crop.sh: to crop figures

done:

(q varies): P2P, axiv-hep

Important scripts

query_count_vs_graph_size.py
query_count_on_dataset.py

for source likelihood modeling

simulation:

single obs: how-well-can-we-model-probability.py
DRS: how-well-can-we-model-probability-drs.py

plots:

cutting plane plot (comparing different modeling): plot_source_likelihood_modeling_comparison_2d.py
surface plot by graph types and sizes: plot_source_likelihood_modeling_by_graphs_and_sizes.py
surface plot by graph types: plot_source_likelihood_modeling.py

Todo

edge_mwu: why it sucks when setting mu[q]=0 for query node that is not source.
`edge_mwu: when doing neighborhood querying, why updating at each neighbor sucks?
max_mu: why it's so unstable?
tree_binary_search: maximum recursion exceeded

Todo (optimization)

edge_mwu

TODO (code structure)

done: simulations.py and cascade.py should be merged into "ic.py"
synthetic_data.py and core.py

Pagerank vs BFS

bfs_source_finding.ipynb and pagerank_source_finding.ipynb

Stopping criteria: when the current query node is the source.

Pagerank is better than BFS. For example, the mean query ratio against cascade size is 0.2 and 0.32 for PR and BFS respectively.

Issues

In reality, the stopping criteria is not realistic
Why pagerank approach is better than BFS-approach?
- one possibility: PR uses information of uninfected nodes while BFS doesn't. What if BFS also uses this information?
What's the intuition of pagerank?

Cascade size and source degree

The larger the source degree, the larger the cascade size. This seems obvious.

Check out cascade_size_vs_source_infected_neighbors.ipynb

Why this is useful?

Core number vs source's neighbors' infection time mean and standard deviation.

The larger the source core number, the higher the mean and std of its neighbors infection time.

To generalize, any node with high core number has high mean and std.

Why is this useful?

Check out cascade_size_vs_source_infected_neighbors.ipynb

Where are we?

Performance metric priority:
- correctness
  - able to find the source
- number of queries should be minimized
- simplicity and applicable to various models
  - a simple method can be applied to a variety of models
  - single p
  - different p one edges
- robustness to
  - different node sampling methods
- computation speed
  - if online fashion (new observations are generated on the fly), then speed is important

Possible directions:

Bounding the likelihood:

Can we derive some lower bound on the likelihood of on-edge nodes?
Can we derive some upper bound on the likelihood of off-edge nodes?

Or some mean if the likelihood function is exponentially distributed

Particle filter:

Can we use it here?

Problems

ZeroDivisionError: all mu drops to zero so total is zero
Why the same input produces different querying strategy?
Does the sampling method converge?

Possible improvements:

consider structure when calculating p_mu, not just p
neighbor query order: query node should be close the earlier infected nodes
different insitializations on mu
- easier but faster way?
baseline starts with the node with highest mu?
- need to justify the multiplicative weight algorithm.
uninfected nodes
- nearby nodes decrease mu?
another baseline:
- iteratively adding new observations
- and infer the new source likelihood

Different settings

sampling by:
- node degree: high degree nodes are more likely to be sampled
- infection time: later nodes are more likely to be sampled
- uniform

Better visualization

Source node (square or star)
Query and response (different colors) with an arrow from query to response
Path nodes bigger
observed nodes (circle with tick inside it)

infection_time_distribution

Note: uninfected nodes are excluded from the following analysis.

when p=0.7

correlation betweenlength of shortest path and ratio of infected nodes: weakly positive
infection time distribution, the shape looks like Poisson distribution

When p=0.4

the correlation is strong
Again, the distribution looks like Poission distribution

Query strategy: select the node with maximum weight

Using late nodes sampling method, epsilon needs to be large enough (>0.6) to beat the baseline.

When eps = 0.8, the query count is almost half.

However, there are certain cases there it queries almost every node.

The reason is:

Because of the stochasticity, it's possible that some non-source node explains the cascade better than the actual source. Note, the process is random, thus a source might produce a cascade that is unlikely to happen.

To make it more robust, we can combine baseline algorithm with this method.

Query selection strategy: source likelihood convergence speed

Query strategy running time

To query all nodes (grid):

Random: 1s
Min consensus: ~20s

For Kronecker graph, it is:

Min consensus: 2min 37s

Findings

For cliques, you need to query all the nodes.
Remember to set mu to zero.

Using Harmonic mean

Penalty definition: abs(hmean - outcome)

Other issues

wheter deciding if node is source, the queried neighbors can be used to update mu as well.
efficient implementation of generalized Jaccard similarity
networkx nodes\_iter and Parallel
- the nodes\_iter order is inconsistent with and without Parallel

preprocessing issue

parallel processing for large graph

each job just load what it needs. Otherwise, data loading can be time consuming
parallel appending to the same file is fine. When the appended content is small (under PIPE_BUF), no need to use file lock

Installing `graph-tool`

sudo apt install -y libcgal-dev
sudo apt install -y libcairo2-dev
sudo apt install -y libcairomm-1.0
sudo apt install -y libcairomm-1.0-dev
sudo apt install -y python3-cairo
sudo apt install -y python3-cairo-dev
sudo apt install -y libsparsehash-dev
also python3-gi python3-click python3-gi-cairo python3-cairo gir1.2-gtk-3.0

Then:

./configure CXXFLAGS="-std=gnu++14"
[https://git.skewed.de/count0/graph-tool/issues/359](the reason with the flag)

Name		Name	Last commit message	Last commit date
Latest commit History 259 Commits
lprofs		lprofs
mprof		mprof
scripts		scripts
.gitignore		.gitignore
README.md		README.md
baselines.py		baselines.py
bfs_source_finding.ipynb		bfs_source_finding.ipynb
build-zika-network.ipynb		build-zika-network.ipynb
build_edge_reward_table.py		build_edge_reward_table.py
build_sp_len.py		build_sp_len.py
cascade.py		cascade.py
cascade_generation_playground.ipynb		cascade_generation_playground.ipynb
cascade_size_vs_source_infected_neighbors.ipynb		cascade_size_vs_source_infected_neighbors.ipynb
compare_reconstructed_tree.ipynb		compare_reconstructed_tree.ipynb
concurrent_gzip_append.py		concurrent_gzip_append.py
continous-time-diffusion-model.ipynb		continous-time-diffusion-model.ipynb
convert_graphml_to_gt.py		convert_graphml_to_gt.py
convert_snap_network.py		convert_snap_network.py
core.py		core.py
crop.sh		crop.sh
ctic.py		ctic.py
edge_and_order_evaluation.ipynb		edge_and_order_evaluation.ipynb
edge_mwu.py		edge_mwu.py
epidemic_threshold.ipynb		epidemic_threshold.ipynb
evaluate.py		evaluate.py
experiment_utils.py		experiment_utils.py
feasibility.py		feasibility.py
fixing_steiner_tree_bug.ipynb		fixing_steiner_tree_bug.ipynb
fixtures.py		fixtures.py
graph_binary_search.ipynb		graph_binary_search.ipynb
graph_generator.py		graph_generator.py
graph_generators.py		graph_generators.py
graph_tool_test.ipynb		graph_tool_test.ipynb
gt_utils.py		gt_utils.py
how-well-can-we-model-probability-drs.py		how-well-can-we-model-probability-drs.py
how-well-can-we-model-probability.ipynb		how-well-can-we-model-probability.ipynb
ic.py		ic.py
infection_edge_mwu.ipynb		infection_edge_mwu.ipynb
infection_path_length_vs_shortest_distance.ipynb		infection_path_length_vs_shortest_distance.ipynb
infection_probability_vs_fraction_of_shortest_path.ipynb		infection_probability_vs_fraction_of_shortest_path.ipynb
infection_probability_vs_fraction_of_shortest_path.py		infection_probability_vs_fraction_of_shortest_path.py
infection_time_distribution.ipynb		infection_time_distribution.ipynb
infection_time_estimation.ipynb		infection_time_estimation.ipynb
infer_infection_time.ipynb		infer_infection_time.ipynb
infer_time.py		infer_time.py
main.ipynb		main.ipynb
map_id_to_consecutive_integer.py		map_id_to_consecutive_integer.py
maximal_adversarial_strategy.ipynb		maximal_adversarial_strategy.ipynb
memory_usage_dict_vs_3d_array.py		memory_usage_dict_vs_3d_array.py
merge_performance.py		merge_performance.py
mst_truncated.py		mst_truncated.py
multiplicative_weight_update.ipynb		multiplicative_weight_update.ipynb
mwu.py		mwu.py
nikolaj_bfs.ipynb		nikolaj_bfs.ipynb
noisy_binary_search.py		noisy_binary_search.py
noisy_binary_search_consistency_multiplier.py		noisy_binary_search_consistency_multiplier.py
pair-method balanced tree, q=1 and bad p values.ipynb		pair-method balanced tree, q=1 and bad p values.ipynb
paper_experiment.ipynb		paper_experiment.ipynb
paper_experiment.py		paper_experiment.py
paper_experiment_plot.ipynb		paper_experiment_plot.ipynb
paper_experiment_plot.py		paper_experiment_plot.py
plot_query_count_vs_graph_size.py		plot_query_count_vs_graph_size.py
plot_source_likelihood_modeling.py		plot_source_likelihood_modeling.py
plot_source_likelihood_modeling_by_graphs_and_sizes.py		plot_source_likelihood_modeling_by_graphs_and_sizes.py
plot_source_likelihood_modeling_comparison_2d.py		plot_source_likelihood_modeling_comparison_2d.py
plot_utils.py		plot_utils.py
presentaion_graph_binary_search.ipynb		presentaion_graph_binary_search.ipynb
print_graph_summary.py		print_graph_summary.py
print_large_dataset_performance_table.ipynb		print_large_dataset_performance_table.ipynb
print_network_stat.py		print_network_stat.py
print_performance.py		print_performance.py
query_count.py		query_count.py
query_count_on_dataset.py		query_count_on_dataset.py
query_count_vs_graph_size.ipynb		query_count_vs_graph_size.ipynb
query_process_illustration.ipynb		query_process_illustration.ipynb
query_strategy.py		query_strategy.py
query_uninfected_node.ipynb		query_uninfected_node.ipynb
rank_corr_debug.ipynb		rank_corr_debug.ipynb
requirements.txt		requirements.txt
reverse_infection_does_time_matter.ipynb		reverse_infection_does_time_matter.ipynb
reverse_infection_plus_earliest_observed_infection.ipynb		reverse_infection_plus_earliest_observed_infection.ipynb
rewards.py		rewards.py
rough-source-estimate-without-time.ipynb		rough-source-estimate-without-time.ipynb
scalability_plot.ipynb		scalability_plot.ipynb
si.py		si.py
simulations.py		simulations.py
single_source_shortest_path.ipynb		single_source_shortest_path.ipynb
sorted_array.py		sorted_array.py
source_likelihood_estimation_experiment.py		source_likelihood_estimation_experiment.py
sp.py		sp.py
starting-template.ipynb		starting-template.ipynb
steiner-tree.ipynb		steiner-tree.ipynb
steiner_order_greedy.ipynb		steiner_order_greedy.ipynb
steiner_tree.py		steiner_tree.py
steiner_tree_algorithm.ipynb		steiner_tree_algorithm.ipynb
steiner_tree_exact.py		steiner_tree_exact.py
steiner_tree_greedy.py		steiner_tree_greedy.py
steiner_tree_mst.ipynb		steiner_tree_mst.ipynb
steiner_tree_mst.py		steiner_tree_mst.py
steiner_tree_order.py		steiner_tree_order.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

active infection source finding

pre-processing

debugging/profiling

scripts for ordered steiner tree

Important scripts

for source likelihood modeling

Todo

Todo (optimization)

TODO (code structure)

Pagerank vs BFS

Issues

Cascade size and source degree

Core number vs source's neighbors' infection time mean and standard deviation.

Where are we?

Possible directions:

Problems

Possible improvements:

Different settings

Better visualization

infection_time_distribution

Query strategy: select the node with maximum weight

Query selection strategy: source likelihood convergence speed

Query strategy running time

Findings

Using Harmonic mean

Other issues

preprocessing issue

parallel processing for large graph

Installing `graph-tool`

About

Releases

Packages

Languages

xiaohan2012/active-infection-source-finding

Folders and files

Latest commit

History

Repository files navigation

active infection source finding

pre-processing

debugging/profiling

scripts for ordered steiner tree

Important scripts

for source likelihood modeling

Todo

Todo (optimization)

TODO (code structure)

Pagerank vs BFS

Issues

Cascade size and source degree

Core number vs source's neighbors' infection time mean and standard deviation.

Where are we?

Possible directions:

Problems

Possible improvements:

Different settings

Better visualization

infection_time_distribution

Query strategy: select the node with maximum weight

Query selection strategy: source likelihood convergence speed

Query strategy running time

Findings

Using Harmonic mean

Other issues

preprocessing issue

parallel processing for large graph

Installing graph-tool

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Installing `graph-tool`

Packages