-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement neighborhood search based on CellListMap.jl #8
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #8 +/- ##
==========================================
+ Coverage 88.15% 89.20% +1.05%
==========================================
Files 15 16 +1
Lines 498 556 +58
==========================================
+ Hits 439 496 +57
- Misses 59 60 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
a2b1d55
to
f7f48e8
Compare
@lmiq It would be great if you could have a look as well if you find the time. To check that I'm using CellListMap.jl correctly here and that I'm getting the best performance out of it. |
Can you point to a MWE of a running benchmark with one of the methods you implemented? It is not obvious to me how to run the benchmarks. |
I should probably update the plotting script and add this in the docs somewhere. include("benchmarks/benchmarks.jl"); plot_benchmarks(benchmark_n_body, (10, 10, 10), 9, parallel=true) where 9 means 9 iterations. In the benchmark plotting code ( function plot_benchmarks(benchmark, n_points_per_dimension, iterations;
parallel = true, title = "",
seed = 1, perturbation_factor_position = 1.0)
neighborhood_searches_names = [#"TrivialNeighborhoodSearch";;
"DictionaryCellList";;
"FullGridCellList";;
# "PrecomputedNeighborhoodSearch";;
"CellListMapNeighborhoodSearch"]
# Multiply number of points in each iteration (roughly) by this factor
scaling_factor = 4
per_dimension_factor = scaling_factor^(1 / length(n_points_per_dimension))
sizes = [round.(Int, n_points_per_dimension .* per_dimension_factor^(iter - 1))
for iter in 1:iterations]
n_particles_vec = prod.(sizes)
times = zeros(iterations, length(neighborhood_searches_names))
for iter in 1:iterations
coordinates = point_cloud(sizes[iter], seed = seed,
perturbation_factor_position = perturbation_factor_position)
search_radius = 3.0
NDIMS = size(coordinates, 1)
n_particles = size(coordinates, 2)
neighborhood_searches = [
# TrivialNeighborhoodSearch{NDIMS}(; search_radius, eachpoint = 1:n_particles),
GridNeighborhoodSearch{NDIMS}(; search_radius, n_points = n_particles),
GridNeighborhoodSearch{NDIMS}(; cell_list = FullGridCellList(; search_radius,
min_corner = minimum(coordinates, dims = 2) .- search_radius,
max_corner = maximum(coordinates, dims = 2) .+ search_radius),
search_radius, n_points = n_particles),
CellListMapNeighborhoodSearch(NDIMS; search_radius)
]
for i in eachindex(neighborhood_searches)
neighborhood_search = neighborhood_searches[i]
initialize!(neighborhood_search, coordinates, coordinates)
time = benchmark(neighborhood_search, coordinates, parallel = parallel)
times[iter, i] = time
time_string = BenchmarkTools.prettytime(time * 1e9)
println("$(neighborhood_searches_names[i])")
println("with $(join(sizes[iter], "x")) = $(prod(sizes[iter])) particles finished in $time_string\n")
end
end
plot(n_particles_vec, times ./ n_particles_vec .* 1e9,
xaxis = :log, yaxis = :log,
xticks = (n_particles_vec, n_particles_vec), linewidth = 2,
xlabel = "#particles", ylabel = "runtime per particle [ns]",
legend = :outerright, size = (750, 400), dpi = 200, margin = 4 * Plots.mm,
label = neighborhood_searches_names, title = title)
end |
Ok, I managed to run a very simple case: julia> using CellListMap, PointNeighbors, StaticArrays
julia> function test_celllistmap(x)
sys = ParticleSystem(positions=x, cutoff=5.0, output=zero(x))
map_pairwise!(sys) do x, y, i, j, d2, f
d = sqrt(d2)
fij = (y - x) / d^3
f[:,i] .+= fij
f[:,j] .-= fij
return f
end
return sys.output
end
test_celllistmap (generic function with 1 method)
julia> function test_pointneighbors(x)
f = zero(x)
neighborhood_search = GridNeighborhoodSearch{3}(; search_radius=5.0, n_points=size(x,2))
initialize!(neighborhood_search, x, x)
foreach_point_neighbor(x, x, neighborhood_search, parallel=true) do i, j, pos_diff, distance
distance < sqrt(eps()) && return
dv = -pos_diff / distance^3
for dim in axes(f, 1)
@inbounds f[dim, i] += dv[dim]
end
end
return f
end
test_pointneighbors (generic function with 1 method)
julia> x, _ = CellListMap.xgalactic(10^5); # 10^5 SVector{3,Float64} with "galactic" density.
julia> xmat = Matrix(reinterpret(reshape, Float64, x));
julia> @time test_celllistmap(xmat) # second run
3.557435 seconds (468.93 M allocations: 21.058 GiB, 42.83% gc time, 9 lock conflicts)
3×100000 Matrix{Float64}:
40.841 -140.986 12.1143 77.6912 11.3534 16.8916 10.9625 … 32.2969 43.5096 7.40524 17.4521 82.746 4.78501
-4.60323 10.6294 28.8101 -11.418 -33.4972 -12.3769 -17.9138 -0.0016226 -222.654 -4.39804 -80.1207 37.4482 51.3636
-57.4807 14.2068 15.9348 71.1715 68.6271 30.9155 -120.342 -2.29914 -55.8974 3.31074 117.948 -27.3466 -26.7153
julia> @time test_pointneighbors(xmat) # second run
0.857072 seconds (967 allocations: 49.540 MiB, 0.49% gc time)
3×100000 Matrix{Float64}:
40.841 -140.986 12.1143 77.6912 11.3534 16.8916 10.9625 … 32.2969 43.5096 7.40524 17.4521 82.746 4.78501
-4.60323 10.6294 28.8101 -11.418 -33.4972 -12.3769 -17.9138 -0.0016226 -222.654 -4.39804 -80.1207 37.4482 51.3636
-57.4807 14.2068 15.9348 71.1715 68.6271 30.9155 -120.342 -2.29914 -55.8974 3.31074 117.948 -27.3466 -26.7153
julia> x, _ = CellListMap.xgalactic(10^6); # 10^6 SVector{3,Float64} with "galactic" density.
julia> xmat = Matrix(reinterpret(reshape, Float64, x));
julia> @time test_celllistmap(xmat)
40.969504 seconds (5.51 G allocations: 247.261 GiB, 45.60% gc time, 13 lock conflicts)
3×1000000 Matrix{Float64}:
9.20059 -2.41152 0.114992 107.299 -23.107 16.395 -19.8264 … 135.155 -4.36814 -3.9865 5.18173 107.441 -24.7465
-26.039 16.8446 -37.0278 -13.3514 74.7956 15.8967 34.0362 -57.17 19.5576 46.6715 -22.3012 26.9519 -156.584
-168.725 -63.8354 -163.919 -18.8817 51.9869 19.4064 34.0843 58.2957 -15.1018 37.0779 -16.5614 -23.6634 -5.18014
julia> @time test_pointneighbors(xmat)
23.620195 seconds (6.68 k allocations: 496.365 MiB, 0.02% gc time)
3×1000000 Matrix{Float64}:
9.20059 -2.41152 0.114992 107.299 -23.107 16.395 -19.8264 … 135.155 -4.36814 -3.9865 5.18173 107.441 -24.7465
-26.039 16.8446 -37.0278 -13.3514 74.7956 15.8967 34.0362 -57.17 19.5576 46.6715 -22.3012 26.9519 -156.584
-168.725 -63.8354 -163.919 -18.8817 51.9869 19.4064 34.0843 58.2957 -15.1018 37.0779 -16.5614 -23.6634 -5.18014
I guess this is more or less consistent with what you see? If the two sets are different, I get: julia> x, _ = CellListMap.xgalactic(10^5); # 10^5 SVector{3,Float64} with "galactic" density.
julia> y, _ = CellListMap.xgalactic(10^5); # 10^5 SVector{3,Float64} with "galactic" density.
julia> xmat = Matrix(reinterpret(reshape, Float64, x));
julia> ymat = Matrix(reinterpret(reshape, Float64, y));
julia> function test_celllistmap(x,y)
sys = ParticleSystem(xpositions=x, ypositions=y, cutoff=5.0, output=zero(x))
map_pairwise!(sys) do x, y, i, j, d2, f
d = sqrt(d2)
fij = (y - x) / d^3
f[:,i] .+= fij
return f
end
return sys.output
end
test_celllistmap (generic function with 2 methods)
julia> @time test_celllistmap(xmat,ymat)
2.527272 seconds (469.03 M allocations: 21.066 GiB, 39.28% gc time, 12 lock conflicts, 15.07% compilation time)
3×100000 Matrix{Float64}:
21.8077 -0.183917 -25.5396 -38.6599 -7.02715 -22.8335 -115.32 … -10.6988 21.6762 12.2227 -1.46057 13.9872 7.5366
-17.4007 -35.3851 14.0842 6.13937 30.6116 -10.2836 309.663 116.57 -6.51407 138.022 8.54818 25.051 155.299
-52.4528 70.1513 26.3846 31.3363 -72.5124 -166.564 31.0876 29.6242 -15.704 -5.80962 11.3542 -157.471 4.72672
julia> function test_pointneighbors(x,y)
f = zero(x)
neighborhood_search = GridNeighborhoodSearch{3}(; search_radius=5.0, n_points=size(x,2))
initialize!(neighborhood_search, x, y)
foreach_point_neighbor(x, y, neighborhood_search, parallel=true) do i, j, pos_diff, distance
distance < sqrt(eps()) && return
dv = -pos_diff / distance^3
for dim in axes(f, 1)
@inbounds f[dim, i] += dv[dim]
end
end
return f
end
test_pointneighbors (generic function with 2 methods)
julia> @time test_pointneighbors(xmat,ymat)
1.100204 seconds (942.08 k allocations: 92.813 MiB, 0.89% gc time, 21.68% compilation time)
3×100000 Matrix{Float64}:
21.8077 -0.183917 -25.5396 -38.6599 -7.02715 -22.8335 -115.32 … -10.6988 21.6762 12.2227 -1.46057 13.9872 7.5366
-17.4007 -35.3851 14.0842 6.13937 30.6116 -10.2836 309.663 116.57 -6.51407 138.022 8.54818 25.051 155.299
-52.4528 70.1513 26.3846 31.3363 -72.5124 -166.564 31.0876 29.6242 -15.704 -5.80962 11.3542 -157.471 4.72672
Let me know if (when) you are supporting periodic boundary conditions :-). edit: Seems that orthorhombic boxes are already supported? (it it unusual to provide minimum and maximum coordinates instead of lengths, though, does it make any difference if the lengths are the same, but the corners are placed differently? - I guess it doesn't ). |
I don't think your benchmarks are fair.
This originated from the way we initialize simulations. We might have a pipe flow with open boundaries, and then want to use periodic boundaries instead, and then we can just provide the bounding box for the domain. We don't have any plans to provide more complex periodic boundaries. But when this PR is merged, we can add support for complex periodic boundaries with the CellListMap.jl backend. |
Yes, good catch (although why it allocates I don't know - it should not). Unfortunately everything I do lately I have to do in a rush... With that fixed I get: julia> function test_celllistmap(x)
sys = ParticleSystem(positions=x, cutoff=5.0, output=zero(x))
map_pairwise!(sys) do x, y, i, j, d2, f
d = sqrt(d2)
fij = (y - x) / d^3
for k in eachindex(fij)
f[k,i] += fij[k]
f[k,j] -= fij[k]
end
return f
end
return sys.output
end
test_celllistmap (generic function with 2 methods)
julia> x, _ = CellListMap.xgalactic(10^5); # 10^5 SVector{3,Float64} with "galactic" density.
julia> xmat = Matrix(reinterpret(reshape, Float64, x));
julia> @time test_celllistmap(xmat) # second run
0.298172 seconds (13.65 k allocations: 97.775 MiB, 1.98% gc time, 12 lock conflicts)
3×100000 Matrix{Float64}:
-88.6852 -129.101 -46.6896 -48.3546 147.853 -24.7797 18.0712 … -6.32157 9.97575 -1.63363 10.3277 15.7659 -29.5342
5.05557 284.657 42.8502 -10.2702 101.038 69.9032 -101.463 21.2009 -36.1706 -128.78 32.0599 6.22065 4.78016
101.876 97.5458 80.0019 1.7858 147.629 -120.023 -20.9682 -19.8745 -38.8108 26.2592 -2.34381 0.131151 108.677
julia> @time test_pointneighbors(xmat) # second run
0.937435 seconds (968 allocations: 49.540 MiB, 2.56% gc time)
3×100000 Matrix{Float64}:
-88.6852 -129.101 -46.6896 -48.3546 147.853 -24.7797 18.0712 … -6.32157 9.97575 -1.63363 10.3277 15.7659 -29.5342
5.05557 284.657 42.8502 -10.2702 101.038 69.9032 -101.463 21.2009 -36.1706 -128.78 32.0599 6.22065 4.78016
101.876 97.5458 80.0019 1.7858 147.629 -120.023 -20.9682 -19.8745 -38.8108 26.2592 -2.34381 0.131151 108.677 and for the julia> x, _ = CellListMap.xgalactic(10^6); # 10^6 SVector{3,Float64} with "galactic" density.
julia> xmat = Matrix(reinterpret(reshape, Float64, x));
julia> @time test_celllistmap(xmat)
3.535903 seconds (103.01 k allocations: 1.097 GiB, 5.36% gc time, 11 lock conflicts)
3×1000000 Matrix{Float64}:
15.8458 6.61017 129.444 16.249 -36.1833 -47.6617 -4.15331 … 83.2507 -135.724 -8.00536 8.17124 -9.491 -13.0002
-2.91166 17.8658 -28.7162 17.7136 -39.6242 44.6349 -14.1746 12.7022 -14.5613 -3.03081 -8.79777 -136.763 4.94949
19.9325 2.57851 -18.5558 -17.7504 -31.7794 173.664 -9.83589 -13.8241 37.9389 -7.4949 175.836 -37.2366 35.1991
julia> @time test_pointneighbors(xmat)
24.696261 seconds (6.68 k allocations: 496.365 MiB, 0.01% gc time)
3×1000000 Matrix{Float64}:
15.8458 6.61017 129.444 16.249 -36.1833 -47.6617 -4.15331 … 83.2507 -135.724 -8.00536 8.17124 -9.491 -13.0002
-2.91166 17.8658 -28.7162 17.7136 -39.6242 44.6349 -14.1746 12.7022 -14.5613 -3.03081 -8.79777 -136.763 4.94949
19.9325 2.57851 -18.5558 -17.7504 -31.7794 173.664 -9.83589 -13.8241 37.9389 -7.4949 175.836 -37.2366 35.1991 (does this make sense?) I tried to run with julia> function test_pointneighbors(x)
f = zero(x)
neighborhood_search = GridNeighborhoodSearch{3}(; search_radius=5.0, n_points=size(x,2),
cell_list=FullGridCellList(search_radius=5.0, min_corner=[0.0, 0.0, 0.0], max_corner=[20.274, 20.274, 20.274]))
initialize!(neighborhood_search, x, x)
foreach_point_neighbor(x, y, neighborhood_search, parallel=true) do i, j, pos_diff, distance
distance < sqrt(eps()) && return
dv = -pos_diff / distance^3
for dim in axes(f, 1)
@inbounds f[dim, i] += dv[dim]
end
end
return f
end
test_pointneighbors (generic function with 2 methods)
julia> @time test_pointneighbors(xmat)
ERROR: BoundsError: attempt to access 100×216 Matrix{Int32} at index [101, 115]
Stacktrace:
[1] throw_boundserror(A::Matrix{Int32}, I::Tuple{Int64, Int64})
@ Base ./essentials.jl:14
[2] checkbounds
@ ./abstractarray.jl:699 [inlined]
[3] setindex!
@ ./array.jl:993 [inlined]
[4] pushat!
@ ~/.julia/packages/PointNeighbors/vKy7i/src/vector_of_vectors.jl:64 [inlined]
[5] push_cell!
@ ~/.julia/packages/PointNeighbors/vKy7i/src/cell_lists/full_grid.jl:124 [inlined]
[6] initialize_grid!(neighborhood_search::GridNeighborhoodSearch{…}, y::Matrix{…})
@ PointNeighbors ~/.julia/packages/PointNeighbors/vKy7i/src/nhs_grid.jl:176
[7] initialize!(neighborhood_search::GridNeighborhoodSearch{…}, x::Matrix{…}, y::Matrix{…})
@ PointNeighbors ~/.julia/packages/PointNeighbors/vKy7i/src/nhs_grid.jl:162
[8] test_pointneighbors(x::Matrix{Float64})
@ Main ./REPL[85]:5
[9] macro expansion
@ ./timing.jl:581 [inlined]
[10] top-level scope
@ ./REPL[86]:1
Some type information was truncated. Use `show(err)` to see complete types.
|
ps: The allocations come from the combination of the slice and broadcasting, which gets interpreted as: f[:,i] .= f[:,i] .+ fij and then the |
CellListMap.map_pairwise!(0, box, cell_list, | ||
parallel = parallel !== false) do x, y, i, j, d2, output | ||
# Skip all indices not in `points` | ||
i in points || return output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i in points || return output | |
i in points || return output |
Doesn't this produce a notable overhead which affects the benchmark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried. It's negligible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just out of interest. Am I missing something?
This check is done for each particle - so multiple times. For a large set, the overhead accumulates, no?
julia> f(i, points) = i in points
f (generic function with 1 method)
julia> points = rand(Int, 1_000_000);
julia> @benchmark f($5, $points)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 221.584 μs … 380.625 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 221.834 μs ┊ GC (median): 0.00%
Time (mean ± σ): 223.171 μs ± 4.460 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█▅▂▄▁ ▅ ▁ ▁ ▁
█████▇▆██▇██▆██▇▆▆▆▇▅▆▆▅▅▆▅▇▅▄▄▃▅▄▄▅▃▃▄▃▄▂▂▃▃▃▃▂▃▂▂▄▅▇▆▄▃▅▂▃▇ █
222 μs Histogram: log(frequency) by time 243 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference is that this is about 1-2% of the actual computation when you have a real example where you are doing actual work per particle.
@lmiq This is a bit tricky. You have to pad the bounding box with the search radius and increase the maximum number of particles per cell. The former we want to change (#80), the latter should have a more meaningful error message (#65). Here is a benchmark for only the interaction, not the initialization: using CellListMap, PointNeighbors, StaticArrays, BenchmarkTools
x, _ = CellListMap.xgalactic(10^6);
xmat = Matrix(reinterpret(reshape, Float64, x));
f = similar(xmat)
f2 = similar(xmat)
f3 = similar(xmat)
f4 = similar(xmat)
nhs1 = GridNeighborhoodSearch{3}(; search_radius=5.0, n_points=size(xmat,2))
initialize!(nhs1, xmat, xmat)
min_corner = minimum(xmat, dims=2) .- 5
max_corner = maximum(xmat, dims=2) .+ 5
nhs2 = GridNeighborhoodSearch{3}(; search_radius=5.0, n_points=size(xmat,2),
cell_list=FullGridCellList(search_radius=5.0;
min_corner, max_corner,
max_points_per_cell=2000))
initialize!(nhs2, xmat, xmat)
sys = ParticleSystem(positions=x, cutoff=5.0, output=f)
sys2 = ParticleSystem(xpositions=x, ypositions=x, cutoff=5.0, output=f2)
function test_celllistmap(_, sys)
sys.output .= 0
map_pairwise!(sys) do x, y, i, j, d2, f
d = sqrt(d2)
fij = (y - x) / d^3
for dim in axes(f, 1)
@inbounds f[dim, i] += fij[dim]
@inbounds f[dim, j] -= fij[dim]
end
return f
end
return sys.output
end
function test_celllistmap_different(_, sys)
sys.output .= 0
map_pairwise!(sys) do x, y, i, j, d2, f
d2 < eps() && return f
d = sqrt(d2)
fij = (y - x) / d^3
for dim in axes(f, 1)
@inbounds f[dim, i] += fij[dim]
end
return f
end
return sys.output
end
function test_pointneighbors(x, f, nhs)
f .= 0
foreach_point_neighbor(x, x, nhs, parallel=true) do i, j, pos_diff, distance
distance < sqrt(eps()) && return
dv = -pos_diff / distance^3
for dim in axes(f, 1)
@inbounds f[dim, i] += dv[dim]
end
end
return f
end
I get this:
The last three seem reasonable based on my original benchmarks in the first post. The first one looks suspicious.
The
Why is the first one so much faster? In my benchmarks in the first post, I found that the speedup comes only from using symmetry to skip half the pairs. But this should be a speedup of at most 2x. |
Indeed, there's something not scaling well there in Thus, the parallelization mechanisms and splitting of tasks are different, and the reason for that is empirical testing of different schemes at some point in the development. I will probably have to revisit the behavior of the two-set case at some point to check these issues. For the records, this is what I got, increasing the number of particles: julia> run(;n=5*10^4)
CellListMap (t1): 0.163701008 s
CellListMap different (t2/t1): 1.8283861147635694
PointNeighbors nhs1 (t3/t1): 2.213968633595708
PointNeighbors nhs2 (t4/t1): 2.1056406384498256
Test Passed
julia> run(;n=10^5)
CellListMap (t1): 0.305206052 s
CellListMap different (t2/t1): 2.0926074362378633
PointNeighbors nhs1 (t3/t1): 2.8729478044557255
PointNeighbors nhs2 (t4/t1): 2.8898966918257574
Test Passed
julia> run(;n=5*10^5)
CellListMap (t1): 1.581912255 s
CellListMap different (t2/t1): 2.4655888565703035
PointNeighbors nhs1 (t3/t1): 4.414566006380676
PointNeighbors nhs2 (t4/t1): 4.29086459855512
Test Passed
julia> run(;n=10^6)
CellListMap (t1): 3.4457909320000004 s
CellListMap different (t2/t1): 3.6117841437862372
PointNeighbors nhs1 (t3/t1): 6.811157335183328
PointNeighbors nhs2 (t4/t1): 6.209188126681157
Test Passed Where we can see that for a smallish number of particles the ratio between "one-set" and "two-sets" is roughly 2x, which is what we expect, but it is getting worse as the number of particles increases. This is the code I ran: using CellListMap, PointNeighbors, StaticArrays, BenchmarkTools, Chairmarks
using Test
function test_celllistmap(_, sys)
sys.output .= 0
map_pairwise!(sys) do x, y, i, j, d2, f
d = sqrt(d2)
fij = (y - x) / d^3
for dim in axes(f, 1)
@inbounds f[dim, i] += fij[dim]
@inbounds f[dim, j] -= fij[dim]
end
return f
end
return sys.output
end
function test_celllistmap_different(_, sys)
sys.output .= 0
map_pairwise!(sys) do x, y, i, j, d2, f
d2 < eps() && return f
d = sqrt(d2)
fij = (y - x) / d^3
for dim in axes(f, 1)
@inbounds f[dim, i] += fij[dim]
end
return f
end
return sys.output
end
function test_pointneighbors(x, f, nhs)
f .= 0
foreach_point_neighbor(x, x, nhs, parallel=true) do i, j, pos_diff, distance
distance < sqrt(eps()) && return
dv = -pos_diff / distance^3
for dim in axes(f, 1)
@inbounds f[dim, i] += dv[dim]
end
end
return f
end
function run(;n=10^5)
x, _ = CellListMap.xgalactic(n);
xmat = Matrix(reinterpret(reshape, Float64, x));
f = similar(xmat)
f2 = similar(xmat)
f3 = similar(xmat)
f4 = similar(xmat)
nhs1 = GridNeighborhoodSearch{3}(; search_radius=5.0, n_points=size(xmat,2))
initialize!(nhs1, xmat, xmat)
min_corner = minimum(xmat, dims=2) .- 5
max_corner = maximum(xmat, dims=2) .+ 5
nhs2 = GridNeighborhoodSearch{3}(; search_radius=5.0, n_points=size(xmat,2),
cell_list=FullGridCellList(search_radius=5.0;
min_corner, max_corner,
max_points_per_cell=2000))
initialize!(nhs2, xmat, xmat)
sys = ParticleSystem(positions=x, cutoff=5.0, output=f)
sys2 = ParticleSystem(xpositions=x, ypositions=x, cutoff=5.0, output=f2)
b1 = @b test_celllistmap($(nothing), $sys)
println("CellListMap (t1): ", b1.time, " s")
b2 = @b test_celllistmap_different($(nothing), $sys2)
println("CellListMap different (t2/t1): ", b2.time / b1.time)
b3 = @b test_pointneighbors($xmat, $f3, $nhs1)
println("PointNeighbors nhs1 (t3/t1): ", b3.time / b1.time)
b4 = @b test_pointneighbors($xmat, $f4, $nhs2)
println("PointNeighbors nhs2 (t4/t1): ", b4.time / b1.time)
@test f ≈ f2 ≈ f3 ≈ f4
end |
Many thanks for the discussion!
Interesting that I didn't see this in my original benchmark, even though I'm using a lot bigger problems there. Probably because our use cases and therefore our benchmarks are very different. You use a much larger search radius with almost 2000 particles per cell. In our simulations and benchmarks, we usually have less than 30 particles per cell and ~100 neighbor particles within the search radius.
Please report back when you find something interesting. |
Indeed, the density relative to the cutoff makes a lot of difference. I have tried to optimize The atomic systems are less dense (relative to the cutoff). They are generated with the julia> run(;n=10^5)
CellListMap (t1): 0.053651925 s
CellListMap different (t2/t1): 2.118421175009098
Test Passed
julia> run(;n=5*10^5)
CellListMap (t1): 0.272764123 s
CellListMap different (t2/t1): 2.6336540088155216
Test Passed
julia> run(;n=10^6)
CellListMap (t1): 0.5655830310000001 s
CellListMap different (t2/t1): 3.85268252505263
Test Passed (I couldn't run
The difference in performances, though, have nothing to do with the scaling of the parallelization. It is the algorithm that is different. In fact, I had already noted that possibility and the code is commented. The thing is that when using a full-cell list approach, I'm using a projection-approach to eliminate the calculation of unnecessary distances. It consists of projecting the positions of the particles into the vector connecting the centers of the cells being considered, sorting the positions along that line, and only computing interactions if the distance along that line is smaller than the cutoff. This method was proposed in 1 and 2. For the two-sets case, this is clearly very advantageous if both sets are large, but if one of the sets is small the cost of computing the cell lists for the largest set becomes less favorable (at least it was the case when I benchmarked these stuff). Probably I will need to test again if computing the lists for all particles is not worth the effort now, or at least switch automatically the method using some criteria. Code ran for the benchmark above: using CellListMap, PointNeighbors, StaticArrays
using Chairmarks
using Test
function test_celllistmap(_, sys)
sys.output .= 0
map_pairwise!(sys) do x, y, i, j, d2, f
d = sqrt(d2)
fij = (y - x) / d^3
for dim in axes(f, 1)
@inbounds f[dim, i] += fij[dim]
@inbounds f[dim, j] -= fij[dim]
end
return f
end
return sys.output
end
function test_celllistmap_different(_, sys)
sys.output .= 0
map_pairwise!(sys) do x, y, i, j, d2, f
d2 < eps() && return f
d = sqrt(d2)
fij = (y - x) / d^3
for dim in axes(f, 1)
@inbounds f[dim, i] += fij[dim]
end
return f
end
return sys.output
end
function test_pointneighbors(x, f, nhs)
f .= 0
foreach_point_neighbor(x, x, nhs, parallel=true) do i, j, pos_diff, distance
distance < sqrt(eps()) && return
dv = -pos_diff / distance^3
for dim in axes(f, 1)
@inbounds f[dim, i] += dv[dim]
end
end
return f
end
function run(;n=10^5)
x, box = CellListMap.xatomic(n);
xmat = Matrix(reinterpret(reshape, Float64, x));
f = similar(xmat)
f2 = similar(xmat)
f3 = similar(xmat)
f4 = similar(xmat)
nhs1 = GridNeighborhoodSearch{3}(; search_radius=box.cutoff, n_points=size(xmat,2))
initialize!(nhs1, xmat, xmat)
min_corner = minimum(xmat, dims=2) .- 5
max_corner = maximum(xmat, dims=2) .+ 5
nhs2 = GridNeighborhoodSearch{3}(; search_radius=box.cutoff, n_points=size(xmat,2),
cell_list=FullGridCellList(search_radius=box.cutoff;
min_corner, max_corner,
max_points_per_cell=2000))
initialize!(nhs2, xmat, xmat)
sys = ParticleSystem(positions=x, cutoff=box.cutoff, output=f)
sys2 = ParticleSystem(xpositions=x, ypositions=x, cutoff=box.cutoff, output=f2)
b1 = @b test_celllistmap($(nothing), $sys)
println("CellListMap (t1): ", b1.time, " s")
b2 = @b test_celllistmap_different($(nothing), $sys2)
println("CellListMap different (t2/t1): ", b2.time / b1.time)
# b3 = @b test_pointneighbors($xmat, $f3, $nhs1)
# println("PointNeighbors nhs1 (t3/t1): ", b3.time / b1.time)
# b4 = @b test_pointneighbors($xmat, $f4, $nhs2)
# println("PointNeighbors nhs2 (t4/t1): ", b4.time / b1.time)
# @test f ≈ f2 ≈ f3 ≈ f4
@test f ≈ f2
end |
In this branch I have implemented the strategy of computing cell lists for both sets of particles, and then using the same method to avoid computation of distances as was used in the single-set problems: https://github.com/m3g/CellListMap.jl/tree/new_CellListPair Now I get the following results from the benchmark in #8 (comment) julia> run(;n=5*10^4)
CellListMap (t1): 0.152432735 s
CellListMap different (t2/t1): 1.6050779643886859
PointNeighbors nhs1 (t3/t1): 2.2481417918533046
PointNeighbors nhs2 (t4/t1): 2.258499921293153
Test Passed
julia> run(;n=10^5)
CellListMap (t1): 0.299085832 s
CellListMap different (t2/t1): 1.710319695116819
PointNeighbors nhs1 (t3/t1): 2.901262748547715
PointNeighbors nhs2 (t4/t1): 2.868669272170673
Test Passed
julia> run(;n=5*10^5)
CellListMap (t1): 1.582251368 s
CellListMap different (t2/t1): 1.6601796706425727
PointNeighbors nhs1 (t3/t1): 4.411138771725176
PointNeighbors nhs2 (t4/t1): 4.24521378577819
Test Passed
julia> run(;n=10^6)
CellListMap (t1): 3.4587672630000004 s
CellListMap different (t2/t1): 1.725844078859029
PointNeighbors nhs1 (t3/t1): 6.799690402586073
PointNeighbors nhs2 (t4/t1): 6.206536996473243
Test Passed For these tests this is a clear improvement for Nevertheless, for smaller but relevant (maybe even more common for us) systems it becomes somewhat slower, and by constructing cell lists for both sets we use more memory, which can be a problem for very large systems. Thus, I'm still wondering if I'll merge that change as it is. But, it is important to remark that this method to avoid unnecessary computations is relevant for relatively dense cells (many particles per cell), or if the computations to be performed for each pair of particle are expensive, such that avoiding any computation is important. For lower densities per cells and fast operations per pair it might be detrimental for performance. |
Why do I get this on your branch?
|
Can you try now? I'm in the middle of updating other things and you might have reached an unusable state. |
Very interesting. Even for my benchmark in the first post with much less dense cells, this is significantly faster than the original version:
Pretty much a factor of 2 now compared to the symmetric version. |
Indeed, the strategy to avoid unncessary computations is suprisingly effective. It is something I guess you could implement as well, it is a local computation (you just need a data structure that allows all the projection and sorting of particles local). |
Thanks! I will look into this. |
I finally properly implemented the prototype from trixi-framework/TrixiParticles.jl#174. This doesn't have an advantage over existing implementations yet, but we can use the advanced periodicity features of CellListMap.jl in the future. For now, it's mostly useful for benchmarking against CellListMap.jl.
Here is a benchmark using the n-body benchmark on a single thread of an AMD Threadripper 3990X.
It's the usual plot, showing the runtime of a single interaction divided by the number of particles.
Lower is better, and a horizontal line means constant runtime per particle, so a perfect scaling with the problem size.
DictionaryCellList
implementation and CellListMap.jl start out with equal performance, but CellListMap.jl scales better.FullGridCellList
is now faster than CellListMap.jl.points_equal_neighbors = true
. For context, if this is set tofalse
, aCellListPair
is used, which works almost the same as theFullGridCellList
(although storing coordinates together with indices, see Store coordinates together with the points in a cell list #52). When the two coordinate matrices are identical (so we are looking for fluid neighbors of fluid particles), we can use a regularCellList
, which makes use of symmetry to skip half the particle-neighbor pairs. This implementation is faster here.FullGridCellList
, but modified to also use symmetry like explained above. I implemented this in [Proof of Concept] Use symmetry to speed up the main loop #85. This reaches the same performance as the CellListMap.jl implementation.The same plot using 128 threads on the Threadripper:
The two CellListMap.jl implementations start quite slow because they are not using Polyester.jl. The regular CellListMap.jl implementation is also slower again for large problems for some reason.
For a fair comparison, I added "CellListMap Polyester.jl", which is the same as the "CellListMap" implementation but using Polyester.jl on the particle loop, just like we do in PointNeighbors.jl. This is implemented in m3g/CellListMap.jl#104.
This plot looks very similar to the serial plot.
Here are the same plots for a single WCSPH interaction loop:
Again, looking very similar to the N-Body benchmark, but "CellListMap equal" is a bit slower.
Here is a benchmark for the updates:
It is clear that one of our main features is: