Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment to mitigate StorageId union access patterns #16939

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

omaskery
Copy link
Contributor

@omaskery omaskery commented Dec 22, 2024

Objective

  • In the query iteration code there are some types which rely on comments & programmer discipline to ensure correct access of fields:
    • StorageId - a union of ArchetypeId and TableId
    • QueryIterationCursor - multiple fields are only valid, and some types vary, depending on a boolean value:
      is_dense: bool,
      storage_id_iter: core::slice::Iter<'s, StorageId>,
      table_entities: &'w [Entity],
      archetype_entities: &'w [ArchetypeEntity],
    • QueryState - similar to above:
      // NOTE: we maintain both a bitset and a vec because iterating the vec is faster
      pub(super) matched_storage_ids: Vec<StorageId>,
      // Represents whether this query iteration is dense or not. When this is true
      // `matched_storage_ids` stores `TableId`s, otherwise it stores `ArchetypeId`s.
      pub(super) is_dense: bool,
  • I asked why this wasn't achieved with an enum of some kind on Discord.
  • Somebody replied saying the reasoning for this is historical (is_dense used to be derived from a const generic or similar), and my proposed enum refactoring was worth considering.

Solution

  • The fields whose access patterns must be managed are now largely moved into enums, making it broadly impossible to access the wrong fields at the wrong times.
  • I have tried, where possible, to make the number of checks no more than before - aka where we only checked is_dense before, we now do a single match on the enum.
  • However, I have left in the concept of a StorageId type and made it a normal enum rather than union. I have then tried to only use this in places where the downsides are hopefully minimal, such as:
    • Where I think the discriminator being stored won't be significant (e.g. it's just transient as part of an iterator)
    • Where I think the cost of accessing the variant is minimal (we would've had to check is_dense anyway)
  • This is largely because I didn't want it to be too drastic a refactor in the fairly scary code around fold_over_storage_range, but I'm open to people giving advice on how to proceed there, whether I should be braver (or not).
  • I have put in the concept of accessing the ArchetypeId or TableId via debug_checked_as_x() unsafely, to still preserve the idea that when the QueryIter knows whether the iteration is dense or not, it doesn't want to pay the cost of checking that again when being asked to iterate storage by a StorageId. The old code didn't pay that cost, so I was wary of doing so. If people feel that this is overly cautious perf-wise, then I can make it safe and have it fail in some way (or whatever you suggest).

Testing

  • I've run the bevy_ecs unit tests, but don't know what else to do. Please advise! I'm also asking in #bevy_ecs on discord.
  • I've now also run the full CI pipeline locally which passed, and the checks GitHub has run on this PR so far have passed.
  • I've also run cargo miri test -p bevy_ecs and found no issues that weren't already present on main (there are ~7 memory leaks reported on both main and my branch!).
  • I haven't added any new tests as it is a refactor that should have no external effects 🤞

@omaskery omaskery force-pushed the try-mitigating-storage-id-union branch from 0b90fd9 to a4c4e9e Compare December 23, 2024 15:13
@omaskery omaskery marked this pull request as ready for review December 23, 2024 15:15
crates/bevy_ecs/src/query/iter.rs Outdated Show resolved Hide resolved
crates/bevy_ecs/src/query/state.rs Outdated Show resolved Hide resolved
crates/bevy_ecs/src/query/iter.rs Outdated Show resolved Hide resolved
crates/bevy_ecs/src/query/state.rs Outdated Show resolved Hide resolved
@BenjaminBrienen BenjaminBrienen added A-ECS Entities, components, systems, and events D-Modest A "normal" level of difficulty; suitable for simple features or challenging fixes S-Needs-Review Needs reviewer attention (from anyone!) to move forward labels Dec 23, 2024
@omaskery
Copy link
Contributor Author

To make the status of this PR clear: PR feedback so far has been addressed, but work is still ongoing to check for perf regressions and address them.

@omaskery omaskery force-pushed the try-mitigating-storage-id-union branch from d7bec23 to 6b11229 Compare December 24, 2024 11:59
Copy link
Contributor

@chescock chescock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

I'll be surprised if the benchmarks show a slowdown from this, but I'm often surprised by the results of benchmarks :).

@BD103
Copy link
Member

BD103 commented Dec 24, 2024

I ran the benchmarks for this and found some significant performance regressions. I did the following:

$ git checkout main
# Build ECS benchmarks.
$ cargo build -p benches --bench ecs
# Run ECS benchmarks directly, pinning them to the first CPU. Save the results as a baseline named "main".
$ taskset --cpu-list 0 target/release/deps/ecs-5a85551b99999190 iter_ --bench --save-baseline main

# Switch to this PR's branch.
$ git switch try-mitigating-storage-id-union
# Rebuild benchmarks.
$ cargo build -p benches --bench ecs
# Run the new benchmarks, comparing the results with the saved baseline.
$ taskset --cpu-list 0 ../target/release/deps/ecs-5a85551b99999190 iter_ --baseline main --bench

I found significant regressions on the following:

  • iter_fragmented/foreach_wide 90%
  • iter_fragmented_sparse/foreach_wide 81%

There were several other performance gains and regressions (between 2-7%), which I've included in the results below. I've also included the HTML report, with all of it's graphs, as a ZIP file for further analysis. Note that the two regressions above were in the microseconds and nanoseconds, so this may be negligible in an actual program.

Hope this helps! I don't have enough knowledge on the ECS to figure out why the performance is regressions, but this should be a good starting point.

criterion-report.zip

Benchmark Output
iter_fragmented/base    time:   [303.60 ns 303.65 ns 303.73 ns]
                        change: [-25.937% -25.724% -25.539%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  8 (8.00%) high severe
iter_fragmented/wide    time:   [3.9187 µs 3.9514 µs 3.9834 µs]
                        change: [+1.2996% +2.1145% +2.8990%] (p = 0.00 < 0.05)
                        Performance has regressed.
iter_fragmented/foreach time:   [113.56 ns 118.16 ns 123.32 ns]
                        change: [+0.5727% +4.8012% +9.3791%] (p = 0.03 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  7 (7.00%) high mild
iter_fragmented/foreach_wide
                        time:   [4.8465 µs 4.8818 µs 4.9181 µs]
                        change: [+88.946% +90.418% +91.872%] (p = 0.00 < 0.05)
                        Performance has regressed.

iter_fragmented_sparse/base
                        time:   [4.6012 ns 4.6214 ns 4.6463 ns]
                        change: [-10.872% -9.8044% -8.9351%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) high mild
  5 (5.00%) high severe
iter_fragmented_sparse/wide
                        time:   [51.661 ns 52.295 ns 52.938 ns]
                        change: [+1.9106% +2.9326% +3.9490%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
iter_fragmented_sparse/foreach
                        time:   [5.1473 ns 5.1730 ns 5.2047 ns]
                        change: [-3.2390% -1.8911% -0.6794%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 19 outliers among 100 measurements (19.00%)
  10 (10.00%) high mild
  9 (9.00%) high severe
iter_fragmented_sparse/foreach_wide
                        time:   [63.093 ns 64.442 ns 66.973 ns]
                        change: [+78.468% +81.425% +85.276%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe

iter_simple/base        time:   [5.1535 µs 5.1571 µs 5.1626 µs]
                        change: [+0.2114% +0.3693% +0.5396%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe
iter_simple/wide        time:   [35.295 µs 35.329 µs 35.368 µs]
                        change: [+0.9023% +1.0881% +1.2520%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  6 (6.00%) high mild
  1 (1.00%) high severe
iter_simple/system      time:   [5.3246 µs 5.3248 µs 5.3251 µs]
                        change: [+2.9911% +3.2136% +3.3466%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 19 outliers among 100 measurements (19.00%)
  2 (2.00%) low severe
  7 (7.00%) low mild
  7 (7.00%) high mild
  3 (3.00%) high severe
iter_simple/sparse_set  time:   [15.684 µs 15.691 µs 15.700 µs]
                        change: [+4.2280% +4.4092% +4.5419%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
iter_simple/wide_sparse_set
                        time:   [78.096 µs 78.192 µs 78.280 µs]
                        change: [+0.7110% +0.8327% +0.9447%] (p = 0.00 < 0.05)
                        Change within noise threshold.
iter_simple/foreach     time:   [5.1049 µs 5.1054 µs 5.1061 µs]
                        change: [+2.6787% +2.8166% +2.9249%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe
iter_simple/foreach_wide
                        time:   [38.040 µs 38.061 µs 38.083 µs]
                        change: [+6.5819% +6.7352% +6.8683%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  6 (6.00%) high mild
  1 (1.00%) high severe
iter_simple/foreach_sparse_set
                        time:   [14.405 µs 14.420 µs 14.436 µs]
                        change: [+4.1729% +4.3148% +4.4363%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 21 outliers among 100 measurements (21.00%)
  7 (7.00%) high mild
  14 (14.00%) high severe
iter_simple/foreach_wide_sparse_set
                        time:   [79.613 µs 79.629 µs 79.647 µs]
                        change: [+1.0849% +1.1762% +1.2634%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe
iter_simple/foreach_hybrid
                        time:   [6.9573 µs 6.9831 µs 7.0189 µs]
                        change: [+3.1587% +5.7356% +8.8315%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 14 outliers among 100 measurements (14.00%)
  1 (1.00%) high mild
  13 (13.00%) high severe

par_iter_simple/with_0_fragment
                        time:   [51.424 µs 51.436 µs 51.448 µs]
                        change: [+0.7495% +0.8385% +0.9118%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe
par_iter_simple/with_10_fragment
                        time:   [51.470 µs 51.485 µs 51.502 µs]
                        change: [+0.6942% +0.7896% +0.8947%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  2 (2.00%) high severe
par_iter_simple/with_100_fragment
                        time:   [52.126 µs 52.152 µs 52.180 µs]
                        change: [-0.2657% -0.0798% +0.0895%] (p = 0.39 > 0.05)
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe
par_iter_simple/with_1000_fragment
                        time:   [59.121 µs 59.224 µs 59.339 µs]
                        change: [-8.9503% -7.3700% -5.9287%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild
par_iter_simple/hybrid  time:   [144.31 µs 144.37 µs 144.42 µs]
                        change: [-2.7294% -1.7431% -0.6849%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 17 outliers among 100 measurements (17.00%)
  10 (10.00%) low mild
  3 (3.00%) high mild
  4 (4.00%) high severe

iter_fragmented(4096)_empty/foreach_table
                        time:   [1.8577 µs 1.8606 µs 1.8639 µs]
                        change: [-1.0602% -0.4868% -0.1011%] (p = 0.03 < 0.05)
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe
iter_fragmented(4096)_empty/foreach_sparse
                        time:   [5.9155 µs 5.9474 µs 5.9762 µs]
                        change: [-1.3398% -0.8089% -0.3228%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  10 (10.00%) low mild

@omaskery omaskery force-pushed the try-mitigating-storage-id-union branch from 6b11229 to 4d8e2f1 Compare December 24, 2024 21:29
@omaskery
Copy link
Contributor Author

I think I've found the source of the performance regressions, I had made a bit of a silly mistake:

  • There are a few places where iteration of storages occur, agnostic of the storage type, but I somehow convinced myself they were not hot-path and that the cost of iterating and constructing a StorageId for each entry was cheap - so I implemented Iterator<Item=StorageId> and didn't think too much about it.
  • Unfortunately one of those places was actually the main query iteration hotpath.
  • I think that because I was iterating StorageIds, every single iteration produced a new StorageId which - from the compilers point of view - could have been any variant (Archetypes or Tables) - and so I suspect that stopped it being able to easily optimise the branching inside fold_over_storage_range. I suspect previously it was better at optimising it because they all referenced the same couple of is_dense variables, making it perhaps more obvious to the compiler.
  • So now I've pulled the branching over the storage type to iterate out to the QueryIter::fold implementation, allowing it to call the correct fold_over_xxxx_range_by_id implementation directly.

crates/bevy_ecs/src/query/state.rs Outdated Show resolved Hide resolved
crates/bevy_ecs/src/query/iter.rs Outdated Show resolved Hide resolved
@chescock chescock self-requested a review December 25, 2024 02:16
@BD103
Copy link
Member

BD103 commented Dec 26, 2024

@omaskery I encourage you to run the benchmarks yourself once you feel this PR is ready, but let me know if you don't use Linux or need help with the instructions. :)

@omaskery
Copy link
Contributor Author

@omaskery I encourage you to run the benchmarks yourself once you feel this PR is ready, but let me know if you don't use Linux or need help with the instructions. :)

@BD103 thanks, I have been running the benchmarks, and used perf diff to identify the issue I mentioned in my previous comment. I'm currently working on the feedback from @chescock - particularly how to approach the par_iter implementation.

@omaskery
Copy link
Contributor Author

omaskery commented Dec 26, 2024

Latest benchmarks: criterion-report.zip

Benchmark Output:
iter_fragmented/base    time:   [506.80 ns 508.79 ns 511.09 ns]
                        change: [-5.5501% -5.0584% -4.5441%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
iter_fragmented/wide    time:   [6.4090 µs 6.4291 µs 6.4489 µs]
                        change: [-1.1382% -0.6173% -0.0846%] (p = 0.02 < 0.05)
                        Change within noise threshold.
iter_fragmented/foreach time:   [215.93 ns 223.90 ns 232.65 ns]
                        change: [-1.1526% +1.9866% +5.0240%] (p = 0.20 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
iter_fragmented/foreach_wide
                        time:   [4.9398 µs 4.9622 µs 4.9835 µs]
                        change: [-8.7052% -6.3693% -4.1158%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe

iter_fragmented_sparse/base
                        time:   [8.7403 ns 8.8672 ns 8.9988 ns]
                        change: [-5.6032% -4.7100% -3.8123%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  9 (9.00%) high mild
  2 (2.00%) high severe
iter_fragmented_sparse/wide
                        time:   [68.481 ns 69.769 ns 71.290 ns]
                        change: [+18.747% +21.111% +23.546%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  8 (8.00%) high mild
  4 (4.00%) high severe
iter_fragmented_sparse/foreach
                        time:   [10.037 ns 10.083 ns 10.134 ns]
                        change: [-14.318% -11.335% -8.2538%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe
iter_fragmented_sparse/foreach_wide
                        time:   [52.544 ns 52.876 ns 53.232 ns]
                        change: [+25.600% +26.624% +27.616%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

iter_simple/base        time:   [10.437 µs 10.653 µs 10.924 µs]
                        change: [+2.4153% +4.5453% +6.9547%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 14 outliers among 100 measurements (14.00%)
  5 (5.00%) high mild
  9 (9.00%) high severe
iter_simple/wide        time:   [52.925 µs 53.135 µs 53.349 µs]
                        change: [-1.3019% +0.4637% +1.8934%] (p = 0.62 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
iter_simple/system      time:   [10.234 µs 10.255 µs 10.279 µs]
                        change: [-0.0113% +0.5585% +1.3869%] (p = 0.10 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe
iter_simple/sparse_set  time:   [23.959 µs 24.060 µs 24.196 µs]
                        change: [-8.5315% -6.2355% -3.6832%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) high mild
  7 (7.00%) high severe
iter_simple/wide_sparse_set
                        time:   [127.20 µs 127.49 µs 127.80 µs]
                        change: [+4.6723% +5.2593% +5.8406%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) low mild
  5 (5.00%) high mild
  1 (1.00%) high severe
iter_simple/foreach     time:   [10.355 µs 10.458 µs 10.573 µs]
                        change: [-6.1792% -4.5931% -3.2888%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  8 (8.00%) high mild
  5 (5.00%) high severe
iter_simple/foreach_wide
                        time:   [57.668 µs 57.772 µs 57.887 µs]
                        change: [-7.1458% -5.4034% -3.7904%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  7 (7.00%) low mild
  6 (6.00%) high mild
  1 (1.00%) high severe
iter_simple/foreach_sparse_set
                        time:   [22.267 µs 22.311 µs 22.360 µs]
                        change: [+0.1102% +0.4825% +0.8512%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe
iter_simple/foreach_wide_sparse_set
                        time:   [133.41 µs 137.98 µs 143.67 µs]
                        change: [+7.4944% +9.8618% +12.698%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  3 (3.00%) high mild
  9 (9.00%) high severe
iter_simple/foreach_hybrid
                        time:   [14.422 µs 14.449 µs 14.479 µs]
                        change: [-13.412% -9.2318% -5.1831%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  7 (7.00%) high mild
  3 (3.00%) high severe

par_iter_simple/with_0_fragment
                        time:   [47.125 µs 47.362 µs 47.598 µs]
                        change: [-10.509% -8.5731% -6.9490%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe
par_iter_simple/with_10_fragment
                        time:   [48.112 µs 48.709 µs 49.548 µs]
                        change: [-9.1211% -6.4227% -3.2409%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe
par_iter_simple/with_100_fragment
                        time:   [48.288 µs 48.659 µs 49.032 µs]
                        change: [-9.2277% -7.8553% -6.5745%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
par_iter_simple/with_1000_fragment
                        time:   [60.481 µs 61.683 µs 63.076 µs]
                        change: [-1.8422% +0.0320% +1.9742%] (p = 0.97 > 0.05)
                        No change in performance detected.
Found 15 outliers among 100 measurements (15.00%)
  9 (9.00%) high mild
  6 (6.00%) high severe
par_iter_simple/hybrid  time:   [90.960 µs 91.182 µs 91.418 µs]
                        change: [-11.332% -7.9212% -4.7275%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) high mild
  8 (8.00%) high severe

iter_fragmented(4096)_empty/foreach_table
                        time:   [6.0752 µs 6.0975 µs 6.1199 µs]
                        change: [+3.5293% +5.8784% +8.0996%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 14 outliers among 100 measurements (14.00%)
  7 (7.00%) low severe
  1 (1.00%) low mild
  4 (4.00%) high mild
  2 (2.00%) high severe
iter_fragmented(4096)_empty/foreach_sparse
                        time:   [17.890 µs 17.995 µs 18.114 µs]
                        change: [+0.3374% +1.3626% +2.3091%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 12 outliers among 100 measurements (12.00%)
  7 (7.00%) high mild
  5 (5.00%) high severe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-ECS Entities, components, systems, and events D-Modest A "normal" level of difficulty; suitable for simple features or challenging fixes S-Needs-Review Needs reviewer attention (from anyone!) to move forward
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants