Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: convert testsuite to be function based #7221

Merged
merged 16 commits into from
Dec 20, 2024
Merged

Conversation

hzhou
Copy link
Contributor

@hzhou hzhou commented Nov 17, 2024

Pull Request Description

This a split/renew from #5725.

The current testsuite consists of thousands of individual mpi test programs. Running the entire testsuite involves invoking process manager to spawn mpi processes and each process goes through MPI_INIT again and again. Both the process spawning and MPI initialization are very slow compared to the tested MPI operation itself. The current testsuite runs for a couple of hours and we run hundreds of them every day.

This PR attempts to convert individual tests into functions, so multiple tests can be tested within a single MPI_Init/Finalize window. I believe this can significantly reduce the CI testing time.

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

@hzhou hzhou force-pushed the 2411_mpitests branch 2 times, most recently from 0c7f2f3 to 5a82dff Compare November 18, 2024 19:25
@hzhou
Copy link
Contributor Author

hzhou commented Nov 18, 2024

test:mpich/custom ✔️

Significantly accerlerate those converted collective tests.

Running tests in ./attr/testlist [20 tests - 00:00:00]
    run_mpitests np=1, 17 tests...
    run_mpitests np=2, 1 tests...
    run_mpitests np=4, 2 tests...
Running tests in ./attr/testlist.dtp [2 tests - 00:00:00]
    run_mpitests np=1, 2 tests...
Running tests in ./coll/testlist [89 tests - 00:00:00]
    run_mpitests np=1, 2 tests...
    run_mpitests np=2, 1 tests...
    run_mpitests np=3, 1 tests...
    run_mpitests np=4, 40 tests...
    run_mpitests np=5, 9 tests...
    run_mpitests np=6, 1 tests...
    run_mpitests np=7, 2 tests...
    run_mpitests np=8, 11 tests...
    run_mpitests np=10, 22 tests...
Running tests in ./coll/testlist.collalgo [1296 tests - 00:00:03]
    run_mpitests np=1, 6 tests...
    run_mpitests np=2, 6 tests...
    run_mpitests np=4, 428 tests...
    run_mpitests np=5, 191 tests...
    run_mpitests np=6, 10 tests...
    run_mpitests np=7, 159 tests...
    run_mpitests np=8, 103 tests...
    run_mpitests np=10, 393 tests...
Running tests in ./attr/testlist [3 tests - 00:00:59]
Running tests in ./coll/testlist [118 tests - 00:01:00]
Running tests in ./coll/testlist.collalgo [118 tests - 00:02:15]
Running tests in ./coll/testlist.dtp [12 tests - 00:02:43]
Running tests in ./comm/testlist [46 tests - 00:04:53]
...

@hzhou hzhou requested a review from raffenet November 18, 2024 20:12
@hzhou
Copy link
Contributor Author

hzhou commented Nov 18, 2024

test:mpich/ch3/most
test:mpich/ch4/most

2 TIMEOUT in ch4-ofi-asan. I don't think they are related to this PR, but it is good that they prove this PR works.

Use ch4-ucx-asan for example, before this PR:

Running tests in ./attr [00:00:00]
Running tests in ./coll [00:00:52]
Running tests in ./comm [01:36:07]

That 1 hour 36 min. to finish the collective tests. After this PR:

Running tests in ./attr/testlist [20 tests - 00:00:00]
    run_mpitests np=1, 17 tests...
    run_mpitests np=2, 1 tests...
    run_mpitests np=4, 2 tests...
Running tests in ./attr/testlist.dtp [2 tests - 00:00:04]
    run_mpitests np=1, 2 tests...
Running tests in ./coll/testlist [89 tests - 00:00:05]
    run_mpitests np=1, 2 tests...
    run_mpitests np=2, 1 tests...
    run_mpitests np=3, 1 tests...
    run_mpitests np=4, 40 tests...
    run_mpitests np=5, 9 tests...
    run_mpitests np=6, 1 tests...
    run_mpitests np=7, 2 tests...
    run_mpitests np=8, 11 tests...
    run_mpitests np=10, 22 tests...
Running tests in ./coll/testlist.collalgo [1296 tests - 00:00:20]
    run_mpitests np=1, 6 tests...
    run_mpitests np=2, 6 tests...
    run_mpitests np=4, 428 tests...
    run_mpitests np=5, 191 tests...
    run_mpitests np=6, 10 tests...
    run_mpitests np=7, 159 tests...
    run_mpitests np=8, 103 tests...
    run_mpitests np=10, 393 tests...
Running tests in ./attr/testlist [3 tests - 00:01:44]
Running tests in ./coll/testlist [117 tests - 00:01:48]
Running tests in ./coll/testlist.collalgo [118 tests - 00:06:18]
Running tests in ./coll/testlist.dtp [12 tests - 00:10:22]
Running tests in ./comm/testlist [46 tests - 00:16:10]

So 1:36 -> 0:16

@hzhou
Copy link
Contributor Author

hzhou commented Nov 19, 2024

test:mpich/whitespace

@raffenet
Copy link
Contributor

I think my only question is how much does this differ in timing than if we configured MPICH --without-hwloc and skipped all the topology discovery stuff? If there's still substantial savings for most CI jobs, then this is probably worth it in the long run.

@hzhou
Copy link
Contributor Author

hzhou commented Nov 22, 2024

Let's find out

test:mpich/custom
netmod: ch4:ofi
config: nohwloc

EDIT: oh, I need run this against the main branch... running in #7204 (comment)

@hzhou
Copy link
Contributor Author

hzhou commented Nov 25, 2024

I think my only question is how much does this differ in timing than if we configured MPICH --without-hwloc and skipped all the topology discovery stuff? If there's still substantial savings for most CI jobs, then this is probably worth it in the long run.

https://jenkins-pmrs.cels.anl.gov/job/mpich-review-custom/1182/console

Running tests in ./attr [00:00:00]
Running tests in ./coll [00:00:09]
Running tests in ./comm [00:31:54]

So ~32min without hwloc.

Copy link
Contributor

@raffenet raffenet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think the increased throughput is going to be a big enough benefit to warrant the changes. Let's do it.

hzhou added 16 commits December 19, 2024 20:26
Refactor the code that check and acquires file lock into a routine. It
is a common part running a test. Wrapping it into a routine makes it
easier to reuse.
Add a version of MTestArgListCreate that parses a command line string.
Make -arg equivalent to -arg=1.
The tests with mpicolltest.h can be compiled with -DUSE_MTEST_NBC to
become a non-blocking test. Compile the source multiple times with macro
is inflexible to move into multi-tests framework -- run multiple tests
inside a single MPI_Init/MPI_Finalize window. Convert it to use explicit
option instead.
The new framework will allow running tests inside a single
MPI_Init/MPI_Finalize window by making each test a uniform
function interface.

Each test file defines a run function that will run the test. The test
file is linked with a stup util/run_mpitests.c to create individual test
program that should work exactly as before. In addition, all the test
files will be linked together in the binary run_mpitests, that can be
used to run multiple tests within a single MPI_Init/Finalize window.

All such functional tests are listed in test/mpi/maint/all_mpitests.txt.
During autogen, gen_all_mpitests.py will load this file and generate all
the Makefile targets.

In this commit we didn't modify runtests. All tests should still work by
running them individually. We'll add the ability to run multiple tests
in runtests in the next commits.
Filter the test list and find all tests that can be run using
run_mpitests and run them first. Tests are grouped by testlist and
number of processes.

The running of tests are controlled by runtests using the input/output
pipes, thus we still have the granular control of individual tests.

When run_mpitests abort due to e.g. segfault, restart it with next test
in the testlist. Track the number of such restart and abort in case
something systematic causing it to fail repeatedly, for example, error
in run_mpitests itself.
Instead of `echo something > .stopfile`, make `touch .stopfile` to work
as well. Refactor the code so that check stopfile once aborts all tests.
It is cleaner to split the utilities for multi-tests into its own source
file. Since it will only be used in run_mpitests.c, it is simpler to
just include the file as static code.
Use alarm() to enforce timeouts.
If any of the environment variables affects init, we need run that test
individually so its settings does not affect the rest of the tests.
Only test in attr folder that is not converted is attrend2 since that
test tests MPI_Finalize behaviors.
Convert all tests used in testlist.cvar to mpitests framework.
@hzhou hzhou merged commit 8c9a8fa into pmodels:main Dec 20, 2024
4 checks passed
@hzhou hzhou deleted the 2411_mpitests branch December 20, 2024 03:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants