test: convert testsuite to be function based #7221

hzhou · 2024-11-17T18:48:32Z

Pull Request Description

This a split/renew from #5725.

The current testsuite consists of thousands of individual mpi test programs. Running the entire testsuite involves invoking process manager to spawn mpi processes and each process goes through MPI_INIT again and again. Both the process spawning and MPI initialization are very slow compared to the tested MPI operation itself. The current testsuite runs for a couple of hours and we run hundreds of them every day.

This PR attempts to convert individual tests into functions, so multiple tests can be tested within a single MPI_Init/Finalize window. I believe this can significantly reduce the CI testing time.

Author Checklist

Provide Description
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits Follow Good Practice
Commits are self-contained and do not do two things at once.
Commit message is of the form: module: short description
Commit message explains what's in the commit.
Passes All Tests
Whitespace checker. Warnings test. Additional tests via comments.
Contribution Agreement
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.

hzhou · 2024-11-18T19:27:48Z

test:mpich/custom ✔️

Significantly accerlerate those converted collective tests.

Running tests in ./attr/testlist [20 tests - 00:00:00]
    run_mpitests np=1, 17 tests...
    run_mpitests np=2, 1 tests...
    run_mpitests np=4, 2 tests...
Running tests in ./attr/testlist.dtp [2 tests - 00:00:00]
    run_mpitests np=1, 2 tests...
Running tests in ./coll/testlist [89 tests - 00:00:00]
    run_mpitests np=1, 2 tests...
    run_mpitests np=2, 1 tests...
    run_mpitests np=3, 1 tests...
    run_mpitests np=4, 40 tests...
    run_mpitests np=5, 9 tests...
    run_mpitests np=6, 1 tests...
    run_mpitests np=7, 2 tests...
    run_mpitests np=8, 11 tests...
    run_mpitests np=10, 22 tests...
Running tests in ./coll/testlist.collalgo [1296 tests - 00:00:03]
    run_mpitests np=1, 6 tests...
    run_mpitests np=2, 6 tests...
    run_mpitests np=4, 428 tests...
    run_mpitests np=5, 191 tests...
    run_mpitests np=6, 10 tests...
    run_mpitests np=7, 159 tests...
    run_mpitests np=8, 103 tests...
    run_mpitests np=10, 393 tests...
Running tests in ./attr/testlist [3 tests - 00:00:59]
Running tests in ./coll/testlist [118 tests - 00:01:00]
Running tests in ./coll/testlist.collalgo [118 tests - 00:02:15]
Running tests in ./coll/testlist.dtp [12 tests - 00:02:43]
Running tests in ./comm/testlist [46 tests - 00:04:53]
...

hzhou · 2024-11-18T20:43:44Z

test:mpich/ch3/most
test:mpich/ch4/most

2 TIMEOUT in ch4-ofi-asan. I don't think they are related to this PR, but it is good that they prove this PR works.

Use ch4-ucx-asan for example, before this PR:

Running tests in ./attr [00:00:00]
Running tests in ./coll [00:00:52]
Running tests in ./comm [01:36:07]

That 1 hour 36 min. to finish the collective tests. After this PR:

Running tests in ./attr/testlist [20 tests - 00:00:00]
    run_mpitests np=1, 17 tests...
    run_mpitests np=2, 1 tests...
    run_mpitests np=4, 2 tests...
Running tests in ./attr/testlist.dtp [2 tests - 00:00:04]
    run_mpitests np=1, 2 tests...
Running tests in ./coll/testlist [89 tests - 00:00:05]
    run_mpitests np=1, 2 tests...
    run_mpitests np=2, 1 tests...
    run_mpitests np=3, 1 tests...
    run_mpitests np=4, 40 tests...
    run_mpitests np=5, 9 tests...
    run_mpitests np=6, 1 tests...
    run_mpitests np=7, 2 tests...
    run_mpitests np=8, 11 tests...
    run_mpitests np=10, 22 tests...
Running tests in ./coll/testlist.collalgo [1296 tests - 00:00:20]
    run_mpitests np=1, 6 tests...
    run_mpitests np=2, 6 tests...
    run_mpitests np=4, 428 tests...
    run_mpitests np=5, 191 tests...
    run_mpitests np=6, 10 tests...
    run_mpitests np=7, 159 tests...
    run_mpitests np=8, 103 tests...
    run_mpitests np=10, 393 tests...
Running tests in ./attr/testlist [3 tests - 00:01:44]
Running tests in ./coll/testlist [117 tests - 00:01:48]
Running tests in ./coll/testlist.collalgo [118 tests - 00:06:18]
Running tests in ./coll/testlist.dtp [12 tests - 00:10:22]
Running tests in ./comm/testlist [46 tests - 00:16:10]

So 1:36 -> 0:16

hzhou · 2024-11-19T22:20:19Z

test:mpich/whitespace

raffenet · 2024-11-22T20:30:39Z

I think my only question is how much does this differ in timing than if we configured MPICH --without-hwloc and skipped all the topology discovery stuff? If there's still substantial savings for most CI jobs, then this is probably worth it in the long run.

hzhou · 2024-11-22T20:37:04Z

Let's find out

test:mpich/custom
netmod: ch4:ofi
config: nohwloc

EDIT: oh, I need run this against the main branch... running in #7204 (comment)

hzhou · 2024-11-25T15:51:24Z

I think my only question is how much does this differ in timing than if we configured MPICH --without-hwloc and skipped all the topology discovery stuff? If there's still substantial savings for most CI jobs, then this is probably worth it in the long run.

https://jenkins-pmrs.cels.anl.gov/job/mpich-review-custom/1182/console

Running tests in ./attr [00:00:00]
Running tests in ./coll [00:00:09]
Running tests in ./comm [00:31:54]

So ~32min without hwloc.

raffenet

OK, I think the increased throughput is going to be a big enough benefit to warrant the changes. Let's do it.

Refactor the code that check and acquires file lock into a routine. It is a common part running a test. Wrapping it into a routine makes it easier to reuse.

Add a version of MTestArgListCreate that parses a command line string.

Make -arg equivalent to -arg=1.

The tests with mpicolltest.h can be compiled with -DUSE_MTEST_NBC to become a non-blocking test. Compile the source multiple times with macro is inflexible to move into multi-tests framework -- run multiple tests inside a single MPI_Init/MPI_Finalize window. Convert it to use explicit option instead.

The new framework will allow running tests inside a single MPI_Init/MPI_Finalize window by making each test a uniform function interface. Each test file defines a run function that will run the test. The test file is linked with a stup util/run_mpitests.c to create individual test program that should work exactly as before. In addition, all the test files will be linked together in the binary run_mpitests, that can be used to run multiple tests within a single MPI_Init/Finalize window. All such functional tests are listed in test/mpi/maint/all_mpitests.txt. During autogen, gen_all_mpitests.py will load this file and generate all the Makefile targets. In this commit we didn't modify runtests. All tests should still work by running them individually. We'll add the ability to run multiple tests in runtests in the next commits.

Filter the test list and find all tests that can be run using run_mpitests and run them first. Tests are grouped by testlist and number of processes. The running of tests are controlled by runtests using the input/output pipes, thus we still have the granular control of individual tests. When run_mpitests abort due to e.g. segfault, restart it with next test in the testlist. Track the number of such restart and abort in case something systematic causing it to fail repeatedly, for example, error in run_mpitests itself.

Instead of `echo something > .stopfile`, make `touch .stopfile` to work as well. Refactor the code so that check stopfile once aborts all tests.

It is cleaner to split the utilities for multi-tests into its own source file. Since it will only be used in run_mpitests.c, it is simpler to just include the file as static code.

Use alarm() to enforce timeouts.

If any of the environment variables affects init, we need run that test individually so its settings does not affect the rest of the tests.

Only test in attr folder that is not converted is attrend2 since that test tests MPI_Finalize behaviors.

Convert all tests used in testlist.cvar to mpitests framework.

hzhou force-pushed the 2411_mpitests branch 2 times, most recently from 0c7f2f3 to 5a82dff Compare November 18, 2024 19:25

hzhou requested a review from raffenet November 18, 2024 20:12

raffenet approved these changes Dec 19, 2024

View reviewed changes

hzhou added 16 commits December 19, 2024 20:26

runtests: refactor test_get_lock

8a37b54

Refactor the code that check and acquires file lock into a routine. It is a common part running a test. Wrapping it into a routine makes it easier to reuse.

mtest_common: add MTestArgListCreate_arg

bc3ada5

Add a version of MTestArgListCreate that parses a command line string.

mtest_common: support options such as -arg

ede53e6

Make -arg equivalent to -arg=1.

runtests: refactor and fix stopfile mechanism

0ea450c

Instead of `echo something > .stopfile`, make `touch .stopfile` to work as well. Refactor the code so that check stopfile once aborts all tests.

run_mpitests: split utilities into multi_tests.c

b75d2f2

It is cleaner to split the utilities for multi-tests into its own source file. Since it will only be used in run_mpitests.c, it is simpler to just include the file as static code.

run_mpitests: reset error handler before running test

73bd846

run_mpitests: add ability to set and reset CVARs

9d6b093

run_mpitests: add timeout

8d1ed0f

Use alarm() to enforce timeouts.

runtests: check locking in run_mpitests

9f8f66d

runtests: run tests with init-time envs individually

4771427

If any of the environment variables affects init, we need run that test individually so its settings does not affect the rest of the tests.

test: convert attr/attrt to test function

f9efa26

test: convert rest of the attr tests to functions

78fea78

Only test in attr folder that is not converted is attrend2 since that test tests MPI_Finalize behaviors.

test: convert some coll tests to test function

e14ad48

Convert all tests used in testlist.cvar to mpitests framework.

hzhou force-pushed the 2411_mpitests branch from 5a82dff to e14ad48 Compare December 20, 2024 02:26

hzhou merged commit 8c9a8fa into pmodels:main Dec 20, 2024
4 checks passed

hzhou deleted the 2411_mpitests branch December 20, 2024 03:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: convert testsuite to be function based #7221

test: convert testsuite to be function based #7221

hzhou commented Nov 17, 2024 •

edited

Loading

hzhou commented Nov 18, 2024 •

edited

Loading

hzhou commented Nov 18, 2024 •

edited

Loading

hzhou commented Nov 19, 2024

raffenet commented Nov 22, 2024

hzhou commented Nov 22, 2024 •

edited

Loading

hzhou commented Nov 25, 2024

raffenet left a comment

test: convert testsuite to be function based #7221

test: convert testsuite to be function based #7221

Conversation

hzhou commented Nov 17, 2024 • edited Loading

Pull Request Description

Author Checklist

hzhou commented Nov 18, 2024 • edited Loading

hzhou commented Nov 18, 2024 • edited Loading

hzhou commented Nov 19, 2024

raffenet commented Nov 22, 2024

hzhou commented Nov 22, 2024 • edited Loading

hzhou commented Nov 25, 2024

raffenet left a comment

Choose a reason for hiding this comment

hzhou commented Nov 17, 2024 •

edited

Loading

hzhou commented Nov 18, 2024 •

edited

Loading

hzhou commented Nov 18, 2024 •

edited

Loading

hzhou commented Nov 22, 2024 •

edited

Loading