Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cygwin: signal: Do not handle signal when __SIGFLUSHFAST is sent #251

Merged
merged 1 commit into from
Dec 25, 2024

Conversation

dscho
Copy link
Collaborator

@dscho dscho commented Dec 25, 2024

After the commit d243e51, zsh sometimes hangs at startup. This occurs because SIGCHLD, which should trigger sigsuspend(), is handled in cygwait() that is used to wait for a wakeup event in sig_send(), even when __SIGFLUSHFAST is sent. Despite __SIGFLUSHFAST being required to return before handling the signal, this does not happen. With this patch, if the signal currently being sent is __SIGFLUSHFAST, do not handle the received signal and keep it asserted after the cygwait() for the wakeup event. Apply the same logic to the cygwait() in the retrying loop for WriteFile() as well.

Applied-from: https://inbox.sourceware.org/cygwin-patches/20241223013332.1269-1-takashi.yano@nifty.ne.jp Addresses: https://cygwin.com/pipermail/cygwin/2024-December/256954.html Fixes: d243e51 ("Cygwin: signal: Fix deadlock between main thread and sig thread")
Reported-by: Daisuke Fujimura booleanlabel@gmail.com
Reviewed-by:

@dscho dscho requested a review from jeremyd2019 December 25, 2024 18:27
@dscho dscho self-assigned this Dec 25, 2024
@dscho dscho mentioned this pull request Dec 25, 2024
Copy link
Member

@jeremyd2019 jeremyd2019 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This member was renamed on master but not in 3.5 branch

winsup/cygwin/sigproc.cc Outdated Show resolved Hide resolved
winsup/cygwin/sigproc.cc Outdated Show resolved Hide resolved
After the commit d243e51, zsh sometimes hangs at startup. This
occurs because SIGCHLD, which should trigger sigsuspend(), is handled
in cygwait() that is used to wait for a wakeup event in sig_send(),
even when __SIGFLUSHFAST is sent. Despite __SIGFLUSHFAST being
required to return before handling the signal, this does not happen.
With this patch, if the signal currently being sent is __SIGFLUSHFAST,
do not handle the received signal and keep it asserted after the
cygwait() for the wakeup event.  Apply the same logic to the cygwait()
in the retrying loop for WriteFile() as well.

Applied-from: https://inbox.sourceware.org/cygwin-patches/20241223013332.1269-1-takashi.yano@nifty.ne.jp
Addresses: https://cygwin.com/pipermail/cygwin/2024-December/256954.html
Fixes: d243e51 ("Cygwin: signal: Fix deadlock between main thread and sig thread")
Reported-by: Daisuke Fujimura <booleanlabel@gmail.com>
Reviewed-by:
Signed-off-by: Takashi Yano <takashi.yano@nifty.ne.jp>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
@dscho dscho force-pushed the special-case-sigflushfast branch from 6326293 to f1affcc Compare December 25, 2024 20:36
@dscho
Copy link
Collaborator Author

dscho commented Dec 25, 2024

This member was renamed on master but not in 3.5 branch

Thanks @jeremyd2019! The build is now successful.

@jeremyd2019
Copy link
Member

I tested the msys-2.0.dll from the CI artifact on the arm64 autobuild, rebuilding imagemagick which hung 3 times in a row with 3.5.5-1, and it succeeded.

@jeremyd2019 jeremyd2019 merged commit 33799c2 into msys2:msys2-3.5.5 Dec 25, 2024
1 check passed
@jeremyd2019
Copy link
Member

@dscho dscho deleted the special-case-sigflushfast branch December 26, 2024 08:27
@dscho
Copy link
Collaborator Author

dscho commented Dec 26, 2024

Hrm. This is apparently still not good enough. I noticed that Git's test suite would time out after 6h with MSYS2 runtime v3.5.5, opened git-for-windows/msys2-runtime#80 to apply this here fix, and then triggered another Git test suite run. It looks like it is hanging again (albeit in less matrix jobs, which is good).

lazka added a commit to lazka/MSYS2-packages that referenced this pull request Dec 26, 2024
lazka added a commit to msys2/MSYS2-packages that referenced this pull request Dec 26, 2024
@jeremyd2019
Copy link
Member

Have you reported this to upstream Cygwin/Takashi?

@dscho
Copy link
Collaborator Author

dscho commented Dec 26, 2024

Have you reported this to upstream Cygwin/Takashi?

@jeremyd2019 No, I didn't have time to investigate yet... so I do not even have a reproducer.

@dscho
Copy link
Collaborator Author

dscho commented Dec 26, 2024

Have you reported this to upstream Cygwin/Takashi?

@jeremyd2019 No, I didn't have time to investigate yet... so I do not even have a reproducer.

I spent quite a while on trying to reproduce, and unsurprisingly it is hard. It is almost certainly a race condition, indicating that some sort of locking is needed and just isn't there.

I'll just jot down some notes to be able to jog my memory later:

The test script in Git's test suite that I zoomed in on is t7814-grep-recurse-submodules (because it should have completed as part of https://github.com/dscho/git-for-windows-automation/actions/runs/12502682476/job/34882008623 but has not).

I actually ran make -j15 GIT_PROVE_OPTS=-j15 GIT_TEST_OPTS=-iVx T='t7602-merge-octopus-many.sh t7004-tag.sh t3425-rebase-topology-merges.sh t3600-rm.sh t7814-grep-recurse-submodules.sh' prove with the still-not-quite-fixed MSYS2 runtime, first thing, and t7602 hung in its 18th test case, "grep using relative path".

Apparently this was "lucky" because re-running that test script individually did not cause a hang.

Then, I ran sh t7814-grep-recurse-submodules.sh -iVx --run=1,18 --stress-jobs=20 --stress-limit=20 (the --stress-* options are designed to catch test failures that are hard to reproduce and require either a high CPU load or resource depletion to reproduce more reliably). It was a bit hard to see from its output that after 5 full rounds of 20 concurrent runs, one of them did not actually finish. And a couple more rounds later (must have been somewhere between 2-5 more rounds), a second job got stuck, but even after 15 more rounds no other job got stuck.

Now, this is where it gets interesting: both of those jobs got stuck in the first "test case" (Git's test suite uses a home-grown test framework that lacks modern test frameworks' features like "setup" and "teardown" functions, therefore the first test case frequently serves as a common "setup" step and it is quite normal, unfortunately so, for the test cases to depend on side effects of previous test cases; As you can tell, I am absolutely not a fan of the way Git's test suite is architected), when running git submodule add ./submodule, which is very similar to the git -C parent submodule add ../sub call that had originally hung in the 18th test case of that test script.

The git submodule command is still implemented as a shell script, therefore it is a frequent guest in any season of the show called "MSYS2 runtime regression detected via Git".

When I had Ctrl+C'ed the hung test script and there were still a couple of left-over processes blocking the working directory of that test from being deleted, I had actually tried to make some sense of the process topology when that git -C parent submodule add ../sub command hung. There were the following processes:

  • winpid 34996 (parent no longer existed): D:\git-sdk-64-minimal\usr\bin\sh.exe /d/git-sdk-64/usr/src/git/bin-wrappers/git -C parent submodule add ../sub
  • winpid 32256 (parent 34996): D:\git-sdk-64\usr\src\git\git.exe -C parent submodule add ../sub
  • winpid 34620 (parent 32256): sh "D:/git-sdk-64/usr/src/git\\git-submodule" add ../sub
  • winpid 14252 (parent no longer existed): D:\git-sdk-64-minimal\usr\bin\sh.exe t7814-grep-recurse-submodules.sh -iVx
  • winpid 32904 (parent no longer existed): D:\git-sdk-64-minimal\usr\bin\sh.exe t7814-grep-recurse-submodules.sh -iVx
  • winpid 6668 (parent 32904): D:\git-sdk-64-minimal\usr\bin\sh.exe t7814-grep-recurse-submodules.sh -iVx

I had to stop, manually (via wmic process ... delete) both 14252 and 34620 (but not 6668 nor any of the listed parent processes) for the worktree to be unblocked from being deleted.

This is an unsatisfying place to take a break from my investigation, but it'll have to do for now.

Potential next steps:

  • do some set -x style debugging
  • try to reproduce with a single script that simply adds a submodule, then removes it, then repeats, over and over again.
  • try to bisect when this started (although historically, it is my experience that it is unlikely to help because the commits I found typically had commit messages that left me as puzzled as before, and when I asked for clarification from the author, I got even less clear explanations). But at least it might be possible to revert those patches for now specifically for Git for Windows because we're super late in the v2.48.0 release cycle and I had actually intended to take some time off from work during my vacation.

@jeremyd2019
Copy link
Member

The git submodule command is still implemented as a shell script, therefore it is a frequent guest in any season of the show called "MSYS2 runtime regression detected via Git".

When I had Ctrl+C'ed the hung test script and there were still a couple of left-over processes blocking the working directory of that test from being deleted, I had actually tried to make some sense of the process topology when that git -C parent submodule add ../sub command hung. There were the following processes:

  • winpid 34996 (parent no longer existed): D:\git-sdk-64-minimal\usr\bin\sh.exe /d/git-sdk-64/usr/src/git/bin-wrappers/git -C parent submodule add ../sub
  • winpid 32256 (parent 34996): D:\git-sdk-64\usr\src\git\git.exe -C parent submodule add ../sub
  • winpid 34620 (parent 32256): sh "D:/git-sdk-64/usr/src/git\\git-submodule" add ../sub
  • winpid 14252 (parent no longer existed): D:\git-sdk-64-minimal\usr\bin\sh.exe t7814-grep-recurse-submodules.sh -iVx
  • winpid 32904 (parent no longer existed): D:\git-sdk-64-minimal\usr\bin\sh.exe t7814-grep-recurse-submodules.sh -iVx
  • winpid 6668 (parent 32904): D:\git-sdk-64-minimal\usr\bin\sh.exe t7814-grep-recurse-submodules.sh -iVx

I had to stop, manually (via wmic process ... delete) both 14252 and 34620 (but not 6668 nor any of the listed parent processes) for the worktree to be unblocked from being deleted.

This is odd, so far the issues that this patch helped could be killed with SIGTERM within Cygwin.

@lazka
Copy link
Member

lazka commented Dec 27, 2024

There were two stuck jobs here right now: https://github.com/msys2/msys2-autobuild/actions/runs/12514784401/job/34911395541

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants