Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

maintenance daemon crashing when dropping extension #4321

Closed
pykello opened this issue Nov 18, 2020 · 7 comments · Fixed by #4561
Closed

maintenance daemon crashing when dropping extension #4321

pykello opened this issue Nov 18, 2020 · 7 comments · Fixed by #4561
Assignees
Labels
Milestone

Comments

@pykello
Copy link
Contributor

pykello commented Nov 18, 2020

When running check-columnar, am_drop crashes when it tries to test DROP EXTENSION citus CASCADE; with the following stack trace:

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f4b60e44859 in __GI_abort () at abort.c:79
#2  0x0000563faa15363a in ExceptionalCondition (conditionName=conditionName@entry=0x7f4b5dc4ab3b "!(found)", 
    errorType=errorType@entry=0x7f4b5dc42120 "FailedAssertion", 
    fileName=fileName@entry=0x7f4b5dc67553 "utils/maintenanced.c", lineNumber=lineNumber@entry=185) at assert.c:54
#3  0x00007f4b5dc2a7ee in InitializeMaintenanceDaemonBackend () at utils/maintenanced.c:185
#4  0x00007f4b5db9e0cd in StartupCitusBackend () at shared_library_init.c:426
#5  0x00007f4b5dbdd0a5 in CitusHasBeenLoaded () at metadata/metadata_cache.c:1652
#6  0x00007f4b5dc2aeb7 in CitusMaintenanceDaemonMain (main_arg=<optimized out>) at utils/maintenanced.c:517
#7  0x0000563fa9f92b8d in StartBackgroundWorker () at bgworker.c:834
#8  0x0000563fa9fa05f3 in do_start_bgworker (rw=<optimized out>) at postmaster.c:5770
#9  maybe_start_bgworkers () at postmaster.c:5996
#10 0x0000563fa9fa105c in sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:5167
#11 <signal handler called>
#12 0x00007f4b60f370da in __GI___select (nfds=nfds@entry=6, readfds=readfds@entry=0x7fff4c324a70, 
    writefds=writefds@entry=0x0, exceptfds=exceptfds@entry=0x0, timeout=timeout@entry=0x7fff4c3249d0)
    at ../sysdeps/unix/sysv/linux/select.c:41
#13 0x0000563fa9fa18ce in ServerLoop () at postmaster.c:1668
#14 0x0000563fa9fa3512 in PostmasterMain (argc=5, argv=<optimized out>) at postmaster.c:1377
#15 0x0000563fa9cca651 in main (argc=5, argv=0x563fab1c36e0) at main.c:228

This doesn't happen always, maybe once every five times.

@metdos metdos added this to the 10.0 Release milestone Nov 18, 2020
@metdos metdos added the bug label Nov 18, 2020
@marcocitus
Copy link
Member

I see it occasionally (not consistently) in the regression tests. I suspect we mainly need to remove the failing Assert.

@agedemenli
Copy link
Contributor

I see it occasionally (not consistently) in the regression tests. I suspect we mainly need to remove the failing Assert.

To see what is failing, I tried to replicate it 30+ times but it never failed.

@pykello
Copy link
Contributor Author

pykello commented Jan 22, 2021

I see it occasionally (not consistently) in the regression tests. I suspect we mainly need to remove the failing Assert.

To see what is failing, I tried to replicate it 30+ times but it never failed.

This is probably because the CitusHasBeenLoaded() in CitusMaintenanceDaemonMain() happens once every few seconds, so timing of it can be different in different machines.

When I apply the following patch so it happens more often, I see the crash more often:

diff --git a/src/backend/distributed/utils/maintenanced.c b/src/backend/distributed/utils/maintenanced.c
index 5312663e2..212545ad3 100644
--- a/src/backend/distributed/utils/maintenanced.c
+++ b/src/backend/distributed/utils/maintenanced.c
@@ -397,6 +397,18 @@ CitusMaintenanceDaemonMain(Datum main_arg)
 
 		CitusTableCacheFlushInvalidatedEntries();
 
+		InvalidateMetadataSystemCache();
+		StartTransactionCommand();
+		PushActiveSnapshot(GetTransactionSnapshot());
+
+		if (CheckCitusVersion(DEBUG1))
+			CitusHasBeenLoaded();
+
+		PopActiveSnapshot();
+		CommitTransactionCommand();
+
+		continue;
+
 		/*
 		 * XXX: Each task should clear the metadata cache before every iteration
 		 * by calling InvalidateMetadataSystemCache(), because otherwise it

Of course we won't want to merge the above patch, but it causes the crash with a similar cause as the original crash happen more often. Can you try it and see if you could reproduce it?

I'm not sure if it matters, but my flags for postgres ./configure were:

'--enable-cassert' '--enable-debug' '--enable-depend' '--enable-nls' '--with-openssl' '--with-llvm' '--with-icu'

Also note that I see this when running make -C src/test/regress/ check-columnar at am_drop. I don't see it for other tests.

@marcocitus
Copy link
Member

#6 0x00007f4b5dc2aeb7 in CitusMaintenanceDaemonMain (main_arg=) at utils/maintenanced.c:517

@pykello any thoughts on which commit this refers to? I'm wondering whether the maintenance daemon is calling CitusHasBeenLoaded without holding a lock on the Citus extension.

@pykello
Copy link
Contributor Author

pykello commented Jan 26, 2021

@marcocitus I can reproduce this on master branch in my local machine.

In the above diff even if I add extension lock I get the assertion failure consistently.

@onurctirtir
Copy link
Member

yesterday's valgrind test crashed in multi_extension when executing DROP EXTENSION citus

https://github.com/citusdata/release-test-results/blob/onur-valgrind-02-03-2021/02_03_2021_1612344724/periodic_job_results/02_03_2021_1612344724/regression.diffs#L395..L399

(Unfortunately we don't have a coredump and I couldn't reproduce the issue even if I re-run valgrind tests 3 times)

I know we merged a fix for that but can it be related with that ?

@onurctirtir
Copy link
Member

onurctirtir commented Mar 1, 2021

yesterday's valgrind test crashed in multi_extension when executing DROP EXTENSION citus

https://github.com/citusdata/release-test-results/blob/onur-valgrind-02-03-2021/02_03_2021_1612344724/periodic_job_results/02_03_2021_1612344724/regression.diffs#L395..L399

(Unfortunately we don't have a coredump and I couldn't reproduce the issue even if I re-run valgrind tests 3 times)

I know we merged a fix for that but can it be related with that ?

I guess we hit to the same issue again, but still don't have coredump :/:

From citusdata/test-automation#204:

We should collect stack traces from core dumps if any crash happens

I think it would be better to implement ^ asap


https://github.com/citusdata/release-test-results/blob/citusbot_valgrind_test_resource_group/03_01_2021_1614591130/periodic_job_results/03_01_2021_1614591130/regression.diffs#L408..L412

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants