maintenance daemon crashing when dropping extension #4321

pykello · 2020-11-18T03:30:43Z

When running check-columnar, am_drop crashes when it tries to test DROP EXTENSION citus CASCADE; with the following stack trace:

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f4b60e44859 in __GI_abort () at abort.c:79
#2  0x0000563faa15363a in ExceptionalCondition (conditionName=conditionName@entry=0x7f4b5dc4ab3b "!(found)", 
    errorType=errorType@entry=0x7f4b5dc42120 "FailedAssertion", 
    fileName=fileName@entry=0x7f4b5dc67553 "utils/maintenanced.c", lineNumber=lineNumber@entry=185) at assert.c:54
#3  0x00007f4b5dc2a7ee in InitializeMaintenanceDaemonBackend () at utils/maintenanced.c:185
#4  0x00007f4b5db9e0cd in StartupCitusBackend () at shared_library_init.c:426
#5  0x00007f4b5dbdd0a5 in CitusHasBeenLoaded () at metadata/metadata_cache.c:1652
#6  0x00007f4b5dc2aeb7 in CitusMaintenanceDaemonMain (main_arg=<optimized out>) at utils/maintenanced.c:517
#7  0x0000563fa9f92b8d in StartBackgroundWorker () at bgworker.c:834
#8  0x0000563fa9fa05f3 in do_start_bgworker (rw=<optimized out>) at postmaster.c:5770
#9  maybe_start_bgworkers () at postmaster.c:5996
#10 0x0000563fa9fa105c in sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:5167
#11 <signal handler called>
#12 0x00007f4b60f370da in __GI___select (nfds=nfds@entry=6, readfds=readfds@entry=0x7fff4c324a70, 
    writefds=writefds@entry=0x0, exceptfds=exceptfds@entry=0x0, timeout=timeout@entry=0x7fff4c3249d0)
    at ../sysdeps/unix/sysv/linux/select.c:41
#13 0x0000563fa9fa18ce in ServerLoop () at postmaster.c:1668
#14 0x0000563fa9fa3512 in PostmasterMain (argc=5, argv=<optimized out>) at postmaster.c:1377
#15 0x0000563fa9cca651 in main (argc=5, argv=0x563fab1c36e0) at main.c:228

This doesn't happen always, maybe once every five times.

The text was updated successfully, but these errors were encountered:

marcocitus · 2021-01-21T11:03:02Z

I see it occasionally (not consistently) in the regression tests. I suspect we mainly need to remove the failing Assert.

agedemenli · 2021-01-21T11:09:43Z

I see it occasionally (not consistently) in the regression tests. I suspect we mainly need to remove the failing Assert.

To see what is failing, I tried to replicate it 30+ times but it never failed.

pykello · 2021-01-22T05:52:29Z

I see it occasionally (not consistently) in the regression tests. I suspect we mainly need to remove the failing Assert.

To see what is failing, I tried to replicate it 30+ times but it never failed.

This is probably because the CitusHasBeenLoaded() in CitusMaintenanceDaemonMain() happens once every few seconds, so timing of it can be different in different machines.

When I apply the following patch so it happens more often, I see the crash more often:

diff --git a/src/backend/distributed/utils/maintenanced.c b/src/backend/distributed/utils/maintenanced.c
index 5312663e2..212545ad3 100644
--- a/src/backend/distributed/utils/maintenanced.c
+++ b/src/backend/distributed/utils/maintenanced.c
@@ -397,6 +397,18 @@ CitusMaintenanceDaemonMain(Datum main_arg)
 
 		CitusTableCacheFlushInvalidatedEntries();
 
+		InvalidateMetadataSystemCache();
+		StartTransactionCommand();
+		PushActiveSnapshot(GetTransactionSnapshot());
+
+		if (CheckCitusVersion(DEBUG1))
+			CitusHasBeenLoaded();
+
+		PopActiveSnapshot();
+		CommitTransactionCommand();
+
+		continue;
+
 		/*
 		 * XXX: Each task should clear the metadata cache before every iteration
 		 * by calling InvalidateMetadataSystemCache(), because otherwise it

Of course we won't want to merge the above patch, but it causes the crash with a similar cause as the original crash happen more often. Can you try it and see if you could reproduce it?

I'm not sure if it matters, but my flags for postgres ./configure were:

'--enable-cassert' '--enable-debug' '--enable-depend' '--enable-nls' '--with-openssl' '--with-llvm' '--with-icu'

Also note that I see this when running make -C src/test/regress/ check-columnar at am_drop. I don't see it for other tests.

marcocitus · 2021-01-22T16:31:16Z

#6 0x00007f4b5dc2aeb7 in CitusMaintenanceDaemonMain (main_arg=) at utils/maintenanced.c:517

@pykello any thoughts on which commit this refers to? I'm wondering whether the maintenance daemon is calling CitusHasBeenLoaded without holding a lock on the Citus extension.

pykello · 2021-01-26T03:41:44Z

@marcocitus I can reproduce this on master branch in my local machine.

In the above diff even if I add extension lock I get the assertion failure consistently.

onurctirtir · 2021-02-03T13:17:02Z

yesterday's valgrind test crashed in multi_extension when executing DROP EXTENSION citus

https://github.com/citusdata/release-test-results/blob/onur-valgrind-02-03-2021/02_03_2021_1612344724/periodic_job_results/02_03_2021_1612344724/regression.diffs#L395..L399

(Unfortunately we don't have a coredump and I couldn't reproduce the issue even if I re-run valgrind tests 3 times)

I know we merged a fix for that but can it be related with that ?

onurctirtir · 2021-03-01T09:56:41Z

yesterday's valgrind test crashed in multi_extension when executing DROP EXTENSION citus

https://github.com/citusdata/release-test-results/blob/onur-valgrind-02-03-2021/02_03_2021_1612344724/periodic_job_results/02_03_2021_1612344724/regression.diffs#L395..L399

(Unfortunately we don't have a coredump and I couldn't reproduce the issue even if I re-run valgrind tests 3 times)

I know we merged a fix for that but can it be related with that ?

I guess we hit to the same issue again, but still don't have coredump :/:

From citusdata/test-automation#204:

We should collect stack traces from core dumps if any crash happens

I think it would be better to implement ^ asap

https://github.com/citusdata/release-test-results/blob/citusbot_valgrind_test_resource_group/03_01_2021_1614591130/periodic_job_results/03_01_2021_1614591130/regression.diffs#L408..L412

metdos added this to the 10.0 Release milestone Nov 18, 2020

metdos added the bug label Nov 18, 2020

metdos assigned agedemenli Jan 20, 2021

agedemenli mentioned this issue Jan 22, 2021

Remove failing assertions #4561

Merged

agedemenli closed this as completed in #4561 Jan 26, 2021

onurctirtir mentioned this issue Mar 1, 2021

Occassional assertion failure in multi_extension #4369

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

maintenance daemon crashing when dropping extension #4321

maintenance daemon crashing when dropping extension #4321

pykello commented Nov 18, 2020

marcocitus commented Jan 21, 2021

agedemenli commented Jan 21, 2021

pykello commented Jan 22, 2021 •

edited

Loading

marcocitus commented Jan 22, 2021

pykello commented Jan 26, 2021 •

edited

Loading

onurctirtir commented Feb 3, 2021

onurctirtir commented Mar 1, 2021 •

edited

Loading

maintenance daemon crashing when dropping extension #4321

maintenance daemon crashing when dropping extension #4321

Comments

pykello commented Nov 18, 2020

marcocitus commented Jan 21, 2021

agedemenli commented Jan 21, 2021

pykello commented Jan 22, 2021 • edited Loading

marcocitus commented Jan 22, 2021

pykello commented Jan 26, 2021 • edited Loading

onurctirtir commented Feb 3, 2021

onurctirtir commented Mar 1, 2021 • edited Loading

pykello commented Jan 22, 2021 •

edited

Loading

pykello commented Jan 26, 2021 •

edited

Loading

onurctirtir commented Mar 1, 2021 •

edited

Loading