Add an end-to-end test for trim gap handling using snapshots #2463

pcholakov · 2024-12-31T15:46:16Z

This test simulates a trim gap and verifies the behavior with and without a suitable snapshot present to enable fast-forward over the gap.

This is a follow-up to #2456 adding an e2e test for snapshot-based fast-forward over a log trim gap.

There are several to-dos here that require deeper changes - I'd like to do those as separate PRs to avoid delaying merging of trim-gap support itself. At a minimum this includes:

the create-snapshot admin API should return the min captured LSN of snapshots
the trim admin API should include the effective new trim point; currently BifrostAdmin can decide to no-op the request if the trim point is greater than the global tail it knows about, which makes it hard to test
[optional] we don't have a good way (that I'm aware of) to externally ask a specific partition processor to become leader; this would be useful for testing and potentially manual operations

Primary reviewer: @tillrohrmann

cc: @jackkleeman as an optional reviewer since I modified some test cluster infra and a test you previously added but feel free to ignore!

github-actions · 2024-12-31T16:03:59Z

Test Results

7 files ±0 7 suites ±0 4m 29s ⏱️ +8s
47 tests ±0 46 ✅ ±0 1 💤 ±0 0 ❌ ±0
182 runs ±0 179 ✅ ±0 3 💤 ±0 0 ❌ ±0

Results for commit 9592d6a. ± Comparison against base commit d55eee4.

♻️ This comment has been updated with latest results.

pcholakov · 2025-01-02T16:20:49Z

crates/local-cluster-runner/src/node/mod.rs

@@ -749,6 +749,30 @@ impl StartedNode {
    }
 }

+impl Drop for StartedNode {
+    fn drop(&mut self) {


I added this to avoid leaking restate-server processes from tests.

pcholakov · 2025-01-02T16:22:46Z

server/tests/trim_gap_handling.rs

+
+mod common;
+
+#[tokio::test]


I am using this rather than test(restate_core::test) because that macro just exits the process on panic, which prevents unwinding from running - and can lead to leaked spawned processes on test failure.

I think the reason we have this panic hook is to ensure that if a panic occurs within a spawned task, the tests will fail. Otherwise, the panic might just be swallowed by the task.

Ah! Of course; I recall the discussion now - switched to #[test_log::test(tokio::test)] as a middle ground to ensure that the Drop callback works.

pcholakov · 2025-01-02T16:23:27Z

server/tests/snapshots.rs

The new trim gap fast-forward test covers the same paths as this one.

pcholakov · 2025-01-02T16:24:43Z

Cargo.toml

@@ -34,6 +34,7 @@ description = "Restate makes distributed applications easy!"
 [workspace.dependencies]
 # Own crates
 codederror = { path = "crates/codederror" }
+mock-service-endpoint = { path = "tools/mock-service-endpoint" }


With this PR, the mock service handler is now usable from other packages - this is handy for e2e testing.

pcholakov · 2025-01-02T16:25:50Z

server/tests/common/replicated_loglet.rs

@@ -89,7 +89,6 @@ pub struct TestEnv {
    pub loglet: Arc<dyn Loglet>,
    pub metadata_writer: MetadataWriter,
    pub metadata_store_client: MetadataStoreClient,
-    pub cluster: StartedCluster,


I removed passing the cluster to the test routine as it is easy to accidentally drop it, and kill the cluster in the process. We can reintroduce it as a reference if it's needed in the future.

tillrohrmann

Thanks for creating the end-to-end test for our snapshots @pcholakov. The changes look good to me.

The one aspect that makes me a bit uneasy is that it seems that we cannot reliably guarantee that a trim has happened. If this is correct, then we might add a test which is unstable in our CI environment. Maybe because of this, it's worth to first add the functionality to report back which lsn was trimmed so that we can make the trim_log function reliable?

tillrohrmann · 2025-01-03T13:58:40Z

crates/local-cluster-runner/src/node/mod.rs

+                pid,
+            );
+            match nix::sys::signal::kill(
+                nix::unistd::Pid::from_raw(pid.try_into().unwrap()),


Is this try_into infallible or why is unwrap ok here?

Why is it ok to unwrap here?

I never answered you this - I blindly duplicated the kill implementation above without looking too closely; it appears this is completely safe. Tokio's Child::id() just passes through std::sys::pal::unix::process::Process::id's return type which is u32 but Process internally holds the pid as a pid_t = i32 and does a blind cast to u32 when returning:

https://doc.rust-lang.org/nightly/src/std/sys/pal/unix/process/process_unix.rs.html#943-945

I couldn't find any background on why, other than other people asking essentially the same question (https://users.rust-lang.org/t/std-id-vs-libc-pid-t-how-to-handle/78281/3).

Two interesting factoids I learned in the process 😀

Linux PIDs max out to 2^22 (https://web.archive.org/web/20111209081734/http://research.cs.wisc.edu/condor/condorg/linux_scalability.html)

macOS PIDs are limited to 99998 (https://apple.stackexchange.com/questions/51119/whats-the-maximum-pid-for-mac-os-x)

I'll update this to use expect() with a comment before merging.

tillrohrmann · 2025-01-03T14:15:32Z

server/tests/trim_gap_handling.rs

+
+mod common;
+
+#[tokio::test]


I think the reason we have this panic hook is to ensure that if a panic occurs within a spawned task, the tests will fail. Otherwise, the panic might just be swallowed by the task.

tillrohrmann · 2025-01-03T14:17:00Z

server/tests/trim_gap_handling.rs

+    tracing_subscriber::fmt()
+        .event_format(tracing_subscriber::fmt::format().compact())
+        .with_env_filter(
+            tracing_subscriber::EnvFilter::builder()
+                .with_default_directive(LevelFilter::INFO.into())
+                .from_env_lossy(),
+        )
+        .init();


If you use test_log::test(tokio::test), then you don't have to set these things up yourself.

tillrohrmann · 2025-01-03T14:25:38Z

server/tests/trim_gap_handling.rs

+    // todo(pavel): promote node 3 to be the leader for partition 0 and invoke the service again
+    // right now, all we are asserting is that the new node is applying newly appended log records


You could do this by manually changing the SchedulingPlan.

I didn't think it would be this easy... and it seems like it isn't. I added a step to manually set the SchedulingPlan but it only works intermittently - Scheduler::update_scheduling_plan nukes the changes as soon as it picks them up. I think this is important, let's do definitely do it but maybe as a follow-up task to provide a leadership hint to the scheduler?

Good news! With a bit of effort, I got this to work reliably - still a draft but will get it ready for review soon: #2471

tillrohrmann · 2025-01-03T14:29:37Z

server/tests/trim_gap_handling.rs

+                State::Alive(s) => s
+                    .partitions
+                    .values()
+                    .any(|p| p.effective_mode.cmp(&1).is_eq()),


I think it is clearer if you compared against RunMode instead of the ordinal value which is harder to remember.

The magic of try_from! Thanks for the tip :-)

tillrohrmann · 2025-01-03T14:40:28Z

server/tests/trim_gap_handling.rs

+    let mut i = 0;
+    loop {
+        client
+            .trim_log(TrimLogRequest {
+                log_id: 0,
+                trim_point,
+            })
+            .await?;
+        if i >= 2 {
+            break;
+        }
+        tokio::time::sleep(Duration::from_secs(1)).await;
+        i += 1;
+    }


How did you come up with the magic number of 3 attempts?

Empirically! I think Azmy suggested it may be related to the heartbeat interval and updating the global tail. Moot now; I've converted this to a retry until the desired effective trim is reached.

tillrohrmann · 2025-01-03T14:46:23Z

server/tests/trim_gap_handling.rs

+async fn trim_log(
+    client: &mut ClusterCtrlSvcClient<Channel>,
+    trim_point: u64,
+) -> googletest::Result<()> {
+    // todo(pavel): this is flimsy, ensure we actually trim the log to a particular LSN


If this method does not do anything because the admin node didn't have the up to date log tail, then I the remaining test will be stuck. This might be a problem for the stability of the test. Something to observe on our CI infra where timings can be quite skewed.

Sorry, I created the wrong impression with the todo comment - I've rebased on #2468 which allows this to be deterministic :-)

tillrohrmann · 2025-01-03T14:48:46Z

server/tests/trim_gap_handling.rs

+async fn grpc_connect(address: AdvertisedAddress) -> Result<Channel, tonic::transport::Error> {
+    match address {
+        AdvertisedAddress::Uds(uds_path) => {
+            // dummy endpoint required to specify an uds connector, it is not used anywhere
+            Endpoint::try_from("http://127.0.0.1")
+                .expect("/ should be a valid Uri")
+                .connect_with_connector(service_fn(move |_: Uri| {
+                    let uds_path = uds_path.clone();
+                    async move {
+                        Ok::<_, io::Error>(TokioIo::new(UnixStream::connect(uds_path).await?))
+                    }
+                })).await
+        }
+        AdvertisedAddress::Http(uri) => {
+            Channel::builder(uri)
+                .connect_timeout(Duration::from_secs(2))
+                .timeout(Duration::from_secs(2))
+                .http2_adaptive_window(true)
+                .connect()
+                .await
+        }
+    }
+}


This looks quite similar to create_tonic_channel_from_advertised_address. Could this be reused?

I copied it nearly verbatim from restatectl's grpc_connect utility - which looks like it may have been the origin of create_tonic_channel_from_advertised_address, too. I've done this under its own PR here:

#2469

pcholakov · 2025-01-06T16:13:14Z

The one aspect that makes me a bit uneasy is that it seems that we cannot reliably guarantee that a trim has happened. If this is correct, then we might add a test which is unstable in our CI environment. Maybe because of this, it's worth to first add the functionality to report back which lsn was trimmed so that we can make the trim_log function reliable?

Yes, definitely! I was already working on that - I realize my todo might have created the wrong impression :-) Here is the change, on which this PR is now rebased: #2468.

I wasn't able to get the leadership change to work reliably but I'm pretty keen to do that too. However, I believe that the test as it stands should be reasonably robust to merge and won't cause undue noise in CI.

tillrohrmann

Thanks for adding this end-to-end test for testing snapshots and trim gap handling @pcholakov. The changes look good to me :-) +1 for merging.

tillrohrmann · 2025-01-07T10:54:17Z

crates/local-cluster-runner/src/node/mod.rs

+                pid,
+            );
+            match nix::sys::signal::kill(
+                nix::unistd::Pid::from_raw(pid.try_into().unwrap()),


Why is it ok to unwrap here?

server/tests/trim_gap_handling.rs

tillrohrmann · 2025-01-07T11:07:21Z

server/tests/trim_gap_handling.rs

+async fn trim_log(
+    client: &mut ClusterCtrlSvcClient<Channel>,
+    trim_point: u64,
+    timeout: Duration,


Nit: The timeout handling could also happen at the call site via tokio::time::timeout(timeout, trim_log(client, trim_point) if you want to keep the inner logic of this function a tad bit simpler and make it more compositional. The same applies to applied_lsn_converged.

Good point, I'll refactor all the helpers where we don't need to track timeouts internally.

…etting dropped

Co-authored-by: Till Rohrmann <till@restate.dev>

…ibe_log instead

pcholakov force-pushed the feat/trim-gap-e2e-test branch from 12b10fb to 713e010 Compare December 31, 2024 16:18

pcholakov commented Jan 2, 2025

View reviewed changes

pcholakov requested review from tillrohrmann and jackkleeman January 2, 2025 16:27

pcholakov marked this pull request as ready for review January 2, 2025 16:32

pcholakov force-pushed the feat/trim-gap-e2e-test branch 2 times, most recently from e7bd6c7 to 071338c Compare January 3, 2025 13:56

Base automatically changed from feat/trim-gap-handling to main January 3, 2025 14:31

tillrohrmann reviewed Jan 3, 2025

View reviewed changes

pcholakov mentioned this pull request Jan 6, 2025

Return effective trim point LSN on successful trim request #2468

Closed

pcholakov force-pushed the feat/trim-gap-e2e-test branch from 071338c to 446bd56 Compare January 6, 2025 13:31

pcholakov changed the base branch from main to feat/trim-log-report-lsn January 6, 2025 13:33

pcholakov force-pushed the feat/trim-gap-e2e-test branch from 60dbc4e to 446bd56 Compare January 6, 2025 16:06

pcholakov requested review from tillrohrmann and removed request for jackkleeman January 6, 2025 16:10

tillrohrmann approved these changes Jan 7, 2025

View reviewed changes

pcholakov force-pushed the feat/trim-gap-e2e-test branch from 446bd56 to a60f576 Compare January 7, 2025 14:11

pcholakov force-pushed the feat/trim-log-report-lsn branch from b3bc1c4 to 73943fb Compare January 7, 2025 15:11

pcholakov force-pushed the feat/trim-gap-e2e-test branch 5 times, most recently from 5200c2a to 81d3b2e Compare January 8, 2025 08:42

pcholakov added 5 commits January 8, 2025 15:30

Make mock-service-endpoint a workspace library

b5c0e15

Add trim gap handling end-to-end test

828be8e

Add a Drop implementation for Cluster to prevent leaking nodes

b787063

Assert log convergence after new follower joins

afb3daf

Don't pass ownership of the cluster to the inner future to avoid it g…

3ac3fe5

…etting dropped

pcholakov and others added 8 commits January 8, 2025 15:30

Robustness improvements

cee8dd7

Remove original snapshots test which is a subset of the trim gap test

56a8a1e

Deterministically trim the log

9312613

PR feedback

97a33ef

Add note about i32 pids

80c9022

Update server/tests/trim_gap_handling.rs

c809e0b

Co-authored-by: Till Rohrmann <till@restate.dev>

PR feedback (extract helper timeouts)

1087f12

Don't use trim_log's response to determine the trim point, call descr…

9592d6a

…ibe_log instead

pcholakov force-pushed the feat/trim-gap-e2e-test branch from 81d3b2e to 9592d6a Compare January 8, 2025 13:48

pcholakov changed the base branch from feat/trim-log-report-lsn to main January 8, 2025 13:49

pcholakov mentioned this pull request Jan 8, 2025

[Draft] Trim gap e2e test: Force leadership change to newly bootstrapped node #2471

Draft

pcholakov merged commit 22fcef1 into main Jan 8, 2025
15 checks passed

pcholakov deleted the feat/trim-gap-e2e-test branch January 8, 2025 16:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an end-to-end test for trim gap handling using snapshots #2463

Add an end-to-end test for trim gap handling using snapshots #2463

pcholakov commented Dec 31, 2024 •

edited

Loading

github-actions bot commented Dec 31, 2024 •

edited

Loading

pcholakov Jan 2, 2025

pcholakov Jan 2, 2025

tillrohrmann Jan 3, 2025

pcholakov Jan 6, 2025

pcholakov Jan 2, 2025

pcholakov Jan 2, 2025

pcholakov Jan 2, 2025

tillrohrmann left a comment

tillrohrmann Jan 3, 2025

tillrohrmann Jan 7, 2025

pcholakov Jan 7, 2025

tillrohrmann Jan 3, 2025

tillrohrmann Jan 3, 2025

pcholakov Jan 6, 2025

tillrohrmann Jan 3, 2025

pcholakov Jan 6, 2025

pcholakov Jan 7, 2025

tillrohrmann Jan 3, 2025

pcholakov Jan 6, 2025

tillrohrmann Jan 3, 2025 •

edited

Loading

pcholakov Jan 6, 2025

tillrohrmann Jan 3, 2025

pcholakov Jan 6, 2025

tillrohrmann Jan 3, 2025

pcholakov Jan 6, 2025 •

edited

Loading

pcholakov commented Jan 6, 2025

tillrohrmann left a comment

tillrohrmann Jan 7, 2025

tillrohrmann Jan 7, 2025

pcholakov Jan 7, 2025

		// todo(pavel): promote node 3 to be the leader for partition 0 and invoke the service again
		// right now, all we are asserting is that the new node is applying newly appended log records


		mod common;

		#[tokio::test]


		mod common;

		#[tokio::test]

Add an end-to-end test for trim gap handling using snapshots #2463

Add an end-to-end test for trim gap handling using snapshots #2463

Conversation

pcholakov commented Dec 31, 2024 • edited Loading

github-actions bot commented Dec 31, 2024 • edited Loading

Test Results

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tillrohrmann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tillrohrmann Jan 3, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pcholakov Jan 6, 2025 • edited Loading

Choose a reason for hiding this comment

pcholakov commented Jan 6, 2025

tillrohrmann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pcholakov commented Dec 31, 2024 •

edited

Loading

github-actions bot commented Dec 31, 2024 •

edited

Loading

tillrohrmann Jan 3, 2025 •

edited

Loading

pcholakov Jan 6, 2025 •

edited

Loading