Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an end-to-end test for trim gap handling using snapshots #2463

Merged
merged 13 commits into from
Jan 8, 2025

Conversation

pcholakov
Copy link
Contributor

@pcholakov pcholakov commented Dec 31, 2024

This test simulates a trim gap and verifies the behavior with and without a suitable snapshot present to enable fast-forward over the gap.

This is a follow-up to #2456 adding an e2e test for snapshot-based fast-forward over a log trim gap.

There are several to-dos here that require deeper changes - I'd like to do those as separate PRs to avoid delaying merging of trim-gap support itself. At a minimum this includes:

  • the create-snapshot admin API should return the min captured LSN of snapshots
  • the trim admin API should include the effective new trim point; currently BifrostAdmin can decide to no-op the request if the trim point is greater than the global tail it knows about, which makes it hard to test
  • [optional] we don't have a good way (that I'm aware of) to externally ask a specific partition processor to become leader; this would be useful for testing and potentially manual operations

Primary reviewer: @tillrohrmann

cc: @jackkleeman as an optional reviewer since I modified some test cluster infra and a test you previously added but feel free to ignore!

Copy link

github-actions bot commented Dec 31, 2024

Test Results

  7 files  ±0    7 suites  ±0   4m 29s ⏱️ +8s
 47 tests ±0   46 ✅ ±0  1 💤 ±0  0 ❌ ±0 
182 runs  ±0  179 ✅ ±0  3 💤 ±0  0 ❌ ±0 

Results for commit 9592d6a. ± Comparison against base commit d55eee4.

♻️ This comment has been updated with latest results.

@pcholakov pcholakov force-pushed the feat/trim-gap-e2e-test branch from 12b10fb to 713e010 Compare December 31, 2024 16:18
@@ -749,6 +749,30 @@ impl StartedNode {
}
}

impl Drop for StartedNode {
fn drop(&mut self) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this to avoid leaking restate-server processes from tests.


mod common;

#[tokio::test]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am using this rather than test(restate_core::test) because that macro just exits the process on panic, which prevents unwinding from running - and can lead to leaked spawned processes on test failure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason we have this panic hook is to ensure that if a panic occurs within a spawned task, the tests will fail. Otherwise, the panic might just be swallowed by the task.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! Of course; I recall the discussion now - switched to #[test_log::test(tokio::test)] as a middle ground to ensure that the Drop callback works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new trim gap fast-forward test covers the same paths as this one.

@@ -34,6 +34,7 @@ description = "Restate makes distributed applications easy!"
[workspace.dependencies]
# Own crates
codederror = { path = "crates/codederror" }
mock-service-endpoint = { path = "tools/mock-service-endpoint" }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this PR, the mock service handler is now usable from other packages - this is handy for e2e testing.

@@ -89,7 +89,6 @@ pub struct TestEnv {
pub loglet: Arc<dyn Loglet>,
pub metadata_writer: MetadataWriter,
pub metadata_store_client: MetadataStoreClient,
pub cluster: StartedCluster,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed passing the cluster to the test routine as it is easy to accidentally drop it, and kill the cluster in the process. We can reintroduce it as a reference if it's needed in the future.

@pcholakov pcholakov marked this pull request as ready for review January 2, 2025 16:32
@pcholakov pcholakov force-pushed the feat/trim-gap-e2e-test branch 2 times, most recently from e7bd6c7 to 071338c Compare January 3, 2025 13:56
Base automatically changed from feat/trim-gap-handling to main January 3, 2025 14:31
Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating the end-to-end test for our snapshots @pcholakov. The changes look good to me.

The one aspect that makes me a bit uneasy is that it seems that we cannot reliably guarantee that a trim has happened. If this is correct, then we might add a test which is unstable in our CI environment. Maybe because of this, it's worth to first add the functionality to report back which lsn was trimmed so that we can make the trim_log function reliable?

pid,
);
match nix::sys::signal::kill(
nix::unistd::Pid::from_raw(pid.try_into().unwrap()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this try_into infallible or why is unwrap ok here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it ok to unwrap here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I never answered you this - I blindly duplicated the kill implementation above without looking too closely; it appears this is completely safe. Tokio's Child::id() just passes through std::sys::pal::unix::process::Process::id's return type which is u32 but Process internally holds the pid as a pid_t = i32 and does a blind cast to u32 when returning:

https://doc.rust-lang.org/nightly/src/std/sys/pal/unix/process/process_unix.rs.html#943-945

I couldn't find any background on why, other than other people asking essentially the same question (https://users.rust-lang.org/t/std-id-vs-libc-pid-t-how-to-handle/78281/3).

Two interesting factoids I learned in the process 😀

I'll update this to use expect() with a comment before merging.


mod common;

#[tokio::test]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason we have this panic hook is to ensure that if a panic occurs within a spawned task, the tests will fail. Otherwise, the panic might just be swallowed by the task.

Comment on lines 47 to 54
tracing_subscriber::fmt()
.event_format(tracing_subscriber::fmt::format().compact())
.with_env_filter(
tracing_subscriber::EnvFilter::builder()
.with_default_directive(LevelFilter::INFO.into())
.from_env_lossy(),
)
.init();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use test_log::test(tokio::test), then you don't have to set these things up yourself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TY!

Comment on lines +204 to +202
// todo(pavel): promote node 3 to be the leader for partition 0 and invoke the service again
// right now, all we are asserting is that the new node is applying newly appended log records
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could do this by manually changing the SchedulingPlan.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't think it would be this easy... and it seems like it isn't. I added a step to manually set the SchedulingPlan but it only works intermittently - Scheduler::update_scheduling_plan nukes the changes as soon as it picks them up. I think this is important, let's do definitely do it but maybe as a follow-up task to provide a leadership hint to the scheduler?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good news! With a bit of effort, I got this to work reliably - still a draft but will get it ready for review soon: #2471

State::Alive(s) => s
.partitions
.values()
.any(|p| p.effective_mode.cmp(&1).is_eq()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is clearer if you compared against RunMode instead of the ordinal value which is harder to remember.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The magic of try_from! Thanks for the tip :-)

Comment on lines 264 to 272
let mut i = 0;
loop {
client
.trim_log(TrimLogRequest {
log_id: 0,
trim_point,
})
.await?;
if i >= 2 {
break;
}
tokio::time::sleep(Duration::from_secs(1)).await;
i += 1;
}
Copy link
Contributor

@tillrohrmann tillrohrmann Jan 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you come up with the magic number of 3 attempts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empirically! I think Azmy suggested it may be related to the heartbeat interval and updating the global tail. Moot now; I've converted this to a retry until the desired effective trim is reached.

Comment on lines 254 to 258
async fn trim_log(
client: &mut ClusterCtrlSvcClient<Channel>,
trim_point: u64,
) -> googletest::Result<()> {
// todo(pavel): this is flimsy, ensure we actually trim the log to a particular LSN
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this method does not do anything because the admin node didn't have the up to date log tail, then I the remaining test will be stuck. This might be a problem for the stability of the test. Something to observe on our CI infra where timings can be quite skewed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I created the wrong impression with the todo comment - I've rebased on #2468 which allows this to be deterministic :-)

Comment on lines 338 to 468
async fn grpc_connect(address: AdvertisedAddress) -> Result<Channel, tonic::transport::Error> {
match address {
AdvertisedAddress::Uds(uds_path) => {
// dummy endpoint required to specify an uds connector, it is not used anywhere
Endpoint::try_from("http://127.0.0.1")
.expect("/ should be a valid Uri")
.connect_with_connector(service_fn(move |_: Uri| {
let uds_path = uds_path.clone();
async move {
Ok::<_, io::Error>(TokioIo::new(UnixStream::connect(uds_path).await?))
}
})).await
}
AdvertisedAddress::Http(uri) => {
Channel::builder(uri)
.connect_timeout(Duration::from_secs(2))
.timeout(Duration::from_secs(2))
.http2_adaptive_window(true)
.connect()
.await
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks quite similar to create_tonic_channel_from_advertised_address. Could this be reused?

Copy link
Contributor Author

@pcholakov pcholakov Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied it nearly verbatim from restatectl's grpc_connect utility - which looks like it may have been the origin of create_tonic_channel_from_advertised_address, too. I've done this under its own PR here:

#2469

@pcholakov pcholakov force-pushed the feat/trim-gap-e2e-test branch from 071338c to 446bd56 Compare January 6, 2025 13:31
@pcholakov pcholakov changed the base branch from main to feat/trim-log-report-lsn January 6, 2025 13:33
@pcholakov pcholakov force-pushed the feat/trim-gap-e2e-test branch from 60dbc4e to 446bd56 Compare January 6, 2025 16:06
@pcholakov pcholakov requested review from tillrohrmann and removed request for jackkleeman January 6, 2025 16:10
@pcholakov
Copy link
Contributor Author

The one aspect that makes me a bit uneasy is that it seems that we cannot reliably guarantee that a trim has happened. If this is correct, then we might add a test which is unstable in our CI environment. Maybe because of this, it's worth to first add the functionality to report back which lsn was trimmed so that we can make the trim_log function reliable?

Yes, definitely! I was already working on that - I realize my todo might have created the wrong impression :-) Here is the change, on which this PR is now rebased: #2468.

I wasn't able to get the leadership change to work reliably but I'm pretty keen to do that too. However, I believe that the test as it stands should be reasonably robust to merge and won't cause undue noise in CI.

Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this end-to-end test for testing snapshots and trim gap handling @pcholakov. The changes look good to me :-) +1 for merging.

pid,
);
match nix::sys::signal::kill(
nix::unistd::Pid::from_raw(pid.try_into().unwrap()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it ok to unwrap here?

server/tests/trim_gap_handling.rs Outdated Show resolved Hide resolved
async fn trim_log(
client: &mut ClusterCtrlSvcClient<Channel>,
trim_point: u64,
timeout: Duration,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The timeout handling could also happen at the call site via tokio::time::timeout(timeout, trim_log(client, trim_point) if you want to keep the inner logic of this function a tad bit simpler and make it more compositional. The same applies to applied_lsn_converged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I'll refactor all the helpers where we don't need to track timeouts internally.

@pcholakov pcholakov force-pushed the feat/trim-gap-e2e-test branch from 446bd56 to a60f576 Compare January 7, 2025 14:11
@pcholakov pcholakov force-pushed the feat/trim-log-report-lsn branch from b3bc1c4 to 73943fb Compare January 7, 2025 15:11
@pcholakov pcholakov force-pushed the feat/trim-gap-e2e-test branch 5 times, most recently from 5200c2a to 81d3b2e Compare January 8, 2025 08:42
@pcholakov pcholakov force-pushed the feat/trim-gap-e2e-test branch from 81d3b2e to 9592d6a Compare January 8, 2025 13:48
@pcholakov pcholakov changed the base branch from feat/trim-log-report-lsn to main January 8, 2025 13:49
@pcholakov pcholakov merged commit 22fcef1 into main Jan 8, 2025
15 checks passed
@pcholakov pcholakov deleted the feat/trim-gap-e2e-test branch January 8, 2025 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants