Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use EC2 Root Volume Replacement to replace macOS hosts #149

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

lyoung-confluent
Copy link

@lyoung-confluent lyoung-confluent commented Feb 25, 2024

Note

I have not tested this change, it is purely based off reading the AWS documentation.

Currently, this stack recycles (terminates) each EC2 instance after it has successfully executed a single CI job to ensure that each CI job has a fresh environment without any risk of broken tooling/configuration from a previous job/execution.

However, this pattern is problematic for macOS machines as EC2 Mac does not bill per-second:

Billing for EC2 Mac instances is per second with a 24-hour minimum allocation period to comply with the Apple macOS Software License Agreement

As a result, a CI job that takes only a few seconds or even a full hour (the default timeout) will be billed for 24 hours of compute time which can quickly become expensive.

Fortunately, there is a potential solution. EC2 has support for quickly restoring a instance to it’s original state: Quickly Restore Amazon EC2 Mac Instances using Replace Root Volume capability:

The second use case is during continuous integration and continuous deployment (CI/CD) when you need to restore an Amazon EC2 Mac instance to a defined well-known state at the end of a build.

To restart your EC2 Mac instance in its initial state without stopping or terminating them, we created the ability to replace the root volume of an Amazon EC2 Mac instance with another EBS volume. This new EBS volume is created either from a new AMI, an Amazon EBS Snapshot, or from the initial volume state during boot.

This PR attempts to implement this feature by adding the ec2:CreateReplaceRootVolumeTask IAM action/permission to the instance IAM role (utilizing the aws:userid condition trick to limit that action to only the instance itself making the request). Then, instead of using calling TerminateInstanceInAutoScalingGroup when an instance needs to be replaced, it will call CreateReplaceRootVolumeTask using it's own instance ID and AMI ID (extracted via the metadata service) to restart the machine with a fresh OS image.

I am a bit concerned about this note in the aws blog post:

During the replacement, the instance will be unable to respond to health checks and hence might be marked as unhealthy if placed inside an Auto Scaled Group. You can write a custom health check to change that behavior.

It's unclear if such a custom health check would be required for this use-case or if the instance will restart quickly enough. To avoid this, I've updated the code to put the instance temporarily in standby mode before the replacement starts which disables checking/enforcing the health check, then in start-agent.sh it exits standby.

@lucaspin
Copy link
Collaborator

lucaspin commented Mar 6, 2024

@lyoung-confluent thanks for the pull request. This looks interesting. I'll try to run a few tests and see if we need to adjust anything here.

@lyoung-confluent
Copy link
Author

@lucaspin Did you by chance get time to test this? I'm not setup to easily test this myself otherwise I would have

@lucaspin
Copy link
Collaborator

@lyoung-confluent unfortunately, not yet. I'll try to reserve a little bit of time next week for this.

@lyoung-confluent
Copy link
Author

@lucaspin Checking in on this, any luck finding some time to test this PR?

@lucaspin
Copy link
Collaborator

@lyoung-confluent unfortunately, no. This completely slipped through me again. It might be best to create a support ticket for this one, just so we can more easily track and prioritize it alongside the other current things we have going on.

@lucaspin
Copy link
Collaborator

lucaspin commented Jan 6, 2025

@lyoung-confluent either the aws:userid condition trick is not working or something else in the permission is incorrect, because we are getting this error when trying to call the create-replace-root-volume-task operation:

An error occurred (UnauthorizedOperation) when calling the CreateReplaceRootVolumeTask operation: You are not authorized to perform this operation. User: arn:aws:sts::{ACCOUNT_ID}:assumed-role/{ROLE_NAME}/{INSTANCE_ID} is not authorized to perform: ec2:CreateReplaceRootVolumeTask on resource: arn:aws:ec2:{REGION}::image/{AMI_ID} because no identity-based policy allows the ec2:CreateReplaceRootVolumeTask action

Also, I'm not sure if the start-agent.sh script will run once the instance is rebooted after the root volume is replaced, since that script is called from the instance userdata, and the userdata is configured to run once per instance on macOS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants