Skip to content

Commit

Permalink
feat: add spot termination handler (#4176)
Browse files Browse the repository at this point in the history
## Description

This PR adds a lambda to to log and metric spot termination based on the
cloudtrail event `BidEvictedEvent`. The feature is experimental and
disabled by default.

## Future directions

The current implemenation only helps to make spot termination visible to
an admin team. For the future we want to enrich a runner with
information via tagging what job is active, This allows to let the
termination handler also inform the user by adding a job annotation once
a spot termination occurs.

## Migration

No migration is required. By default the watcher is disabled.

- logging for the watcher is changed
- resources will be recreated for notification warning watcher

---------

Co-authored-by: philips-labs-pr|bot <philips-labs-pr[bot]@users.noreply.github.com>
Co-authored-by: Stuart Pearson <1926002+stuartp44@users.noreply.github.com>
  • Loading branch information
3 people authored Oct 10, 2024
1 parent 3fb1729 commit 8ba0a82
Show file tree
Hide file tree
Showing 35 changed files with 861 additions and 221 deletions.
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -165,7 +165,7 @@ Talk to the forestkeepers in the `runners-channel` on Slack.
| <a name="input_instance_max_spot_price"></a> [instance\_max\_spot\_price](#input\_instance\_max\_spot\_price) | Max price price for spot instances per hour. This variable will be passed to the create fleet as max spot price for the fleet. | `string` | `null` | no |
| <a name="input_instance_profile_path"></a> [instance\_profile\_path](#input\_instance\_profile\_path) | The path that will be added to the instance\_profile, if not set the environment name will be used. | `string` | `null` | no |
| <a name="input_instance_target_capacity_type"></a> [instance\_target\_capacity\_type](#input\_instance\_target\_capacity\_type) | Default lifecycle used for runner instances, can be either `spot` or `on-demand`. | `string` | `"spot"` | no |
| <a name="input_instance_termination_watcher"></a> [instance\_termination\_watcher](#input\_instance\_termination\_watcher) | Configuration for the instance termination watcher. This feature is Beta, changes will not trigger a major release as long in beta.<br/><br/>`enable`: Enable or disable the spot termination watcher.<br/>`memory_size`: Memory size linit in MB of the lambda.<br/>`s3_key`: S3 key for syncer lambda function. Required if using S3 bucket to specify lambdas.<br/>`s3_object_version`: S3 object version for syncer lambda function. Useful if S3 versioning is enabled on source bucket.<br/>`timeout`: Time out of the lambda in seconds.<br/>`zip`: File location of the lambda zip file. | <pre>object({<br/> enable = optional(bool, false)<br/> enable_metric = optional(string, null) # deprectaed<br/> memory_size = optional(number, null)<br/> s3_key = optional(string, null)<br/> s3_object_version = optional(string, null)<br/> timeout = optional(number, null)<br/> zip = optional(string, null)<br/> })</pre> | `{}` | no |
| <a name="input_instance_termination_watcher"></a> [instance\_termination\_watcher](#input\_instance\_termination\_watcher) | Configuration for the instance termination watcher. This feature is Beta, changes will not trigger a major release as long in beta.<br/><br/>`enable`: Enable or disable the spot termination watcher.<br/>'features': Enable or disable features of the termination watcher.<br/>`memory_size`: Memory size linit in MB of the lambda.<br/>`s3_key`: S3 key for syncer lambda function. Required if using S3 bucket to specify lambdas.<br/>`s3_object_version`: S3 object version for syncer lambda function. Useful if S3 versioning is enabled on source bucket.<br/>`timeout`: Time out of the lambda in seconds.<br/>`zip`: File location of the lambda zip file. | <pre>object({<br/> enable = optional(bool, false)<br/> enable_metric = optional(string, null) # deprectaed<br/> features = optional(object({<br/> enable_spot_termination_handler = optional(bool, true)<br/> enable_spot_termination_notification_watcher = optional(bool, true)<br/> }), {})<br/> memory_size = optional(number, null)<br/> s3_key = optional(string, null)<br/> s3_object_version = optional(string, null)<br/> timeout = optional(number, null)<br/> zip = optional(string, null)<br/> })</pre> | `{}` | no |
| <a name="input_instance_types"></a> [instance\_types](#input\_instance\_types) | List of instance types for the action runner. Defaults are based on runner\_os (al2023 for linux and Windows Server Core for win). | `list(string)` | <pre>[<br/> "m5.large",<br/> "c5.large"<br/>]</pre> | no |
| <a name="input_job_queue_retention_in_seconds"></a> [job\_queue\_retention\_in\_seconds](#input\_job\_queue\_retention\_in\_seconds) | The number of seconds the job is held in the queue before it is purged. | `number` | `86400` | no |
| <a name="input_job_retry"></a> [job\_retry](#input\_job\_retry) | Experimental! Can be removed / changed without trigger a major release.Configure job retries. The configuration enables job retries (for ephemeral runners). After creating the insances a message will be published to a job retry queue. The job retry check lambda is checking after a delay if the job is queued. If not the message will be published again on the scale-up (build queue). Using this feature can impact the reate limit of the GitHub app.<br/><br/>`enable`: Enable or disable the job retry feature.<br/>`delay_in_seconds`: The delay in seconds before the job retry check lambda will check the job status.<br/>`delay_backoff`: The backoff factor for the delay.<br/>`lambda_memory_size`: Memory size limit in MB for the job retry check lambda.<br/>`lambda_timeout`: Time out of the job retry check lambda in seconds.<br/>`max_attempts`: The maximum number of attempts to retry the job. | <pre>object({<br/> enable = optional(bool, false)<br/> delay_in_seconds = optional(number, 300)<br/> delay_backoff = optional(number, 2)<br/> lambda_memory_size = optional(number, 256)<br/> lambda_timeout = optional(number, 30)<br/> max_attempts = optional(number, 1)<br/> })</pre> | `{}` | no |
Expand Down Expand Up @@ -257,6 +257,7 @@ Talk to the forestkeepers in the `runners-channel` on Slack.
| Name | Description |
|------|-------------|
| <a name="output_binaries_syncer"></a> [binaries\_syncer](#output\_binaries\_syncer) | n/a |
| <a name="output_instance_termination_handler"></a> [instance\_termination\_handler](#output\_instance\_termination\_handler) | n/a |
| <a name="output_instance_termination_watcher"></a> [instance\_termination\_watcher](#output\_instance\_termination\_watcher) | n/a |
| <a name="output_queues"></a> [queues](#output\_queues) | SQS queues. |
| <a name="output_runners"></a> [runners](#output\_runners) | n/a |
Expand Down
22 changes: 18 additions & 4 deletions docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -215,21 +215,35 @@ In case the setup does not work as intended, trace the events through this seque

### Termination watcher

This feature is in early stage and therefore disabled by default.
This feature is in early stage and therefore disabled by default. To enable the watcher, set `instance_termination_watcher.enable = true`.

The termination watcher is currently watching for spot termination notifications. The module is only taken events into account for instances tagged with `ghr:environment` by default when deployment the module as part of one of the main modules (root or multi-runner). The module can also be deployed stand-alone, in that case the tag filter needs to be tunned.
The termination watcher is currently watching for spot terminations. The module is only taken events into account for instances tagged with `ghr:environment` by default when deployment the module as part of one of the main modules (root or multi-runner). The module can also be deployed stand-alone, in this case, the tag filter needs to be tunned.

### Termination notification

The watcher is listening for spot termination warnings and create a log message and optionally a metric. The watcher is disabled by default. The feature is enabled once the watcher is enabled, the feature can be disabled explicit by setting `instance_termination_watcher.features.enable_spot_termination_handler = false`.

- Logs: The module will log all termination notifications. For each warning it will look up instance details and log the environment, instance type and time the instance is running. As well some other details.
- Metrics: Metrics are disabled by default, this to avoid costs. Once enabled a metric will be created for each warning with at least dimensions for the environment and instance type. THe metric name space can be configured via the variables. The metric name used is `SpotInterruptionWarning`.

#### Log example
### Termination handler

!!! warning
This feature will only work once the CloudTrail is enabled.

The termination handler is listening for spot terminations by capture the `BidEvictedEvent` via CloudTrail. The handler will log and optionally create a metric for each termination. The intend is to enhance the logic to inform the user about the termination via the GitHub Job or Workflow run. The feature is disabled by default. The feature is enabled once the watcher is enabled, the feature can be disabled explicit by setting `instance_termination_watcher.features.enable_spot_termination_handler = false`.

- Logs: The module will log all termination notifications. For each warning it will look up instance details and log the environment, instance type and time the instance is running. As well some other details.
- Metrics: Metrics are disabled by default, this to avoid costs. Once enabled a metric will be created for each termination with at least dimensions for the environment and instance type. THe metric name space can be configured via the variables. The metric name used is `SpotTermination`.

### Log example (both warnings and terminations)

Below an example of the the log messages created.

```
{
"level": "INFO",
"message": "Received spot notification warning:",
"message": "Received spot notification for ${metricName}",
"environment": "default",
"instanceId": "i-0039b8826b3dcea55",
"instanceType": "c5.large",
Expand Down
4 changes: 2 additions & 2 deletions examples/default/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ module "runners" {
# Let the module manage the service linked role
# create_service_linked_role_spot = true

instance_types = ["m5.large", "c5.large"]
instance_types = ["m7a.large", "m5.large"]

# override delay of events in seconds
delay_webhook_event = 5
Expand Down Expand Up @@ -122,7 +122,7 @@ module "runners" {
# metric = {
# enable_spot_termination_warning = true
# enable_job_retry = false
# enable_github_app_rate_limit = true
# enable_github_app_rate_limit = false
# }
# }

Expand Down
2 changes: 2 additions & 0 deletions lambdas/functions/termination-watcher/src/ConfigResolver.ts
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ import { createChildLogger } from '@aws-github-runner/aws-powertools-util';

export class Config {
createSpotWarningMetric: boolean;
createSpotTerminationMetric: boolean;
tagFilters: Record<string, string>;
prefix: string;

Expand All @@ -11,6 +12,7 @@ export class Config {
logger.debug('Loading config from environment variables', { env: process.env });

this.createSpotWarningMetric = process.env.ENABLE_METRICS_SPOT_WARNING === 'true';
this.createSpotTerminationMetric = process.env.ENABLE_METRICS_SPOT_TERMINATION === 'true';
this.prefix = process.env.PREFIX ?? '';
this.tagFilters = { 'ghr:environment': this.prefix };

Expand Down
93 changes: 93 additions & 0 deletions lambdas/functions/termination-watcher/src/ec2.test.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
import { EC2Client, DescribeInstancesCommand, DescribeInstancesResult } from '@aws-sdk/client-ec2';
import { mockClient } from 'aws-sdk-client-mock';
import { getInstances, tagFilter } from './ec2';

const ec2Mock = mockClient(EC2Client);

describe('getInstances', () => {
beforeEach(() => {
ec2Mock.reset();
});

it('should return the instance when found', async () => {
const instanceId = 'i-1234567890abcdef0';
const instance = { InstanceId: instanceId };
ec2Mock.on(DescribeInstancesCommand).resolves({
Reservations: [{ Instances: [instance] }],
});

const result = await getInstances(new EC2Client({}), [instanceId]);
expect(result).toEqual([instance]);
});

describe('should return null when the instance is not found', () => {
it.each([{ Reservations: [] }, {}, { Reservations: undefined }])(
'with %p',
async (item: DescribeInstancesResult) => {
const instanceId = 'i-1234567890abcdef0';
ec2Mock.on(DescribeInstancesCommand).resolves(item);

const result = await getInstances(new EC2Client({}), [instanceId]);
expect(result).toEqual([]);
},
);
});
});

describe('tagFilter', () => {
describe('should return true when the instance matches the tag filters', () => {
it.each([{ Environment: 'production' }, { Environment: 'prod' }])(
'with %p',
(tagFilters: Record<string, string>) => {
const instance = {
Tags: [
{ Key: 'Name', Value: 'test-instance' },
{ Key: 'Environment', Value: 'production' },
],
};

const result = tagFilter(instance, tagFilters);
expect(result).toBe(true);
},
);
});

it('should return false when the instance does not have all the tags', () => {
const instance = {
Tags: [{ Key: 'Name', Value: 'test-instance' }],
};
const tagFilters = { Name: 'test', Environment: 'prod' };

const result = tagFilter(instance, tagFilters);
expect(result).toBe(false);
});

it('should return false when the instance does not have any tags', () => {
const instance = {};
const tagFilters = { Name: 'test', Environment: 'prod' };

const result = tagFilter(instance, tagFilters);
expect(result).toBe(false);
});

it('should return true if the tag filters are empty', () => {
const instance = {
Tags: [
{ Key: 'Name', Value: 'test-instance' },
{ Key: 'Environment', Value: 'production' },
],
};
const tagFilters = {};

const result = tagFilter(instance, tagFilters);
expect(result).toBe(true);
});

it('should return false if instance is null', () => {
const instance = null;
const tagFilters = { Name: 'test', Environment: 'prod' };

const result = tagFilter(instance, tagFilters);
expect(result).toBe(false);
});
});
13 changes: 13 additions & 0 deletions lambdas/functions/termination-watcher/src/ec2.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
import { DescribeInstancesCommand, EC2Client, Instance } from '@aws-sdk/client-ec2';

export async function getInstances(ec2: EC2Client, instanceId: string[]): Promise<Instance[]> {
const result = await ec2.send(new DescribeInstancesCommand({ InstanceIds: instanceId }));
const instances = result.Reservations?.[0]?.Instances;
return instances ?? [];
}

export function tagFilter(instance: Instance | null, tagFilters: Record<string, string>): boolean {
return Object.keys(tagFilters).every((key) => {
return instance?.Tags?.find((tag) => tag.Key === key && tag.Value?.startsWith(tagFilters[key]));
});
}
78 changes: 73 additions & 5 deletions lambdas/functions/termination-watcher/src/lambda.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,16 @@ import { Context } from 'aws-lambda';
import { mocked } from 'jest-mock';

import { handle as interruptionWarningHandlerImpl } from './termination-warning';
import { interruptionWarning } from './lambda';
import { SpotInterruptionWarning, SpotTerminationDetail } from './types';
import { handle as terminationHandlerImpl } from './termination';
import { interruptionWarning, termination } from './lambda';
import { BidEvictedDetail, BidEvictedEvent, SpotInterruptionWarning, SpotTerminationDetail } from './types';

jest.mock('./termination-warning');
jest.mock('./termination');

process.env.POWERTOOLS_METRICS_NAMESPACE = 'test';
process.env.POWERTOOLS_TRACE_ENABLED = 'true';
const event: SpotInterruptionWarning<SpotTerminationDetail> = {
const spotInstanceInterruptionEvent: SpotInterruptionWarning<SpotTerminationDetail> = {
version: '0',
id: '1',
'detail-type': 'EC2 Spot Instance Interruption Warning',
Expand All @@ -25,6 +27,42 @@ const event: SpotInterruptionWarning<SpotTerminationDetail> = {
},
};

const bidEvictedEvent: BidEvictedEvent<BidEvictedDetail> = {
version: '0',
id: '186d7999-3121-e749-23f3-c7caec1084e1',
'detail-type': 'AWS Service Event via CloudTrail',
source: 'aws.ec2',
account: '123456789012',
time: '2024-10-09T11:48:46Z',
region: 'eu-west-1',
resources: [],
detail: {
eventVersion: '1.10',
userIdentity: {
accountId: '123456789012',
invokedBy: 'sec2.amazonaws.com',
},
eventTime: '2024-10-09T11:48:46Z',
eventSource: 'ec2.amazonaws.com',
eventName: 'BidEvictedEvent',
awsRegion: 'eu-west-1',
sourceIPAddress: 'ec2.amazonaws.com',
userAgent: 'ec2.amazonaws.com',
requestParameters: null,
responseElements: null,
requestID: 'ebf032e3-5009-3484-aae8-b4946ab2e2eb',
eventID: '3a15843b-96c2-41b1-aac1-7d62dc754547',
readOnly: false,
eventType: 'AwsServiceEvent',
managementEvent: true,
recipientAccountId: '123456789012',
serviceEventDetails: {
instanceIdSet: ['i-12345678901234567'],
},
eventCategory: 'Management',
},
};

const context: Context = {
awsRequestId: '1',
callbackWaitsForEmptyEventLoop: false,
Expand All @@ -48,22 +86,52 @@ const context: Context = {

// Docs for testing async with jest: https://jestjs.io/docs/tutorial-async
describe('Handle sport termination interruption warning', () => {
beforeEach(() => {
jest.clearAllMocks();
});

it('should not throw or log in error.', async () => {
const mock = mocked(interruptionWarningHandlerImpl);
mock.mockImplementation(() => {
return new Promise((resolve) => {
resolve();
});
});
await expect(interruptionWarning(event, context)).resolves.not.toThrow();
await expect(interruptionWarning(spotInstanceInterruptionEvent, context)).resolves.not.toThrow();
});

it('should not throw only log in error in case of an exception.', async () => {
const logSpy = jest.spyOn(logger, 'error');
const error = new Error('An error.');
const mock = mocked(interruptionWarningHandlerImpl);
mock.mockRejectedValue(error);
await expect(interruptionWarning(event, context)).resolves.toBeUndefined();
await expect(interruptionWarning(spotInstanceInterruptionEvent, context)).resolves.toBeUndefined();

expect(logSpy).toHaveBeenCalledTimes(1);
});
});

describe('Handle sport termination (BidEvictEvent', () => {
beforeEach(() => {
jest.clearAllMocks();
});

it('should not throw or log in error.', async () => {
const mock = mocked(terminationHandlerImpl);
mock.mockImplementation(() => {
return new Promise((resolve) => {
resolve();
});
});
await expect(termination(bidEvictedEvent, context)).resolves.not.toThrow();
});

it('should not throw only log in error in case of an exception.', async () => {
const logSpy = jest.spyOn(logger, 'error');
const error = new Error('An error.');
const mock = mocked(terminationHandlerImpl);
mock.mockRejectedValue(error);
await expect(termination(bidEvictedEvent, context)).resolves.toBeUndefined();

expect(logSpy).toHaveBeenCalledTimes(1);
});
Expand Down
15 changes: 14 additions & 1 deletion lambdas/functions/termination-watcher/src/lambda.ts
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,8 @@ import { logMetrics } from '@aws-lambda-powertools/metrics/middleware';
import { Context } from 'aws-lambda';

import { handle as handleTerminationWarning } from './termination-warning';
import { SpotInterruptionWarning, SpotTerminationDetail } from './types';
import { handle as handleTermination } from './termination';
import { BidEvictedDetail, BidEvictedEvent, SpotInterruptionWarning, SpotTerminationDetail } from './types';
import { Config } from './ConfigResolver';

const config = new Config();
Expand All @@ -24,6 +25,18 @@ export async function interruptionWarning(
}
}

export async function termination(event: BidEvictedEvent<BidEvictedDetail>, context: Context): Promise<void> {
setContext(context, 'lambda.ts');
logger.logEventIfEnabled(event);
logger.debug('Configuration of the lambda', { config });

try {
await handleTermination(event, config);
} catch (e) {
logger.error(`${(e as Error).message}`, { error: e as Error });
}
}

const addMiddleware = () => {
const middleware = middy(interruptionWarning);

Expand Down
Loading

0 comments on commit 8ba0a82

Please sign in to comment.