Skip to content

Commit

Permalink
feat: SSM housekeeper (#3577)
Browse files Browse the repository at this point in the history
# Description
The runner module uses SSM to provide the JIT config or token to the
runner. In case the runner does not start healthy the SSM parameter is
not deleted. This PR adds a Lambda to remove by default SSM paramaters
in the token path that are older then a day.

The lambda will be deployed by default as part of the control plane and
manage the tokens in the path used by the scale-up runner function.

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Scott Guymer <scott.guymer@philips.com>
  • Loading branch information
3 people authored Oct 30, 2023
1 parent f38f20a commit 340deea
Show file tree
Hide file tree
Showing 20 changed files with 460 additions and 14 deletions.
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@ To be able to support a number of use-cases the module has quite a lot of config

### AWS SSM Parameters

The module uses the AWS System Manager Parameter Store to store configuration for the runners, as well as registration tokens and secrets for the Lambdas. Paths for the parameters can be configured via the variable `ssm_paths`. The location of the configuration parameters is retrieved by the runners via the instance tag `ghr:ssm_config_path`. The following default paths will be used.
The module uses the AWS System Manager Parameter Store to store configuration for the runners, as well as registration tokens and secrets for the Lambdas. Paths for the parameters can be configured via the variable `ssm_paths`. The location of the configuration parameters is retrieved by the runners via the instance tag `ghr:ssm_config_path`. The following default paths will be used. Tokens or JIT config stored in the token path will be deleted after retrieval by instance, data not deleted after a day will be deleted by a SSM housekeeper lambda.

| Path | Description |
| ----------- | ----------- |
Expand Down Expand Up @@ -585,6 +585,7 @@ We welcome any improvement to the standard module to make the default as secure
| <a name="input_runners_maximum_count"></a> [runners\_maximum\_count](#input\_runners\_maximum\_count) | The maximum number of runners that will be created. | `number` | `3` | no |
| <a name="input_runners_scale_down_lambda_timeout"></a> [runners\_scale\_down\_lambda\_timeout](#input\_runners\_scale\_down\_lambda\_timeout) | Time out for the scale down lambda in seconds. | `number` | `60` | no |
| <a name="input_runners_scale_up_lambda_timeout"></a> [runners\_scale\_up\_lambda\_timeout](#input\_runners\_scale\_up\_lambda\_timeout) | Time out for the scale up lambda in seconds. | `number` | `30` | no |
| <a name="input_runners_ssm_housekeeper"></a> [runners\_ssm\_housekeeper](#input\_runners\_ssm\_housekeeper) | Configuration for the SSM housekeeper lambda. This lambda deletes token / JIT config from SSM.<br><br> `schedule_expression`: is used to configure the schedule for the lambda.<br> `enabled`: enable or disable the lambda trigger via the EventBridge.<br> `lambda_timeout`: timeout for the lambda in seconds.<br> `config`: configuration for the lambda function. Token path will be read by default from the module. | <pre>object({<br> schedule_expression = optional(string, "rate(1 day)")<br> enabled = optional(bool, true)<br> lambda_timeout = optional(number, 60)<br> config = object({<br> tokenPath = optional(string)<br> minimumDaysOld = optional(number, 1)<br> dryRun = optional(bool, false)<br> })<br> })</pre> | <pre>{<br> "config": {}<br>}</pre> | no |
| <a name="input_scale_down_schedule_expression"></a> [scale\_down\_schedule\_expression](#input\_scale\_down\_schedule\_expression) | Scheduler expression to check every x for scale down. | `string` | `"cron(*/5 * * * ? *)"` | no |
| <a name="input_scale_up_reserved_concurrent_executions"></a> [scale\_up\_reserved\_concurrent\_executions](#input\_scale\_up\_reserved\_concurrent\_executions) | Amount of reserved concurrent executions for the scale-up lambda function. A value of 0 disables lambda from being triggered and -1 removes any concurrency limitations. | `number` | `1` | no |
| <a name="input_ssm_paths"></a> [ssm\_paths](#input\_ssm\_paths) | The root path used in SSM to store configuration and secrets. | <pre>object({<br> root = optional(string, "github-action-runners")<br> app = optional(string, "app")<br> runners = optional(string, "runners")<br> use_prefix = optional(bool, true)<br> })</pre> | `{}` | no |
Expand Down
8 changes: 4 additions & 4 deletions lambdas/functions/control-plane/jest.config.ts
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,10 @@ const config: Config = {
...defaultConfig,
coverageThreshold: {
global: {
statements: 97.6,
branches: 94.6,
functions: 97,
lines: 98,
statements: 97.89,
branches: 94.64,
functions: 97.33,
lines: 98.21,
},
},
};
Expand Down
2 changes: 1 addition & 1 deletion lambdas/functions/control-plane/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
"test": "NODE_ENV=test jest",
"test:watch": "NODE_ENV=test jest --watch",
"lint": "yarn eslint src",
"watch": "ts-node-dev --respawn --exit-child src/local.ts",
"watch": "ts-node-dev --respawn --exit-child src/local-ssm-housekeeper.ts",
"build": "ncc build src/lambda.ts -o dist",
"dist": "yarn build && cd dist && zip ../runners.zip index.js",
"format": "prettier --write \"**/*.ts\"",
Expand Down
39 changes: 33 additions & 6 deletions lambdas/functions/control-plane/src/lambda.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,12 @@ import { logger } from '@terraform-aws-github-runner/aws-powertools-util';
import { Context, SQSEvent, SQSRecord } from 'aws-lambda';
import { mocked } from 'jest-mock';

import { adjustPool, scaleDownHandler, scaleUpHandler } from './lambda';
import { adjustPool, scaleDownHandler, scaleUpHandler, ssmHousekeeper } from './lambda';
import { adjust } from './pool/pool';
import ScaleError from './scale-runners/ScaleError';
import { scaleDown } from './scale-runners/scale-down';
import { ActionRequestMessage, scaleUp } from './scale-runners/scale-up';
import { cleanSSMTokens } from './scale-runners/ssm-housekeeper';

const body: ActionRequestMessage = {
eventType: 'workflow_job',
Expand Down Expand Up @@ -61,6 +62,7 @@ const context: Context = {
jest.mock('./scale-runners/scale-up');
jest.mock('./scale-runners/scale-down');
jest.mock('./pool/pool');
jest.mock('./scale-runners/ssm-housekeeper');
jest.mock('@terraform-aws-github-runner/aws-powertools-util');

// Docs for testing async with jest: https://jestjs.io/docs/tutorial-async
Expand All @@ -87,7 +89,7 @@ describe('Test scale up lambda wrapper.', () => {
const error = new Error('Non scale should resolve.');
const mock = mocked(scaleUp);
mock.mockRejectedValue(error);
await expect(scaleUpHandler(sqsEvent, context)).resolves;
await expect(scaleUpHandler(sqsEvent, context)).resolves.not.toThrow;
});

it('Scale should be rejected', async () => {
Expand All @@ -110,7 +112,7 @@ async function testInvalidRecords(sqsRecords: SQSRecord[]) {
Records: sqsRecords,
};

await expect(scaleUpHandler(sqsEventMultipleRecords, context)).resolves;
await expect(scaleUpHandler(sqsEventMultipleRecords, context)).resolves.not.toThrow();

expect(logWarnSpy).toHaveBeenCalledWith(
expect.stringContaining(
Expand All @@ -127,14 +129,14 @@ describe('Test scale down lambda wrapper.', () => {
resolve();
});
});
await expect(scaleDownHandler({}, context)).resolves;
await expect(scaleDownHandler({}, context)).resolves.not.toThrow();
});

it('Scaling down with error.', async () => {
const error = new Error('Scaling down with error.');
const mock = mocked(scaleDown);
mock.mockRejectedValue(error);
await expect(await scaleDownHandler({}, context)).resolves;
await expect(scaleDownHandler({}, context)).resolves.not.toThrow();
});
});

Expand All @@ -146,7 +148,7 @@ describe('Adjust pool.', () => {
resolve();
});
});
await expect(adjustPool({ poolSize: 2 }, context)).resolves;
await expect(adjustPool({ poolSize: 2 }, context)).resolves.not.toThrow();
});

it('Handle error for adjusting pool.', async () => {
Expand All @@ -158,3 +160,28 @@ describe('Adjust pool.', () => {
expect(logSpy).lastCalledWith(expect.stringContaining(error.message), expect.anything());
});
});

describe('Test ssm housekeeper lambda wrapper.', () => {
it('Invoke without errors.', async () => {
const mock = mocked(cleanSSMTokens);
mock.mockImplementation(() => {
return new Promise((resolve) => {
resolve();
});
});

process.env.SSM_CLEANUP_CONFIG = JSON.stringify({
dryRun: false,
minimumDaysOld: 1,
tokenPath: '/path/to/tokens/',
});

await expect(ssmHousekeeper({}, context)).resolves.not.toThrow();
});

it('Errors not throwed.', async () => {
const mock = mocked(cleanSSMTokens);
mock.mockRejectedValue(new Error());
await expect(ssmHousekeeper({}, context)).resolves.not.toThrow();
});
});
13 changes: 13 additions & 0 deletions lambdas/functions/control-plane/src/lambda.ts
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ import { PoolEvent, adjust } from './pool/pool';
import ScaleError from './scale-runners/ScaleError';
import { scaleDown } from './scale-runners/scale-down';
import { scaleUp } from './scale-runners/scale-up';
import { SSMCleanupOptions, cleanSSMTokens } from './scale-runners/ssm-housekeeper';

export async function scaleUpHandler(event: SQSEvent, context: Context): Promise<void> {
setContext(context, 'lambda.ts');
Expand Down Expand Up @@ -48,3 +49,15 @@ export async function adjustPool(event: PoolEvent, context: Context): Promise<vo
logger.error(`${(e as Error).message}`, { error: e as Error });
}
}

export async function ssmHousekeeper(event: unknown, context: Context): Promise<void> {
setContext(context, 'lambda.ts');
logger.logEventIfEnabled(event);
const config = JSON.parse(process.env.SSM_CLEANUP_CONFIG) as SSMCleanupOptions;

try {
await cleanSSMTokens(config);
} catch (e) {
logger.error(`${(e as Error).message}`, { error: e as Error });
}
}
15 changes: 15 additions & 0 deletions lambdas/functions/control-plane/src/local-ssm-housekeeper.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
import { cleanSSMTokens } from './scale-runners/ssm-housekeeper';

export function run(): void {
cleanSSMTokens({
dryRun: true,
minimumDaysOld: 3,
tokenPath: '/ghr/my-env/runners/tokens',
})
.then()
.catch((e) => {
console.log(e);
});
}

run();
1 change: 1 addition & 0 deletions lambdas/functions/control-plane/src/modules.d.ts
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ declare namespace NodeJS {
RUNNER_OWNER: string;
SCALE_DOWN_CONFIG: string;
SSM_TOKEN_PATH: string;
SSM_CLEANUP_CONFIG: string;
SUBNET_IDS: string;
INSTANCE_TYPES: string;
INSTANCE_TARGET_CAPACITY_TYPE: 'on-demand' | 'spot';
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
import { DeleteParameterCommand, GetParametersByPathCommand, SSMClient } from '@aws-sdk/client-ssm';
import { mockClient } from 'aws-sdk-client-mock';
import 'aws-sdk-client-mock-jest';
import { cleanSSMTokens } from './ssm-housekeeper';

process.env.AWS_REGION = 'eu-east-1';

const mockSSMClient = mockClient(SSMClient);

const deleteAmisOlderThenDays = 1;
const now = new Date();
const dateOld = new Date();
dateOld.setDate(dateOld.getDate() - deleteAmisOlderThenDays - 1);

const tokenPath = '/path/to/tokens/';

describe('clean SSM tokens / JIT config', () => {
beforeEach(() => {
mockSSMClient.reset();
mockSSMClient.on(GetParametersByPathCommand).resolves({
Parameters: undefined,
});
mockSSMClient.on(GetParametersByPathCommand, { Path: tokenPath }).resolves({
Parameters: [
{
Name: tokenPath + 'i-old-01',
LastModifiedDate: dateOld,
},
],
NextToken: 'next',
});
mockSSMClient.on(GetParametersByPathCommand, { Path: tokenPath, NextToken: 'next' }).resolves({
Parameters: [
{
Name: tokenPath + 'i-new-01',
LastModifiedDate: now,
},
],
NextToken: undefined,
});
});

it('should delete parameters older then minimumDaysOld', async () => {
await cleanSSMTokens({
dryRun: false,
minimumDaysOld: deleteAmisOlderThenDays,
tokenPath: tokenPath,
});

expect(mockSSMClient).toHaveReceivedCommandWith(GetParametersByPathCommand, { Path: tokenPath });
expect(mockSSMClient).toHaveReceivedCommandWith(DeleteParameterCommand, { Name: tokenPath + 'i-old-01' });
expect(mockSSMClient).not.toHaveReceivedCommandWith(DeleteParameterCommand, { Name: tokenPath + 'i-new-01' });
});

it('should not delete when dry run is activated', async () => {
await cleanSSMTokens({
dryRun: true,
minimumDaysOld: deleteAmisOlderThenDays,
tokenPath: tokenPath,
});

expect(mockSSMClient).toHaveReceivedCommandWith(GetParametersByPathCommand, { Path: tokenPath });
expect(mockSSMClient).not.toHaveReceivedCommandWith(DeleteParameterCommand, { Name: tokenPath + 'i-old-01' });
expect(mockSSMClient).not.toHaveReceivedCommandWith(DeleteParameterCommand, { Name: tokenPath + 'i-new-01' });
});

it('should not call delete when no parameters are found.', async () => {
await expect(
cleanSSMTokens({
dryRun: false,
minimumDaysOld: deleteAmisOlderThenDays,
tokenPath: 'no-exist',
}),
).resolves.not.toThrow();

expect(mockSSMClient).not.toHaveReceivedCommandWith(DeleteParameterCommand, { Name: tokenPath + 'i-old-01' });
expect(mockSSMClient).not.toHaveReceivedCommandWith(DeleteParameterCommand, { Name: tokenPath + 'i-new-01' });
});

it('should not error on delete failure.', async () => {
mockSSMClient.on(DeleteParameterCommand).rejects(new Error('ParameterNotFound'));

await expect(
cleanSSMTokens({
dryRun: false,
minimumDaysOld: deleteAmisOlderThenDays,
tokenPath: tokenPath,
}),
).resolves.not.toThrow();
});

it('should only accept valid options.', async () => {
await expect(
cleanSSMTokens({
dryRun: false,
minimumDaysOld: undefined as unknown as number,
tokenPath: tokenPath,
}),
).rejects.toBeInstanceOf(Error);

await expect(
cleanSSMTokens({
dryRun: false,
minimumDaysOld: 0,
tokenPath: tokenPath,
}),
).rejects.toBeInstanceOf(Error);

await expect(
cleanSSMTokens({
dryRun: false,
minimumDaysOld: 1,
tokenPath: undefined as unknown as string,
}),
).rejects.toBeInstanceOf(Error);
});
});
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
import { DeleteParameterCommand, GetParametersByPathCommand, SSMClient } from '@aws-sdk/client-ssm';
import { logger } from '@terraform-aws-github-runner/aws-powertools-util';

export interface SSMCleanupOptions {
dryRun: boolean;
minimumDaysOld: number;
tokenPath: string;
}

function validateOptions(options: SSMCleanupOptions): void {
const errorMessages: string[] = [];
if (!options.minimumDaysOld || options.minimumDaysOld < 1) {
errorMessages.push(`minimumDaysOld must be greater then 0, value is set to "${options.minimumDaysOld}"`);
}
if (!options.tokenPath) {
errorMessages.push('tokenPath must be defined');
}
if (errorMessages.length > 0) {
throw new Error(errorMessages.join(', '));
}
}

export async function cleanSSMTokens(options: SSMCleanupOptions): Promise<void> {
logger.info(`Cleaning tokens / JIT config older then ${options.minimumDaysOld} days, dryRun: ${options.dryRun}`);
logger.debug('Cleaning with options', { options });
validateOptions(options);

const client = new SSMClient({ region: process.env.AWS_REGION });
const parameters = await client.send(new GetParametersByPathCommand({ Path: options.tokenPath }));
while (parameters.NextToken) {
const nextParameters = await client.send(
new GetParametersByPathCommand({ Path: options.tokenPath, NextToken: parameters.NextToken }),
);
parameters.Parameters?.push(...(nextParameters.Parameters ?? []));
parameters.NextToken = nextParameters.NextToken;
}
logger.info(`Found #${parameters.Parameters?.length} parameters in path ${options.tokenPath}`);
logger.debug('Found parameters', { parameters });

// minimumDate = today - minimumDaysOld
const minimumDate = new Date();
minimumDate.setDate(minimumDate.getDate() - options.minimumDaysOld);

for (const parameter of parameters.Parameters ?? []) {
if (parameter.LastModifiedDate && new Date(parameter.LastModifiedDate) < minimumDate) {
logger.info(`Deleting parameter ${parameter.Name} with last modified date ${parameter.LastModifiedDate}`);
try {
if (!options.dryRun) {
// sleep 50ms to avoid rait limit
await new Promise((resolve) => setTimeout(resolve, 50));
await client.send(new DeleteParameterCommand({ Name: parameter.Name }));
}
} catch (e) {
logger.warn(`Failed to delete parameter ${parameter.Name} with error ${(e as Error).message}`);
logger.debug('Failed to delete parameter', { e });
}
} else {
logger.debug(`Skipping parameter ${parameter.Name} with last modified date ${parameter.LastModifiedDate}`);
}
}
}
3 changes: 2 additions & 1 deletion main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -277,6 +277,8 @@ module "runners" {
pool_lambda_timeout = var.pool_lambda_timeout
pool_runner_owner = var.pool_runner_owner
pool_lambda_reserved_concurrent_executions = var.pool_lambda_reserved_concurrent_executions

ssm_housekeeper = var.runners_ssm_housekeeper
}

module "runner_binaries" {
Expand Down Expand Up @@ -318,7 +320,6 @@ module "runner_binaries" {
lambda_security_group_ids = var.lambda_security_group_ids
aws_partition = var.aws_partition


lambda_principals = var.lambda_principals
}

Expand Down
1 change: 1 addition & 0 deletions modules/multi-runner/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -166,6 +166,7 @@ module "multi-runner" {
| <a name="input_runners_lambda_zip"></a> [runners\_lambda\_zip](#input\_runners\_lambda\_zip) | File location of the lambda zip file for scaling runners. | `string` | `null` | no |
| <a name="input_runners_scale_down_lambda_timeout"></a> [runners\_scale\_down\_lambda\_timeout](#input\_runners\_scale\_down\_lambda\_timeout) | Time out for the scale down lambda in seconds. | `number` | `60` | no |
| <a name="input_runners_scale_up_lambda_timeout"></a> [runners\_scale\_up\_lambda\_timeout](#input\_runners\_scale\_up\_lambda\_timeout) | Time out for the scale up lambda in seconds. | `number` | `30` | no |
| <a name="input_runners_ssm_housekeeper"></a> [runners\_ssm\_housekeeper](#input\_runners\_ssm\_housekeeper) | Configuration for the SSM housekeeper lambda. This lambda deletes token / JIT config from SSM.<br><br> `schedule_expression`: is used to configure the schedule for the lambda.<br> `enabled`: enable or disable the lambda trigger via the EventBridge.<br> `lambda_timeout`: timeout for the lambda in seconds.<br> `config`: configuration for the lambda function. Token path will be read by default from the module. | <pre>object({<br> schedule_expression = optional(string, "rate(1 day)")<br> enabled = optional(bool, true)<br> lambda_timeout = optional(number, 60)<br> config = object({<br> tokenPath = optional(string)<br> minimumDaysOld = optional(number, 1)<br> dryRun = optional(bool, false)<br> })<br> })</pre> | <pre>{<br> "config": {}<br>}</pre> | no |
| <a name="input_ssm_paths"></a> [ssm\_paths](#input\_ssm\_paths) | The root path used in SSM to store configuration and secreets. | <pre>object({<br> root = optional(string, "github-action-runners")<br> app = optional(string, "app")<br> runners = optional(string, "runners")<br> })</pre> | `{}` | no |
| <a name="input_subnet_ids"></a> [subnet\_ids](#input\_subnet\_ids) | List of subnets in which the action runners will be launched, the subnets needs to be subnets in the `vpc_id`. | `list(string)` | n/a | yes |
| <a name="input_syncer_lambda_s3_key"></a> [syncer\_lambda\_s3\_key](#input\_syncer\_lambda\_s3\_key) | S3 key for syncer lambda function. Required if using S3 bucket to specify lambdas. | `string` | `null` | no |
Expand Down
2 changes: 2 additions & 0 deletions modules/multi-runner/runners.tf
Original file line number Diff line number Diff line change
Expand Up @@ -104,4 +104,6 @@ module "runners" {
pool_runner_owner = each.value.runner_config.pool_runner_owner
pool_lambda_reserved_concurrent_executions = var.pool_lambda_reserved_concurrent_executions
associate_public_ipv4_address = var.associate_public_ipv4_address

ssm_housekeeper = var.runners_ssm_housekeeper
}
Loading

0 comments on commit 340deea

Please sign in to comment.