Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. SPDX-License-Identifier: MIT-0
- DISCLAIMER
- Data perimeter helper – Overview
- Getting started
- How to use data perimeter helper
- Data perimeter helper queries
- Data perimeter helper – Example use cases
- Data perimeter helper documentation
- Definitions
- Project structure
- Uninstallation
- Resources
The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards.
AWS offers a set of features and capabilities you can use to implement a data perimeter to help prevent data disclosure and unintended access to your data. A data perimeter is a set of preventive controls you use to help ensure that only your trusted identities are accessing trusted resources from expected networks.
If you are new to data perimeter concepts, see Data perimeters on AWS.
Data perimeter helper
is a tool that helps you design and anticipate the impact of your data perimeter controls by analyzing access activity in your AWS CloudTrail logs.
Data perimeter helper
provides a set of queries tailored for specific data perimeter objectives. For example, you can use the data perimeter helper
query common_from_public_cidr_ipv4
to review API calls that are made from public IPv4 addresses and facilitate your network perimeter implementation.
Based on your business and security requirements, you define your data perimeters by setting what trusted identities, trusted resources and expected networks mean for your organization. You can document your data perimeter definition in the data perimeter helper
configuration file to fine-tune data perimeter helper
results.
Data perimeter helper
uses tailored Amazon Athena queries to analyze your CloudTrail logs. It then performs data processing to enrich query results with your resource configuration information retrieved from AWS Config and AWS Organizations and to remove irrelevant information. And finally, it returns the API calls that do not adhere to your data perimeter definition, helping you to:
- Build and fine-tune your data perimeter policies.
- Assess potential impact on your workloads before deployment.
- Support troubleshooting activities after deployment.
The following is an example of data perimeter helper
usage:
One of the security objectives pursued by companies is ensuring that their Amazon S3 buckets can be accessed only by AWS Identity and Access Management (IAM) principals belonging to their AWS organization. You can help achieve this security objective by implementing an identity perimeter on your buckets.
Before enforcing identity perimeter controls, you can run the data perimeter helper
query s3_bucket_policy_identity_perimeter_org_boundary
to identify principals that do not belong to your AWS organization and have performed API calls to your Amazon S3 buckets in a given time frame.
In this example, data perimeter helper
performs the following actions:
- Analyzes CloudTrail events to identify principals that have performed Amazon S3 API calls in a given time frame.
- Removes API calls performed by principals belonging to your AWS organization and those documented as trusted in the
data perimeter helper
configuration file. - Enriches results to ease analysis. For example,
data perimeter helper
adds a fieldisAssumableBy
with the list of principals in IAM roles’ trust policies. - Exports the results to a file for analysis (HTML, Excel, or JSON).
You can then review data perimeter helper
results to identify principals not belonging to your AWS organization that performed API calls on your S3 buckets. If you determine that the principals have a legitimate reason to access your buckets, you can update your identity perimeter controls and document the principals as trusted in the data perimeter helper
configuration file. Otherwise, you can investigate with your development teams to understand why the principals performed those API calls and proceed according to your policy.
Note that data perimeter helper
is intended to accelerate your analysis — not to replace human analysis. The helper relies on CloudTrail logs to help you reason about potential impacts of policies you write. Though CloudTrail provides you information about parameters of a request, it does not reflect values of IAM condition keys present in the request.
Data perimeter helper
is built on Python. To run it, you need the following to be installed on your system:
- Python 3 (tested with 3.11.4).
- pip (tested with 23.2.1).
- virtualenv (tested with 20.24.6). You can install virtualenv by using the following command:
$ python -m pip install virtualenv
Data perimeter helper
uses AWS services and features. Be sure to configure the following in your AWS environment:
- AWS Organizations.
- A CloudTrail organization trail with management events enabled and sent to a central Amazon S3 bucket. If you use AWS services that support CloudTrail data events and want to analyze the associated API calls, enable the relevant data events.
- An Athena table to perform queries on CloudTrail logs stored in the previously mentioned Amazon S3 bucket:
Data perimeter helper queries expect a specific table schema. You can use the following query to create a table that meets the requirement: ./prerequisites/athena/organization_trail_athena_table.sql.
Data perimeter helper
supports a setup where management events are stored in one bucket and data events in another.
- AWS Config organization aggregator.
To accelerate the deployment of the preceding prerequisites, you can use the following Terraform templates: ./prerequisites/terraform/README.md.
- (Optional) AWS IAM Access Analyzer external access analyzer.
You can configure data perimeter helper
to retrieve IAM Access Analyzer external access findings from AWS IAM Access Analyzer or AWS Security Hub.
Security Hub provides provides cross-region aggregation enabling you to retrieve more easily findings accross your organization.
If you choose Security Hub then you need to have Security Hub cross-region enabled for your organization.
Data perimeter helper
uses three IAM principals to perform its operations:
- A principal with permissions to run Athena queries, read CloudTrail logs, and read/write Athena query results.
- A principal with AWS Config permissions to run AWS Config advanced queries.
- A principal with permissions to list AWS accounts in AWS Organizations.
- (Optional) A principal with permissiosn to get IAM Access Analyzer external access findings.
The following diagram demonstrates how these permissions are utilized by the tool:
Optionally, you can use a single IAM principal with all required permissions granted in its permissions policy. For sample IAM policies, see: ./prerequisites/dph/dph_principals.md.
To install data perimeter helper
, follow these steps:
- At the root level of the project, create a Python virtual environment by using the following command:
$ python -m venv .venv
- Activate the virtual environment by using the following command:
$ source .venv/bin/activate (on Unix)
% .venv/Scripts/activate.ps1 (on Windows)
- Make sure that the latest pip version is used by using the following command:
$ python -m pip install --upgrade pip
- Install the
data perimeter helper
package by using the following command:
$ pip install -e ./
- Test if the
data_perimeter_helper
package has been correctly installed by using the following command:
$ data_perimeter_helper --version
The command line will display the current version of the helper:
Data perimeter helper - vX.Y.Z
With the data perimeter helper
installed, the next step is to provide your configuration parameters:
- Update variables.yaml by specifying the names of AWS CLI credential profiles, the name of your Athena table, and others parameters specific to your environment.
For an example, see ./prerequisites/dph/variable.sample.yaml.
- Update
data perimeter helper
configuration file data_perimeter.yaml with your data perimeter definition by entering your trusted identities, expected networks, and trusted resources.
For an example, see ./prerequisites/dph/data_perimeter_sample.yaml.
The following list provides common data perimeter helper
commands:
- Get help:
$ data_perimeter_helper/dph -h
- Run a specific query on one account:
$ dph --list-account/-la <ACCOUNT_ID> --list-query/-lq <QUERY_NAME>
You can use --list-account/-la
with an account ID.
- Run a specific query on multiple accounts.
You can use --list-account/-la
with multiple account IDs or account names:
$ dph --list-account/-la <ACCOUNT_ID_1> <ACCOUNT_ID_2> -lq <QUERY_NAME>
If your account name contains spaces but the name under quotes.
You can run your query across all your organization by setting --list-account/-la
to all
:
$ dph --list-account/-la all -lq <QUERY_NAME>
If you want to run your query against accounts descendant of specific organizational units, you can use --list-ou/-lo
with multiple organizational unit IDs:
$ dph --list-ou/-lo <OU_ID_1> <OU_ID_2> -lq <QUERY_NAME>
- Run a specific query on one account using the substring of a query name:
$ dph -la <ACCOUNT_ID> -lq <SUBSTRING_OF_QUERY_NAME>
For example, by using bucket
as a substring, any query name containing bucket
would be selected.
- Run multiple queries on multiple accounts:
$ dph -la <ACCOUNT_ID_1> <ACCOUNT_ID_2> -lq <QUERY_NAME_1> <QUERY_NAME_2>
- Use a custom variables file:
By default, variables' values are retrieved from the variables.yaml file.
This behavior can be overridden with the --variable_file/-vf
flag. You can use this flag if you have multiple configuration profiles.
$ dph -la <ACCOUNT_ID> -lq <QUERY_NAME> --variable-file/-vf my_custom_variable_file.yaml
- Specify the export file format:
Supported formats: HTML, Excel, and JSON.
By default, if the --export-format/-ef
flag is not set, HTML and Excel formats are used.
$ dph --export-format/-ef html excel json
You can use common
queries to analyze activity in your AWS organization against data perimeter objectives without focusing on a specific AWS service. Example: review all API calls from public IPv4 addresses.
The common
queries are prefixed with the keyword common
and are not tied to a specific AWS service. For more information, see: ./data_perimeter_helper/queries/common/README.md.
You can use referential
queries to get insights on your resource configurations.
The referential
queries rely on AWS Config advanced queries and AWS Organizations API calls.
The referential
queries are prefixed with the keyword referential
and are not tied to a specific AWS service. For more information, see: ./data_perimeter_helper/queries/common/README.md.
You can use findings
queries to get insights on your AWS IAM Access Analyzer external access findings.
The findings
queries rely on AWS IAM Access Analyzer or AWS Security Hub API calls depending on the value of variable external_access_findings
in the variables file.
The findings
queries are prefixed with the keyword findings
and are not tied to a specific AWS service. For more information, see: ./data_perimeter_helper/queries/findings/README.md.
You can use s3
queries to analyze activity in your AWS organization against data perimeter objectives while focusing exclusively on Amazon S3 API calls.
The s3
queries are prefixed with the keyword s3
. For more information, see: ./data_perimeter_helper/queries/s3/README.md.
You can use sns
queries to analyze activity in your AWS organization against data perimeter objectives while focusing exclusively on Amazon SNS API calls.
The sns
queries are prefixed with the keyword sns
. For more information, see: ./data_perimeter_helper/queries/sns/README.md.
You can add your custom queries to the tool by using a query template file: ./data_perimeter_helper/queries/template.py.
Follow these steps to add a new query:
-
Copy and paste the query template file in a direct child folder of ./data_perimeter_helper/queries/ (such as
common
ors3
). You can create new subfolders, if necessary.
Rename the file and the Python class namequery_name_replace_me
inside it with your query name. -
The file and class names must match (for example, file name:
sns_identity_perimeter.py
, class name:sns_identity_perimeter
). Also, the file and class names must be prefixed with the folder name (for example, folder name:s3
, file name:s3_my_query_name
).
Configure your class with the applicable Athena SQL query and your data processing logic (see the following section for detailed instructions). -
Queries are then automatically discovered by
data perimeter helper
and made available for use.
Data perimeter helper
queries are defined as Python classes. All queries inherit from their parent class Query
defined in Query.py. For example, the data perimeter helper
query s3_scp_network_perimeter_ipv4
is declared with class s3_scp_network_perimeter_ipv4(Query)
.
To configure a query, you need to update three main sections in your query file:
- Update the Python class constructor
__init__
- Set the variable
depends_on_resource_type
with resource types you use as part of your data processing logic:Data perimeter helper
uses this information to parallelize retrieval of resource configuration data with threading to speed up query execution.- If you do not declare resource dependencies,
data perimeter helper
will retrieve resource configuration information during its execution without threading. - If you use AWS Config advanced queries to retrieve the resource configuration information, you need to specify resource type values supported by AWS Config (for example,
AWS::S3::Bucket
, which is case sensitive). See the list of AWS Config supported resource types.
- Set the variable
depends_on_iam_access_analyzer
toTrue
if your queries relies on AWS IAM Access Analyzer external access findings. - Set the variable
use_split_table
:- If you store CloudTrail management events and data events in two different buckets:
- Set
use_split_table = True
if you want to analyze data events as part of your query. - Set
use_split_table = False
if the analyzed services do not support data events or you do not want to analyze data events.
- Set
- If you store all CloudTrail logs in one bucket:
- Set
use_split_table = False
.
- Set
- If you store CloudTrail management events and data events in two different buckets:
- Set the variable
- Update the
generate_athena_statement
function with your Athena SQL query. The template file provides the expected skeleton of the Athena query. - Update the
submit_query
function with your data processing steps performed after the Athena SQL query result is returned.
The Query
class provides common attributes and functions shared across all queries. You can, for instance, use the following functions for your data processing logic (see Available functions for data processing for more details):
- add_column_vpc_id
- add_column_vpce_account_id
- add_column_is_service_role
As part of your data processing, you might need to retrieve resource configuration information from an AWS Config aggregator. In data perimeter helper
, resources are defined as Python classes and located in the ./data_perimeter_helper/referential/ folder. All resource classes inherit from their parent ResourceType
defined in ResourceType.py. For example, the resource iam_role
/AWS::IAM:Role
is declared with class iam_role(ResourceType)
.
Resources are defined by using their resource type with the following format: AWS::<SERVICE_NAME>::<RESOURCE_NAME>
. If you use AWS Config advanced queries to retrieve the resource configuration information, you need to specify resource type values supported by AWS Config (for example, AWS::S3::Bucket
, which is case sensitive). See the list of AWS Config supported resource types.
A generic AWS Config advanced query is available and described in ./data_perimeter_helper/referential/generic.py:
SELECT
accountId, awsRegion, resourceId
WHERE
resourceType = '{resource_type}'
This query allows you to retrieve accountId
, awsRegion
, and resourceId
for any resource type inventoried by an AWS Config aggregator. If you need to retrieve additional configuration parameters, you can create a custom query by using the generic query as a template. For an example of a custom query, see ./data_perimeter_helper/referential/iam_role.py.
The following custom resource type are available to speed up the configuration information retrieval process:
AWS::EC2::VPCEndpoint::<SERVICE_NAME>
, use this resource to retrieve configuration information for VPC endpoints of a given service name. To retrieve configuration information for all VPC endpoints, useAWS::EC2::VPCEndpoint
.AWS::Organizations::Tree
, use this resource to retrieve information for the organization structure. This resource provides the list of the account IDs, names, list of parents and the organizational unit boundaries. You need this resource for queries performed at the organizational unit boundary.AWS::Organizations::Account
provides only the list of the account IDs and names.
The following functions are available in the Query
class. You can use them to enrich your query results:
add_column_vpc_id
: Add a column with VPC IDs of VPC endpoints recorded in CloudTrail events.add_column_vpce_account_id
: Add a column with account IDs of VPC endpoints recorded in CloudTrail events.add_column_is_assumable_by
: Add a column with values of the principal element in trust policies for IAM roles recorded in CloudTrail events.add_column_is_service_role
: Add a column with a Boolean value to denote if an IAM principal recorded in CloudTrail events is a service role.add_column_is_service_linked_role
: Add a column with a Boolean value to denote if an IAM principal recorded in CloudTrail events is a service-linked role.remove_calls_from_service_on_behalf_of_principal
: Remove a subset of API calls made by an AWS service using forward access sessions (FAS):- Remove from the query results the API calls made from an AWS service network by using a service role and where the
sourceipaddress
field in the CloudTrail record is populated with the service’s DNS name that does not match the one specified in the role’s trust policy. - Remove from the query results the API calls made from an AWS service network by a principal that is neither a service role nor a service-linked role and where the
sourceipaddress
field in the CloudTrail record is populated with the service’s DNS name.
- Remove from the query results the API calls made from an AWS service network by using a service role and where the
remove_trusted_vpc_id
: Remove calls made from VPCs which IDs are documented as expected in the data_perimeter.yaml file.remove_resource_exception
: Remove calls where values of relevant request parameters match resource-specific exceptions documented in the data_perimeter.yaml file.
A data perimeter is a set of preventive guardrails in your AWS environment you use to help ensure that only your trusted identities are accessing trusted resources from expected networks.
In the Data Perimeters Blog Post Series, we cover in depth the objectives and foundational elements needed to enforce each perimeter type:
- Identity perimeter: Allow only trusted identities to access company data.
- Network perimeter: Allow access to company data only from expected networks.
- Resource perimeter: Allow only trusted resources from my organization.
The following is a high-level diagram of controls that compose a data perimeter:
See the post Establishing a data perimeter on AWS: Analyze your account activity to evaluate impact and refine controls.
You can use dph_doc
(installed with the data_perimeter_helper
package) to generate the documentation of data perimeter helper
queries automatically.
dph_doc
parses the query definition and matches the instructions with previously documented instructions in dph_doc.py.
- To generate the documentation of a query, you can use the following command:
$ dph_doc
- To generate documentation for specific queries, you can use the following command:
$ dph_doc -lq <QUERY_NAME>
For example, for queries on Amazon S3:
$ dph_doc -lq s3
dph_doc
will generate README files named README.auto.local.md
in each query's folders.
You can then use these files as input to create your own README files.
A principal is a human user or workload that can make a request for an action or operation on an AWS resource.
A service principal is an identifier for a service. The service principal is defined and owned by the service. Example: cloudtrail.amazonaws.com.
A service role is an IAM role that can be assumed by an AWS service.
A service-linked role is a type of service role that is linked to an AWS service. The service can assume the role to perform an action on your behalf. Service-linked roles appear in your AWS account and are owned by the service. An IAM administrator can view but not edit the permissions for service-linked roles.
Project Structure
├── data_perimeter_helper/
│ ├── queries/
│ │ ├── common/
│ │ │ ├── README.md
│ │ │ ├── common_from_public_cidr_ipv4.py
│ │ │ ├── <other_queries>
│ │ │ └── README.md
│ │ ├── referential/
│ │ │ └── referential_service_role.py
│ │ ├── s3/
│ │ │ ├── README.md
│ │ │ ├── s3_bucket_policy_identity_perimeter_org_boundary.py
│ │ │ ├── <other_queries>
│ │ ├── sns/
│ │ │ ├── README.md
│ │ │ └── sns_network_perimeter_ipv4.py
│ │ ├── helper.py
│ │ ├── import_query.py
│ │ ├── Query.py
│ │ └── template.py
│ ├── referential/
│ │ ├── account.py
│ │ ├── config_adv.py
│ │ ├── generic.py
│ │ ├── iam_role.py
│ │ ├── import_referential.py
│ │ ├── Referential.py
│ │ ├── ResourceType.py
│ │ ├── s3_bucket.py
│ │ └── vpce.py
│ ├── toolbox/
│ │ ├── cli.py
│ │ ├── exporter.py
│ │ └── utils.py
│ ├── __init__.py
│ ├── data_perimeter.yaml
│ ├── main.py
│ ├── variables.local.yaml
│ ├── variables.py
│ └── variables.yaml
├── docs/
├── LICENSE
├── LICENSE-SAMPLECODE
├── lint/
│ ├── python_licenses.txt
│ └── .flake8
├── outputs/
├── prerequisites/
│ ├── athena/
│ │ └── organization_trail_athena_table.sql
│ ├── dph/
│ │ ├── dph_principals.md
│ │ └── variable.sample.yaml
│ └── terraform/
│ ├── examples/
│ │ ├── cloudtrail_local_trail/
│ │ │ ├── data.tf
│ │ │ ├── local.tf
│ │ │ ├── main.tf
│ │ │ ├── provider.tf
│ │ │ ├── root.auto_.tfvars
│ │ │ └── variables.tf
│ │ ├── cloudtrail_org_trail/
│ │ │ ├── data.tf
│ │ │ ├── local.tf
│ │ │ ├── main.tf
│ │ │ ├── provider.tf
│ │ │ ├── root.auto_.tfvars
│ │ │ └── variables.tf
│ │ ├── config_org_aggregator/
│ │ │ ├── data.tf
│ │ │ ├── main.tf
│ │ │ ├── provider.tf
│ │ │ ├── root.auto_.tfvars
│ │ │ └── variables.tf
│ │ └── config_with_invit_aggregator/
│ │ ├── data.tf
│ │ ├── main.tf
│ │ ├── provider.tf
│ │ ├── root.auto_.tfvars
│ │ └── variables.tf
│ ├── modules/
│ │ ├── athena_workgroup/
│ │ │ ├── athena.tf
│ │ │ ├── athena_output_bucket.tf
│ │ │ ├── athena_output_kms_key.tf
│ │ │ ├── athena_queries/
│ │ │ │ ├── cloudtrail_create_table_local.sql.tftpl
│ │ │ │ └── cloudtrail_create_table_org.sql.tftpl
│ │ │ ├── data.tf
│ │ │ ├── local.tf
│ │ │ ├── provider.tf
│ │ │ └── variables.tf
│ │ ├── cloudtrail_bucket/
│ │ │ ├── cloudtrail_logs_bucket.tf
│ │ │ ├── cloudtrail_logs_kms_key.tf
│ │ │ ├── data.tf
│ │ │ ├── local.tf
│ │ │ ├── outputs.tf
│ │ │ ├── provider.tf
│ │ │ └── variables.tf
│ │ ├── cloudtrail_trail/
│ │ │ ├── data.tf
│ │ │ ├── local.tf
│ │ │ ├── provider.tf
│ │ │ ├── trail.tf
│ │ │ └── variables.tf
│ │ ├── config_aggregator_central/
│ │ │ ├── aggregator.tf
│ │ │ ├── org_aggregator_role.tf
│ │ │ ├── provider.tf
│ │ │ └── variables.tf
│ │ └── config_aggregator_invited/
│ │ ├── authorization.tf
│ │ ├── provider.tf
│ │ └── variables.tf
│ └── README.md
└── tests/
├── conftest.py
├── context.py
├── for_test_utils.py
├── test_end_to_end.py
└── test_help.py
├── pyproject.toml
├── README.md
├── CHANGELOG.md
├── CODE_OF_CONDUCTS
├── CONTRIBUTING
├── requirements.txt
└── setup.py
- Use the following command to uninstall the
data_perimeter_helper
package:
$ pip uninstall data_perimeter_helper
- Use the following command to exit the virtual environment:
$ deactivate
See the following resources for more insights: