This repository contains code samples that will show you how to analyze document and image files stored in S3 using the following techniques:
- Using Large Language Models (LLMs) hosted on Amazon Bedrock.
- Using Amazon Bedrock Data Automation.
Both these will show you how to generate metadata based on the content of the files and store them as key-value pairs (tags) in an Amazon DynamoDB table with a reference to the files in S3.
Amazon S3 is a popular object storage service on AWS. You can store any type of file as an object in a S3 bucket. Although you can write files of a specific type or context within a specific directory structure (path) in S3, it will be useful to add metadata to the files like it's content description, owner, context etc. so you can easily retrieve the file that you are looking for. There are two ways to do this:
Option 1: Use the user-defined metadata feature in S3
While uploading an object to a S3 bucket, you can optionally assign user-defined metadata as key-values pairs to the object. This will be stored along with the object. This cannot be added later on to an existing object. The only way to modify object metadata is to make a copy of the object and set the metadata.
Option 2: Store the metadata in an external system with a reference to the object in S3
If you want to set metadata to an existing object in S3 without copying that object or if you want to add to an existing metadata system that already exist, then it will make sense to store the metadata in an external system, like an Amazon DynamoDB table for example. This option is also applicable if the data is stored outside S3 and needs to be tagged with metadata.
In both of these options, if you do not know the metadata that describes the data stored in the object, then, you have to read the object, analyze it's content and generate the appropriate metadata. This is where AI can help.
- Choose an AWS Account to use and make sure to create all resources in that Account.
- Identify an AWS Region that has Amazon Bedrock with Amazon Nova 1.0 or Anthropic Claude 3/3.5 or Meta Llama 3.2 models and Amazon Bedrock Data Automation.
- In that Region, create two new or use two existing Amazon S3 buckets of your choice. These will be input and output S3 buckets.
- In the S3 bucket that you designate as the input bucket, upload all the files from the assets folder.
- In that same Region, create a new Amazon SageMaker notebook instance with Amazon Linux 2, Jupyter Lab 4(notebook-al2-v3) as the Platform Identifier.
- Clone this GitHub repo to that notebook instance.
- In that notebook instance, open the following Jupyter notebooks,
- For using Large Language Models (LLMs) hosted on Amazon Bedrock, open file-tagger-with-bedrock-llms.ipynb by navigating to the Amazon SageMaker notebook instances console and clicking on the Open Jupyter link.
- For using Amazon Bedrock Data Automation, open file-tagger-with-bedrock-data-automation.ipynb by navigating to the Amazon SageMaker notebook instances console and clicking on the Open Jupyter link.
This repository contains
-
Two Jupyter Notebooks to get started.
-
Assets folder with files that represent various types of documents and images that will be processed by the notebook. Note: Of these, Document_2.pdf is not shared under MIT-0 license.
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.