Skip to content

jseluis/data_science_and_analytics

Repository files navigation

Data Science using Machine Learning Pipelines

My Certificate

Analyze Datasets and Train ML Models using AutoML Certificate

Build, Train and Deploy ML Pipelines using BERT

Optimize ML Models and Deploy Human-in-the-Loop Pipelines Certificate

Analyze Datasets and Train ML Models using AutoML

Project 1 - Amazon Sagemaker - Register and visualize dataset

Register and visualize dataset

List and access the Women's Clothing Reviews dataset files hosted in an S3 bucket

Install and import AWS Data Wrangler

Create an AWS Glue Catalog database and list all Glue Catalog databases

Register dataset files with the AWS Glue Catalog

Write SQL queries to answer specific questions on your dataset and run your queries with Amazon Athena

Return the query results in a pandas dataframe

Produce and select different plots and visualizations that address your questions

Project 2 Detect data bias with Amazon SageMaker Clarify

Download and save raw unbalanced dataset

Analyze bias with open source Clarify

Balance the dataset

Analyze bias at scale with a Amazon SageMaker processing job and Clarify

Analyze bias reports before and after balancing the dataset

Project 3 Train a model with Amazon SageMaker Autopilot

Dataset review

Configure the Autopilot job

Launch Autopilot job

Track Autopilot job progress

Feature engineering

Model training and tuning

Review all output

Deploy and test best candidate model

Project 4 Train a text classifier using Amazon SageMaker BlazingText built-in algorithm

Prepare dataset

Train the model with Amazon SageMaker BlazingText

Deploy the model

Test the model

Build, Train, and Deploy ML Pipelines using BERT

  • Automate a natural language processing task by building an end-to-end machine learning pipeline using Hugging Face’s highly-optimized implementation of the state-of-the-art BERT algorithm with Amazon SageMaker Pipelines. The pipeline will first transform the dataset into BERT-readable features and store the features in the Amazon SageMaker Feature Store. It will then fine-tune a text classification model to the dataset using a Hugging Face pre-trained model, which has learned to understand the human language from millions of Wikipedia documents. Finally, your pipeline will evaluate the model’s accuracy and only deploy the model if the accuracy exceeds a given threshold.

  • Practical data science is geared towards handling massive datasets that do not fit in your local hardware and could originate from multiple sources. One of the biggest benefits of developing and running data science projects in the cloud is the agility and elasticity that the cloud offers to scale up and out at a minimum cost.

  • The focus of these projects are on developing ML workflows using Amazon SageMaker, Python and SQL programming languages. I fully recommend you these projects if you want to learn how to build, train, and deploy scalable, end-to-end ML pipelines in the AWS cloud.

  • Steps ML_Pipelines_using_BERT

    1. Configure the SageMaker Feature Store

    2. Transform the dataset

    3. Inspect the transformed data

    4. Query the Feature Store

  • Train a review classifier with BERT and Amazon SageMaker

    1. Configure dataset

    2. Configure model hyper-parameters

    3. Setup evaluation metrics, debugger and profiler

    4. Train model

    5. Analyze debugger results

    6. Deploy and test the model

  • SageMaker pipelines to train a BERT-Based text classifier

    1. Configure dataset and processing step

    2. Configure training step

    3. Configure model-evaluation step

    4. Configure register model step

    5. Create model for deployment step

    6. Check accuracy condition step

    7. Create and start pipeline

    8. List pipeline artifacts

    9. Approve and deploy model

SageMaker pipelines to train a BERT-Based text classifier

My Quick Lab from Practical Data Science Specializastion, Coursera.

In this lab, you will do the following:

  • Define and run a pipeline using a directed acyclic graph (DAG) with specific pipeline parameters and model hyper-parameters
  • Define a processing step that cleans, balances, transforms, and splits our dataset into train, validation, and test dataset
  • Define a training step that trains a model using the train and validation datasets
  • Define a processing step that evaluates the trained model's performance on the test dataset
  • Define a register model step that creates a model package from the trained model
  • Define a conditional step that checks the model's performance and conditionally registers the model for deployment

Terminology

This notebook focuses on the following features of Amazon SageMaker Pipelines:

  • Pipelines - a directed acyclic graph (DAG) of steps and conditions to orchestrate SageMaker jobs and resource creation
  • Processing job steps - a simplified, managed experience on SageMaker to run data processing workloads, such as feature engineering, data validation, model evaluation, and model explainability
  • Training job steps - an iterative process that teaches a model to make predictions on new data by presenting examples from a training dataset
  • Conditional step execution - provides conditional execution of branches in a pipeline
  • Registering models - register a model in a model registry to create a deployable models in Amazon SageMaker
  • Parameterized pipeline executions - allows pipeline executions to vary by supplied parameters
  • Model endpoint - hosts the model as a REST endpoint to serve predictions from new data
  1. Configure dataset and processing step

  2. Configure training step

  3. Configure model-evaluation step

  4. Configure register model step

  5. Create model for deployment step

  6. Check accuracy condition step

  7. Create and start pipeline

  8. List pipeline artifacts

  9. Approve and deploy model

BERT Pipeline

The pipeline that you will create follows a typical machine learning application pattern of pre-processing, training, evaluation, and model registration.

In the processing step, you will perform feature engineering to transform the review_body text into BERT embeddings using the pre-trained BERT model and split the dataset into train, validation and test files. The transformed dataset is stored in a feature store. To optimize for Tensorflow training, the transformed dataset files are saved using the TFRecord format in Amazon S3.

In the training step, you will fine-tune the BERT model to the customer reviews dataset and add a new classification layer to predict the sentiment for a given review_body.

In the evaluation step, you will take the trained model and a test dataset as input, and produce a JSON file containing classification evaluation metrics.

In the condition step, you will register the trained model if the accuracy of the model, as determined by our evaluation step, exceeds a given threshold value.

Optimize ML Models and Deploy Human-in-the-Loop Pipelines

  • Project 1 - Optimize models using Automatic Model Tuning

    When training ML models, hyperparameter tuning is a step taken to find the best performing training model. In this lab you will apply a random algorithm of Automated Hyperparameter Tuning to train a BERT-based natural language processing (NLP) classifier. The model analyzes customer feedback and classifies the messages into positive (1), neutral (0), and negative (-1) sentiments.

    1. Configure dataset

    2. Configure and run hyper-parameter tuning job

    3. Evaluate the results

  • Project 2 A/B testing, traffic shifting and autoscaling

    Create an endpoint with multiple variants, splitting the traffic between them. Then after testing and reviewing the endpoint performance metrics, you will shift the traffic to one variant and configure it to autoscale.

    1. Configure and create REST Enpoint with multiple variants

    2. Test the model

    3. Show the metrics for each variant

    4. Shift all traffic to one variant

    5. Configure one variant to autoscale

  • Project 3 Data labeling and human-in-the-loop pipelines with Amazon Augmented AI (A2I)

    Create your own human workforce, a human task UI, and then define the human review workflow to perform data labeling. You will make the original predictions of the labels with the custom ML model, and then create a human loop if the probability scores are lower than the preset threshold. After the completion of the human loop tasks, you will review the results and prepare data for re-training.

    1. Setup private workforce and Cognito pool

    2. Create the Human Task UI using a Worker Task Template

    3. Create a Flow Definition

    4. Start and check the status of human loop

    5. Verify the completion

    6. View the labels and prepare data for training

References:

Advanced model training, tuning, and evaluation:

Hyperband

Bayesian Optimization

Amazon SageMaker Automatic Model Tuning

Advanced model deployment, and monitoring:

A/B Testing

Autoscaling

Multi-armed bandit

Batch Transform

Inference Pipeline

Model Monitor

Data labeling and human-in-the-loop pipelines:

Towards Automated Data Quality Management for Machine Learning

Amazon SageMaker Ground Truth Developer Guide

Create high-quality instructions for Amazon SageMaker Ground Truth labeling jobs

Amazon SageMaker Augmented AI (Amazon A2I) Developer Guide

Amazon Augmented AI Sample Task UIs

Liquid open source Template Language

Elastic Machine Learning Algorithms in Amazon SageMaker

Word2Vec algorithm

GloVe algorithm

FastText algorithm

Transformer architecture, "Attention Is All You Need"

BlazingText algorithm

ELMo algorithm

GPT model architecture

BERT model architecture

Built-in algorithms

Amazon SageMaker BlazingText

About

Data Science, Applications and Pipelines

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published