A. Portfolio focused Project Based Learning
B. Self Directed Configuration of VSCode and Python Locally
...
Artefacts from Live Technical Sessions in the form of:
Session 1
: Python Fundamentals (for beginners and new to Python). (2024.06.19)🖇️ Session1.ipnyb
:
CoLab Run -> :- NB: Was familar with Python Fundamentals from previous software engineering efforts and courses.
- i) Lists, Tuples, and Dictionaries
- ii) Basic Python Operations
- iii) Flow Control Structoures
- iv) Handling errors
- v) Functions
- Recommended Activities
- Code with Mosh Complete Python Mastery
- Practice Katas, for example, Code Wars, CodeSignal
- NB: Was familar with Python Fundamentals from previous software engineering efforts and courses.
Session 2
: Machine Learning Models and Methodologies Fundamentals. (2024.07.02)🖇️ Session2.ipnyb
CoLab Run -> :- i) Regressions
- ii) Classifications
- iii) Clustering
- iv) Recommender Systems
Session 3
: Generative AI Lab (2024.07.16)🖇️ Session3_VAE.ipnyb
:
CoLab Run ->- i) Load Datasets
- ii) Encoders
- iii) VAE Sampling
- iv) Decoders
- v) VAE Model
- vi) VAE Loss
- vii) Model Training
- viii) Display Images (func)
🖇️ Session3_Transformers.ipnyb
:
CoLab Run ->- i) Setups/Imports
- ii) Load Datasets
- iii) Load Transformer Model (BERT)
- iv) Training Params
- v) Trainer
- vi) Model Evaluation
- vii) Predictions
Session 4
:OpenAIAnthropic Text Completions (2024.07.30)🖇️ Session4 Anthropic Text Completion.ipynb
:
CoLab Run -> , Anthropic,not OpenAI- i) Install
- ii) Intiatiate API Key
- iii) Model Functions
- Original
- Refactored
- vi) Examples
- v) Interactive Prompt
Session 2
: Unsupervised Learning Models.Session 3
:- 3.1 GenAI: VAE
- 3.2 GenAI: Tuning Transformers
Session 4
Embeddable AI: ChatBot & Text Completion
Use the Jumpto buttons to launch Google Colab per Sessions' cell
- The OpenAI API Key was not being issued due to a CORS policy, so Anthropic was switched, and the Session4 notebook was duplicated. All subsequent references will incluude Anthropic, not OpenAI, as alternative LLM provider
Objective: Understand the theory and hands-on implementation of:
1️⃣ Regression,
2️⃣ Classification,
3️⃣ Clustering and
4️⃣ Recommender Systems.
NumPy, short for "Numerical Python," is a powerful library used in Python programming for numerical and scientific computing.
NumPy like a supercharged version of Python's built-in list data structure, designed to handle large amounts of data more efficiently.
Matplotlib is a powerful library in Python used for creating visualizations, such as graphs and charts.
MatplotLibs is particularly useful for data scientists, engineers, and anyone who needs to visualize data to understand and communicate trends, patterns, and insights
...
Scikit-learn is a popular Python library for machine learning, offering simple and efficient tools for data analysis and modeling
SciKit_Learn (sklearn
) provides a wide range of algorithms for classification, regression, clustering, and dimensionality reduction.
- It integrates well with other scientific libraries like NumPy and pandas
- As such, makes it easy to build and evaluate machine learning models.
- Is widely used for its
- ease of use,
- comprehensive documentation, and
- versatility in handling different machine learning tasks.
Logistic Regression models, as a type of linear models, are used as a common workflow in classification tasks (like binary classification) where you want to estimate the likelihood of a data point belonging to different categories.
- K-Means Clustering
- Hierarchical Clustering
- DBSCAN
Are just 3 of the 26 algorithms from
sklearn.cluster
, and the rest are out of scope for this purpose.
K-Means is an unsupervised machine learning algorithm that partitions a dataset into k distinct clusters based on similarities, aiming to minimize the sum of squared distances between data points and their assigned cluster centroids
It minimizes within-cluster variances (squared Euclidean distances), facilitating partitioning by mean rather than Euclidean distances.
Hierarchical Clustering (a la Agglomerative Clustering) is an unsupervised machine learning algorithm that groups unlabeled data points into a hierarchy of clusters based on their similarity. An analytical method that seeks to build a hierarchy of clusters by either merging or splitting them based on data observations.
It builds a cluster hierarchy in the form of a tree-like structure called a dendrogram, where each merge or split is represented by a node
- Agglomerative (Bottom-up) - Starting small, think of this as starting with one feature as its own group
- Divisive (Top-down) - Starting big, think of this as starting with the whole box of features as one big group
DBSCAN is an unsupervised clustering algorithm that groups together closely packed data points based on their density, while identifying points in low-density regions as outliers or noise.
- DBSCAN is known as Density-Based Spatial Clustering of Applications with Noise.
It operates by defining clusters as areas where a minimum number of points (minPts) exist within a specified radius (epsilon) around each point, allowing it to detect clusters of arbitrary shapes and effectively handle noise in datasets
Recommender systems are a type of information filtering system that predict the "rating" or "preference" a user would give to an item. They help users discover items they might like but haven't encountered yet. The algorthmic steps are somewhat as follows:
i. Idenitify a target to compare.
ii. Find similar targets.
iii. Calculate an average value for similar targets.
iv. Sort the high ranking values for recommendations.
v. Display the recommendation.
Pandas aims to be the fundamental high-level building block for doing practical, real world data analysis in Python; designed to make working with "relational" or "labeled" data both easy and intuitive
- Pandas, as a python package, has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language, via fast, flexible, and expressive data structures. It is already well on its way towards this goal.
SKLearn.metrics is part of 3 APIs used for evaluating the quality of a model's predicitions; specifically implementing functions assessing prediction error for targeted purposes.
- Score functions, performance metrics, pairwise metrics and distance computations.
SciKit's Metric Pairwise sub module implements utilities to evaluate pairwise distances or affinity of sets of samples.
-
Pairwise metrics is involved with subset of data transformations Pairwise metrics, affinities and Kernels, specifically covering transforming feature spaces into affinity spaces.
-
Cosine Similarity
is a popular choice for computing the similarity of documents represented as tf-idf vectors.tf-idf
vectors:- TF-IDF stands for: Term Frequency-Inverse Document Frequency.
- They represent text documents as numerical vectors, where each dimension corresponds to a unique word
- called so as Euclidean (L2) normalization projects the vectors onto the unit sphere.
- As their dot product is then the cosine of the angle between the points denoted by the vectors.
- accepts
scipy.sparse
matrices. - computes the L2-normalized dot product of vectors. That is, if
\(x\)
and\(y\)
are row vectors, their cosine similarity\(k\)
as follows for equation display:
The following equation represents the function \( k(x, y) \
):
$k(x, y) = \frac{x y^\top}{\|x\| \|y\|}$
These sessions needs to be run on if local system compute are not configured or specified for GPU loads.
VAE is an unsupervised learning technique where the machine is using and analyzing unlabeled data sets. With this method, the model can learn patterns in the data and learn how to reconstruct the inputs as its outputs after significantly downsizing it.
- Autoencoders have four main layers:
encoder
,bottleneck
,decoder
, and thereconstruction loss
.- The
encoder
is the given input with reduced dimensionality. - The
bottleneck
is the compressed representation of the encoded data. - The
decoder
is the reconstructed version of the original output. - The
reconstruction
loss is the difference between the original output and the reconstructed output.
- The
Input ➡️ Encoder ➡️ Bottleneck ➡️ Decoder ➡️ Ouput
TensorFlow is an end-to-end open source platform for machine learning and it is easy to create ML models that can run in any environment.
- It has a comprehensive, flexible ecosystem of tools, libraries, and community resources to build and deploy ML-powered applications.
- Lite lirbaries for mobile and edge devices
- Browser libraries
- ML models & datasets
- Developer tools for model evaluation, performance optimisation and productising ML workflows.
Keras is a multi-backend deep learning framework, with support for JAX, TensorFlow, and PyTorch
- It provides an approachable, highly-productive interface for solving machine learning (ML) problems, with a focus on modern deep learning.
- Build and train models for computer vision, natural language processing, audio processing, timeseries forecasting, recommender systems, etc.
- To use keras, you should also install the backend of choice:
tensorflow
,jax
, ortorch
. - NB: Note that
tensorflow
is required for using certain Keras 3 features: certain preprocessinglayers
as well astf.data
pipelines.- Keras 3 is intended to work as a drop-in replacement for
tf.keras
(when using the TensorFlow backend).
- Keras 3 is intended to work as a drop-in replacement for
Foundational Models are large scale models pre-trained on vast ammount of data, broad and diverse datasets, for adaption to downstream tasks. These can be fine tuned for specific applications by building more specialized models.
- BERT, a type of transformer model, is used in this session (3.2).
- Designed to underdstand a words's context in search queries.
- It does this by looking at the words that come before and after it
- It is bidirectional: BERT reads entire sequence of words one, considering the full context of each word.
- Is excellent for understanding text and it's context, thus ideal for deep understanding and analysis of language.
- Pre-training models allows models to learn general language patterns, structures, and representations.
- Fine-tuning: The process of adapting pre-trained models to specific tasks, using smaller, task specific datasets.
- Customises the model to improve performance in specific applications without needing to train it from scratch.
HuggingFace's 🤗 library that enables the same
PyTorch
code to be run across any distributed configuration.
- It's run your raw
PyTorch
training script on any kind of device. - Accerlate was created for
PyTorch
users who like to write the training loop ofPyTorch
models ..- ... but are reluctant to write and maintain the boilerplate code needed to use multi-GPUs/TPU/fp16.
- Accelerate abstracts exactly and only the boilerplate code related to multi-GPUs/TPU/fp16.
HuggingFace provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets and share them on HuggingFace's mode; hub.
- It provides thousands of pretrained models to perform tasks on different modalities such as text, vision, and audio.
UseCases (Source:PyPi)
These models can be applied on:
- 📝 Text, for tasks like text classification, information extraction, question answering, summarization, translation, and text generation, in over 100 languages.
- 🖼️ Images, for tasks like image classification, object detection, and segmentation.
- 🗣️ Audio, for tasks like speech recognition and audio classification.
- Transformer models can also perform tasks on several modalities combined, such as
- Table question answering,
- Ooptical character recognition,
- Information extraction from scanned documents,
- Video classification, and
- Visual question answering.
These sessions needs to be run on if local system compute are not configured or specified for GPU loads.
- 1️⃣ Embedded AI- Hands-on Chatbots
- Embedded AI- Hands-on Chatbots using Python, Jupyter Notebook.
- Integrate the chatbot with OpenAI's GPT-4-o model to give it a high level of intelligence and the ability to understand and respond to user requests
- 2️⃣ Embedded AI - IBM Watson Speach to Text / Text-Speech
- Embedded AI- Hands-on Chatbots using Python, Flask, HTML, CSS, and Javascript.
- Implement IBM Watson Speech-to-Text functionality to allow the chatbot to understand voice input from users.
- Implement IBM Watson Text-to-Speech functionality to allow the chatbot to communicate with users through voice output.
OpenAI is a leading AI research company that offers powerful models for text completion and chatbot development.
Its GPT models excel at understanding and generating human-like text, enabling applications like:
- Text Completion: Predicting and suggesting subsequent words in a sentence or paragraph.
- Chatbots: Building interactive conversational agents that can engage in natural, dynamic dialogue with users
- OpenAI provides APIs and tools for easy integration of these capabilities into various platforms and applications.
- LLM Provider registration/login and Freeemium Subscription 💳🔐
- Issue: #7 | 🐛 [Bug]: External | OpenAI bug with API Key generation
- ➡️➡️ Switch to another LLM Provider/Platform: Anthropic Claude Sonnet 3.5
- Request: #8: Update Session4 Notebook or duplicate/mirror OpenAi variant.
https://docs.anthropic.com/en/api/client-sdks
-
For a live session, this was not covered, due to the extant requirements of IBM WatsonX.ai registration, credit card approvals and multiple Entitlements and Trial License; as well as local access/technical configuration of each of these SST and TTS models.
- IBM ID/Cloud account registration/login required - 🔐 (** elective)
-
This is guided project is optional to the requirments of accrditation completion and will be updated here in an external (private) repository. Access by demand. WIP.
- n
- n
- n
- n IBM Developer (2023-12-08) "Implement autoencoders using TensorFlow" (Accessed: July 2024); URL https://developer.ibm.com/tutorials/implement-autoencoders-using-tensorflow/
- n
- n
- n
Datea | Version | Changed By | Change | Activity |
---|---|---|---|---|
2024-07-16 | 0.1 | Charles J Fowler | Initial version created | Create |
2024-07-27 | 0.2 | Charles J Fowler | Draft Portfolio version | Modify |
a: YYYY-MM-DD |