Skip to content

a course project calculated as the number of clicks an ad receives divided by the number of times the ad is shown (impressions), expressed as a percentage.

Notifications You must be signed in to change notification settings

anhtngc/IS252-ClickThroughRatePrediction

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📊 Data Mining Demo App

A Streamlit app showcasing various data mining techniques, including data processing, outlier detection, correlation analysis, binning, smoothing, clustering, and prediction. The app serves as a demonstration of applying machine learning algorithms on real-world datasets and performing basic exploratory data analysis (EDA).

Table of Contents

Introduction

The Data Mining Demo App provides an interactive platform to explore and apply key data mining techniques. This app is designed to be user-friendly, with a simple interface allowing users to process, analyze, and visualize datasets in various ways.

Key Features:

  • Data preprocessing and cleaning
  • Outlier detection using K-Nearest Neighbors (KNN)
  • Correlation analysis between features
  • Binning and smoothing techniques for noise reduction
  • K-Means clustering and machine learning models for prediction

This tool is ideal for learners, data scientists, and anyone interested in performing data mining on small to medium datasets.

Technologies Used

Python Streamlit Scikit-learn Docker Jupyter Notebook

Installation

Follow these steps:

  1. Clone the repository:
# 1. Clone the repository:
git clone https://github.com/LoylP/Data_Mining_App.git
# 2. Navigate to the project directory:
cd Data_Mining_App

You can run with docker:

docker build -t streamlit-app .
docker run -p 8080:8080 streamlit-app

(Optional) Or create and activate a virtual environment:

  • For Ubuntu/macOS:
python3 -m venv venv
source venv/bin/activate
  • For Windows:
python -m venv venv
.\venv\Scripts\activate
  1. Install the required dependencies:
pip install -r requirements.txt
  1. Run:
streamlit run app.py

Features

🏠 Home

  • Overview: Displays the dataset summary, including the shape and the number of missing values for each column.
  • File Upload: Users can upload a CSV file for analysis, which is then displayed in a tabular format.

🛠 Data Processing

  • Categorical and Numerical Columns: Automatically detect and classify columns into categorical or numerical types.
  • Missing Value Handling: Display and provide options to fill or drop missing values.
  • Encoding Categorical Data: Convert categorical columns to numerical values using techniques like One-Hot Encoding.

🚫 Outlier Detection

  • KNN-based Outlier Detection: Identify data points that significantly differ from the rest of the dataset.
  • Outlier Removal: Optionally remove outliers based on a set threshold, improving the quality of analysis.
  • Distribution Visualization: Visualize the distance distribution to help users set an appropriate threshold for outlier detection.

🔗 Correlation Analysis

  • Correlation Matrix: Visualize correlations between numerical columns using a heatmap.
  • Threshold Selection: Choose a minimum correlation threshold to filter out weak correlations, focusing on the most significant relationships.

🌫 Binning/Smoothing

  • Data Binning: Group continuous data into bins (e.g., age ranges, income groups).

  • Smoothing: Apply smoothing techniques to reduce noise in binned data, improving the quality of the analysis.

  • Apriori Algorithm: Perform association rule mining to find frequent itemsets and learn associations between features.

🚀 Clustering

  • K-Means Clustering: Apply K-Means clustering on numerical features to group similar data points.

  • Cluster Visualization: Visualize clustering results in 2D or 3D plots for better interpretation.

  • Cluster Count Selection: Select the number of clusters using the Elbow method or manual input.

📈 EDA (Exploratory Data Analysis)

  • Descriptive Statistics: Display key statistics like mean, median, standard deviation, and percentiles.
  • Visualizations: Generate histograms, scatter plots, box plots, and pair plots for deeper insights into the data distribution and relationships.
  • Feature Correlation: Visualize and analyze the correlation between selected features to understand dependencies.

🔎 Prediction

  • Model Training: Train machine learning models such as linear regression, decision trees, and random forests.

  • Model Evaluation: Display performance metrics like accuracy, precision, recall, F1 score, and confusion matrices.

  • Prediction Input: Allow users to input new data and predict outcomes based on the trained model.

Usage

Upload Data:

  • Click on the "Upload CSV" button in the sidebar to upload a CSV dataset.
  • After the file is uploaded, it will be displayed in a table format on the main page.

Data Processing:

  • Categorization: View and manage categorical and numerical columns.
  • Missing Value Handling: Choose to either drop or impute missing values.
  • Encoding: Convert categorical columns to numerical format using one-hot encoding.

Outlier Detection:

  • Run Outlier Detection: Detect outliers using the KNN-based method.
  • Adjust Threshold: Set the threshold for outlier removal.
  • Visualize: See the distribution of average distances and outliers highlighted on the graph.

Clustering and Prediction:

  • K-Means Clustering: Apply K-Means clustering on selected numerical columns.
  • Train Model: Train machine learning models and evaluate them using metrics like accuracy and precision.
  • Prediction: Input new data points to make predictions based on the trained models.

📄 License

MIT License

About

a course project calculated as the number of clicks an ad receives divided by the number of times the ad is shown (impressions), expressed as a percentage.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 94.3%
  • Python 5.7%