This project implements K-means, Bisecting K-means, and Decision Tree algorithms in PySpark on the Iris dataset.
This project demonstrates the use of PySpark to perform clustering and classification on the Iris dataset. The Iris dataset is a classic dataset in machine learning and statistics, containing measurements of various attributes of iris flowers and their corresponding species.
The project consists of three main components:
- K-means Clustering: A traditional clustering algorithm that partitions data into k distinct clusters.
- Bisecting K-means Clustering: A variation of K-means that recursively splits clusters to improve clustering quality.
- Decision Tree Classification: A classification algorithm that uses a tree-like model to make decisions based on input features.
To run this project, you need to have PySpark installed. You can install it using pip:
pip install pyspark
Additionally, you need to have Matplotlib and Seaborn for data visualization:
pip install matplotlib seaborn
- Initialize Spark Session: The Spark session is initialized with the name "IrisAnalysis".
- Load Data: The Iris dataset is loaded from a CSV file.
- Prepare Features: Features are prepared using VectorAssembler.
- K-means Clustering: K-means algorithm is applied to the data, and results are visualized.
- Bisecting K-means Clustering: Bisecting K-means algorithm is applied, and results are visualized.
- Decision Tree Classification: A decision tree classifier is trained, evaluated, and results are visualized.
- K-means Silhouette Score: The silhouette score for the K-means clustering model is 0.7482.
- Bisecting K-means Silhouette Score: The silhouette score for the Bisecting K-means clustering model is 0.6682.
- Decision Tree Accuracy: The accuracy of the Decision Tree classifier is 1.0.
Visualizations for each clustering method and the decision tree classification are generated using Matplotlib and Seaborn.
- PySpark: For data processing and machine learning.
- Matplotlib: For data visualization.
- Seaborn: For enhanced data visualization.
Contributions are welcome! Please open an issue or submit a pull request for any improvements or additions.
This project is licensed under the MIT License.