K-mean parallel with Map-Reduce

Description

This is my project to implement about the K-means parallel using Map-Reduce in Hadoop with Python code.

Word File

I write some word file to save userful information about my proplems, introduction to K-mean, how map-reduce work with K-mean and also guide to install the Hadoop on Ubuntu 22.04 system. Here is some information you can get in the Word file (It's only Vietnamese language but you can contact me if you need help).

Install: Guide to install the Hadoop Single Node on Ubuntu 22.04.
Review-Kmean: Short review about the K-mean algorithm and how it work.
Case-Study: This part is talking about the sequential and parallel in K-mean.
Evaluation: Some evaluation on time speed while running the parallel and sequential.

How to run demo

Here is the step-by-step how you can run this code in your Ubuntu 22.04 after you install Hadoop follow the Install guide above.

cd to ~
Clone this repository to your ubuntu server: git clone git@github.com:tanaha2002/K-means-parallel.git (may you need to install git on your ubuntu server first).
Copy the k_mean folder to /home/ubuntu.
Install the python with command sudo apt install python3
Cd to k_mean and run pip3 install -r requirements.txt for install some depedencies python library.
Now you can go to datagen folder and open the data_gen_labels.py then change the num_samples` to how many data row you wanna gen to text. You will also get the init centroids and image about the ground truth cluster.
Turn on the HDFS and yarn with command start-dfs.sh and start-yarn.sh. Then go to the parallel_hdfs folder and run command bash data2hdfs.sh to put your data to hdfs and remove some old logs if you have.
Then run bash run.sh for run the parallel K-mean. You will get the visual of the images in each iter.
Now go to the sequential folder and run python3 k_means.py for run the sequential version.
After run all two version parallel and sequential. Go to the evaluation folder and run python3 evaluation and you will get the plot of time running of two version for evaluation.

Note: This version is just can run with 2 feature in data_gen_labels.py, you can clone this repo and update your own version to let it can run with any feature-dim you want, maybe 10-20... for get more true parallel time you can compute with the sequential.

Demo

demo-k-mean.mp4

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
k_mean		k_mean
Case-Study.docx		Case-Study.docx
Evaluation.docx		Evaluation.docx
Install.docx		Install.docx
README.md		README.md
Review-Kmean.docx		Review-Kmean.docx
~$Demo.docx		~$Demo.docx
~$se-Study.docx		~$se-Study.docx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

K-mean parallel with Map-Reduce

Description

Word File

How to run demo

Demo

About

Releases

Packages

Languages

tanaha2002/K-means-parallel

Folders and files

Latest commit

History

Repository files navigation

K-mean parallel with Map-Reduce

Description

Word File

How to run demo

Demo

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages