STAT 547M Homework 9 Repository of Rebecca Asiimwe

Theme: Automate Tasks and Pipelines

Assignment overview

Pipelines are common in most walks of life, digital circuits, software, transportaion, industries, sales.... pipelines are almost everywhere! Pipelines used for data analysis take inputs which go through a number of processing steps that are chained together in some way to produce the desired output. They are sort of a chain of commands that can be run on one or more data sets - very helpful when we are going to have to rerun any analysis especially with multiple files. Makefiles are used to describe a pipeline of shell commands and the interdependencies of the input and output files of those commands.

Benefits of automating analytics pipelines using R and Make:

Automation helps us reproduce previous results and recreate deleted results. We can also rerun a pipeline with updated software or on a different dataset.
Make can resume a pipeline after a failed command without needing to start over. It als has the ability to run independent jobs in parallel.
A Makefile can be easily displayed as a graphical flow chart of files and shell commands as a powerful way or interpreting and communicating pipelines.

The developed pipeline:

Problem at hand:

I have a number of text files that contain words. Unfortunately, the files are disintegrated. They could have come from another pipeline as separate pieces of the same file and I would like to combine all of them into on large master file that I can use as my dataset.

The Pipeline

The pipeline starts with a python component seed, merger.py that traverses the system directory tree of the specified path and looks for files with a certain pattern. The script then takes the files and concatenates them into one (Assuming that all files from a "prior" pipeline were put in one directory)

The output of merger.py file is dataset_merge.txt. This output file is then fed into the trump_words.R file that does required analyses on dataset_merge.txt and generates required plots as shown in the md file. This md file was entirely generated for visualization.

For the pipeline to run, please note the following:

Install python using any of the methods specified here or here based on your operating system
Change the path in the merger.py file: path="/Users/rasiimwe/hw09-rasiimwe/files/" to align with your system environment.

Install required packages

install.packages("tm")  # to support text mining
install.packages("SnowballC") # to support text stemming
install.packages("wordcloud") # word-cloud generator

Repo Navigation:- Please visit the following main files 👇:

Homework Files	Description
README.md	This readme.md file provides an overview of the ghist of this repo and provides useful pointers to key files in my homework-09 repo. Herein, are also links to past files that provide an introduction to data exploration and analysis
Link to Makefile	This file describes the pipeline commands and the interdependencies of each of the input and output files
Link to md file	This file was rendered purposely for visualization
Link to R script	R source code that does the analyses on the merged dataset and provides required pipeline plots
Files	Directory that contains the base files that were used for the merger done by the python script (*.txt)

Sources to Acknowledge:

STAT 5457M notes on automating Data-analysis Pipelines

Text mining and word cloud fundamentals in R : 5 simple steps you should know

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
.Rproj.user		.Rproj.user
files		files
plugins		plugins
trump_words_files/figure-markdown_github		trump_words_files/figure-markdown_github
.Rhistory		.Rhistory
Makefile		Makefile
README.md		README.md
bar.png		bar.png
common_words.html		common_words.html
common_words.tsv		common_words.tsv
hw09-rasiimwe.Rproj		hw09-rasiimwe.Rproj
merger.py		merger.py
render_bar_plot.png		render_bar_plot.png
render_cloud_plot.pdf		render_cloud_plot.pdf
trump_words.R		trump_words.R
trump_words.md		trump_words.md
trump_words.rmd		trump_words.rmd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

STAT 547M Homework 9 Repository of Rebecca Asiimwe

Theme: Automate Tasks and Pipelines

Assignment overview

Benefits of automating analytics pipelines using R and Make:

The developed pipeline:

For the pipeline to run, please note the following:

Install required packages

Repo Navigation:- Please visit the following main files 👇:

Sources to Acknowledge:

About

Releases

Packages

Languages

STAT545-UBC-hw-2018-19/hw09-rasiimwe

Folders and files

Latest commit

History

Repository files navigation

STAT 547M Homework 9 Repository of Rebecca Asiimwe

Theme: Automate Tasks and Pipelines

Assignment overview

Benefits of automating analytics pipelines using R and Make:

The developed pipeline:

For the pipeline to run, please note the following:

Install required packages

Repo Navigation:- Please visit the following main files 👇:

Sources to Acknowledge:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages