Pipelines are common in most walks of life, digital circuits, software, transportaion, industries, sales.... pipelines are almost everywhere! Pipelines used for data analysis take inputs which go through a number of processing steps that are chained together in some way to produce the desired output. They are sort of a chain of commands that can be run on one or more data sets - very helpful when we are going to have to rerun any analysis especially with multiple files. Makefiles are used to describe a pipeline of shell commands and the interdependencies of the input and output files of those commands.
- Automation helps us reproduce previous results and recreate deleted results. We can also rerun a pipeline with updated software or on a different dataset.
- Make can resume a pipeline after a failed command without needing to start over. It als has the ability to run independent jobs in parallel.
- A Makefile can be easily displayed as a graphical flow chart of files and shell commands as a powerful way or interpreting and communicating pipelines.
Problem at hand:
I have a number of text files that contain words. Unfortunately, the files are disintegrated. They could have come from another pipeline as separate pieces of the same file and I would like to combine all of them into on large master file that I can use as my dataset.
The Pipeline
The pipeline starts with a python component seed, merger.py
that traverses the system directory tree of the specified path and looks for files with a certain pattern. The script then takes the files and concatenates them into one (Assuming that all files from a "prior" pipeline were put in one directory)
The output of merger.py
file is dataset_merge.txt. This output file is then fed into the trump_words.R file that does required analyses on dataset_merge.txt
and generates required plots as shown in the md file. This md file was entirely generated for visualization.
-
Install python using any of the methods specified here or here based on your operating system
-
Change the path in the
merger.py
file:path="/Users/rasiimwe/hw09-rasiimwe/files/"
to align with your system environment.
install.packages("tm") # to support text mining
install.packages("SnowballC") # to support text stemming
install.packages("wordcloud") # word-cloud generator
Homework Files | Description |
---|---|
README.md | This readme.md file provides an overview of the ghist of this repo and provides useful pointers to key files in my homework-09 repo. Herein, are also links to past files that provide an introduction to data exploration and analysis |
Link to Makefile | This file describes the pipeline commands and the interdependencies of each of the input and output files |
Link to md file | This file was rendered purposely for visualization |
Link to R script | R source code that does the analyses on the merged dataset and provides required pipeline plots |
Files | Directory that contains the base files that were used for the merger done by the python script (*.txt) |
STAT 5457M notes on automating Data-analysis Pipelines
Text mining and word cloud fundamentals in R : 5 simple steps you should know