Hadoop Tweet Wordcounter Job

What is it?

A tweet retriever & hadoop wordcounter job. Built to show the hadoop hdfs usage for a 'CSC338: Parallel & Distributed Processing' group project at Missouri State University.

Why use it?

For sentimental analysis of the most recent tweets containing a particular hashtag. For example, if you run the job on tweets containing the hashtag #food, you may be able to make conclusions about the most discussed meals within a given amount of time the tweets were retrieved.

How it works

There are 2 major steps in running the project:

Retrieve tweets via twitter-api by hashtag and put them into a textfile
Run a wordcounter script with Hadoop to count the number of word occurences in the retrieved tweets file

Steps to setup & run:

Make sure you have hadoop 2.7.3 and python 3 installed on your server or local computer where you will be hosting the project.
Run the tweet retriever script with ./get-tweets.sh #hashtag-word, where #hashtag-word is your desired hashtag. To quit the script after desired amount of tweets are retrieved, use 'Ctrl-C'.
To copy the textfile to HDFS & run the hadoop job on the file, run ./job.sh [textfilename.txt]. NOTE: The default textfile created from the tweet retriever in step 2 is 'fetched_tweets.txt', so this should be used unless you plan to use a different textfile in the hadoop wordcounter.

Viewing the Output:

Assuming the job completes successfully, the output will be placed in the local folder where the repository files are located.
Open and view the textfile inside the '/completed-wordcount' directory. Words are sorted by most common occurences to least common occurences

Notes:

There is a serial wordcount script that can be used in lieu of the Hadoop job script. Run python serial-wordcount.py [filename.txt] instead of the job.sh script.
Commands to manipulate the HDFS begin with hdfs dfs, followed by the command to execute with any other arguments
hdfs dfs -ls [directory name] can be used to verify files were copied to the HDFS
If you need to create a directory on the HDFS, run hdfs dfs -mkdir /directory-name
You can type 'hadoop-streaming-*.jar' instead of remembering the exact version number when accessing the hadoop jar file
$PWD gives the current working directory

External Resources & Documentation:

The Mapper & Reducer are based on this Hadoop Application Walkthrough

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.gitignore		.gitignore
README.md		README.md
get-tweets.sh		get-tweets.sh
job.sh		job.sh
mapper.py		mapper.py
reducer.py		reducer.py
serial-wordcount.py		serial-wordcount.py
serial_job.sh		serial_job.sh
tweet-script.py		tweet-script.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hadoop Tweet Wordcounter Job

What is it?

Why use it?

How it works

Steps to setup & run:

Viewing the Output:

Notes:

External Resources & Documentation:

About

Releases

Packages

Languages

jakekemple/Hadoop-Tweet-Wordcounter

Folders and files

Latest commit

History

Repository files navigation

Hadoop Tweet Wordcounter Job

What is it?

Why use it?

How it works

Steps to setup & run:

Viewing the Output:

Notes:

External Resources & Documentation:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages