-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Before you do anything else, make sure you have the right version of [lein] (https://github.com/technomancy/leiningen). If you don't have the right version, run lein upgrade
and hen run lein clean, deps
.
If inexplicable things happen, make sure your local repository (~/.m2
) has the latest versions of all dependencies. We've gotten f'd by outdated versions more than once. We're using development versions of a few things after all.
One issue we've run into is having an outdated version of Cascalog. The solution seems to be the following:
rm -r ~/.m2/repository/cascalog
Now that you have the right code, you need to download and install the elastic-mapreduce ruby script. Installation is simple: unzip the the file you downloaded, then add that location to your .bash_profile
, akin to this:
PATH="/path/to/elastic-mapreduce-ruby:${PATH}"
export PATH
Then either run the command source ~/.bash_profile
or open a new terminal window and work in that. You'll set up the credentials for actually using elastic-mapreduce below.
Finally, and most importantly, pronounce lein as LINE and not LEAN else the cluster will shutdown in a non-deterministic way.
Save the json object below in a file called credentials.json
in the forma-deploy directory.
{
"access-id":"AKIAJ56QWQ45GBJELGQA",
"private-key":"6L7JV5+qJ9yXz1E30e3qmm4Yf7E1Xs4pVhuEL8LV",
"key-pair":"forma-keypair",
"key-pair-file":"~/.ssh/id_rsa-forma-keypair",
"log-uri":"s3n://reddemrlogs"
}
The id_rsa-forma-keypair
file referenced in credentials.json
is required. Ask Dan, Sam, or Robin for it. Once you have it, be sure the permissions are set correctly - too loose and you won't be able to do much. The command to properly set the permissions is.
$ chmod 600 ~/.ssh/id_rsa-forma-keypair
First, upload the latest forma-clj/dev/forma_bootstrap.sh
script to S3 here:
https://s3.amazonaws.com/reddconfig/bootstrap-actions/forma_bootstrap.sh
Note that the above URL is hard coded here:
https://github.com/reddmetrics/forma-deploy/blob/master/src/forma/hadoop/cluster.clj#L241
In forma-deploy/src/forma/hadoop/cli.clj
is the CLI for deploying to cluster. From within the forma-deploy
directory, use this:
lein run --type large --emr --size 25
There are three supported cluster types, defined in forma-deploy/src/forma/hadoop/cluster.clj
: large
which launches instances of type m1.large
(ami-4db76624), high-memory
which launches instances of type m2.4xlarge
, and cluster-compute
which launches instances of type cc1.4xlarge
.
It takes awhile to boot up the cluster. Here are a few helpful commands for monitoring and terminating a cluster:
$ elastic-mapreduce --list --active
$ # terminate all active clusters/job flows
$ elastic-mapreduce --list --active --terminate
$ # get a cluster/job flow id, and terminate it
$ elastic-mapreduce --terminate --jobflow j-ABABABASABA
$ # log into master node
elastic-mapreduce --ssh --jobflow j-ABABABABA
More helpful hints from AWS: http://aws.amazon.com/developertools/2264
You can use the option --all
to list all past jobs on EMR, not just the ones from the previous week or two.
You can also monitor the cluster via the AWS EMR console:
https://console.aws.amazon.com/elasticmapreduce/home?region=us-east-1
And to monitor nodes coming online, see the AWS EC2 console:
https://console.aws.amazon.com/ec2/home?region=us-east-1#s=Instances
If your cluster gets killed due to spot price spikes, try moving it to a different availability zone (AZ). For now you need to edit the file src/forma/hadoop/cluster.clj
and add a line like the following (choose your own AZ) in the boot-emr!
function:
--availability-zone us-east-1e
After the cluster on EMR is in the Waiting
state, we're ready. Let's lein uberjar
the forma-clj
project and then upload it to the job tracker:
$ lein uberjar
# Creates forma-0.2.0-SNAPSHOT-standalone.jar
Then from within forma-deploy
:
$ elastic-mapreduce --list # Get the job tracker URL (e.g., ec2-67-202-41-140.compute-1.amazonaws.com)
$ sftp -i ~/.ssh/id_rsa-forma-keypair hadoop@ec2-67-202-41-140.compute-1.amazonaws.com
$ put forma-0.2.0-SNAPSHOT-standalone.jar
# alteratively, for the scp-inclined:
$ scp -i ~/.ssh/id_rsa-forma-keypair forma-0.2.0-SNAPSHOT-standalone.jar hadoop@ec2-67-202-41-140.compute-1.amazonaws.com:
Next SSH in, create a new screen, and become the hadoop user:
$ ssh -i ~/.ssh/id_rsa-forma-keypair hadoop@ec2-67-202-41-140.compute-1.amazonaws.com
$
$ # this gets you a fresh, ready-to-use "window" in screen,
$ # and turns on logging for anything printed to stdout.
$ screen -Lm
Then you're ready to run a job! Try this, for 500m data in Indonesia.
$ hadoop jar forma-0.2.0-SNAPSHOT-standalone.jar forma.hadoop.jobs.modis s3n://modisfiles/MOD13A1 s3n://pailbucket/master * :IDN
Or launch a REPL and run your job:
$ hadoop jar forma-0.2.0-SNAPSHOT-standalone.jar clojure.main
Once in the REPL:
(use 'forma.hadoop.jobs.scatter)
(in-ns 'forma.hadoop.jobs.scatter)
(ultrarunner "/user/hadoop/checkpoint"
"s3n://formaresults/staticbuckettemp"
"s3n://formaresults/finalbuckettemp"
"s3n://formaresults/finaloutput")
It can be useful to log into slave nodes on occasion, whether to check memory usage or browse local log files. You can do this by grabbing the private IP address of an instance using the Hadoop Jobtracker. For example, this page gives you the list of nodes you're running:
http://ec2-23-20-44-190.compute-1.amazonaws.com:9100/machines.jsp?type=active
The Host
column gives you the private IP address for each node, for example 10.28.100.164
. Log into the AWS console and go to the EC2 page, and enter the private IP address in the search/filter box. The instance this matches will become the only one visible. Click as usual to get its public DNS address. If you have function emr() { ssh -i ~/.ssh/id_rsa-forma-keypair hadoop@$@; }
in your ~/.bash_profile
, you can login using the public DNS like this:
$ emr ec2-23-20-44-190.compute-1.amazonaws.com
Nice tutorial for getting started. And here are the basic commands you'll need:
Launching:
launch: screen
launch with no welcome screen: screen -m
launch with logging: screen -Lm
launch and immediately disconnect: screen -dLm
Once in screen:
disconnect and kill: ^-d
disconnect without killing: ^-a d
view stdout history: ^-a [
see key bindings: ^-a ?
Reconnect:
get list of existing screens (if more than one): screen -r
reconnect to existing screen (if only one): screen -r
reconnect to most recent screen (if one or more): screen -rr
reconnect to specific screen: screen -r internal-address.ttys.other-stuff
If you somehow get disconnected and screen says something is still attached to a "window", use this to force a disconnect and reconnect immediately:
screen -dr internal-address.ttys.other-stuff
Sign into the jobtracker by
http://<Master DNS>:9100/jobtracker.jsp
For example,
http://ec2-107-21-135-90.compute-1.amazonaws.com:9100/jobtracker.jsp
You can find the DNS here.
Sometimes the forma_bootstrap.sh
is messed up, or it's not public on S3, causing the EMR cluster to automatically fail and shutdown. If this happens, stay calm! Just comment out the bootstrap step here, spin up a small cluster, upload forma_bootstrap.sh
to the job tracker, run it, and investigate the errors.
Make sure you have the latest version of the EMR CLI tools. Defaults change across versions, and the bootstrap script as of 2/13/2011 expects the version released in December 2011.
To examine the EMR logs, navigate to the reddemrlogs
bucket on the S3 console or some other S3 access service. The folders are tagged with with the Job Flow ID, which can be found by navigating to Elastic MapReduce tab on the AWS console. Click on the relevant EMR job, and the Job Flow ID will be at the top of the descriptive panel.
If there are memory errors, try out a few things listed on this StackOverflow page
If there's a problem with bootstrapping the cluster, it can be useful to run the bootstrapping script on the AMI we use: ami-4db76624
.
hadoop jar forma-0.2.0-SNAPSHOT-standalone.jar clojure.main