Spark, EC2 and easy spark-shell deployment

If you have a Scala Spark project and you need to deliver the resulting über-jar to an Amazon EC2 Spark master and then execute spark-shell with that jar on the classpath then I have a bash script for you.

#!/usr/bin/env bash

set -o nounset  
set -o errexit

readonly default_env="${0#*-}"  
readonly ENV=${1:-$default_env}  
readonly JAR_NAME="your-analytics.jar"  
readonly UPLOAD_JAR=`dirname $0`/../target/scala-2.10/$JAR_NAME

case $ENV in  
  integration )
    HOST=your-integration-host.compute.amazonaws.com
    ;;
  prod )
    HOST=your-prod-host.compute.amazonaws.com
    ;;
  * )
    echo "environment $ENV not supported"
    exit 1
    ;;
esac

cd `dirname $0`/..  
./sbt assembly

readonly remote_md5sum=`ssh root@$HOST "/usr/bin/md5sum /root/$JAR_NAME | cut -d ' ' -f1"`  
readonly local_md5sum=`md5 -q $UPLOAD_JAR`

if [ "$remote_md5sum" != "$local_md5sum" ]; then  
  scp $UPLOAD_JAR root@$HOST:~
else  
  echo "MD5 sum $remote_md5sum matched local JAR; not uploading"
fi

ssh -t root@$HOST "/root/spark/bin/spark-shell \  
  --jars /root/$JAR_NAME \
  --conf 'config.resource=$ENV.conf' \
  --conf 'spark.executor.extraJavaOptions=-Dconfig.resource=$ENV.conf'"

To use this, assuming its placed in the bin directory of your Scala project root directory (with target as a sibling directory of bin):

$ ./bin/deploy integration

This script uses a neat trick in that you can give it a symlink name with a dash (the default_env="${0#*-}" variable) where the environment name follows the dash and basically run the same command using zsh partial completion:

$ cd bin && ln -s deploy deploy-integration && cd ..
$ ./bin/inte[TAB] # zsh partial completion, turns into:
$ ./bin/deploy-integration

Substitute the appropriate EC2 hosts for your environment(s). You can pull them off your EC2 console or use the ec2/spark-ec2 login cluster-name command to locate your master node's public address.

Here's a breakdown of what this script does:

derives the environment from the script filename, if necessary, but accepts the first argument as the environment name
selects the matching Spark EC2 master host name
rebuilds your über-jar
calculates the local and remote MD5 sum for your über-jar and compares them (its okay if the remote has no copy of the über-jar)
if the MD5 sums are different, scps the local über-jar to your Spark master
finally opens an ssh connection to the Spark master and fires up the spark-shell
- includes your über-jar in the spark-shell session
- specifies that the typesafe config resource is the environment, e.g., src/main/resources/integration.conf

And here's the assumptions:

you used your Spark distribution's ec2/spark-ec2 to provision the cluster so Spark is installed in the default location
sbt-assembly 0.12.0 sbt plugin to assemble an über-jar
sbt is a local script with sbt-launch.jar (0.13.7) in the root of the project directory
sbt 0.13.7 for auto-plugin support for sbt-assembly 0.12.0 and to fix a problem where scala-logging-slf4j 2.1.2 couldn't be included in the über-jar (not everyone can use Scala 2.11.x for the latest scala-logging...)
your Spark cluster username is root
you are on OSX (for the md5 command, replace with md5sum if using Linux)

It took me a little while to sort out the assumptions above - mostly scala-logging-slf4j - because of the fragmentation in Scala versions with the official Typesafe libraries. Its likely if you are reading this a few months down the line you can get away with using Scala 2.11.x.

The alternatives to this approach are to perform the steps manually (ick), use an AWS VPC with a VPN connection, or create an ssh tunnel. I will use a similar script as this for firing off detached spark-submit jobs from a build server in the future so the script seemed like a good time investment.

And here's a gist of an example project structure. Imagine the _s are /s as Github gists don't support subdirectory filenames.