If you have a Scala Spark project and you need to deliver the resulting über-jar to an Amazon EC2 Spark master and then execute spark-shell with that jar on the classpath then I have a bash script for you.
#!/usr/bin/env bash
set -o nounset
set -o errexit
readonly default_env="${0#*-}"
readonly ENV=${1:-$default_env}
readonly JAR_NAME="your-analytics.jar"
readonly UPLOAD_JAR=`dirname $0`/../target/scala-2.10/$JAR_NAME
case $ENV in
integration )
HOST=your-integration-host.compute.amazonaws.com
;;
prod )
HOST=your-prod-host.compute.amazonaws.com
;;
* )
echo "environment $ENV not supported"
exit 1
;;
esac
cd `dirname $0`/..
./sbt assembly
readonly remote_md5sum=`ssh root@$HOST "/usr/bin/md5sum /root/$JAR_NAME | cut -d ' ' -f1"`
readonly local_md5sum=`md5 -q $UPLOAD_JAR`
if [ "$remote_md5sum" != "$local_md5sum" ]; then
scp $UPLOAD_JAR root@$HOST:~
else
echo "MD5 sum $remote_md5sum matched local JAR; not uploading"
fi
ssh -t root@$HOST "/root/spark/bin/spark-shell \
--jars /root/$JAR_NAME \
--conf 'config.resource=$ENV.conf' \
--conf 'spark.executor.extraJavaOptions=-Dconfig.resource=$ENV.conf'"
To use this, assuming its placed in the bin
directory of your Scala project root directory (with target
as a sibling directory of bin
):
$ ./bin/deploy integration
This script uses a neat trick in that you can give it a symlink name with a dash (the default_env="${0#*-}"
variable) where the environment name follows the dash and basically run the same command using zsh partial completion:
$ cd bin && ln -s deploy deploy-integration && cd ..
$ ./bin/inte[TAB] # zsh partial completion, turns into:
$ ./bin/deploy-integration
Substitute the appropriate EC2 hosts for your environment(s). You can pull them off your EC2 console or use the ec2/spark-ec2 login cluster-name
command to locate your master node's public address.
Here's a breakdown of what this script does:
- derives the environment from the script filename, if necessary, but accepts the first argument as the environment name
- selects the matching Spark EC2 master host name
- rebuilds your über-jar
- calculates the local and remote MD5 sum for your über-jar and compares them (its okay if the remote has no copy of the über-jar)
- if the MD5 sums are different,
scp
s the local über-jar to your Spark master - finally opens an ssh connection to the Spark master and fires up the
spark-shell
- includes your über-jar in the
spark-shell
session - specifies that the typesafe config resource is the environment, e.g.,
src/main/resources/integration.conf
- includes your über-jar in the
And here's the assumptions:
- you used your Spark distribution's
ec2/spark-ec2
to provision the cluster so Spark is installed in the default location - sbt-assembly 0.12.0 sbt plugin to assemble an über-jar
- sbt is a local script with
sbt-launch.jar
(0.13.7) in the root of the project directory - sbt 0.13.7 for auto-plugin support for sbt-assembly 0.12.0 and to fix a problem where scala-logging-slf4j 2.1.2 couldn't be included in the über-jar (not everyone can use Scala 2.11.x for the latest scala-logging...)
- your Spark cluster username is
root
- you are on OSX (for the
md5
command, replace withmd5sum
if using Linux)
It took me a little while to sort out the assumptions above - mostly scala-logging-slf4j
- because of the fragmentation in Scala versions with the official Typesafe libraries. Its likely if you are reading this a few months down the line you can get away with using Scala 2.11.x.
The alternatives to this approach are to perform the steps manually (ick), use an AWS VPC with a VPN connection, or create an ssh tunnel. I will use a similar script as this for firing off detached spark-submit
jobs from a build server in the future so the script seemed like a good time investment.
And here's a gist of an example project structure. Imagine the _
s are /
s as Github gists don't support subdirectory filenames.