Getting S3A working correctly on Spark can be a frustrating experience; using S3 as a cost effective semi-solution for HDFS pretty much requires it because of various performance [speed] improvements. There are bits and pieces of what you need to know scattered across the Internet. This is what I've distilled, as well as the stack traces I've faced along the way.
Your Spark cluster will need a Hadoop version 2.x or greater. If you use the Spark EC2 setup scripts and maybe missed it, the switch for using something other than 1.0 is to specify
--hadoop-major-version 2(which uses CDH 4.2 as of this writing). It is possible to upgrade the Hadoop version in place after deployment; see the bottom of this post for what I did.
You'll need to include what may at first seem to be an out of date AWS SDK library (built in 2014 as version 1.7.4) for versions of Hadoop as late as 2.7.1 (stable): aws-java-sdk 1.7.4. As far as I can tell using this along with the specific AWS SDK JARs for 1.10.8 hasn't broken anything. The reason for using an older version of library is because of a breaking change in the AWS API. This JAR file can be uploaded to your cluster and shared with every executor or assembled into an uberjar/sbt assembly. I have an alternative solution that might be more reliable if the older version of the AWS SDK is causing problems beyond those covered here; it appears near the bottom of this post.
You'll also need the hadoop-aws 2.7.1 JAR on the classpath. Again this JAR can be uploaded to your cluster and shared with every executor or assembled into an uberjar/sbt assembly. This JAR contains the class
spark.propertiesyou probably want some settings that look like this:
Alternatively you can set these configuration options in your
hdfs-site.xml and deploy that file to all the executors. I prefer
At this point you might execute your job and encounter a nasty NPE:
Lost task 1.3 in stage 4.0 (TID 42, 0.0.0.0): java.lang.NullPointerException
This has to do with missing configuration for the local buffer temp directory. I didn't need to fix this problem myself because I knew my files were small enough to buffer directly from memory to S3; instead in my job I just modified the Hadoop configuration on the
But if your files are too large to fit in memory, or the thought of using
fs.s3a.fast.upload creates anxiety - its still in beta, afterall - then you can set:
(A comma separated list of local directories used to buffer results prior to transmitting the to S3. Ignored if
fs.s3a.fast.upload is set to
true.) If you set this in
spark.properties which is probably a good idea, since its a decent default when you cannot use
fs.s3a.fast.upload remember that the key includes
Here's the complete list of S3A configuration options.
If you see a different exception message:
Then make sure you're using
aws-java-sdk-1.7.4.jar and not a more recent version.
And if you see this exception message:
Then its because you're running Hadoop 1.0. Specify a
--hadoop-major-minor version parameter of either
yarn when using
spark-ec2 to recreate your cluster.
Finally if you see this exception message while trying to use
Ensure that the
jets3t library is on your classpath for the driver (if you merge results) and executors. I ended up using bundling version 0.9.0 to my master and slaves, although my Hadoop's distribution had
net.java.dev.jets3t % jets3t % 0.9.0
(For more detailed information on steps 2 and 3 see this gist.)
aws-java-sdk-1.7.4 with a
(Hadoop 2.8 branch)