Getting S3A working correctly on Spark can be a frustrating experience; using S3 as a cost effective semi-solution for HDFS pretty much requires it because of various performance [speed] improvements. There are bits and pieces of what you need to know scattered across the Internet. This is what I've distilled, as well as the stack traces I've faced along the way.

  1. Your Spark cluster will need a Hadoop version 2.x or greater. If you use the Spark EC2 setup scripts and maybe missed it, the switch for using something other than 1.0 is to specify --hadoop-major-version 2 (which uses CDH 4.2 as of this writing). It is possible to upgrade the Hadoop version in place after deployment; see the bottom of this post for what I did.

  2. You'll need to include what may at first seem to be an out of date AWS SDK library (built in 2014 as version 1.7.4) for versions of Hadoop as late as 2.7.1 (stable): aws-java-sdk 1.7.4. As far as I can tell using this along with the specific AWS SDK JARs for 1.10.8 hasn't broken anything. The reason for using an older version of library is because of a breaking change in the AWS API. This JAR file can be uploaded to your cluster and shared with every executor or assembled into an uberjar/sbt assembly. I have an alternative solution that might be more reliable if the older version of the AWS SDK is causing problems beyond those covered here; it appears near the bottom of this post.

  3. You'll also need the hadoop-aws 2.7.1 JAR on the classpath. Again this JAR can be uploaded to your cluster and shared with every executor or assembled into an uberjar/sbt assembly. This JAR contains the class org.apache.hadoop.fs.s3a.S3AFileSystem.

  4. In spark.properties you probably want some settings that look like this:

spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem  
spark.hadoop.fs.s3a.access.key=ACCESSKEY  
spark.hadoop.fs.s3a.secret.key=SECRETKEY  

Alternatively you can set these configuration options in your hdfs-site.xml and deploy that file to all the executors. I prefer spark.properties myself.

At this point you might execute your job and encounter a nasty NPE:

Lost task 1.3 in stage 4.0 (TID 42, 0.0.0.0): java.lang.NullPointerException  
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.confChanged(LocalDirAllocator.java:268)
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:344)
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.createTmpFileForWrite(LocalDirAllocator.java:412)
        at org.apache.hadoop.fs.LocalDirAllocator.createTmpFileForWrite(LocalDirAllocator.java:198)
        at org.apache.hadoop.fs.s3a.S3AOutputStream.<init>(S3AOutputStream.java:87)

This has to do with missing configuration for the local buffer temp directory. I didn't need to fix this problem myself because I knew my files were small enough to buffer directly from memory to S3; instead in my job I just modified the Hadoop configuration on the SparkContext:

sc.hadoopConfiguration.set("fs.s3a.fast.upload", "true")  

But if your files are too large to fit in memory, or the thought of using fs.s3a.fast.upload creates anxiety - its still in beta, afterall - then you can set:

sc.hadoopConfiguration.set("fs.s3a.buffer.dir", "/root/spark/work,/tmp")  

(A comma separated list of local directories used to buffer results prior to transmitting the to S3. Ignored if fs.s3a.fast.upload is set to true.) If you set this in spark.properties which is probably a good idea, since its a decent default when you cannot use fs.s3a.fast.upload remember that the key includes spark.hadoop, i.e., spark.hadoop.fs.s3a.buffer.dir=/root/spark/work,/tmp.

Here's the complete list of S3A configuration options.

If you see a different exception message:

java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManagerConfiguration.setMultipartUploadThreshold(I)V  

Then make sure you're using aws-java-sdk-1.7.4.jar and not a more recent version.

And if you see this exception message:

java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.getTrimmed(Ljava/lang/String;Ljava/lang/String;)Ljava/lang/String;  

Then its because you're running Hadoop 1.0. Specify a --hadoop-major-minor version parameter of either 2 or yarn when using spark-ec2 to recreate your cluster.

Finally if you see this exception message while trying to use s3 or s3n:

java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException  

Ensure that the jets3t library is on your classpath for the driver (if you merge results) and executors. I ended up using bundling version 0.9.0 to my master and slaves, although my Hadoop's distribution had ./ephemeral-hdfs/share/hadoop/common/lib/jets3t-0.6.1.jar already.

net.java.dev.jets3t % jets3t % 0.9.0

(For more detailed information on steps 2 and 3 see this gist.)

Replacing aws-java-sdk-1.7.4 with a hadoop-aws-3.0.0-SNAPSHOT.jar

(Hadoop 2.8 branch)

TODO