Friday, July 25, 2008

Amazon S3 Fast downloads to EC2 using Curl

Our company is building an application that depends heavily on Amazon's Cloud Web Services: Simple Storage Service, Elastic Compute Cloud, and Simple Queue Service. We are using Java for a lot of the business logic and Groovy for the 'glue code', interacting with frameworks, etc.

Anyway, I've spent some time tuning downloads and found out that I can get an order of magnitude faster download time if I shell out to 'curl' than if I use jets3t or the lower-level HttpClient. Note this speed-up only occurs when moving an S3 object to an EC2 instance, not when moving it outside the cloud (to my laptop for instance).

For some reason uploads using jets3t are very fast and we are guessing at this point that HttpClient (which jets3t depends upon) is causing the slowdown because it either can't (or hasn't been configured properly to) deal with the extra-large packet sizes that AWS allows within its cloud.

Being a developer on a schedule I punted and shelled out to curl using a signed-url that jets3t provides for my S3 Object get.

Here is the pseudo-code in Groovy (would be trivial to convert to plain Java) for the shell-out:


S3Service s3service = ... // Inject an instance of JetS3t S3Service
File file = ... // File representing download location on disk
S3Bucket bucket // Bucket object (could just be a string)

Date date = s3service.getCurrentTimeWithOffset()
long secondsSinceEpoch = (date.time / 1000) + 60L
def url = new URL(S3Service.createSignedUrl('GET', bucket.name, key, null, null, s3service.AWSCredentials, secondsSinceEpoch, false, false))

def cmd = ['curl']
// I break up large downloads so here is an optional byte range.
cmd += ['--range', "${low}-${hi}"]
cmd += ['--show-error']
cmd += ['--connect-timeout', '30']
cmd += ['--retry', '5']
cmd += ['--output', file.absolutePath]
cmd += [url]

Process p = cmd.execute()

p.waitFor()
if (p.exitValue() != 0) {
throw new IllegalStateException("Curl process exited with error code ${p.exitValue()}")
}
LOG.info("${file.name} download completed")


One final note: this will capture a curl process error but not many of the errors that you could experience when working with S3. For example if the key did not exist, the curl process would succeed but the downloaded file would contain the Amazon error response xml instead of the intended file. So it is your responsibility to first do a s3service.getObjectDetails(..) to make sure the object exists, then you must check the downloaded content length (and possibly content type) to ensure that you received your object and not an error.

No comments: