Wednesday, April 15, 2020

S3 Multipart Upload of Memory Mapped File in Java

In AWS,when uploading files bigger than 5 GB, you have no choice bu to use a Multipart Upload. In the first versions of the AWS SDK in Java, you had a TransferManager class to handle all the low level bits for you. Unfortunately, I did not find any trace of it in the latest versions.
The low level operations of the Multipart Upload expect you to divide your file into parts, each time providing a buffer to the part you want to upload. The best way to go through a big file without reading it all in memory is to use a Memory Mapped File. Here is my code:
private void uploadMultiPartFile(String bucketName, Path path, String key) throws IOException {
    S3Client s3 = S3Client.builder().region(REGION).build();
 
    CreateMultipartUploadRequest createMultipartUploadRequest = CreateMultipartUploadRequest.builder()
       .bucket(bucketName)
       .key(key)
       .build();

    CreateMultipartUploadResponse response = s3.createMultipartUpload(createMultipartUploadRequest);
    String uploadId = response.uploadId();

    long position = 0;
    long fileSize = Files.size(path);
    int part = 1;
    List<CompletedPart> completedParts = new ArrayList<>();

    try (FileChannel channel = FileChannel.open(path, StandardOpenOption.READ)) {
        while (position < fileSize) {
            long remaining = fileSize - position;
            int toRead = (int) Math.min(BUFFER_SIZE, remaining);
            MappedByteBuffer map = channel.map(MapMode.READ_ONLY, position, toRead);

            UploadPartRequest uploadPartRequest = UploadPartRequest.builder()
                .bucket(bucketName)
                .key(key)
                .uploadId(uploadId)
                .partNumber(part)
                .build();
            String etag = s3.uploadPart(uploadPartRequest, RequestBody.fromByteBuffer(map)).eTag();
            CompletedPart completed = CompletedPart.builder().partNumber(part).eTag(etag).build();
            completedParts.add(completed);
   
            position += BUFFER_SIZE;
            part++;
        }
    }  
 
    CompletedMultipartUpload completedMultipartUpload = CompletedMultipartUpload.builder().parts(completedParts).build();
    CompleteMultipartUploadRequest completeMultipartUploadRequest = CompleteMultipartUploadRequest.builder()
        .bucket(bucketName)
        .key(key)
        .uploadId(uploadId)
        .multipartUpload(completedMultipartUpload)
        .build();
    s3.completeMultipartUpload(completeMultipartUploadRequest);
}
For my case, I set the BUFFER_SIZE to 100MB. The AWS API states that buffers can be up to 5GB, but I had the bad surprise of noticing that the buffer size parameter in the Java SDK is an int instead of a long, so you cannot use anything above 2GB. When I tried to use 2GB, the process would just hang, so I guess there must be other limitations in the OS.
However, 100MB is OK for me because it allows me to upload files up to 1TB (10000 parts is the maximum allowed). But you can play with this parameter since bigger numbers allow you to upload files faster. Another optimization would be to use a parallel stream to run several uploads in parallel, but I didn't try it.
Another piece of advice if you start playing with Multipart Uploads: if your upload fails at some point, the parts already uploaded are stored in your S3, but you cannot see them, also you pay for them. So do not forget to attach a lifecycle policy for aborted Multipart Uploads.

No comments:

Post a Comment