Wednesday, April 15, 2020

S3 Multipart Upload of Memory Mapped File in Java

In AWS,when uploading files bigger than 5 GB, you have no choice bu to use a Multipart Upload. In the first versions of the AWS SDK in Java, you had a TransferManager class to handle all the low level bits for you. Unfortunately, I did not find any trace of it in the latest versions.
The low level operations of the Multipart Upload expect you to divide your file into parts, each time providing a buffer to the part you want to upload. The best way to go through a big file without reading it all in memory is to use a Memory Mapped File. Here is my code:
private void uploadMultiPartFile(String bucketName, Path path, String key) throws IOException {
    S3Client s3 = S3Client.builder().region(REGION).build();
 
    CreateMultipartUploadRequest createMultipartUploadRequest = CreateMultipartUploadRequest.builder()
       .bucket(bucketName)
       .key(key)
       .build();

    CreateMultipartUploadResponse response = s3.createMultipartUpload(createMultipartUploadRequest);
    String uploadId = response.uploadId();

    long position = 0;
    long fileSize = Files.size(path);
    int part = 1;
    List<CompletedPart> completedParts = new ArrayList<>();

    try (FileChannel channel = FileChannel.open(path, StandardOpenOption.READ)) {
        while (position < fileSize) {
            long remaining = fileSize - position;
            int toRead = (int) Math.min(BUFFER_SIZE, remaining);
            MappedByteBuffer map = channel.map(MapMode.READ_ONLY, position, toRead);

            UploadPartRequest uploadPartRequest = UploadPartRequest.builder()
                .bucket(bucketName)
                .key(key)
                .uploadId(uploadId)
                .partNumber(part)
                .build();
            String etag = s3.uploadPart(uploadPartRequest, RequestBody.fromByteBuffer(map)).eTag();
            CompletedPart completed = CompletedPart.builder().partNumber(part).eTag(etag).build();
            completedParts.add(completed);
   
            position += BUFFER_SIZE;
            part++;
        }
    }  
 
    CompletedMultipartUpload completedMultipartUpload = CompletedMultipartUpload.builder().parts(completedParts).build();
    CompleteMultipartUploadRequest completeMultipartUploadRequest = CompleteMultipartUploadRequest.builder()
        .bucket(bucketName)
        .key(key)
        .uploadId(uploadId)
        .multipartUpload(completedMultipartUpload)
        .build();
    s3.completeMultipartUpload(completeMultipartUploadRequest);
}
For my case, I set the BUFFER_SIZE to 100MB. The AWS API states that buffers can be up to 5GB, but I had the bad surprise of noticing that the buffer size parameter in the Java SDK is an int instead of a long, so you cannot use anything above 2GB. When I tried to use 2GB, the process would just hang, so I guess there must be other limitations in the OS.
However, 100MB is OK for me because it allows me to upload files up to 1TB (10000 parts is the maximum allowed). But you can play with this parameter since bigger numbers allow you to upload files faster. Another optimization would be to use a parallel stream to run several uploads in parallel, but I didn't try it.
Another piece of advice if you start playing with Multipart Uploads: if your upload fails at some point, the parts already uploaded are stored in your S3, but you cannot see them, also you pay for them. So do not forget to attach a lifecycle policy for aborted Multipart Uploads.

Wednesday, April 8, 2020

Use Terraform to transform a CSV file to JSON documents

I used this trick to ingest items into DynamoDB. Terraform is maybe not the best tool for that, but we already created the DynamoDB table with the tool, and needed to import a couple of items representing metadata at creation. So here it is.
I have a semi-colon separated CSV file, where each line contains data for a JSON document. Here is the Terraform code:

locals {
    content = file("myfile.csv")
    lines = split("\n", local.content)
    json_docs = [for item in local.lines: format(<<EOT
{
    "key1": {
      "S": "%s"
    },
    "key2": {
      "S": "%s"
    },
    "key3": {
      "S": "%s"
    }
}
EOT
    , split(";", item)...)]
}

resource "aws_dynamodb_table_item" "items" {
    count = length(local.json_docs)

    table_name = aws_dynamodb_table.mytable.name
    hash_key   = aws_dynamodb_table.mytable.hash_key

    item = local.json_docs[count.index]
}

I first copy the content of the CSV file into a String. I then split it along newline characters. I assume that it is a Unix style file. Then I have a loop, transforming each line into a String containing the JSON document. For this, I use the format function, with the template of the JSON document and my line split along the semi-colons as parameters. Notice that to expand my split line, which is a list, into arguments to the format function, I have to use the three periods (...) symbol.

Finally, I can import all those documents into my DynamoDB table.