Friday, December 18, 2020

Group files by folders in Python

 I sometimes need to display a list of files coming in this format:

folder1/file1

folder2/file1

folder1/file2 ...

 And I want to display it in this format:

folder1

file1

file2

folder2

 file1

 Here is my code. It uses the groupby function from the itertools library:

from itertools import groupby

def format_files_by_folder(folder,filenames):
    return folder + "\n  " + "\n  ".join([f[1for f in filenames])

def file_by_folder(file_list):
    files_and_folders = [(f.split('/')[0], '/'.join(f.split('/')[1:])) 
        for f in file_list]

    # Group by folder
    files_and_folders.sort(key=lambda f: f[0])
    files_by_folder = groupby(files_and_folders, lambda f: f[0])

    return "\n\n".join(
        [format_files_by_folder(folder, filenames) 
            for folder, filenames in files_by_folder])


Thursday, November 19, 2020

Snake Case in Terraform

 How do you convert from camel case to snake case in Terraform? How do you go from "MyProjectName" to "my-project-name"? Here is a simple solution:

locals {
# convert to snake case
snake_case_name = lower(replace(var.camel_case_name
"/(\\w)([A-Z])/""$${1}_$2"))
}

Basically, you add an underscore before each capital letters. The strange syntax with the double dollars is a workaround to a strange bug in Terraform, which considers '$1_' as meaning something. 

This line works well if you have simple cases like "MyDatabaseName". However, I had to handle some more complex cases, like "ABCProject", or "ProjectABC". In that case, you have to work a bit more.

My solution was to implement two replace functions:

  • One for a series of capital letters anywhere in my word. I insert an underscore only if my capital letter is followed by a lower case.
  • One for a series of capital letters at the end of my word. I insert an underscore before the first letter of the series.
This gives this more complex command:

locals {
  # convert to snake case
  snake_case_name lower(replace(replace(var.camel_case_name,
      # add underscore before capital letter followed by lowcase
      "/(\\w)([A-Z][a-z])/""$${1}_$2"),
      # add underscore before capital letters at the end of the word
      "/([A-Z]+)$/""_$1"))
}

Friday, June 26, 2020

Work around EC2 Termination Protection in Jenkins Pipeline

Recently, we enabled Termination Protection on our EC2 instances on our AWS cloud. That means that it is not possible to accidentally switch off an instance, you have to switch off the PRotection flag first. In our Terraform files, it is easy to put in place:
resource "aws_instance" "myinstance" {
  ...
  disable_api_termination = true
}
The problem is, when we run our Terraform scripts, and a new instance has to be created instead of an old one (like when we modify user data), it won't let us. You'd have to manually remove the flag. Since we decided that running a Terraform script for deployment means that we really want to be able to terminate our instance, we decide to allow the Jenkins pipeline to remove the flag before running the deployment scripts.
In order to do this, we wrote a small Groovy script at the beginning of our pipeline:
def removeTerminationProtection(instanceName) {
    echo "Looking for '${instanceName}'"
    def instanceId = sh (script: "aws ec2 describe-instances --region ${AWS_REGION} --filters Name=tag:Name,Values=${instanceName} --query 'Reservations[0].Instances[0].InstanceId' | xargs echo -n", returnStdout: true)
    if ('null'.equals(instanceId)) {
        echo "Instance ${instanceName} not found."
    }
    else {
        echo "Removing termination protection for '${instanceId}'"
        sh (script: "aws ec2 modify-instance-attribute --instance-id ${instanceId} --region ${AWS_REGION} --disable-api-termination Value=false || exit 1")
    }
}
It uses AWS CLI to first find our instance by its name, and then disable termination protection. The way to call it from a pipeline stage is the following:
pipeline {
    ...
    stages {
        ...
        stage('Remove Termination Protection') {
            environment {
                AWS_PROFILE = "${PROFILE_NAME}"
            }

            steps {
                script {
                    removeTerminationProtection("myinstance")
                }
            }
        }
        ...
    }
}
Since we run the AWS CLI, we must have AWS credentials, so we do it using an AWS Profile. We run our Terraform script in a later stage.

Wednesday, April 15, 2020

S3 Multipart Upload of Memory Mapped File in Java

In AWS,when uploading files bigger than 5 GB, you have no choice bu to use a Multipart Upload. In the first versions of the AWS SDK in Java, you had a TransferManager class to handle all the low level bits for you. Unfortunately, I did not find any trace of it in the latest versions.
The low level operations of the Multipart Upload expect you to divide your file into parts, each time providing a buffer to the part you want to upload. The best way to go through a big file without reading it all in memory is to use a Memory Mapped File. Here is my code:
private void uploadMultiPartFile(String bucketName, Path path, String key) throws IOException {
    S3Client s3 = S3Client.builder().region(REGION).build();
 
    CreateMultipartUploadRequest createMultipartUploadRequest = CreateMultipartUploadRequest.builder()
       .bucket(bucketName)
       .key(key)
       .build();

    CreateMultipartUploadResponse response = s3.createMultipartUpload(createMultipartUploadRequest);
    String uploadId = response.uploadId();

    long position = 0;
    long fileSize = Files.size(path);
    int part = 1;
    List<CompletedPart> completedParts = new ArrayList<>();

    try (FileChannel channel = FileChannel.open(path, StandardOpenOption.READ)) {
        while (position < fileSize) {
            long remaining = fileSize - position;
            int toRead = (int) Math.min(BUFFER_SIZE, remaining);
            MappedByteBuffer map = channel.map(MapMode.READ_ONLY, position, toRead);

            UploadPartRequest uploadPartRequest = UploadPartRequest.builder()
                .bucket(bucketName)
                .key(key)
                .uploadId(uploadId)
                .partNumber(part)
                .build();
            String etag = s3.uploadPart(uploadPartRequest, RequestBody.fromByteBuffer(map)).eTag();
            CompletedPart completed = CompletedPart.builder().partNumber(part).eTag(etag).build();
            completedParts.add(completed);
   
            position += BUFFER_SIZE;
            part++;
        }
    }  
 
    CompletedMultipartUpload completedMultipartUpload = CompletedMultipartUpload.builder().parts(completedParts).build();
    CompleteMultipartUploadRequest completeMultipartUploadRequest = CompleteMultipartUploadRequest.builder()
        .bucket(bucketName)
        .key(key)
        .uploadId(uploadId)
        .multipartUpload(completedMultipartUpload)
        .build();
    s3.completeMultipartUpload(completeMultipartUploadRequest);
}
For my case, I set the BUFFER_SIZE to 100MB. The AWS API states that buffers can be up to 5GB, but I had the bad surprise of noticing that the buffer size parameter in the Java SDK is an int instead of a long, so you cannot use anything above 2GB. When I tried to use 2GB, the process would just hang, so I guess there must be other limitations in the OS.
However, 100MB is OK for me because it allows me to upload files up to 1TB (10000 parts is the maximum allowed). But you can play with this parameter since bigger numbers allow you to upload files faster. Another optimization would be to use a parallel stream to run several uploads in parallel, but I didn't try it.
Another piece of advice if you start playing with Multipart Uploads: if your upload fails at some point, the parts already uploaded are stored in your S3, but you cannot see them, also you pay for them. So do not forget to attach a lifecycle policy for aborted Multipart Uploads.

Wednesday, April 8, 2020

Use Terraform to transform a CSV file to JSON documents

I used this trick to ingest items into DynamoDB. Terraform is maybe not the best tool for that, but we already created the DynamoDB table with the tool, and needed to import a couple of items representing metadata at creation. So here it is.
I have a semi-colon separated CSV file, where each line contains data for a JSON document. Here is the Terraform code:

locals {
    content = file("myfile.csv")
    lines = split("\n", local.content)
    json_docs = [for item in local.lines: format(<<EOT
{
    "key1": {
      "S": "%s"
    },
    "key2": {
      "S": "%s"
    },
    "key3": {
      "S": "%s"
    }
}
EOT
    , split(";", item)...)]
}

resource "aws_dynamodb_table_item" "items" {
    count = length(local.json_docs)

    table_name = aws_dynamodb_table.mytable.name
    hash_key   = aws_dynamodb_table.mytable.hash_key

    item = local.json_docs[count.index]
}

I first copy the content of the CSV file into a String. I then split it along newline characters. I assume that it is a Unix style file. Then I have a loop, transforming each line into a String containing the JSON document. For this, I use the format function, with the template of the JSON document and my line split along the semi-colons as parameters. Notice that to expand my split line, which is a list, into arguments to the format function, I have to use the three periods (...) symbol.

Finally, I can import all those documents into my DynamoDB table.

Tuesday, March 31, 2020

Implement a Whitelist in Terraform

This happens sometimes that you need to implement a variable in Terraform, that can only take an acceptable list of values. In my case, it was a list of DNS names that needed to be accepted by a security team, and stored in a file on an S3.
The difficulty is to make the Terraform fail if you decide to use a bad value. There are several ways to do that, here is mine:

data "aws_s3_bucket_object" "white_list" {
  bucket = "my-bucket"
  key    = "my_white_list"
}

locals {
  value_to_check = "SomeValue"

  white_list = split(
    " ",
    replace(data.aws_s3_bucket_object.white_list.body, "/\\s+/", " "),
  )
  allowed = zipmap(local.white_list, local.white_list)[local.value_to_check]
}

The data part is fetching my file from an S3, but you can imagine using a simple file command, or even a hardcoded list.
Then I am setting the value to check, which is hardcoded here for the example, but it will typically be calculated or retrieved from some other place. I then create a Terraform list from the file, by removing any extra space and splitting the lines.
Finally, here is my way of making Terraform fail. I create a map from the white list, using the zipmap function, and get the value from it. If the value is not in the map, Terraform will just stop with an error.

Thursday, March 26, 2020

Mock boto3 services not handled by moto

We have a pattern to mock AWS services in our python lambdas. First, in our lamba code, we initialize boto3 clients with properties:

import os
import boto3

@property
def REGION_NAME():
    return os.getenv("REGION", "eu-west-3")

@property
def SQS():
    return boto3.client(service_name="sqs", region_name=REGION_NAME.fget())

def lambda_handler(event, context):
    sqs_client = SQS.fget()
As you can see, the region is also a property. The reason for it is that all clients in moto are declared on the us-east-1 region.
Our unit tests are all in a test sub-folder of our lambda. To write the tests, we usually follow this pattern:
from moto import mock_sqs
from pytest import fixture

from ..mylambda import (
    lambda_handler,
    SQS
)

@fixture(autouse=True)
def prepare_test_env(monkeypatch):
    monkeypatch.setenv("REGION", "us-east-1")   

@mock_sqs
def test_mylambda():
    SQS.fget().create_queue(QueueName='MyQueue')
    
    #do the test...
We import from the lambda in the parent folder the methods to test, as well as the properties. For this to work, we have to create an empty __init__.py file in that directory. We then patch the environment variables, including the region that needs setting to us-east-1. In our test function, we have to mock the boto3 client, using the appropriate moto annotation. Then we write our test code.

In some cases, the boto3 client has no mock in moto. That is our case for Step Functions for instance (I know it is in preparation, but not yet ready as the time of the writing). For those cases, we use the following pattern:

import boto3

@property
def SFN():
    return boto3.client("stepfunctions")
In our lambda, nothing changes. We are still using a property. However, in the unit test, we have to use a patch:

from unittest.mock import (
    PropertyMock,
    patch
)
from ..mylambda import lambda_handler

def test_lambda():
    with patch('mylambda.mylambda.SFN', new_callable=PropertyMock) as mock_stepfunctions:
        lambda_handler(event)

        mock_stepfunctions.fget().start_execution.assert_called_with(
            stateMachineArn="MyMachine", 
            name="State-Machine-0", 
            input="{}"
        )
    
We patch the property SFN with a PropertyMock. By giving it a name, we can then use the mock to assert that it was called with the correct parameters.

Tuesday, March 24, 2020

Use Terraform output in Jenkins file

In a Jenkins pipeline file, you might have several Terraform stacks running in different stages. Make them communicating is usually pretty easy, using the data construct, or the remote state. However, making a Terraform step communicating with another step running shell commands for instance requires a bit more work. Of course, you have the terraform output command, but there is a small glitch: storing the output in a variable ends up storing also a newline character at its end.
So here is a command that helps work around that problem:

def BUCKET_NAME = ''

pipeline {
    stages {
        stage('Terraform') {
            steps {
                sh "terraform init"
                sh "terraform apply"
                script {
                    BUCKET_NAME = sh (script: 'terraform output bucket_name | xargs echo -n', returnStdout: true)
                }
            }
        }
        stage('Another') {
            steps {
                sh "echo ${BUCKET_NAME}"
            }
        }
    }
}

Sunday, March 22, 2020

Mock SQL connection in Python

I have a Python code that uses pyodbc library to send SQL queries to an MSSQL database. I would like to unit test it, but I do not want to install the pyodbc library on my testing machine. Fortunately, Python allows us to easily mock things, even Modules.
The first hurdle is to avoid looking for the pyodbc library in the import statement. For modules, Python has quite a simple method. It stores them in the dict sys.modules the first time it meets an import. The solution is to insert a Mock in this dict before we import the file to test. For instance, we have this line at the beginning of the file we want to test:

import pyodbc
In our test file, we will add these lines:

import sys
from unittest.mock import MagicMock

mock_pyodbc = MagicMock()
sys.modules['pyodbc'] = mock_pyodbc

import module_to_test
We insert a MagicMock into the modules dict. After that, we simply import our own Module. Pyodbc is already imported, so no need to look for it again.
The second part is to test that the code we are testing is calling the query we expect it to. The good thing with Mocks is that we do not have to implement any code. We let it handle everything. Here is the code we are testing:

connection = pyodbc.connect(
    connection_string, autocommit=True, timeout=ODBC_CONNECTION_TIMEOUT
)
with connection.cursor() as cursor:
    cursor.execute(query)
Every time a method is called, the Mock will generate another Mock object. The reason we chose a MagicMock is that it can handle also the magic methods, such as __enter__() here, which is called because of the with construct.
So how do we test that our query is called? Here is the testing line:
# Check that the correct query is being executed
mock_pyodbc.connect().cursor().__enter__().execute.assert_called_with(expected_query)


Saturday, March 21, 2020

Send Batch of Messages to SQS

We have a simple code that can send up to several thousand messages to an SQS queue. Using Python and boto3, the code looks like this:

sqs = boto3.resource('sqs')
queue = sqs.get_queue_by_name(QueueName=SQS_QUEUE_NAME)
for message in messages:
    queue.send_message(message)

When you have really a lot of messages in an array, it is possible to send them by batch of up to 10 messages, using the send_message_batch method. When doing this, there are two problems to solve: creating the batches of 10 messages, and generating an ID for each message. AWS enforce the generation of this ID so it can send back a response containing the list of messages that failed or succeeded, identified by the ID.

Here is the new code:

sqs = boto3.resource('sqs')
queue = sqs.get_queue_by_name(QueueName=SQS_QUEUE_NAME)

for i in range(0, len(messages), 10):
    chunk = messages[i:i+10]
    queue.send_messages(Entries=[
        {
            "Id": "MSG" + str(id), 
            "MessageBody": message
        } for id, message in zip(range(10), chunk)
    ])

Our loop now jump to every tenth message. Inside the loop, we create a chunk of 10 messages, and use the send_message_batch method to send it. You can see that we use a list comprehension the runs over both the chunk and a range to generate the ID.

Monday, March 9, 2020

Terraform: move resource between state files

When you have several terraform stacks to handle, it might happen that you realize that one resource is created in the incorrect stack. The easiest way to move it is usually to remove it from one stack, apply, then add it to the other stack, and apply again. But for some resources, this is solution is difficult to implement.
In my case, it was an S3 bucket, containing several big files. It would have been a long process to backup the files, destroy them from the bucket, then restore them on the destination bucket. So here is the way to move a resource between stacks without destroying it.

First, you have to pull your destination state file locally. Say you want to move your module my_bucket from a stack in folderA to another stack in folderB:

cd folderB terraform state pull > folderB.state

Second step, you have to move your resource to its new destination:

cd ../folderA terraform state mv -state-out ../folderB/folderB.state module.my_bucket module.my_bucket

The mv command takes the source and destination name of your resource as parameters, so it is possible to rename your resource as you move it. As the final step, you push your destination state file to its remote location:

cd ../folderB terraform state push folderB.state