Scratch Where It's Itching

Monday, April 21, 2025

EDT Freeze Detector

This article was originally posted on Jroller on April 2, 2013

It is always the same scenario: the support team contacts us with a problem of a "frozen GUI". They send us the logs so that we can investigate, but of course the logs do not show anything. The user already restarted its GUI, so when we ask the support to perform a jstack, it is already too late. Quite often, it is not even due to a deadlock, but to a long operation that should not be done in the EDT. Since the only thing we have to investigate is the logs, I decided to write an EDT Freeze Detector that would log any operation monopolizing the EDT for more than 10 seconds. Here is the code:

import java.awt.AWTEvent;
import java.awt.EventQueue;
import java.awt.Toolkit;
import java.util.Timer;
import java.util.TimerTask;

public class FreezeDetector extends EventQueue{
    private static final long FREEZE_TIMER_PERIOD = 10000L;

    private volatile AWTEvent currentEvent;
    private volatile Thread eventDispatchThread;

    private FreezeDetector() {
        Timer timer = new Timer("Freeze Detector", true);
        timer.schedule(new FreezeTimerTask(), FREEZE_TIMER_PERIOD, FREEZE_TIMER_PERIOD);
    }

    public static void installFreezeDetector() {
        Toolkit.getDefaultToolkit().getSystemEventQueue().push(new FreezeDetector());
    }

    @Override
    protected void dispatchEvent(AWTEvent event) {
        eventDispatchThread = Thread.currentThread();
        currentEvent = event;

        try {
            super.dispatchEvent(event);
        }
        finally {
            currentEvent = null;
        }
    }

    private class FreezeTimerTask extends TimerTask {
        private AWTEvent lastEvent;

        @Override
        public void run() {
            if (lastEvent != null && lastEvent == currentEvent) {
                printStack();
            }

            lastEvent = currentEvent;
        }

        private void printStack() {
            StackTraceElement[] stackTrace = eventDispatchThread.getStackTrace();
            StringBuilder sb = new StringBuilder();
            sb.append("Freeze detected on EDT:");

            for (StackTraceElement stackElement : stackTrace) {
                sb.append(stackElement.toString()).append('\n');
            }

            //use your favorite logger
            System.out.println(sb);
        }
    }
}

Wednesday, October 2, 2024

AWS: Step functions can keep only Lambda payloads

When running an AWS Lambda in synchronous mode from a Step Function, you will get the Lambda's return value in the output's Payload part. But unfortunately, you might also get some mostly useless information as well:

{
    "ExecutedVersion": "$LATEST",
    "Payload": {
        "result": "value"
    },
    "SdkHttpMetadata": {},
    "HttpHeaders": {}
}

When browsing your Step Functions's output, that's a lot of noise. So to keep only the lambda's payload part, you can add this line in your task definition:

    "ResultSelector": { "Payload.$": "$.Payload" }

Keep only the good stuff!

Saturday, July 13, 2024

AWS: Physical Resource ID in Custom Resources

Originally, Custom Resources in Cloudformation were designed for wrapping AWS resources that are not yet supported by the Cloudformation service into a Lambda. However, we often use them for other purposes:

Retrieve information from other resources (like the data in terraform)
Trigger some actions
Implement some logic

In most of those cases, we do not care about resource deletion. But we are often surprised by calls from Cloudformation to delete the resource. The reason is the misunderstanding or the misuse of the Physical Resource ID. And the origin of this, for me, comes from a bad design choice on AWS part.

Let's have a look at the way the cfnresponse module is written.

def send(event, context, responseStatus, responseData, physicalResourceId=None, noEcho=False, reason=None):
    responseUrl = event['ResponseURL']
    responseBody = {
        'Status' : responseStatus,
        'Reason' : reason or "See the details in CloudWatch Log Stream: {}".format(context.log_stream_name),
        'PhysicalResourceId' : physicalResourceId or context.log_stream_name,
        'StackId' : event['StackId'],
        'RequestId' : event['RequestId'],
        'LogicalResourceId' : event['LogicalResourceId'],
        'NoEcho' : noEcho,
        'Data' : responseData
    }

There are 2 bad choices:

The Physical Resource ID parameter is optional. It makes you think that it is not important. That if you don't set it, some default behavior will handle it correctly for you.
The default value is random. Even worse, it is not consistently random. It is set to the log stream name, that changes on each Lambda cold start.

That means that most of the time, your Physical Resource ID will change on each call, except if you trigger it several times in a row. And this change of Physical Resource ID is the one that triggers the call to the delete part of your Lambda.

You can imagine that your Resource behaves like an EC2. If you modify a tag, your instance will keep its ID. But if you change its VPC, a new instance will be created, with a new ID. In that case, the old instance must be deleted. You can consider the Physical Resource ID to be like the instance ID. You want to decide, based on which parameter was modified, if the old Resource must be kept or deleted.

Which means that in most cases, you do not want your Physical Resource ID to change. So the default behavior is wrong. It will lead to:

Have your Lambda called for deletion for no reason.
Can cause accidental calls that you don't expect. We had the case when an S3 bucket was deleted in production because someone added a parameter to the Custom Resource.
Makes you write some useless code to avoid to call the Lambda when you don't expect it, like checking that your Cloudformation stack is really deleting the Resource.

So the correct behavior is to always set the Physical Resource ID. And usually to a constant value:

cfnresponse.send(event, context, cfnresponse.SUCCESS, responseData, "ConstantPhysicalID")

What about the legacy code? Those old Custom Resources that already have a Physical Resource ID set to the log stream name? The good thing is that the previously set Physical Resource ID is sent to the Lambda in the event parameter. So you can simply set it back to its previous value:

physicalId = event["PhysicalResourceId"]

cfnresponse.send(event, context, cfnresponse.SUCCESS, responseData, physicalId)

Wednesday, July 10, 2024

AWS: Why Serverless Macro for Cloudformation always packages?

I'm using the Serverless macro quite a lot in my Cloudformation templates. It is very practical that you point a Lambda content to a local folder. Then Cloudformation packages the whole content into a Zip file and uploads it to an S3 Bucket.

However, it happens that I use the Serverless macro for some other feature, like generating the Event Rule that trigger my Lambda for instance. In some cases, my Lambda code can be already packaged in a Container on ECR, or even inlined. In those cases, I don't need any packaging.

What I noticed, is that Cloudformation is still packaging something. I downloaded the packaged Zip and checked its content. I could find the complete folder from the Cloudformation template location. For one template that was stored in the root of my source code, it packaged the complete application!

Does someone know why is that? Is there a reason for packaging when a Lambda is only inlined? Is there a way to tell Cloudformation to avoid packaging? Is it a bug?

Thursday, April 18, 2024

WTF: E-mail Validation

E-mail validation is usually a hard task, but in our case, we had a simple regular expression that allowed us to accept a list of e-mails in a known format. Here is the regular expression that you could find in our code:

^$|^\s*[\w+.-]+@[a-zA-Z_-]+?\.[a-zA-Z]{2,3}(?:,[\w+.-]+@[a-zA-Z_-]+?\.[a-zA-Z]{2,3})*\s*$

Many things here, let's decompose:

^$|: we accept an empty string
^\s*: we ignore leading white space characters
[\w+.-]+: the user name part of the e-mail. We accept all words characters, plus sign, dot and dash.
@: the at sign
[a-zA-Z_-]+?: the domain name, which can have any letter, dash and underscore. Note here the use of the +? pattern, which is very strange. I had to google it, it is the lazy expansion, which means take the minimum number of character needed to fulfill the pattern. Completely useless here since we are looking for a dot character afterward.
\. The dot character between the domain name and the extension
[a-zA-Z]{2,3}: the extension, which can be 2 or 3 letters (like .fr or .com)
(?:, ... )*: we repeat here the whole pattern to say that we can have any number of other e-mails separated by a comma. Note the strange use of the ?: pattern. I had to google that one too. This is the non capturing group, which means that it is a group that you can not retrieve later using group() functions. Useless here since we are not checking for capturing groups.
\s*$: we ignore all trailing space characters

A bit complicated, but still ok. But then, somebody complained that it is not supporting e-mails from our Japanese branch, which have extensions in the form of @domain.co.jp. So someone was set to the task, and came up with the following regular expression:

^$|^\s*[\w+.-]+@[a-zA-Z_-]+?\.[a-zA-Z]{2,3}(?:,[\w+.-]+@[a-zA-Z_-]+?\.[a-zA-Z]{2,3}\.[a-zA-Z]{2,3})*\s*$

The only difference with the previous one is that there is a new \.[a-zA-Z]{2,3} added within the parenthesis. Which mean that you can have japanese style e-mails, but only after the first e-mail of the list. Worse, you can only have japanese style e-mails from the second mail onward. I notified the person that commited the code, and he said that he will think about the problem. Of course, code went to prod...

So I decided to make a quick fix. I removed all the strange patterns, and set the following regular expression:

^$|^\s*[\w+.-]+@[a-zA-Z_-]+(\.[a-zA-Z]{2,3}){1,2}(,[\w+.-]+@[a-zA-Z_-]+(\.[a-zA-Z]{2,3}){1,2})*\s*$

The fix was made using the {1,2} pattern to say that we can have one or two extensions. Meanwhile, the guy who made the first change also started to make a fix. Small communication problem here, he didn't noticed that I already assigned the bug to myself. But the funny thing is that he had a fix on a branch that was never merged. It looked like this:

^$|^\s*[\w+.-]+@(?:domain)+?(\.[a-zA-Z]{2,3}|\.[a-zA-Z]{2,3}\.[a-zA-Z]{2,3})(?:,[\w+.-]+@(?:domain)+?(\.[a-zA-Z]{2,3}|\.[a-zA-Z]{2,3}\.[a-zA-Z]{2,3}))(?:,[\w+.-]+@(?:domain)+?(\.[a-zA-Z]{2,3}|\.[a-zA-Z]{2,3}\.[a-zA-Z]{2,3}))*\s*$

I don't even want to know if it is correct...

Friday, March 15, 2024

AWS: Find Root Cause of Failure for CloudFormation Stacks

When a CloudFormation stack fails, you have to scroll back trough the events to find the root cause of the failure. Recently, AWS even added a "Detect Root Cause" button to the Console to immediately scroll to the correct event. But how do you do it from a python script?

import boto3

def find_root_cause(stack_name):
    cf_client = boto3.client('cloudformation')

    next_values = "First Time"
    params = {
        "StackName": stack_name
    }
    root_cause = None

    while next_values:
        result = cf_client.describe_stack_events(**params)

        next_values = result.get("NextToken")
        params["NextToken"] = next_values

        for event in result["StackEvents"]:
            status = event.get("ResourceStatus", "")
            reason = event.get("ResourceStatusReason")

            # start of deployment
            if reason == "User Initiated":
                return root_cause
            
            if reason and "FAILED" in status:
                root_cause = reason

    return root_cause

You follow the same pattern as from the Console. You go back the events history, until you reach the oldest error message before the start of the deployment.

Sunday, March 3, 2024

JFileChooser and the Lost Folder Selection

This article was originally posted on JRoller on July 7, 2005

It might sound like an Indiana Jones movie title, but it is an interesting problem we came across. We have a third party product which at some point displays a JFileChooser, in which you must select a directory. In old Java 1.4, this dialog box was working properly. Now that we switched to brand new 5.0, when we select a folder and click on open, it does not come back with the folder as a selected value, but instead goes into the folder. The main difference in the behavior comes from the fact that when we selected a folder, its name was visible in the selected file textfield, and now it is not.

The colleague who had to solve the problem tried to execute the program by copying the 1.4 version of JFileChooser into the bootclasspath. It did not help, so I suggested him to try with the UI class instead. And oh suprise, it works as in the old days. So he started to compare the source code of both versions, and in the ListSelectionListener, he found an interesting difference. A property which was always true before is now set to false by default. So to solve the problem, he inserted the following line in the main method:

UIManager.put("FileChooser.usesSingleFilePane", new Boolean(true));

I wonder if these properties are documented somewhere. There seems to be so many of them...

I checked in my more recent version of Java. This parameter still exists, and still does not seem to be documented.