Thursday, April 18, 2024

WTF: E-mail Validation

 E-mail validation is usually a hard task, but in our case, we had a simple regular expression that allowed us to accept a list of e-mails in a known format. Here is the regular expression that you could find in our code:

^$|^\s*[\w+.-]+@[a-zA-Z_-]+?\.[a-zA-Z]{2,3}(?:,[\w+.-]+@[a-zA-Z_-]+?\.[a-zA-Z]{2,3})*\s*$

Many things here, let's decompose:

  • ^$|: we accept an empty string
  • ^\s*: we ignore leading white space characters
  • [\w+.-]+: the user name part of the e-mail. We accept all words characters, plus sign, dot and dash. 
  • @: the at sign
  • [a-zA-Z_-]+?: the domain name, which can have any letter, dash and underscore. Note here the use of the +? pattern, which is very strange. I had to google it, it is the lazy expansion, which means take the minimum number of character needed to fulfill the pattern. Completely useless here since we are looking for a dot character afterward.
  • \. The dot character between the domain name and the extension
  • [a-zA-Z]{2,3}: the extension, which can be 2 or 3 letters (like .fr or .com)
  • (?:, ... )*: we repeat here the whole pattern to say that we can have any number of other e-mails separated by a comma. Note the strange use of the ?: pattern. I had to google that one too. This is the non capturing group, which means that it is a group that you can not retrieve later using group() functions. Useless here since we are not checking for capturing groups.
  • \s*$: we ignore all trailing space characters
A bit complicated, but still ok. But then, somebody complained that it is not supporting e-mails from our Japanese branch, which have extensions in the form of @domain.co.jp. So someone was set to the task, and came up with the following regular expression:
^$|^\s*[\w+.-]+@[a-zA-Z_-]+?\.[a-zA-Z]{2,3}(?:,[\w+.-]+@[a-zA-Z_-]+?\.[a-zA-Z]{2,3}\.[a-zA-Z]{2,3})*\s*$

The only difference with the previous one is that there is a new \.[a-zA-Z]{2,3} added within the parenthesis. Which mean that you can have japanese style e-mails, but only after the first e-mail of the list. Worse, you can only have japanese style e-mails from the second mail onward. I notified the person that commited the code, and he said that he will think about the problem. Of course, code went to prod...

So I decided to make a quick fix. I removed all the strange patterns, and set the following regular expression:

^$|^\s*[\w+.-]+@[a-zA-Z_-]+(\.[a-zA-Z]{2,3}){1,2}(,[\w+.-]+@[a-zA-Z_-]+(\.[a-zA-Z]{2,3}){1,2})*\s*$

The fix was made using the {1,2} pattern to say that we can have one or two extensions. Meanwhile, the guy who made the first change also started to make a fix. Small communication problem here, he didn't noticed that I already assigned the bug to myself. But the funny thing is that he had a fix on a branch that was never merged. It looked like this:

^$|^\s*[\w+.-]+@(?:domain)+?(\.[a-zA-Z]{2,3}|\.[a-zA-Z]{2,3}\.[a-zA-Z]{2,3})(?:,[\w+.-]+@(?:domain)+?(\.[a-zA-Z]{2,3}|\.[a-zA-Z]{2,3}\.[a-zA-Z]{2,3}))(?:,[\w+.-]+@(?:domain)+?(\.[a-zA-Z]{2,3}|\.[a-zA-Z]{2,3}\.[a-zA-Z]{2,3}))*\s*$

I don't even want to know if it is correct...