RegEx: Sequence Is Important

Regular Expressions, known as RegEx, are cool,they are extremely useful and allow us to do beautiful things.

RegEx: Sequence Is Important

Regular Expressions, known as RegEx, are cool, they are extremely useful and allow us to do beautiful things. Sometimes, however, we underestimate the importance of the order of the characters in the pattern. Sometimes... Okay, let's say that none of us have ever thought about this, come on, let's face it.

Example

During a code review on a Java project with the support of Fortify SCA, a Header Manipulation came out, one of the typical problems when you don't sanitize the input data.

The code in question looked very similar to the following:

protected void error(HttpServletRequest request, HttpServletResponse response, Error error) {
  try {
    String errorMessage = error.getMessage();
    log(errorMessage);

    response.setContentType(request.getContentType());
    response.getWriter().print(errorMessage);
  } catch (Exception e) {
    throw new ServletException(e);
  }
}

The problem is that the Content Type is taken from a request and inserted into a response, without checking its content, which could be dangerous (I will talk about it in detail maybe in a separate article).

The developer accepted this report, and had decided to implement "a particular filter using a RegEx, because it is powerful and customizable".

His solution was therefore to create the following method to sanitize that field:

public static String sanitizeContentType(String input) {
  return input
    .replaceAll("[^a-zA-Z0-9;=-\\\\/\", "")
    .replaceAll("\\s{2,}", " ")
    .replaceAll("\\r", "")
    .replaceAll("\\n", "");
}

In detail:

  • [^a-zA-Z0-9;=-\\/] intercepts all characters other than semicolons, equal, minus, slash, backslash, all numbers and all letters a to z, both lowercase and uppercase.
  • \s{2,} intercepts all sequences with more than one space.
  • \r intercepts carriage return (or carriage return, as Jessica Fletcher would say).
  • \n intercepts the escape sequence for the new line.

Since I never trust much in general, and above all I don't understand why to rewrite something when there are several more efficient and advanced libraries that do this kind of thing, I decided to do a little test.

Test

As usual, I created a small program to do the tests:

public class RegExSanitizer {
 public static void main(String[] args) {
   if (args.length == 0) {
     System.out.println("Usage is: java RegExSanitizer input");
     System.exit(0);
   }

   String input2sanitize = args[0];
   System.out.println("String to sanitize: " + input2sanitize);
   System.out.println("Sanitized string: " + sanitize(input2sanitize));
 }

  public static String sanitize(String input) {
    return input.replaceAll("[^a-zA-Z0-9;=-\\\\/]", \"\")
      .replaceAll("\\s{2,}", " ")
      .replaceAll("\\r", "")
      .replaceAll("\\n", "");
  }
}

Being a function to sanitize the inputs, the first test was obviously passing a rather strange string, but not much for the truth.

C:\RegExSanitizer> javac RegExSanitizer.java
C:\RegExSanitizer> java RegExSanitizer Bob%%0d%00d%0aa<script>alert('document.domain')</script>
String to sanitize: Bob%%0d%00d%0aa<script>alert('document.domain')</script>
Sanitized string: Bob0d00d0aascript>alertdocumentdomain/script>

The first thing that immediately catches the eye is that the closed hook brackets have not been eliminated. And already we start badly.

I did some tests with the trusty Regex101 starting from the regex created by the developer and studying the pattern, since the nice thing about Regex101 is the fact that passing the mouse over it, every single sequence and its meaning are highlighted. In addition, in the EXPLANATION box on the right it is explained in detail point by point.

And that's exactly how I discovered this:

=-\ a single character in the range between = (index 61) and \ (index 92) (case sensitive)

That is, the sequence =-\ intercepts any character between index 61 and index 92.

Looking at the ASCII Table, between index 61 and index 92 there are several characters, including the right angle bracket, with index 62 (those who work with XSS probably already guessed, given the use of &#60; and &#62; in certain payloads, the HTML code of the angle brackets).

To fix all this, I simply changed the pattern sequence like so:

[^\\\\/a-zA-Z0-9;=-]

And they all lived happily ever after.

Conclusion

Regular Expressions remain fantastic things and a world to be discovered. There is no doubt that their usefulness is immense and that if I could I would also use them when I have to ask someone for the time.

Be that as it may, the fact remains that for sanitizing inputs there are many much more reliable libraries than us, no doubt, and there is probably no need to reinvent the wheel every time.

For heaven's sake, nothing is perfect. Maybe using one of these libraries you will find an input that is not properly sanitized.

And there is applause, because you won.

The awesome image used in this article was created by Oscar Moctezuma.