A Python Regular Expression Bypass Technique

Sometimes, functions included in Python RE are misused by developers and when you see this it can be possible to bypass weak input validation functions.

A Python Regular Expression Bypass Technique

One of the most common ways to check a user's input is to test it against a Regular Expression. The Python module RE provides easy and very powerful functions to check if a particular string matches a given regular expression (or if a given regular expression matches a particular string, which comes down to the same thing). Sometimes, functions included in Python RE are either misused or not very well understood by developers and when you see this it can be possible to bypass weak input validation functions.

TL;DR using python re.match() function to validate a user input can lead to bypass because it will only match at the beginning of the string and not at the beginning of each line. So, by converting a payload to multiline, the second line will be ignored by the function. This means that a weak validation function that prevents using special characters in a value (for example id=123), could be bypassed with something like id=123\n'+OR+1=1--.

In this article I'll show you an example of bad usage of the re.match() function. [from search() vs. match()] Python offers two different primitive operations based on regular expressions: re.match() checks for a match only at the beginning of the string, while re.search() checks for a match anywhere in the string (this is what Perl does by default).

For example:

>>>
>>> re.match("c", "abcdef")    # No match
>>> re.search("c", "abcdef")   # Match
<re.Match object; span=(2, 3), match='c'>

Regular expressions beginning with '^' can be used with search() to restrict the match at the beginning of the string:

>>>
>>> re.match("c", "abcdef")    # No match
>>> re.search("^c", "abcdef")  # No match
>>> re.search("^a", "abcdef")  # Match
<re.Match object; span=(0, 1), match='a'>

As you can see, the first re.match didn't match because implicit anchors. Anchors do not match any character at all. Instead, they match a position before, after, or between characters. They can be used to “anchor” the regex match at a certain position (https://www.regular-expressions.info/anchors.html).

Input Validation using re.match()

Let say that I've got a Python flask web application that is vulnerable to SQL Injection. If I send an HTTP request for /news sending an article id number on the id argument, and a category name on the argument category, it returns me the content of that article. For example:

from flask import Flask
from flask import request
import re

app = Flask(__name__)

def is_valid_input(input):
    m = re.match(r'.*(["\';=]|select|union|from|where).*', input, re.IGNORECASE)
    if m is not None:
        return False
    return True

@app.route('/news', methods=['GET', 'POST'])
def news():
    if request.method == 'POST':
        if "id" in request.form:
            if "category" in request.form:
                if is_valid_input(request.form["id"]) and is_valid_input(request.form["category"]):
                    return f"OK: {request.form['category']}/{request.form['id']}"
                else:
                    return f"Invalid value: {request.form['category']}/{request.form['id']}", 403
            else:
                return "No category parameter sent."
        else:
            return "No id parameter sent."

By sending a request with id=123 and category=financial the application reply me with "200 OK" status code and "OK: financial/123" response body. As I said, the argument id is vulnerable to SQL Injection, so the developer has fixed it by creating a function to validate the user's input on both arguments (id and category) that prevents sending some characters like single and double quotes or strings like "select" or "union".

As you can see, this webapp checks the user's input with the is_valid_input function at line 7:

def is_valid_input(input):
    m = re.match(r'.*(["\';=]|select|union|from|where).*', input, re.IGNORECASE)
    if m is not None:
        return False
    return True

the code above means: "if the value of any input contains double quote, or single quote, or semicolon, or equal character, or any of the following string: "select", "union", "from", "where", then discard it". Let's try it:

By trying to inject SQL syntax on the value of argument id the webapp returns a 403 Forbidden status with "Invalid value" as response body. This thanks to the validation function that matches invalid characters in my payload such as single quote and equal.

Input Validation Bypass

From the RE module documentation, about the re.match() function: "... even in MULTILINE mode, re.match() will only match at the beginning of the string and not at the beginning of each line. If you want to locate a match anywhere in string, use search() instead (see also search() vs. match())."

So, to bypass this kind of input validation we just need to convert the SQL Injection payload from single line to multiline by adding a \n between the numeric value and the SQL syntax. For example:

If the question is "can SQL have a newline inside a SELECT?" the answer is yes it can. The hypothetical SQL syntax becomes something like the following:

Let's do it on the vulnerable webapp:

As shown in the screenshot, I just put a \n (not CRLF \r\n) after the id value and then I started my SQL Injection. The validation function just validate the first line, so I bypassed it.

Using curl:

curl -s -d "id=123%0a'+OR+1=1--&category=test" 'http://localhost:5000/news'
OK: test/123%

Run it in your Lab

First, download the vulnerable flask webapp source code from here:

then start flask webserver with:

flask run

Remediation

First option is to do a positive validation instead of a negative one. Don't create a sort of deny-list of "not allowed words" or "not allowed characters" but check for expected value format. Example id=123 can be validated by ^[0-9]+$.

Second option is to use re.search() instead of re.match() that check over the whole value and not just for the first line.

Third option: don't create your own input validation function but try to find a widly used and mantained library that does it for you.

Follow

if you liked this post, follow me on twitter to keep in touch! https://twitter.com/AndreaTheMiddle

The awesome image used in this article was created by Ankur Patar.