Readable /Reg(ular )?Ex(pressions)?/ in PHP

Many developers at some point find themselves having to work with regular expressions, also know as RegEx. Some people love working with these quirky little string patterns, others find them intimidating.

One of the issues with regular expressions is that they can look like a foreign language. Encountering them can be like finding 外国語のいくつかの単語 mid-sentence. You can be very familiar with the main code's language, but RegEx is something else. It can pose a bit of a cognitive stumbling block.

Many times, the way we write RegEx could be simplified and made more readable. I'm going to share some tips with you that I have picked up over the years. Many of these are applicable to other coding languages; I am going to focus on PHP here.

An example regular expression

To help demonstrate how we can transform our RegEx, let's consider a relatively simple expression.

/https?:\/\/app\.asana\.com\/0\/(\d+)\/(\d+)(\/f)?/

This pattern matches Asana's task URLs. It's not the most complicated pattern out there; however, it should suffice in illustrating how we can rewrite a pattern to be more readable. You can then use the tips described here for more involved RegEx.

Note: Asana have recently updated their URLs. We'll address this later in this blog post.

Delimiters don't have to be forward-slashes

When writing regular expressions in PHP we have to enclose the pattern in delimiters. The delimiter is a character that occurs at the start and end of the pattern. It can be any non-alphanumeric, non-backslash and non-whitespace character.

Most examples out there seem to use the forward slashes (/), but you don't have to. Sometimes, a different character can be more useful. For example, compare the following two regular expressions:

/https:\/\//
~https://~

Both represent the same pattern, but the second expression is a little easier to read.

By not using a forward slash (/) as our delimiter we don't have to escape any forward slashes within the pattern. As a result, we can improve the readability of our pattern.

Personally, I like to use the following for delimiters depending on the characters needed within the pattern: /, #, | and ~.

With this knowledge, let's rewrite our Asana example using tildes for the delimiters:

~https?://app\.asana\.com/0/(\d+)/(\d+)(/f)?~

`\d` versus `[0-9]`

This next tip is a bit opinionated and you may not agree. Personally, I find the \d notation less readable than using the [0-9] character class. Both do the same thing, but the latter is more clearly defined. \d expects greater familiarity with RegEx syntax, which not everyone reading your code may have.

For me, using [0-9] is more explicit what we are trying to match. So we will write our Asana example to use the ranged character class instead of \d.

~https?://app\.asana\.com/0/([0-9]+)/([0-9]+)(/f)?~

Later, we will extend this example and need to match just 0 or 1 for part of the URL; we will use [0-1] to do this as \d would be too greedy (it would also match 2, 3, 4, etc.). Therefore, using ranged character classes throughout will make the syntax more consistent. This can help simply the pattern for readability.

Know your pattern modifiers

Pattern modifiers are the characters that appear after the closing delimiter. There are many modifiers available to use in PHP.

One of the most commonly used modifiers is i which makes the expression case-insensitive.

I commonly see in examples the i modifier used, and often spot cases where its use is redundant. This is a good indicator that some people don't understand what the modifier is for. When modifying a pattern with i we're making the whole expression case-insensitive. For example, take a look at this pattern:

/^[a-zA-Z]+$/i

This pattern will check that a string only contains letter characters from a to z, including both upper and lower case characters. This is being implied by the [a-zA-Z] part; however, this is redundant as we are also using the i modifier, which means this could be written more simply as:

/^[a-z]+$/i

Another modifier I find particularly useful is x. This modifier allows us to use whitespace to break up an expression into more readable chunks. When set, whitespace data characters in the pattern are ignored, unless escaped or inside a character class.

Using the x modifier allows us to rewrite our example expression so that each 'part' is on its own line:

~
	https?://
	app\.asana\.com
	/0
	/([0-9]+)
	/([0-9]+)
	(/f)?
~x

Annotate your patterns

With our expression now spaced across multiple lines, we can now make use of the fact that you can annotate patterns.

It is often said that 'good code speaks for itself' and therefore doesn't need comments. However, I believe there's a lot of value in annotating RegEx. It can provide context to the different parts of the expression, making it far easier to understand.

~
	https?://
	app\.asana\.com
	/0 # Asana URL version
	/([0-9]+) # Project ID
	/([0-9]+) # Task ID
	(/f)? # Focus mode
~x

Hopefully, you will agree that this expression is much easier to mentally process and understand than our original pattern.

Use named sub patterns

My final tip is a little different. This one relates more to how we handle the matches generated by the RegEx.

In PHP, we use the preg* functions to search strings for patterns. Take this simple example of searching a string for a name and number:

$str = 'foobar: 2008';
if (preg_match('/([a-z]+): ([0-9]+)/', $str, $matches)) {
	echo 'name: ' . $matches[1] . "\n";
	echo 'digit: ' . $matches[2] . "\n";
}

The RegEx pattern contains two sub patterns which are returned individually in the $matches array. We reference them by the matched group number.

Personally, I don't like those numeric indices. They make our code less descriptive. They also increase the risk of us using the wrong value, which could introduce a bug to our code. Thankfully, in PHP we can do something about that.

We can name our sub patterns, and then access the matched groups using those names.

$str = 'foobar: 2008';
if (preg_match('/(?<name>[a-z]+): (?<digit>[0-9]+)/', $str, $matches)) {
	echo 'name: ' . $matches['name'] . "\n";
	echo 'digit: ' . $matches['digit'] . "\n";
}

Here, ?<name> and ?<digit> name our sub patterns. We can then reference the captured groups using these names in our $matches array.

With this knowledge, we can now update our Asana RegEx so that our sub patterns for the project ID and task ID are captured with the names projectId and taskId.

~
	https?://
	app\.asana\.com
	/0 # Asana URL version
	/(?<projectId>[0-9]+) # Project ID
	/(?<taskId>[0-9]+) # Task ID
	(/f)? # Focus mode
~x

Updating the RegEx

I mentioned above that Asana have updated their task URLs. They have updated them from:

https://app.asana.com/0/1206043162733419/12058909747493732

To the more human-readable:

https://app.asana.com/1/15793206719/project/1206043162733419/task/12058909747493732

The newer 'v1' format is replacing the older 'v0' format in the browser. However, Asana have committed to supporting both indefinitely. This does mean we need to update our RegEx.

One of the advantages of making our expression more readable is that it is easier to see where we need to modify it.

Accounting for both v0 and v1 formats of the URLs, we can rewrite our expression to become:

~
	https?://
	app\.asana\.com
	/[0-1] # Asana URL version
	(/([0-9]+/project/)?(?<projectId>[0-9]+)) # Project ID
	(/(task/)?(?<taskId>[0-9]+))? # Task ID
	(/f|(\?focus=true))? # Focus mode
~x

The pattern is a little more complex, but hopefully still fairly easy to read.

By persisting with the named sub patterns from before, we shouldn't need to update any further code that makes use of the project and task IDs. Whilst the number of captured groups has changed, and therefore the numeric keys, the named keys haven't.

Regular expressions are likely to cause a barrier for many people's understanding of a code base. However, by using some of the tips outlined here you should be able to make your patterns easier to interpret. As a result, you will be improving the readability of your code.