Real World PHP

About   |   EN   |   ES

007a More Strings: Regular Expressions

"A regular expression is a sequence of characters that specifies a match pattern in text."

-- Wikipedia, Regular expression

As long as there has been text in computers, there has been a need to be able to search through that text. Over time, sophisticated pattern matching has developed. We call this searching via patterns "regular expressions" or regex. It is a well-understood idea in computer science and is a very quick way to search through text.

There are multiple fundamental approaches to regex, but in modern PHP, there is only one we use: Perl-style regular expressions. The Perl programming language developed this approach and it is popular among many other programming languages. Thus, the formal name is Perl Compatible Regular Expressions, or PCRE. There is a family of related functions to perform the regex, they all begin with the prefix "preg_".

Basic Concepts

A regular expression (also called a pattern) is a string that can used to find a match or matches within another string. Generally speaking, there are a few types of operations used to construct regular expressions:

Boolean "or"

A vertical bar or "pipe" to separate choices. So the pattern gray|grey can match either "gray" or "grey."

Grouping

A group is made with parenthesis "()" and is used to treat multiple tokens together. Using groups, we can achieve the same match as the pattern from before: gr(a|e)y will also match either "gray" or "grey."

Quantification

Quantifiers follow an element and specify how many times that element should repeat. Typical quantifiers are:

Wildcard

The wildcard is a period . and it will match any character. So the pattern a.e will match any three characters that start with "a" and end with "e" such as in the word "save" or "lace" but note that it will only match the "ave" or "ace" part of those words, not the whole word. You might extend it to a pattern such as a.*e which would match anything that starts with "a" and ends with "e" no matter how long. It would match both "apple" and "abc th quick brown fox jumps ove"

Ranges and sets

A set is made with square brackets []. It will match anything within. So the pattern [abc] would match against "a" and "b" and "c". It will also support a range: [a-m] with matches against any letter between "a" and "m" so it would match "a" and "b" and "c" all the way through "l" and "m".

And more...

There are more ways to match, of course. Ways to match against whitespace, words, digits. There are ways to negate and escape. And everything can be used together to accomplish complex matching.

preg_match and preg_match_all

The most used regex related functions in PHP are preg_match and preg_match_all. You use them like this:

<?php
// preg_match($pattern, $subject, $matches);
$pattern = '/PHP/';
$subject = 'Real World PHP';
preg_match($pattern, $subject, $matches);
print_r($matches);
/* Output:
Array
(
    [0] => PHP
)
*/

// preg_match stops looking after the first match
// Look what happens when we search for any vowel
$pattern = '/[aeiou]/';
$subject = 'The quick brown fox jumps over the lazy dog.';
preg_match($pattern, $subject, $matches);
print_r($matches);
/* Output:
Array
(
	[0] => e
)
*/

// Now see how preg_match_all works
// preg_match_all($pattern, $subject, $matches);
$pattern = '/[aeiou]/';
$subject = 'The quick brown fox jumps over the lazy dog.';
preg_match_all($pattern, $subject, $matches);
print_r($matches);
/* Output:
Array
(
    [0] => Array
        (
            [0] => e
            [1] => u
            [2] => i
            [3] => o
            [4] => o
            [5] => u
            [6] => o
            [7] => e
            [8] => e
            [9] => a
            [10] => o
        )
)
*/

Note how we didn't declare the $matches variable before we used it on line 5. That is because the preg_match function creates it for us. This technique is called "pass by reference." It is not ideal, but it is one of the quirks of the language.

preg_match and preg_match_all do return something, but not the results. The results come in the $matches parameter only. The return values, according to the documentation:

preg_match() returns 1 if the pattern matches given subject, 0 if it does not, or false on failure.

Examples

Regex is powerful and fast. There are countless ways where a programmer will want to use it. Often it is used in user input validation. Here are some examples

Email validation

Since the creation of email addresses, the rules behind what a valid address can be have changed. As it has gotten more complicated, the ability to check what makes a valid email address has become harder to determine. A search online might give you patterns like these:

You might be able to get away with any one of these. If you need a definitive way of checking, then you would probably use the more elegant: filter_var($email, FILTER_VALIDATE_EMAIL) but it is good to know you can validate email addresses using regex yourself.

Telephone number validation

Telephone numbers vary wildly across the world. However, for a US telephone number, something like these would likely work well:

Zip Codes

For US zip codes:


Resources


Challenges

Regex validation examples

Try running the examples in PHP. Use the preg_match and the preg_match_all functions.

Regex playground

Go to one of the following websites related to regex: regex101.com or regexr.com. Use their interface to try to match the following:

  • Your name from a list of names
  • The letters "P", "H", "p", and "h" from a random paragraph
  • Any word beginning with the letter "r" and ending with the letter "s"