007a More Strings: Regular Expressions
"A regular expression is a sequence of characters that specifies a match pattern in text."
-- Wikipedia, Regular expression
As long as there has been text in computers, there has been a need to be able to search through that text. Over time, sophisticated pattern matching has developed. We call this searching via patterns "regular expressions" or regex. It is a well-understood idea in computer science and is a very quick way to search through text.
There are multiple fundamental approaches to regex, but in modern PHP, there is only one we use: Perl-style regular expressions. The Perl programming language developed this approach and it is popular among many other programming languages. Thus, the formal name is Perl Compatible Regular Expressions, or PCRE. There is a family of related functions to perform the regex, they all begin with the prefix "preg_".
Basic Concepts
A regular expression (also called a pattern) is a string that can used to find a match or matches within another string. Generally speaking, there are a few types of operations used to construct regular expressions:
Boolean "or"
A vertical bar or "pipe" to separate choices. So the pattern gray|grey
can match either "gray" or "grey."
Grouping
A group is made with parenthesis "()" and is used to treat multiple tokens together. Using groups, we can achieve the same match as the pattern from before: gr(a|e)y
will also match either "gray" or "grey."
Quantification
Quantifiers follow an element and specify how many times that element should repeat. Typical quantifiers are:
?
- indicates 0 or 1 occurrences of the element. Sopatterns?
matches both "pattern" and "patterns"*
- indicates 0 or more occurrences of the element. So(th-)*anks
matches "th-anks" and "th-th-anks" and "th-th-th-th-th-anks" and so on.+
- indicates 1 or more occurrences of the element. SoPH+P
does NOT match "PP" but it does match "PHP" and "PHHP" and "PHHHHHP" and so on.
Wildcard
The wildcard is a period .
and it will match any character. So the pattern a.e
will match any three characters that start with "a" and end with "e" such as in the word "save" or "lace" but note that it will only match the "ave" or "ace" part of those words, not the whole word. You might extend it to a pattern such as a.*e
which would match anything that starts with "a" and ends with "e" no matter how long. It would match both "apple" and "abc th quick brown fox jumps ove"
Ranges and sets
A set is made with square brackets []
. It will match anything within. So the pattern [abc]
would match against "a" and "b" and "c". It will also support a range: [a-m]
with matches against any letter between "a" and "m" so it would match "a" and "b" and "c" all the way through "l" and "m".
And more...
There are more ways to match, of course. Ways to match against whitespace, words, digits. There are ways to negate and escape. And everything can be used together to accomplish complex matching.
preg_match and preg_match_all
The most used regex related functions in PHP are preg_match
and preg_match_all
. You use them like this:
<?php
// preg_match($pattern, $subject, $matches);
$pattern = '/PHP/';
$subject = 'Real World PHP';
preg_match($pattern, $subject, $matches);
print_r($matches);
/* Output:
Array
(
[0] => PHP
)
*/
// preg_match stops looking after the first match
// Look what happens when we search for any vowel
$pattern = '/[aeiou]/';
$subject = 'The quick brown fox jumps over the lazy dog.';
preg_match($pattern, $subject, $matches);
print_r($matches);
/* Output:
Array
(
[0] => e
)
*/
// Now see how preg_match_all works
// preg_match_all($pattern, $subject, $matches);
$pattern = '/[aeiou]/';
$subject = 'The quick brown fox jumps over the lazy dog.';
preg_match_all($pattern, $subject, $matches);
print_r($matches);
/* Output:
Array
(
[0] => Array
(
[0] => e
[1] => u
[2] => i
[3] => o
[4] => o
[5] => u
[6] => o
[7] => e
[8] => e
[9] => a
[10] => o
)
)
*/
Note how we didn't declare the $matches
variable before we used it on line 5. That is because the preg_match
function creates it for us. This technique is called "pass by reference." It is not ideal, but it is one of the quirks of the language.
preg_match
and preg_match_all
do return something, but not the results. The results come in the $matches
parameter only. The return values, according to the documentation:
preg_match() returns 1 if the
pattern
matches givensubject
, 0 if it does not, orfalse
on failure.
Examples
Regex is powerful and fast. There are countless ways where a programmer will want to use it. Often it is used in user input validation. Here are some examples
Email validation
Since the creation of email addresses, the rules behind what a valid address can be have changed. As it has gotten more complicated, the ability to check what makes a valid email address has become harder to determine. A search online might give you patterns like these:
^\w*(\-\w)?(\.\w*)?@\w*(-\w*)?\.\w{2,3}(\.\w{2,3})?$
^(([^<>()\[\]\\.,;:\s@"]+(\.[^<>()\[\]\\.,;:\s@"]+)*)|(".+"))@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$
^(?=[a-z][a-z0-9@._-]{5,40}$)[a-z0-9._-]{1,20}@(?:(?=[a-z0-9-]{1,15}\.)[a-z0-9]+(?:-[a-z0-9]+)*\.){1,2}[a-z]{2,6}$
You might be able to get away with any one of these. If you need a definitive way of checking, then you would probably use the more elegant: filter_var($email, FILTER_VALIDATE_EMAIL)
but it is good to know you can validate email addresses using regex yourself.
Telephone number validation
Telephone numbers vary wildly across the world. However, for a US telephone number, something like these would likely work well:
/(\+?(\b1)?[\ .\/-]?((?(2)|(\b))|(\())\d{3}(?(?<=\(\d{3})\)|)[\ .\/-]?)?(?(1)|\b)\d{3}[\ .\/-]?\d{4}[\ ]?([xX][\ ]?\d{1,5})?\b/gm
/^(\+?\d{0,2})?[\D]?\(?(\d{3})\)?[\D]?(\d{3})[\D]?(\d{4})$/gm
Zip Codes
For US zip codes:
(^\d{5}$)
Resources
- Wikipedia: Regular expression
- PHP Manual: PCRE Functions
- regex101
- Learn Regex: A Beginner’s Guide
- Learn, build, and test regex
Challenges
Regex validation examples
Try running the examples in PHP. Use the preg_match
and the preg_match_all
functions.
Regex playground
Go to one of the following websites related to regex: regex101.com or regexr.com. Use their interface to try to match the following:
- Your name from a list of names
- The letters "P", "H", "p", and "h" from a random paragraph
- Any word beginning with the letter "r" and ending with the letter "s"