Regular expressions provide a means for advanced string matching and manipulation. They are very often not a pretty thing to look at. For instance:
CODE:
| ^.+@.+..+$ |
This useful but scary bit of code is enough to give some programmers headaches and enough to make others decide that they don't want to know about regular expressions. Although they take a little time to learn, regular expressions, or regex as they're sometimes known, can be very handy.
Let's start with the basics. A regular expression is essentially a pattern, a set of characters that describes the nature of the string being sought. The pattern can be
as simple as a literal string, or it can be extremely complex, using special characters to represent ranges of characters, multiple occurrences, or specific contexts in which to search. Look at the following pattern:
CODE:
| ^stuff |
This pattern includes the special character ^, which indicates that the pattern should only match for strings that begin with the string "stuff", so the string "stuff is good" would match this pattern, but the string "don't got the stuff" would not. Just as the ^ character matches strings that begin with the pattern, the $ character is used to match strings that end with the given pattern.
CODE:
| food$ |
would match the string "Where is my food", but it would not match "food is good". ^ and $ can be used together to match exact strings (with no leading or trailing characters not in the pattern). For example:
CODE:
| ^food$ |
matches only the string "food". If the pattern does not contain ^ or $, then the match will return true if the pattern is found anywhere in the source string. For the string:
CODE:
| "there once was a man with a red hood he liked food" |
the pattern:
CODE:
| once |
would result in a match. The letters in the pattern ("o", "n", "c", and "e") are literal characters. Letters and numbers all match themselves literally in the source string. For slightly more complex characters, such as punctuation and whitespace characters, we use an escape sequence. Escape sequences all begin with a backslash (). For a tab character, the sequence is t. So if we want to detect whether a string begins with a tab, we use the pattern:
CODE:
| ^t |
This would match the strings:
CODE:
|
But his daughter, named Nan Ran away with a man |
since both of these lines begin with tabs. Similarly, n represents a new line character, f represents a form feed, and r represents a carriage return. For most punctuation marks, you can simply escape them with a . Therefore, a backslash itself would be represented as , a literal . would be represented as ., and so on.
In Internet applications, regular expressions are especially useful for validating user input. You want to make sure that when a user submits a form, his or her phone number, address, e-mail address, credit card number, etc. all make reasonable sense. Obviously, you could not do this by literally matching individual words. (To do that, you would have to test for all possible phone numbers, all possible credit cardnumbers, and so on.) We need a way to more loosely describe the values that we are trying to match, and character classes provide a way to do that. To create a character class that matches any one vowel, we place all vowels in square brackets:
CODE:
| [AaEeIiOoUu] |
This will return true if the character being considered can be found in this "class", hence the name, character class. We can also use a hyphen to represent a range of characters:
CODE:
|
[a-z] // Match any lowercase letter
[A-Z] // Match any uppercase letter [a-zA-Z] // Match any letter [0-9] // Match any digit [0-9.-] // Match any digit, dot, or minus sign [ frtn] // Match any whitespace character |
Be aware that each of these classes is used to match one character. This is an important distinction. If you were attempting to match a string composed of one lowercase letter and one digit only, such as "a2", "t6", or "g7", but not "ab2", "r2d2", or "b52", you could use the following pattern:
CODE:
| ^[a-z][0-9]$ |
Even though [a-z] represents a range of twenty-six characters, the character class itself is used to match only the first character in the string being tested. Remember that ^ tells PHP to look only at the beginning of the string. The next character class, [0-9] will attempt to match the second character of the string, and the $ matches the end of the string, thereby disallowing a third character. We've learned that the carat (^) matches the beginning of a string, but it can also have a second meaning. When used immediately inside the brackets of a character class, it means "not" or "exclude". This can be used to "forbid" characters. Suppose we wanted to relax the rule above. Instead of requiring only a lowercase letter and a digit, we wish to allow the first character to be any non-digit character:
CODE:
| ^[^0-9][0-9]$ |
The special character "." is used in regular expressions to represent any non-newline character. Therefore the pattern ^.5$ will match any two-character string that ends in five and begins with any character (other than newline). The pattern . by itself will match any string at all, unless it is empty or composed entirely of newline characters.
Now that we understand regular expressions, it's time to explore how they fit into PHP. PHP has five functions for handling regular expressions. Two are used for simple searching and matching (ereg() and eregi()), two for search-and-replace (ereg_replace() and eregi_replace()), and one for splitting (split()). In addition, sql_regcase() is used to create case-insensitive regular expressions for database products that may be case sensitive.
The basic regular expression function in PHP is ereg() It returns a positive integer (equivalent to true) if the pattern is found in the source string, or an empty value (equivalent to false) if it is not found or an error has occurred.
CODE:
|
if (ereg("^.+@.+..+$", $email)) {
echo ("E-mail address is valid."); }else{ echo ("Invalid e-mail address."); } |
ereg() can accept a third argument. This optional argument is an array passed by reference. Recall from the previous section that parentheses can be used to group characters and sequences. With the ereg() function, they can also be used to capture matched substrings of a pattern. For example, suppose that we not only wish to verify whether a string is an email address, but we also would like to individually examine the three principal parts of the email address: the username, domain name, and top-level domain name. We can do this by surrounding each corresponding part of our pattern with parentheses:
CODE:
| ^(.+)@(.+).(.+)$ |
Note that we have added three sets of parentheses to the pattern: the first where the username would be, the second where the domain name would be, and the third where the top-level domain name would be. Our next step is to include a variable as the third argument. This is the variable that will hold the array once ereg() has executed:
CODE:
| if (ereg("^(.+)@(.+).(.+)$", $email, $arr)) { |
If the address is valid, the function will still return true. Additionally, the $arr variable will be set. $arr[0] will store the entire string, such as "scollo@taurix.com". Each matched, parenthesized substring will then be stored in an element of the array, so $arr[1] would equal "scollo", $arr[2] would equal "taurix", and $arr[3] would equal "com". If the e-mail address is not valid, the function will return false, and $arr will not be set. Here it is in action:
CODE:
|
if (ereg("^(.+)@(.+).(.+)$", $email, $arr)) {
echo ("E-mail address is valid. n" . "E-mail address: $arr[0] n" . "Username: $arr[1] n" . "Domain name: $arr[2] n" . "Top-level domain name: $arr[3] n" ); } else { echo ("Invalid e-mail address. n"); } |
eregi() behaves identically to ereg(), except it ignores case distinctions when matching letters.
ereg_replace() and eregi_replace() ereg_replace() searches string for the given pattern and replaces all occurrences with replacement. If a replacement took place, it returns the modified string, otherwise, it returns the original string:
CODE:
|
$str = "Then the pair followed Pa to Manhasset";
$pat = "followed"; $repl = "FOLLOWED"; echo (ereg_replace($pat, $repl, $str)); |
prints:
CODE:
| Then the pair FOLLOWED Pa to Manhasset |
Like ereg(), ereg_replace() also allows special treatment of parenthesized substrings. For each left parenthesis in the pattern, ereg_replace() will "remember" the value stored in that pair of parentheses, and represent it with a digit (1 to 9). You can then refer to that value in the replacement string by including two backslashes and the digit. For example:
CODE:
|
$str = "Where he still held the cash as an asset";
$pat = "c(as)h"; $repl = "C1H"; echo (ereg_replace($pat, $repl, $str)); |
Prints:
CODE:
| Where he still held the CasH as an asset |
The "as" is stored as 1, and can thus be referenced in the replacement string.






