When working with texts in any modern programming language, developers are constantly faced with the tasks of checking the entered data for compliance with the desired template, searching for and replacing test fragments, and other typical operations for processing symbolic information. The development of our own verification algorithms leads to a loss of time, incompatibility of program code and complexity in its development and modernization.
The rapid development of the Internet and WEB-development languages required the creation of universal and compact tools for processing text information with the minimum amount required for this code. PHP is a popular language among beginners and professional developers. Regular expression as a language of text templates allows you to simplify text processing tasks and reduce program code by tens or hundreds of lines. Many tasks in general cannot be solved without it.
PHP regular expressions
PHP contains three mechanisms for working with regular expressions - “ereg”, “mb_ereg” and “preg”. The most common is the preg interface, whose functions provide access to the PCRE regular expression support library, originally developed for the Perl language, which is included with PHP. Preg functions look for matches in a given text string according to a specific pattern in the language of regular expressions.
Syntax Basics
In a short article, it is impossible to describe in detail the entire syntax of regular expressions, for this there is a special literature. Here are just the basic elements to show the wide possibilities for the developer and understanding the code examples.
In PHP, a regular expression is formally defined very difficult, and therefore simplify the description. A regular expression is a text string. It consists of a delimited template and a modifier that indicates how to handle it. It is possible to include various alternatives and repetitions in the templates.
For example, in the expression / \ d {3} - \ d {2} - \ d {2} / m, the delimiter will be “/” , followed by a pattern, and the symbol “m” will be a modifier.
The full power of regular expressions is encoded using metacharacters. The main metacharacter of the language is the backslash - “\”. It changes the type of the character following it to the opposite (i.e., the ordinary character turns into a metacharacter and vice versa). Another important metacharacter is the forward slash “|”, which defines alternate template options. More examples of metacharacters:
^ | Start of object or line |
( | Beginning of a subpattern |
) | Subpattern End |
{ | The beginning of the quantifier |
} | End of quantifier |
\ d | decimal digit 0 to 9 |
\ D | any non-digit character |
\ s | blank character, space, tab |
\ w | vocabulary character |
PHP, when processing regular expressions, considers the space as a separate significant character, therefore the expressions ABCDE and ABC WHERE are different.
Subpatterns
In PHP, regular subpatterns are highlighted with parentheses and are sometimes called “subexpressions”. Perform the following functions:
Highlighting alternatives . For example, the heat pattern (some | bird |) matches the words heat, firebird, and roast . And without brackets, it will be only an empty string, “bird” and “roast”.
"Exciting" subpattern. This means that if a substring matches in the pattern, then all matches are returned as the result. For clarity, we give an example. The following regular expression is given: the winner receives ((gold | gilded) (medal | cup)) - and a line for finding matches: “the winner receives a gold medal” . In addition to the original phrase, the search will return: “gold medal” , “medal”, “gold” .
Repetition Operators (Quadrificators)
When drawing up regular expressions, it is often necessary to analyze the repetition of numbers and characters. This is not a problem if there are not many repetitions. But what to do when we do not know their exact number? In this case, you must use special metacharacters.
To describe repetitions, quadrificators are used - metacharacters to specify the quantity. Quadrics are of two types:
- common brackets;
- abbreviated.
The common quantifier sets the minimum and maximum number of allowed repetitions of an element in the form of two numbers in curly brackets, for example, x {2.5}. If the maximum number of repetitions is unknown, the second argument is not specified: x {2,}.
Abbreviated quantifiers are characters for the most common repetitions to avoid unnecessary syntax overload. Three abbreviations are commonly used:
1. * - zero and more repetitions, which is equivalent to {0,}.
2. + - one or more repetitions, that is , {1 ,}.
3.? - zero or only one repetition - {0,1}.
Regular Expression Examples
For those who study regular expressions, examples are the best tutorial. We will give a few that show their wide capabilities with a minimum of effort. All program codes are fully compatible with versions of PHP 4.x and higher. For a complete understanding of the syntax and the use of all the features of the language, we recommend J. Friedl’s book “Regular Expressions”, where the syntax is fully considered and there are examples of regular expressions not only in PHP, but also for Python, Perl, MySQL, Java, Ruby, and C #.
Validating Email Address
Task. There is a web page on which a visitor is requested an email address. The regular expression must verify the correctness of the received address before sending messages. Verification does not guarantee that the specified mailbox really exists and accepts letters. But it can weed out obviously wrong addresses.
Decision. As with any programming language, in PHP regular expressions of email-address verification can be implemented in different ways, and the examples in this article are not the final and only option. Therefore, in each case, we will provide a list of requirements that need to be considered when programming, and the specific implementation depends entirely on the developer.
So, an email validation expression must check the following conditions:
- The presence of the @ character in the source line and the absence of spaces.
- The domain part of the address, behind the @ symbol, contains only valid characters for domain names. The same applies to the username.
- When checking the username, you must determine the presence of special characters, such as an apostrophe or a vertical bar. Such characters are potentially dangerous and may be contained in types of attacks such as SQL injections. Avoid such addresses.
- Usernames allow only one dot, which cannot be the first or last character in a string.
- The domain name must contain at least two and no more than six characters.
An example that takes into account all these conditions can be seen in the figure below.
Validating URLs
Task. Check if the specified text string is a valid URL. Once again, URL validation regular expressions can be implemented in various ways.
Decision. Our final version is as follows:
/^(https?:\/\/)?([\da-z\.---------+)\.([az\.{{2,6►)([\/\w \ .-] *) * \ /? $ /
Now we will analyze its components in more detail using the figure.
item 1 | No characters must be before the URL |
item 2 | Check for the mandatory http prefix |
item 3 | There should be no characters |
item 4 | If "s" is present, then the URL points to a secure connection "https" |
item 5 | Mandatory snippet "//" |
item 6 | No characters |
paragraphs 7-9 | Validation of the first level domain and the availability of points |
p.10-13 | Spelling control of the second level domain and point |
p.14-17 | URL file structure - a set of numbers, letters, underscores, hyphens, periods, and a slash at the end |
Checking Credit Card Numbers
Task. It is necessary to implement a validation of the entered plastic card number of the most common payment systems. Only option for Visa and MasterCard cards is considered .
Decision. When creating an expression, you must consider the possible presence of spaces in the entered number. The numbers on the map are divided into groups for easier reading and dictation. Therefore, it is quite natural that a person can try to enter a number in this way (i.e., using spaces).
Writing a universal expression that takes into account possible spaces and hyphens is more difficult than simply discarding all characters except numbers. Therefore, it is recommended to use the metacharacter / D in the expression, which deletes all characters except numbers.
Now you can go directly to checking the number. All credit card companies use a unique number format. In the example, this is used, and the client does not need to enter the name of the company - it is determined by the number. Visa cards always start with 4 and have a number length of 13 or 16 digits. MasterCard starts in the range 51-55 with the number 16 length. As a result, we get the expression:
Before processing the order, you can conduct an additional check of the last digit of the number, which is calculated by the Moon algorithm.
Checking Phone Numbers
Task. Validation of the entered phone number.
Decision. The number of digits in landline and mobile phone numbers varies significantly depending on the country, so it’s impossible to universally check using regular expressions, the phone number is correct. But international numbers have a strict format and are great for checking against a pattern. Moreover, more and more national telephone operators are trying to comply with a single standard. The structure of the number is as follows:
+ CCC.NNNNNNNNNNxEEEE, where:
- C is a country code of 1-3 digits.
- N - number up to 14 digits long.
- E is an optional extension.
Plus is an obligatory element, and the x sign is present only when expansion is necessary.
As a result, we have the following expression:
^ \ + [0-9] {1,3} \. [0-9] {4,14} (?: x. +)? $
Numbers in the range
Task. It is necessary to ensure the coincidence of an integer from a certain range. In addition, it is necessary that regular numbers are found only in the range of values.
Decision. Here are a few expressions for some of the most common cases:
Determine the hour from 1 to 24 | ^ (1 [0-2] | [1-9]) $ |
Day inside the month 1-31 | ^ (3 [01] | [12] [0-9] | [1-9]) $ |
Second or minute 0-59 | ^ [1-5]? [0-9] $ |
Number from 1 to 100 | ^ (100 | [1-9]? [0-9]) $ |
Day of the Year 1-366 | ^ (36 [0-6] | 3 [0-5] [0-9] | [12] [0-9] {2} | [1-9] [0-9]?) $ |
IP Address Search
Task. You must determine whether the specified string is a valid IP address in IPv4 format in the range 000,000,000,000-255.255.255.255.
Decision. As with any PHP task, a regular expression has many variations. For example, this:
Online expression check
Validating regular expressions for beginners can be difficult due to the complexity of the syntax, which differs from the "regular" programming languages. To solve this problem, there are many online expression testers that make it easy to verify the correctness of the created template in real text. The programmer enters the expression and data for verification and instantly sees the result of processing. Usually, there is also a reference section where regular expressions, examples, and implementation differences for the most common programming languages are described in detail.
But completely trusting the results of online services is not recommended for all developers using PHP. A regular expression, written and verified personally, enhances qualifications and ensures that there are no errors.