|
|
 |
Categories of Pattern Matching Characters
Pattern-matching characters can be grouped into various
categories, which will be explained in detail later. By understanding
these characters, you understand the language needed to create a regular
expression pattern. The categories are:
-
Position matching- You wish to match a
substring that occurs at a specific location within the larger string.
For example, a substring that occurs at the very beginning or end of
string.
-
Special literal character matching- All
alphabetic and numeric characters by default match themselves
literally in regular expressions. However, if you wish to match say a
newline in Regular Expressions, a special syntax is needed,
specifically, a backslash (\) followed by a designated character. For
example, to match a newline, the syntax "\n" is used, while "\r"
matches a carriage return.
-
Character classes matching- Individual
characters can be combined into character classes to form more complex
matches, by placing them in designated containers such as a square
bracket. For example, /[abc]/ matches "a", "b", or "c", while
/[a-zA-Z0-9]/ matches all alphanumeric characters.
-
Repetition matching- You wish to match
character(s) that occurs in certain repetition. For example, to match
"555", the easy way is to use /5{3}/
-
Alternation and grouping matching- You wish to
group characters to be considered as a single entity or add an "OR"
logic to your pattern matching.
-
Back reference matching- You wish to refer
back to a subexpression in the same regular expression to perform
matches where one match is based on the result of an earlier match.
The following are categorized tables explaining the
above:
Position Matching
| Symbol |
Description |
Example |
| ^ |
Only matches the beginning of a string. |
/^The/ matches "The" in "The night" by not "In The Night"
|
| $ |
Only matches the end of a string. |
/and$/ matches "and" in "Land" but not "landing"
|
| \b |
Matches any word boundary (test characters must
exist at the beginning or end of a word within the string) |
/ly\b/ matches "ly" in "This is really cool."
|
| \B |
Matches any non-word boundary. |
/\Bor/ matches “or” in "normal" but not "origami."
|
Literals
| Symbol |
Description |
| Alphanumeric |
All alphabetical and numerical characters match themselves literally. So /2
days/ will match "2 days" inside a string. |
| \n |
Matches a new line character |
| \f |
Matches a form feed character |
| \r |
Matches carriage return character |
| \t |
Matches a horizontal tab character |
| \v |
Matches a vertical tab character |
| \xxx |
Matches the ASCII character expressed by the
octal number xxx.
"\50" matches left parentheses character "(" |
| \xdd |
Matches the ASCII character expressed by the hex
number dd.
"\x28" matches left parentheses character "(" |
| \uxxxx |
Matches the ASCII character expressed by the
UNICODE xxxx.
"\u00A3" matches "£". |
The backslash (\) is also used when you wish to match a special
character literally. For example, if you wish to match the symbol "$"
literally instead of have it signal the end of the string, backslash it:
/\$/
Character Classes
| Symbol |
Description |
Example |
| [xyz] |
Match any one character enclosed in the character
set. You may use a hyphen to denote range. For example. /[a-z]/
matches any letter in the alphabet, /[0-9]/ any single digit. |
/[AN]BC/ matches "ABC" and "NBC" but not "BBC" since the leading
“B” is not in the set.
|
| [^xyz] |
Match any one character not enclosed in the character set. The
caret indicates that none of the characters NOTE: the
caret used within a character class is not to be confused with the
caret that denotes the beginning of a string. Negation is only
performed within the square brackets.
|
/[^AN]BC/ matches "BBC" but not "ABC" or "NBC".
|
| . |
(Dot).
Match any character except newline or another Unicode line
terminator. |
/b.t/ matches "bat", "bit", "bet" and so on.
|
| \w |
Match any alphanumeric character including the
underscore. Equivalent to [a-zA-Z0-9_]. |
/\w/ matches "200" in "200%"
|
| \W |
Match any single non-word character. Equivalent
to [^a-zA-Z0-9_]. |
/\W/ matches "%" in "200%"
|
| \d |
Match any single digit. Equivalent to [0-9]. |
|
| \D |
Match any non-digit. Equivalent to [^0-9]. |
/\D/ matches "No" in "No 342222"
|
| \s |
Match any single space character. Equivalent to [
\t\r\n\v\f]. |
|
| \S |
Match any single non-space character. Equivalent
to [^ \t\r\n\v\f]. |
|
Repetition
| Symbol |
Description |
Example |
| {x} |
Match exactly x occurrences of a regular
expression. |
/\d{5}/ matches 5 digits.
|
| {x,} |
Match x or more occurrences of a regular
expression. |
/\s{2,}/ matches at least 2 whitespace characters.
|
| {x,y} |
Matches x to y number of occurrences of a regular
expression. |
/\d{2,4}/ matches at least 2 but no more than 4 digits.
|
| ? |
Match zero or one occurrences. Equivalent to
{0,1}. |
/a\s?b/ matches "ab" or "a b".
|
| * |
Match zero or more occurrences. Equivalent to
{0,}. |
/we*/ matches "w" in "why" and "wee" in "between", but nothing in
"bad"
|
| + |
Match one or more occurrences. Equivalent to
{1,}. |
/fe+d/ matches both "fed" and "feed"
|
Alternation & Grouping
| Symbol |
Description |
Example |
| ( ) |
Grouping characters together to create a clause.
May be nested. |
/(abc)+(def)/ matches one or more occurrences of "abc" followed by one
occurrence of "def".
|
| | |
Alternation combines clauses into one regular
expression and then matches any of the individual clauses. Similar
to "OR" statement. |
/(ab)|(cd)|(ef)/ matches "ab" or "cd" or "ef".
|
Backreferences
| Symbol |
Description |
Example |
| ( )\n |
Matches a parenthesized clause in the pattern
string. n is the number of the clause to the left of the
backreference. |
(\w+)\s+\1 matches any word that occurs twice in a row, such as
"hubba hubba." The \1 denotes that the first word after the space
must match the portion of the string that matched the pattern in
the last set of parentheses. If there were more than one set of
parentheses in the pattern string you would use \2 or \3 to match
the appropriate grouping to the left of the backreference. Up to 9
backreferences can be used in a pattern string.
|
Pattern Switches
In addition to the pattern-matching characters, you can
use switches to make the match global or case- insensitive or both:
Switches are added to the very end of a regular expression.
| Property |
Description |
Example |
| i |
Ignore the case of characters. |
/The/i matches "the" and "The" and "tHe"
|
| g |
Global search for all occurrences of a pattern |
/ain/g matches both "ain"s in "No pain no gain", instead of just
the first.
|
| gi |
Global search, ignore case. |
/it/gi matches all "it"s in "It is our IT department"
|
|