Categories:

Categories of Pattern Matching Characters

Pattern-matching characters can be grouped into various categories, which will be explained in detail later. By understanding these characters, you understand the language needed to create a regular expression pattern. The categories are:

  • Position matching- You wish to match a substring that occurs at a specific location within the larger string. For example, a substring that occurs at the very beginning or end of string.
  • Special literal character matching- All alphabetic and numeric characters by default match themselves literally in regular expressions. However, if you wish to match say a newline in Regular Expressions, a special syntax is needed, specifically, a backslash (\) followed by a designated character. For example, to match a newline, the syntax "\n" is used, while "\r" matches a carriage return.
  • Character classes matching- Individual characters can be combined into character classes to form more complex matches, by placing them in designated containers such as a square bracket. For example, /[abc]/ matches "a", "b", or "c", while /[a-zA-Z0-9]/ matches all alphanumeric characters. 
  • Repetition matching- You wish to match character(s) that occurs in certain repetition. For example, to match "555", the easy way is to use /5{3}/
  • Alternation and grouping matching- You wish to group characters to be considered as a single entity or add an "OR" logic to your pattern matching.
  • Back reference matching- You wish to refer back to a subexpression in the same regular expression to perform matches where one match is based on the result of an earlier match.

The following are categorized tables explaining the above:

Position Matching

Symbol Description Example
 ^ Only matches the beginning of a string. /^The/ matches "The" in "The night" by not "In The Night"
 $ Only matches the end of a string. /and$/ matches "and" in "Land" but not "landing"
 \b Matches any word boundary (test characters must exist at the beginning or end of a word within the string) /ly\b/ matches "ly" in "This is really cool."
 \B Matches any non-word boundary. /\Bor/ matches “or” in "normal" but not "origami."

Literals

Symbol Description
Alphanumeric All alphabetical and numerical characters match themselves literally. So /2 days/ will match "2 days" inside a string.
 \n Matches a new line character
 \f Matches a form feed character
 \r Matches carriage return character
 \t Matches a horizontal tab character
 \v Matches a vertical tab character
 \xxx Matches the ASCII character expressed by the octal number xxx.

"\50" matches left parentheses character "("
 \xdd Matches the ASCII character expressed by the hex number dd.

"\x28" matches left parentheses character "("
 \uxxxx Matches the ASCII character expressed by the UNICODE xxxx.

"\u00A3" matches "£".

The backslash (\) is also used when you wish to match a special character literally. For example, if you wish to match the symbol "$" literally instead of have it signal the end of the string, backslash it: /\$/ 

Character Classes

Symbol Description Example
 [xyz] Match any one character enclosed in the character set. You may use a hyphen to denote range. For example. /[a-z]/ matches any letter in the alphabet, /[0-9]/ any single digit. /[AN]BC/ matches "ABC" and "NBC" but not "BBC" since the leading “B” is not in the set.
 [^xyz] Match any one character not enclosed in the character set. The caret indicates that none of the characters

NOTE: the caret used within a character class is not to be confused with the caret that denotes the beginning of a string. Negation is only performed within the square brackets.

/[^AN]BC/ matches "BBC" but not "ABC" or "NBC".
 . (Dot). Match any character except newline or another Unicode line terminator. /b.t/ matches "bat", "bit", "bet" and so on.
 \w Match any alphanumeric character including the underscore. Equivalent to [a-zA-Z0-9_]. /\w/ matches "200" in "200%"
 \W Match any single non-word character. Equivalent to [^a-zA-Z0-9_]. /\W/ matches "%" in "200%"
 \d Match any single digit. Equivalent to [0-9].
 \D Match any non-digit. Equivalent to [^0-9]. /\D/ matches "No" in "No 342222"
 \s Match any single space character. Equivalent to [ \t\r\n\v\f].
 \S Match any single non-space character. Equivalent to [^ \t\r\n\v\f].
 

Repetition

Symbol Description Example
{x} Match exactly x occurrences of a regular expression. /\d{5}/ matches 5 digits.
{x,} Match x or more occurrences of a regular expression. /\s{2,}/ matches at least 2 whitespace characters.
{x,y} Matches x to y number of occurrences of a regular expression. /\d{2,4}/ matches at least 2 but no more than 4 digits.
? Match zero or one occurrences. Equivalent to {0,1}. /a\s?b/ matches "ab" or "a b".
* Match zero or more occurrences. Equivalent to {0,}. /we*/ matches "w" in "why" and "wee" in "between", but nothing in "bad"
+ Match one or more occurrences. Equivalent to {1,}. /fe+d/ matches both "fed" and "feed"

Alternation & Grouping

Symbol Description Example
( ) Grouping characters together to create a clause. May be nested. /(abc)+(def)/ matches one or more occurrences of "abc" followed by one occurrence of "def".
| Alternation combines clauses into one regular expression and then matches any of the individual clauses. Similar to "OR" statement. /(ab)|(cd)|(ef)/ matches "ab" or "cd" or "ef".

Backreferences

Symbol Description Example
( )\n Matches a parenthesized clause in the pattern string. n is the number of the clause to the left of the backreference. (\w+)\s+\1 matches any word that occurs twice in a row, such as "hubba hubba." The \1 denotes that the first word after the space must match the portion of the string that matched the pattern in the last set of parentheses. If there were more than one set of parentheses in the pattern string you would use \2 or \3 to match the appropriate grouping to the left of the backreference. Up to 9 backreferences can be used in a pattern string.

Pattern Switches

In addition to the pattern-matching characters, you can use switches to make the match global or case- insensitive or both: Switches are added to the very end of a regular expression.

Property Description Example
 i Ignore the case of characters. /The/i matches "the" and "The" and "tHe"
 g Global search for all occurrences of a pattern /ain/g matches both "ain"s in "No pain no gain", instead of just the first.
 gi Global search, ignore case. /it/gi matches all "it"s in "It is our IT department"