Using Regular Expressions

By Kevin Skoglund

Basics

Symbols representing a text pattern (which are interpreted by a regex processor)
The processor is used for matching, searching, and replacing text
Common flags: g = global, i = case insensitive, m = multiline
Regex engines are eager (try to give back a match asap) and greedy (they match as much as possible before giving control to the next expression part) where lazy matches match as little as possible
Metacharacters have special meaning, some are:
- . any character except new line
- \ escape next character
- \t tab character
- \r, \n, \r\n line returns
Character Set matches one of several characters
- Ex. [aeiou] – matches any one vowel
- Metacharacters inside character sets are already escaped
  - Exceptions: ] - ^ \
Character Ranges represent all characters between a range (the – is not literal only in a character set)
- ^ not any one of several characters (when in a character set)
Shorthand Character Sets
- \d digit [0-9]
- \w word character [0-9a-zA-Z_]
- \s whitespace [\t\r\n]
- \D not digit [^0-9]
- \W not word [^0-9a-zA-Z_]
- \S not whitespace [^ \t\r\n]
Repetition Metacharacters
- * match preceding item zero or more times
- + match preceding item one or more times
- ? match preceding item zero or one time
Quantified Repetition Metacharacters
- { start quantified repetition of preceding item
- } end quantified repetition of preceding item
- ex. \d{4, 8} matches numbers with 4 to 8 digits
- ex. \d{4} matches numbers with exactly 4 digits
- ex. \d{4, } matches numbers with 4 or more digits
Lazy Expressions
- ? make preceding quantifier lazy (optional)
Grouping Metacharacters
- ( start grouped expression
- ) end grouped expression
- Ex. (abc)+ matches abc and abcabcabc
- _(expression)_ can capture a group for use in matching and replacing - (Capturing Group)
- _(?:expression)_ is a Non-Capturing Group
Alternation Metacharacter
- | match previous or next expression
Start and End Anchors
- ^ start of string/line
- $ end of string/line
Word Boundaries
- \b word boundary (start/end of word)
- \B not a word boundary
- Spaces are not word boundaries (the boundaries are on either side of the word)
Back References
- \1 through \9 backreference for positions 1 to 9 (stored result of (expression))
- Ex. <(i|em)>.+?</\1> matches <i>hello</i> and <em>Hello</em>
Assertions
- Lookahead for match of expression but don’t include in match (Lookbehind not supported in JS)
- Positive Lookahead – (?=regex)
  - Ex. sea(?=shore) matches “sea” in “seashore” but not “seaside”
- Negative Lookahead – (?!regex)
  - Ex. sea(?!shore) matches “sea” in “seaside” but not “seashore”
Unicode Metacharacter
- \u Matching for Unicode \u0065 where 0065 is the Unicode number