Using Regular Expressions

By Kevin Skoglund

Basics

  • Symbols representing a text pattern (which are interpreted by a regex processor)
  • The processor is used for matching, searching, and replacing text
  • Common flags: g = global, i = case insensitive, m = multiline
  • Regex engines are eager (try to give back a match asap) and greedy (they match as much as possible before giving control to the next expression part) where lazy matches match as little as possible
  • Metacharacters have special meaning, some are:
    • . any character except new line
    • \ escape next character
    • \t tab character
    • \r, \n, \r\n line returns
  • Character Set matches one of several characters
    • Ex. [aeiou] – matches any one vowel
    • Metacharacters inside character sets are already escaped
      • Exceptions: ] - ^ \
  • Character Ranges represent all characters between a range (the is not literal only in a character set)
    • ^ not any one of several characters (when in a character set)
  • Shorthand Character Sets
    • \d digit [0-9]
    • \w word character [0-9a-zA-Z_]
    • \s whitespace [\t\r\n]
    • \D not digit [^0-9]
    • \W not word [^0-9a-zA-Z_]
    • \S not whitespace [^ \t\r\n]
  • Repetition Metacharacters
    • * match preceding item zero or more times
    • + match preceding item one or more times
    • ? match preceding item zero or one time
  • Quantified Repetition Metacharacters
    • { start quantified repetition of preceding item
    • } end quantified repetition of preceding item
    • ex. \d{4, 8} matches numbers with 4 to 8 digits
    • ex. \d{4} matches numbers with exactly 4 digits
    • ex. \d{4, } matches numbers with 4 or more digits
  • Lazy Expressions
    • ? make preceding quantifier lazy (optional)
  • Grouping Metacharacters
    • ( start grouped expression
    • ) end grouped expression
    • Ex. (abc)+ matches abc and abcabcabc
    • _(expression)_ can capture a group for use in matching and replacing - (Capturing Group)
    • _(?:expression)_ is a Non-Capturing Group
  • Alternation Metacharacter
    • | match previous or next expression
  • Start and End Anchors
    • ^ start of string/line
    • $ end of string/line
  • Word Boundaries
    • \b word boundary (start/end of word)
    • \B not a word boundary
    • Spaces are not word boundaries (the boundaries are on either side of the word)
  • Back References
    • \1 through \9 backreference for positions 1 to 9 (stored result of (expression))
    • Ex. <(i|em)>.+?</\1> matches <i>hello</i> and <em>Hello</em>
  • Assertions
    • Lookahead for match of expression but don’t include in match (Lookbehind not supported in JS)
    • Positive Lookahead(?=regex)
      • Ex. sea(?=shore) matches “sea” in “seashore” but not “seaside”
    • Negative Lookahead(?!regex)
      • Ex. sea(?!shore) matches “sea” in “seaside” but not “seashore”
  • Unicode Metacharacter
    • \u Matching for Unicode \u0065 where 0065 is the Unicode number