banner
 Sayyiku

Sayyiku

Chaos is a ladder
telegram
twitter

JS Regular Expression Techniques

What is a regular expression? In one sentence: A regular expression is a matching pattern that matches either characters or positions.

Character Matching#

Fuzzy Matching#

Regular expressions can achieve fuzzy matching in addition to precise matching, which can be divided into horizontal fuzzy matching and vertical fuzzy matching.

Horizontal Fuzzy Matching#

Horizontal fuzziness refers to the fact that the length of a string that a regular expression can match is not fixed. This is implemented using quantifiers, such as {m, n}, which indicates that a character must appear at least m times and at most n times consecutively.

const regex = /ab{2,4}c/g
const string = 'abc abbc abbbc abbbbc abbbbbc'
console.log(string.match(regex)) // ["abbc", "abbbc", "abbbbc"]

The g modifier in the regular expression indicates global matching, emphasizing "all" rather than "the first".

// Without global modifier
const regex = /ab{2,4}c/
const string = 'abc abbc abbbc abbbbc abbbbbc'
console.log(string.match(regex))
// ["abbc", index: 4, input: "abc abbc abbbc abbbbc abbbbbc", groups: undefined]

Vertical Fuzzy Matching#

Vertical fuzziness refers to the fact that for a string matched by a regular expression, a specific character position can be any character rather than a specific one. This is implemented using character classes, such as [abc], which indicates that the character can be any one of "a", "b", or "c".

const regex = /a[123]b/g
const string = 'a0b a1b a2b a3b a4b'
console.log(string.match(regex)) // ["a1b", "a2b", "a3b"]

Character Classes#

Although called character classes, they actually match only a single character. For example, the character class [abc] matches only one character. Character classes can be expressed using range notation, exclusion, and shorthand.

Range Notation#

The character class [0-9a-zA-Z] represents any character that is a digit or a letter (uppercase or lowercase). Since the hyphen "-" has a special meaning, to match any character among "a", "-", and "c", it can be written as: [-az], [az-], or [a\-z], where the hyphen must either be at the beginning, at the end, or escaped.

Exclusion Character Classes#

Exclusion character classes (negated character classes) represent any character except "a", "b", or "c". The first position of the character class contains ^ (caret), indicating negation. The ^ can be used with range notation, such as .

Shorthand Notation#

The shorthand notation for regular expressions is as follows:

Character ClassMeaning
\d[0-9], represents digits
\D[^0-9], represents non-digits
\w[0-9a-zA-Z_], represents digits, letters, and underscores
\W[^0-9a-zA-Z_], represents non-word characters
\s[ \t\v\n\r\f], represents whitespace characters
\S[^ \t\v\n\r\f], represents non-whitespace characters
.Wildcard

Note: [ \t\v\n\r\f] represents whitespace, horizontal tab, vertical tab, newline, carriage return, and form feed respectively.

The wildcard . can represent almost any character, except for newline, carriage return, line separator, and paragraph separator. If you want to match any character, you can use combinations like: [\d\D], [\w\W], [\s\S], and [^].

Quantifiers#

Shorthand Notation#

QuantifierMeaning
{m, n}Indicates m to n occurrences
{m,}At least m occurrences
{m}Equivalent to {m, m}, indicates m occurrences
?Equivalent to {0, 1}, indicates either occurrence or non-occurrence
+Equivalent to {1,}, indicates at least 1 occurrence
\*Equivalent to {0,}, indicates any number of occurrences

Greedy Matching and Lazy Matching#

Greedy matching tries to match as much as possible, as shown below:

const regex = /\d{2,5}/g
const string = '123 1234 12345 123456'
console.log(string.match(regex))
// ["123", "1234", "12345", "12345"]

By adding ? after the quantifier, lazy matching can be achieved, which tries to match as little as possible, as shown below:

const regex = /\d{2,5}?/g
const string = '123 1234 12345 123456'
console.log(string.match(regex))
// ["12", "12", "34", "12", "34", "12", "34", "56"]

Multiple Choice Branches#

Multiple choice branches can support multiple sub-patterns, any of which can be chosen. The specific form is: (p1|p2|p3), where p1, p2, and p3 are sub-patterns separated by | (pipe), indicating any one of them. Note: Multiple choice branches are lazily matched from left to right; once a match is successful, subsequent patterns are not attempted. The matching result can be changed by altering the order of the sub-patterns.

const regex = /good|goodbye/g
const string = 'goodbye'
console.log(string.match(regex))
// ["good"]

Practical Applications#

Matching File Paths#

The file path format is like drive:\folder\folder\folder\. To match the drive, use: [a-zA-Z]:\\. To match file names or folder names, which cannot contain certain special characters, an exclusion character class is needed to represent valid characters, and the file name or folder name cannot be empty, requiring at least one character, thus using the quantifier +. Folders can appear any number of times, and the last part may be a file rather than a folder, so it does not need to end with \\.

const regex = /^[a-zA-Z]:\\([^\\:*<>|"?\r\n/]+\\)*([^\\:*<>|"?\r\n/]+)?$/
console.log(regex.test('F:\\study\\regular expression.pdf'))

Matching ID#

const regex = /id=".*"/
const string = '<div id="container" class="main"></div>'
// id="container" class="main"

The quantifier . is a wildcard that can match double quotes and is greedy, so it can cause errors. It can be modified to lazy matching:

const regex = /id=".*?"/

However, the above regular expression has low matching efficiency due to the "backtracking" concept in its matching principle. The optimal solution is as follows:

const regex = /id="[^"]*"/

Position Matching#

The Concept of Position#

A position (anchor) is the space between adjacent characters. A position can be understood as an empty string. In ES5, there are six anchors: ^, $, \b, \B, (?=p), and (?!p).

  1. ^ matches the beginning; in multiline matching, it matches the beginning of a line.
  2. $ matches the end; in multiline matching, it matches the end of a line.
  3. \b matches a word boundary, i.e., the position between \w and \W, ^, and $.
  4. \B matches a non-word boundary.
  5. (?=p) is a positive lookahead assertion, matching the position before pattern p.
  6. (?!p) is a negative lookahead assertion, matching the position not before p.

Practical Applications#

Inserting Thousand Separators#

The position for inserting thousand separators is before every three digits, and it cannot be at the beginning.

const result = '123456789'
const regex = /(?!^)(?=(\d{3})+$)/g
console.log(result.replace(regex, ','))
// 123,456,789

Password Validation#

The password length is 6-12 characters, consisting of digits and letters (both uppercase and lowercase), but must include at least 2 types of characters. First, consider matching 6-12 characters of digits and letters:

const regex = /^[0-9A-Za-z]{6-12}$/g

Then, it needs to check that at least two types of characters are included, which can be done in two ways.

The first method: first check if it contains digits, which can be represented as follows:

const regex = /(?=.*[0-9])^[0-9A-Za-z]{6-12}$/

The key point to understand is (?=.*[0-9])^, which indicates the position before the beginning, also representing the beginning, because the position can be an empty string. This regex indicates that there is a digit after any number of characters. Similarly, if it needs to include both numbers and uppercase letters, it can be represented as:

const regex = /(?=.*[0-9])(?=.*[A-Z])^[0-9A-Za-z]{6-12}$/

The final regex can be represented as:

const regex = /((?=.*[0-9])(?=.*[A-Z])|(?=.*[0-9])(?=.*[a-z])|(?=.*[A-Z])(?=.*[a-z]))^[0-9A-Za-z]{6-12}$/

const str1 = '123456'
const str2 = '123456a'
const str3 = 'abcdefgA'
console.log(str1, regex.test(str1)) // false
console.log(str2, regex.test(str2)) // true
console.log(str3, regex.test(str3)) // true

The second method: "at least two types of characters" means it cannot be all digits, all uppercase letters, or all lowercase letters. To indicate it cannot be all digits, it can be represented as follows:

const regex = /(?!^[0-9]{6-12}$)^[0-9A-Za-z]{6-12}$/

So the final regex can be represented as:

const regex = /(?!^[0-9]{6,12}$)(?!^[A-Z]{6,12}$)(?!^[a-z]{6,12}$)^[0-9A-Za-z]{6,12}$/

The Role of Parentheses#

Grouping and Branching Structures#

Parentheses provide grouping for referencing. There are two types of references: references in JavaScript and references in regular expressions. Grouping and branching structures are the most direct functions of parentheses, emphasizing that what is inside the parentheses is a whole, i.e., providing a sub-expression.

// Grouping case, emphasizing that ab is a whole
const regex1 = /(ab)+/g

// Branching case, emphasizing that the branching structure is a whole
const regex = /this is (ab|cd)/g

Group References#

Using parentheses for grouping allows for data extraction and replacement operations. For example, to extract dates in the format yyyy-mm-dd:

const regex = /(\d{4})-(\d{2})-(\d{2})/g
const date = '2018-01-01'
const regex = /(\d{4})-(\d{2})-(\d{2})/
const date = '2018-01-01'
console.log(regex.exec(date))
// console.log(date.match(regex))
// ["2018-01-01", "2018", "01", "01", index: 0, input: "2018-01-01", groups: undefined]

console.log(RegExp.$1, RegExp.$2, RegExp.$3)
// 2018 01 01

Extension: In JavaScript, the exec and match methods have similar functions, with two main differences:

  1. exec is a method of the RegExp class, while match is a method of the String class.
  2. exec only matches the first matching string, while match behavior depends on whether the global g modifier is set; in non-global matching, both behave the same.

Additionally, grouped parentheses can facilitate replacement operations, such as replacing yyyy-mm-dd with dd-mm-yyyy:

const date = '2018-01-31'
const regex = /^(\d{4})-(\d{2})-(\d{2})$/
const result = date.replace(regex, '$3-$2-$1')
console.log(result) // 31-01-2018

// Equivalent to
const result2 = data.replace(regex, function () {
  return RegExp.$3 + '-' + RegExp.$2 + '-' + RegExp.$1
})

// Equivalent to
const result3 = data.replace(regex, function (match, year, month, day) {
  return day + '-' + month + '-' + year
})

Backreferences#

In addition to referencing groups in JavaScript, backreferences can also be used in regular expressions. For example, to match dates:

const date1 = '2018-01-31'
const date2 = '2018-01.31'
const regex = /\d{4}(-|\/|\.)\d{2}\1\d{2}/
console.log(regex.test(date1)) // true
console.log(regex.test(date2)) // false

If parentheses are nested, the grouping order is determined by the order of the first occurrence of the left parentheses. There are three tips for backreferences:

  • Tip 1: If a number like 10 appears, it refers to the 10th group rather than 1 and 0. If you need to represent the latter, use non-capturing parentheses, represented as (?:1)0 or 1(?:0).
  • Tip 2: If a reference does not exist, it only matches the character itself, such as 2 matching only 2, with the backslash indicating escape.
  • Tip 3: If a quantifier follows a group, it uses the last captured data as the group.

Non-Capturing Parentheses#

In previous examples, the groups or captured data within parentheses are for later referencing, referred to as capturing groups and capturing branches. If you only want to use the original function of parentheses, you can use non-capturing parentheses (?:p) and (?:p1|p2|p3).

Backtracking Principle#

Backtracking, also known as trial and error, is based on the idea of starting from a certain state of the problem (the initial state) and searching for all "states" that can be reached from this state. When a path reaches a "dead end" (cannot proceed), it steps back one or several steps and continues searching from another possible "state" until all "paths" (states) have been explored. This method of continuously "advancing" and "backtracking" to find a solution is called "backtracking".

"Backtracking" is essentially a depth-first algorithm. For example, using the regular expression /ab{1,3}/c to match the string 'abbc', the matching process is as follows:

Regular Expression Backtracking

In the image, step 5 is highlighted in red, indicating a match failure. At this point, b{1,3} has matched 2 characters "b", and when attempting to match the third, it finds that the next character is "c". Therefore, it concludes that b{1,3} has matched completely. The state then returns to the previous state, and the sub-expression c is used to match the character "c". At this point, the entire expression matches successfully. Step 6 in the image is referred to as "backtracking".

The above is backtracking in the case of greedy matching; backtracking also exists in lazy matching. Another example of lazy matching is:

const string = '12345'
const regex = /^(\d{1,3}?)(\d{1,3})$/
console.log(string.match(regex))
// => ["1234", "12", "2345", index: 0, input: "12345"]

Even though it is lazy matching, to ensure overall matching success, the first group will still allocate one more character, and its overall matching process is as follows:

Lazy Regular Expression Backtracking

Additionally, branching structures can also be viewed as a form of backtracking; when the current branch does not meet the matching conditions, it switches to another branch.

To illustrate backtracking in various situations:

  • The strategy of greedy quantifiers "testing" is: bargaining when buying clothes. The price is too high, so ask for a lower price; if that doesn't work, ask for an even lower price.
  • The strategy of lazy quantifiers "testing" is: raising the price when selling goods. If the price is too low, offer a bit more; if that's still too low, offer a bit more.
  • The strategy of branching structures "testing" is: comparing prices at different stores. If one store doesn't work, switch to another; if that doesn't work, switch again.

Breakdown of Regular Expressions#

Structures and Operators#

In JavaScript, regular expressions consist of character literals, character classes, quantifiers, anchors, groups, choice branches, and backreferences.

StructureDescription
Character LiteralMatches a specific character, including escaped and non-escaped
Character ClassMatches one of several possible characters
QuantifierMatches characters that appear consecutively
AnchorMatches a position
GroupMatches a whole within parentheses
Choice BranchMatches one of multiple sub-expressions

The operators involved are:

Operator DescriptionOperatorPriority
Escape Character\1
Parentheses and Brackets\(...\)、\(?:...\)、\(?=...\)、\(?!...\)、\[...\]2
Quantifier Modifiers{m}、{m,n}、{m,}、?、\*、+3
Positions and Sequences^、\$、\ metacharacters、general characters4
Pipe Symbol\`\`5

Metacharacters#

The metacharacters used in JavaScript regular expressions include ^、$、.、*、+、?、|、\、/、(、)、[、]、{、}、=、!、:、-, and when matching these characters themselves, they can all be escaped.

Constructing Regular Expressions#

The balancing principles for constructing regular expressions are:

  • Match the expected strings
  • Do not match unexpected strings
  • Readability and maintainability
  • Efficiency

Here are a few ways to improve matching efficiency:

  1. Use specific character classes instead of wildcards to eliminate backtracking.
  2. Use non-capturing groups, as capturing groups require memory to store captured data from groups and branches.
  3. Isolate certain characters, such as changing a+ to aa*, where the latter specifies character a more.
  4. Extract common parts of branches, such as changing this|that to th(?:is|at).
  5. Reduce the number of branches, such as changing red|read to rea?d.

Regular Expression Programming#

In JavaScript, there are 6 commonly used APIs related to regular expressions, with 4 for string instances and 2 for regular expression instances:

  • String#search
  • String#split
  • String#match
  • String#replace
  • RegExp#test
  • RegExp#exec

The match and search methods of string instances convert the string into a regular expression:

const str = '2018.01.31'
console.log(str.search('.'))
// 0
// Needs to be modified to one of the following forms
console.log(str.search('\\.'))
console.log(str.search(/\./))
// 4

console.log(str.match('.'))
// ["2", index: 0, input: "2018.01.31"]
// Needs to be modified to one of the following forms
console.log(str.match('\\.'))
console.log(str.match(/\./))
// [".", index: 4, input: "2018.01.31"]

The four methods of strings always start matching from 0, meaning the lastIndex property remains unchanged. In contrast, the two methods of regular expressions, exec and test, modify lastIndex after each global match.

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.