I have had a love/hate relationship with regular expressions in the past. Reading or writing a regular expression typically made me feel like I was toying with a broken Rubiks Cube. However, after I would fiddle an expression into submission, almost by accident, and it did its job, I would become enamored with its brevity and power.
It wasn't until I re-adjusted my thoughts on the nature of regular expressions that my fear of them turned into pleasure. This happened when I started thinking about regular expressions as an actual language itself, instead of a value contained within a language. I know it is not technically a language, but studying it like a programming language might just help a developer get over the steep learning curve. In this tutorial I am going to give an explanation of JavaScript regular expressions by dissecting it as if it were a language. From this enlightened perspective, I hope to push you past the insurmountable learning curve that stops most developers from knowing regular expression basics.
Two types of developers will benefit from a reading of this tutorial. The first is the intermediate JavaScript developer who has not found the right resource for learning regular expressions and, thus, has put off a thorough understanding of the details. The second is an intermediate programmer looking to get a handle on the JavaScript regular expression parser, what is supported by the parser (i.e. meta/shortand characters), and what methods and properties JavaScript provides for working with regular expressions.
What Is A Regular Expression (aka regex or regexp)?
A regular expression is a special set of symbolic characters and literal characters used for matching character patterns in a string. Consider that when you type literal character(s) into a google search input you are searching for a set of character(s) that match what you entered into the input. That is, you want to know if the characters are found in the source you are searching. In a sense, you are defining a pattern (i.e. the characters inputed) which is then used to render results which match that pattern. A regular expression pattern in its simplest form is a literal character search not unlike a google search. If you search google for the word "dog" it will look for web pages that have the character "d", followed by the character "o", followed by the character "g" in its markup. However, a simple character search is a grossly simplified analogy for a regular expression pattern. Consider that to do advanced searching google provides an advanced search UI because simple literal character searches lack power. The advance google search UI provides the meta data inputs required for more robust searches. Regular expressions, in a sense, are very much like the advance google search UI except instead of meta UI inputs, regular expressions use meta and short characters. In many cases advanced searching utilities, like those found in popular code editors, take regular expressions as input to further the power of a search.
Let's examine a couple of regular expressions in order to get a firm understanding about the nature and purpose of a regular expression. But first, I am going to review some JavaScript regular expression syntax basics so the code examples are not completely foreign to you.
Constructing A JavaScript Regular Expression Object
In this article we are going to examine regular expressions from the perspective of a JavaScript developer. Below I show the literal (i.e. /dog/
) and constructor (i.e. RegExp()
) syntax for creating regular expressions in JavaScript.
var literalSyntax = /dog/g; /* /expression/flags(g|i|m) */
var constructorSyntax = new RegExp('dog','g'); /* new RegExp('expression','flags(g|i|m)'); */
The JavaScript literal syntax will be used throughout this article. We will dive into more detail surrounding JavaScript and regular expressions later. For now, make note of the makeup and syntax of a JavaScript regular expression. The expression itself is everything between the two forward slashes (e.g. /expression/
) and flags comes after the ending forward slash. /g
. Notice this literal value is not a string like the parameter passed to the RegExp()
constructor function.
Note:
//
start a single-line comment. To specify an empty regular expression, use: /(?:)/
".The Nature & Purpose Of A Regular Expression
Imagine we want to search the sentence, "Dogs go to doggy parks!" for the literal sub string "dog". The pattern we want to search for would be /dog/
and the source we are searching is the string "Dogs go to doggy parks!".
Below I created the JavaScript regular expression /dog/
which is used as the pattern for searching the string "Dogs go to doggy parks!". This is often referred to as a literal character match and is the most basic use of a regular expression.
Using the JavaScript match()
method available to instances created from the JavaScript String()
object, we can search the "Dogs go to doggy parks!" string for a match by passing the match()
method a regular expression value. Notice that match()
returns an array containing the match(es) found in the string. As well, notice the yellow highlighting indicating what the pattern matches in the string.
As you can see the pattern /dog/
does, in fact, find the first set of sequential characters that starts with a "d", followed by an "o", followed by a "g". Notice that our pattern is case sensitive ignoring "Dogs" and matches the first 3 characters of the word "doggy". This demonstrates the default nature of matching characters in a string accomplished by a regular expression pattern.
The JSFiddle, containing our code example is actually a regular expression mini editor. I will be using this mini editor throughout this tutorial. Try changing either the string or pattern above to test your own regular expression matches. For example, change the inputs so that the expression is /(D|d)ogs?/g
. When inputing the expression in the mini editor you need to only input the characters between the /
and /
because these are assumed by the editor. Make sure you input flags like g
into the last input. If you changed the inputs correctly, changing the /dog/
expression to /(D|d)ogs?/g
produces a new match (i.e. "Dogs" as well as "dog" becomes highlighted).
Finding the /dog/
characters in the string "Dogs go to doggy parks!" is an example of a very simple regular expression in which literal character combinations are matched. This is only the tip of the iceberg. Beyond simple character matches like the one we just looked at, regular expression patterns can make use of operator-like characters that have special meaning when used in a pattern. These special characters are called meta or shorthand characters while everything else can be considered a literal character. In addition to meta and shorthand characters, regular expressions also take flags (e.g. /g
) which act like configuration values for the entire expression.
Let's examine a complex expression which makes use of meta characters and flags. Consider a scenario where you would like to find all of the email addresses contained in a string of characters. Unlike the first example where we match the literal "dog" characters, an email address presents an unknowable and infinite set of character possibilities. In order to match an email address we have to create an ordered pattern made up of common characters and character types found in an email. Using meta characters we can describe the characters or range of characters we are trying to match, without literally matching a unique email address. A pattern like this, loosely stated in words, could be something like:
@
character.
characterExamine the pattern below which mimics the loose logic I just described for matching a valid email.
As you can see above this pattern makes use of several meta characters (e.g. [
]
, \
, ?
, {
, -
and }
) and the /g
flag. You may find that reading this pattern is rather difficult. Don't worry. You are not alone here. Tools are available that can help decipher what the meta characters are doing. My favorite tool for doing this is debuggex. This is by far the best solution I have seen for breaking down a regular expression into a easily comprehensible format. For example, debuggex provides the following visual representation for the above email expression.
This image aids in the comprehension of the sequences of characters that match the email expression. Precisely, it tells us that our expression will match any characters that:
.
, _
, %
, or -
. @
.
, or -
. .
Make sure you stop here for a second and appreciate the brevity and power that the email expression contains. Consider the power provided by the regular expression v.s. having to write pure JavaScript logic alone to validate string characters.
Regular Expression Interpretation & Logic
I have spoken (i.e. followed by, followed by) about how to interpret a regular expression in a very specific way thus far in this overview but I have not specifically mentioned of how this is done, or how the testing logic works. I think it is worth saying plainly that you should think of a regular expression pattern as a set of rules tested against individual characters in a string.
Consider the regular expression /[a-z]1/ig
which matches any set of characters in a string that starts with an alpha case insensitive character and is followed by the literal character 1. When I run this pattern against the string "A153f2af1532143f2f1" it will test every character in the string for a match. You can see the /[a-z]1/ig
pattern used in the editor below.
Notes:
i
on the end of /[a-z]1/ig
means case insensitive and the g
means search for multiple matches not just the first match.g
flag is used, but if the flag is not used, once a match is found the expression engine stops parsing string and returns the first match.I want you to meditate on the idea that a regular expression is a language which defines a set of rules. The rules are the expression, which are applied to each character in a string. When the rules are all true for a character (and surrounding characters), we have a match. What is being matched is not the single character alone, but the current character in consideration of characters in front or behind it based on expression rules. This means that a match could include none, some, or all of the characters around the current character being tested. If what I just wrote still seems muddy, consider again that the expression /[a-z]1/ig
tests each character (from left to right) in the string "A153f2af1532143f2f1". Let's break this down as if the expression logic was telling us what it was doing.
Since the two rules above are true, we have a match. The match would be "A1". Normally the engine would be done, but the /g
flag was set so the engine will continue to the end of the string checking each character in the string regardless of how many matches it finds. Now, the engine backtracks to the "1" character and repeats the expression logic. The "1" character is a digit, so no match because the "1" fails to be an a to z character. We move to the next character, and the next character, and next until each and every character in the string is tested against the rules defined by the expression. As you can see above, 3 matches are found.
The regular expression interruption logic discusses in this section is critical to a proper understanding of regular expression meta and shorthand characters. How the rules are interrupted and the testing is applied is a very knowable procedure. Make sure to re-read this section of the tutorial until what I am explaining is clear. Make sure you are clear on the fact that an expression pattern is tested on each individual character. And that, just because the engine moves forward when following the pattern and looking for a match it still backtracks and examines each character in a string until a match is found or if the global flag is set until all characters are examined.
Overview Conclusion
This brief dive into the nature of regular expressions will act as the foundation for what is to follow. For the rest of this tutorial we are going to examine flags, meta characters (aka special characters) and the methods provided by JavaScript which make use of regular expression values. After that we will take a look at a few commonly used regular expressions in JavaScript programs.
Overview Of Flags
Flags tell a pattern how to behave. They are used to control the mode in which the expression is parsed by the regular expression engine. We have already seen most of these flags in use, but you should be aware of all of them.
Javascript supports 3 flags:
Flag | aka | Description |
---|---|---|
g |
global matching | This flag indicates to the regular expression engine that all matches should be found not just the first match. |
i |
ignore case | This flag indicates to the regular expression engine to ignore case when parsing a string for matches (this overrides meta and shorthand characters) |
m |
multiline input | This flag indicates to the regular expression engine that the `^`, `$` and the `.` meta characters are aware of new lines (i.e. `\n`) and using these meta characters with the `m` flag will change the default interpretation. |
Working With Flags
We have already seen the global flag (e.g. /g
) in use when we discussed finding multiple emails in a string of characters. The global and ignore case flag should be self-explanatory, but the multiline flag requires some explanation.
By default, expressions are in single line mode, which means that characters that create new lines are not considered boundaries in an expression. For instance, if you have the string "Dog bite Dog bite Dog bite" with a carriage return in-between each "Dog bite" so that each one is on its own line, the regular expression engine by default does not consider the carriage return a boundary. The start of the string is a boundary and the end of the string is a boundary and there are no boundaries in-between regardless of new line characters. To demonstrate this default behavior consider the expression /^Dog/g
which matches a sequence of characters at the start of a string that is a capital "D", followed by a lower case "o", followed by a lower case "g". In the code example below this expression finds one match at the beginning of the string.
We can add additional new line boundaries to a string by setting the m
flag telling the expression engine to consider new lines a boundary as well. If we were to add the m
flag, which stands for multiline, to the expression "/^Dog/g" (i.e. "/^Dog/gm") it would match each new line that starts with "D","o", and "g". Shown in the example below is the use of the multiline flag, on the previous "Dog bite Dog bite Dog bite" string containing carriage returns.
Notice that by adding the m
flag our expression matches news lines that start with a capital "D", followed by a lower case "o", followed by a lower case "g". The m
flag told the expression engine to consider new lines a boundary, in addition to the default start and end boundaries.
The use of the m
flag effects not only the ^
meta character but also the $
meta character. The $
meta character matches a sequence of characters at the end of a boundary. Using the multiline flag and the $
meta character we can match the characters "bite" if they occur before a new line boundary.
Notes:
^
, $
and the .
meta characters./dog/gi
)/dog/gi
and /dog/ig
are the same)m
flag changes the behavior of the ^
and $
meta characters.Overview Of JavaScript Meta & Shorthand Characters (aka special characters)
Below you will find a table covering a majority of the (for a complete table checkout Mozilla's reference) meta characters and shorthand characters used in JavaScript regular expressions. I am showing this table, not so that you might learn how these characters specifically work in an expression from the table, but so that you are aware of the characters themselves and briefly what they do. We will be examining many of theses characters in the "Working With Meta & Shorthands Characters" section. For now, just acquaint yourself with the fact that these characters have a special meaning in regular expression and that the characters can be combined into powerful patterns such as what I demonstrated with the email pattern. If it helps, you can click on the jsfiddle link to run and edit the pattern example.
character(s) | Pattern Ex. | String Ex. (Pattern matches in red) | Example Description & Jsfiddle | Type |
---|---|---|---|---|
Meta Characters | ||||
^ | /^Cat/ |
Cat go fast | string must start with Cat jsfiddle | anchor |
$ | /slow$/ |
Dogs are slow | string must end with slow jsfiddle | anchor |
* | /bo*/ |
boom and boat and bug | matches when the preceding o occurs 0 or more times jsfiddle | quantifier |
+ | /bo+/ |
boom and boat and bug | matches when the preceding o occurs 1 or more times jsfiddle | quantifier |
? | /bo?/ |
boom and boat and bug | matches when the preceding o occurs 0 or one time jsfiddle | quantifier |
. | /.a/ |
Cats fats rats | any character followed by an a jsfiddle | character class |
\ | /4\.0/ |
I have a 4.0. | escape the next charater jsfiddle | escape |
(...) | /(bug) \1's/ |
bug bug's and bug bugs | match bug and store it in a variable called \1 which you can use later in the express to reference back to the first capture parentheses jsfiddle | group |
(?:...) | /(?:bug) \1's/ |
bug bug's and bug bugs | don't capture (bug) into a variable. So no matches because \1 has no value jsfiddle | Groups |
(?=...) | /bug(?='s)/ |
bug bug's and bug bugs | matches bug only if bug is followed by 's. called a lookahead jsfiddle | lookaround |
(?!...) | /bug(?!'s)/ |
bug bug's and bug bugs | matches bug only if bug is not followed by 's called a negated lookahead jsfiddle | lookaround |
...|... | /foo|bar/ |
foo and bar is foobar | matches either foo or bar jsfiddle | alternation |
{...} | /fo{2}/ |
foo and fooo and foooooo and fo | matches the previous character exactly 2 times jsfiddle | quantifier |
{...,} | /fo{2,}/ |
foo and fooo and foooooo and fo | matches the previous character at least 2 times jsfiddle | quantifier |
{...,...} | /fo{2,4}/ |
foo and fooo and foooooo and fo | matches the previous character at least 2 times, but no more than 4 times jsfiddle | quantifier |
[...] | /[cde]|[456]/ |
abcdefghijklmnopqrstuvwxyz 0123456789 | matches any c, d, e character or 4, 5, 6 character jsfiddle | character set |
[...-...] | /[c-u]|[4-9]/ |
abcdefghijklmnopqrstuvwxyz 0123456789 | matches any character in the range c to u or 4 to 9 jsfiddle | character set |
[^...] | /[^c-u]/ |
abcdefghijklmnopqrstuvwxyz 0123456789 | matches any single character not in the range of c to u jsfiddle | character set |
Character Shorthands |
||||
\b | /\bton\b/ |
tone wantons ton toon | match ton if its a word character (ie. [A-Za-z0-9_]) between and start and end boundary jsfiddle | anchors |
\B | /ton\B/ |
tone wantons ton toon | match ton if its NOT a word word character (ie. [A-Za-z0-9_]) between and start and end boundary jsfiddle | anchors |
\d | /\d/ |
Match digits 0123456789 | matches a digit character. Equivalent to [0-9] jsfiddle | character class |
\D | /\D/ |
Match non-digits 0123456789 | matches any non-digit character. Equivalent to [^0-9] jsfiddle | character class |
\s | /\s/ |
spaces inbetween these words and line breaks | matches a single white space character, including space, tab, form feed, line feed jsfiddle | character class |
\S | /\S/ |
anything but white space |
matches a single character other than white space, including space, tab, form feed, line feed jsfiddle | character class |
\w | /w/ |
abc or 123 even _ but not much else | matches any alphanumeric character including the underscore. Equivalent to [A-Za-z0-9_] jsfiddle | character class |
\W | /\W/ |
abc or 123 even _ but not much else | matches any non-word character. Equivalent to [^A-Za-z0-9_] jsfiddle | character class |
Escaping Meta & Shorthand Characters
Meta characters are not always placed into a pattern with the intention of using the characters as special meta characters. When a situation arises where you need to match a meta character itself, you will have to escape the character so that the expression reads the character as a literal character and not a special meta character.
For example, let's say that you need to literally match the meta character *
in a string. To do this you will need to escape the *
character by adding a \
(i.e. a backslash) before the meta character so that the regular expressions knows that this character should be interrupted literally and not as a meta character. In the expression editor below try removing the \
and see what happens.
Many of the meta characters already include the \
. For example, the \d
meta character can be used to match any digit character in a string. But what if you need to literally find the characters "\d" in a string? To do this you will have to add another backslash to the front of \d
, like so: \\d
to escape the shorthand usage of \d
by the regular expression parser.
Notes:
^
, .
, [
, $
, (
, )
, |
, *
, +
, ?
, {
, and \
.new RegExp("\\d+")
is equivalent to the literal expression /\d/
). Notice the extra \
. This is because the expression is in a string format and not a literal regular expression format. Because it is a JavaScript string you have to escape the \
with an additional \
. I typically use JavaScript literal values so that I do not have to add additional character noise to my expressions.What follows is a conceptual break down of how the most common meta and shorthand characters function within an expression. These characters tell the regular expression parser to behave in a special way, similar to operators found in programming language. You'll need to grok these special characters before you can properly read and write regular expressions.
Be aware that many of these examples are fictitious and are purposely small and simplistic so the concept can be grasped. I have spent a good deal of time organizing this section into concepts that build on top of each other, so I suggest reading it from top to bottom on your first go around.
Creating Sub Expressions with Parentheses (aka Groups)
The parentheses characters alone have one very basic task, they create sub expressions. This becomes valuable when you need to parse an expression before another expression is parsed. Sub experssions function similar to the use of parentheses and expressions in JavaScript (i.e. (1+(2-2) === 1
). Groupings created with parentheses in a regular expression gets evaluated first then moves outward, just like in JavaScript. As well, parentheses not only provide order of parsing control but can be used to make expressions more readable, just like in JavaScript.
When you first start using parentheses the value they provide is not immediately obvious unless you combined them with other meta characters because not much changes with or without parentheses alone. For example the patterns /(in)side/g
and /inside/g
) match the same thing. But let's say that you want to match either the sequence of characters "inside" or "side". To do this, you can create a sub expression group and then use the ?
meta character defining the "in" sequence of characters as optional.
Examine the expression below:
If you remove the parentheses so the expression is /in?side/
then the ?
only applies to the "n". Which makes sense, because we no longer are dealing with the sub expression result, "in". With the parentheses gone, the ?
operates only on the "n" which occurs before it. If you add a ?
before the "i" so the expression is /i?n?side/
we get our match again, but don't you think the use of the parentheses is more readable (i.e. /i?n?side/
v.s. /(in)?side/
)?
Notes:
Matching One Or Several Sub Expressions (aka Alternation)
Using the |
meta character we can create alternative sub expressions. This simply allows us to give several options for a match. The classic example of using the |
meta character is matching words in a string that could be spelled two different ways. For example my name could be spelled, "codi" or "cody" or "kodi". To match any of these spellings the |
character could permit either an "i" or "y" or an "ie". Below I use the |
character to match all spellings.
Notes:
[
and ]
the |
is a normal character./gr(a|e)y/
and /gra|ey/
are not the same. We need the parentheses to create a sub expression for "a" and "e" other wise with out the parentheses /gra|ey/ would match either "gra" or "ey".Matching A Set or Range Of Characters (aka Character Sets)
The [
and ]
square brackets are used to define a set of characters or a range of characters which qualify as a match.
To match a set of characters simply place the characters that will qualify as a match inside of the brackets. Order does not matter. Below I am matching any "a" "c" and "g" character or any "1", "2", "3" and "7" character.
To match ranges, simply place a -
character between alpha or numeric characters. The "-" character has a special meaning inside of a set. Below I match a range of alpha characters between "a and g (including a and g) and between any numeric characters 1 thru 7 (including 1 and 7).
Notes:
/[a-z]/
is not the same as /[A-Z]/
. The first one
matches lowercase characters in the range and the later matches any
upper case characters in the range.^
meta character at the begining of the characters
inside of the set negates the characters or range. The expression /[^a-z]/g
means any alpha character that is not a lower case alpha character.[A-Za-z0-3]
).[10-90]
is not 10 to 90 it's actually 1, 0 to 9, and 0.
Character sets are only made up of individual characters or two
characters representing a range.]
, -
, ^
, and \
.Matching A Specified, Minimum, Or Range Of Occurrences (aka Quantifiers)
The {
and }
curly brakets are used to quantify the number of occurrences, minimum number of occurrences, or range of occurrences that are required or permitted by a previous character.
For example, to match any consecutive set of 2 or more a's in the string "aaa aaaa aaa aaaaaa aaa" you can use the {2}
quantifier.
To match any sequence of a's where at least 2 a's, to an unlimited number occur, we only need to add a comma after the 2 (i.e. /a{2,}/g
)
If we need to cap off the number of occurrences, and create a range, use two numeric values separated by a comma.
In addition to the quantifier brakets we also have the special meta characters *
, +
, and ?
which provide the following curly bracket like defaults.
*
is eqivilant to {0,}
, matches when the preceding character(s) occurs 0 or more times+
is eqivilant to {1,}
, matches when the preceding character(s) occurs 1 or more times?
is eqivilant to {0,1}
, matches when the preceding character(s) occurs 0 or one timeThe above meta characters are typically used over the more verbose syntax {...,...}
. However, I memorize them by always thinking about the equivalent meaning for them using the {...,...}
meta characters.
Matching A Specific Character Type (aka Character Classes)
The following meta and shorthands characters can be used to match a specific type of character.
.
used to match any character, (i.e. the wildcard meta character)\w
used to match an alpha or digit character and includes _ equivalent to [a-zA-z0-9_]
\W
used to match anything but an alpha or digit character and includes _ equivalent to [^a-zA-z0-9_]
\d
used to match a digit character equivalent to [0-9]
\D
used to match any but a digit character equivalent to [^0-9_]
\s
used to match a white space character, including space, tab, form feed, line feed equivalent to [ \t\r\n]
\S
used to match anything but a white space character, including space, tab, form feed, line feed equivalent to [^ \t\r\n]
To demonstrate these character classes, consider an expression that could be used to validate a 3 character password format. Let's say (silly format, I know) that our password format has the following rules
An expression using character classes to validate this password might look something like:
This expression will break fairly easily but I want you to grokk that these character classes are simply shorthands for types of characters that can be matched.
Notes:
/[\w-]/
)Matching Only If The Entire String Starts With A Specific Sequence of Characters (aka Anchors)
By using the ^
meta character you can tell the regular expression parser to only match if the beginning of the entire string contains a match for the expression.
In the example below, the only match found is the "dogs" at the beginning of the string.
Notice the "dogs" at the end of the string is not matched.
Matching Only If The Entire String Ends With A Specific Sequence of Characters (aka Anchors)
By using the $
meta character, you can tell the regular expression parser to only match if the end of the entire string contains a match for the expression.
In the example below, the only match found is the "dogs" at the end of the string.
Notice the "dogs" at the beginning of the string is not matched.
Matching Only If The Entire String Starts and Ends With A Specific Sequence of Characters (aka Anchors)
When you combine the ^
and $
meta characters, it tells the regular expression engine to match only if the entire string is an exact match. This means that a match is only found when the expression matches every character in the string from beginning to end. This makes the expression /dogs/ and /^dogs$/ different because /dogs/ will match the first set of characters in a string that have a "d", followed by an "o", followed by a "g", followed by an "s". While, /^dogs$/ will only match if the entire string starts and ends with "dogs". In the example below, we have a match because the entire string is an exact match.
If we were to add any new characters to the beginning of the string or the end of the string, or in-between, a match would not occur. Typically the ^
and $
anchors are used when you are validating a string, using an expression, looking for a single and exact match like validating form data such as phone numberers, emails, and passwords.
Matching Word (i.e. \w
) Boundaries (aka Anchors)
The \b
shorthand is used to match boundaries around word characters. Remember a word character (i.e. \w
) is anything A-Z, a-z, 0-9, or a _. In a sentence of words, you might think that the boundaries are produced by spaces, but strictly speaking, a boundary is anything that is not a word character; not just spaces. In the example below, we can match each word in the sentence by matching anything between boundaries that is a word character, followed by 1 or more occurrences (i.e. \w+
) of a word character.
These boundaries can be a bit confusing. If it helps in the regular expression editor, replace all of the spaces in the expression with a dash "-". Notice that a boundary is still created because a dash is not a word character.
Keep in mind that in a string you have a default boundary at the start of the string and the end of the string. So if we were to remove the \b
from the start or end (i.e. /\b\w+/
or /\w+\b/
) of the above expression it would still match each word because of the start and end boundary on the entire string. This, again, might be confusing, so make sure you try both of these variations in the editor and think carefully about why all words in the sentence get matched.
Notes:
\B
meta character which matches anything that is not a word character next to a boundary. In the string "regexp are powerful tools in the right hands" the expression \B\w+\B
would match ["egex","r","owerfu","ool","h","igh","and"]
. Matching If Followed By Specified Characters (aka Lookarounds/lookaheads)
To match a sequence of characters that are followed by a specific sequence of characters we can use the (?=)
meta character's. Imagine that you wanted to find all words in paragraph that are followed by a comma. If we take the concept of a boundary just discussed and add (?=,)
after the word meta character then we will only match the words that are followed by a comma.
Notice that the comma itself is not part of the actual match.
Matching If Not Followed By Specified Characters (aka Lookarounds/lookaheads)
To negate a sequence of characters that are followed by meta characters, that is to only match the words that are not followed by, use (?!')
instead of (?=)
. In the code below I match all of the words in the string that are not followed by a comma.
Creating Capture Groups and Using Backreferences (aka Groups)
An advance use of parentheses is the fact that by default they create what is known as a capturing group. A capture group is the storing of the data matched in the parentheses in memory. This capture group can be referred too after its creation in the same regular expression using a backreferences (i.e. \1
).
In the regular expression editor below we place "apple" inside of parentheses, creating a capture group.
To refer to this sub experssion result (i.e. "apples"), we use a \1
in the same expresssion which is equivalent to /(apples) to (apples)/
. The next capture, or second capture group, if we had more than one set of parentheses at work, would be avaliable at \2
, and the next at \3
and so on.
Notes:
?:
inside the group telling the sub expression not to create a capture group (i.e. /(?:blah)/
) for reference from a backreference.Creating JavaScript Regular Expression Object's
At the start of this tutorial, I covered the creation of JavaScript regular expression objects using a literal syntax and a constructor syntax. Before reading the remaining part of the this tutorial you should review this information again so you are comfortable with the creation of RegExp()
objects using the literal syntax.
RegExp Object Methods & Properties
A JavaScript regular expression object has the following methods and properties:
Methods & Properties | Description |
---|---|
//.exec("") |
Executes a search for a match in the string passed to the exec() method. It returns an array of information containing the first matched characters (e.g. /\w+/.exec('foo')[0] === 'foo' ), capture groups, the index location of the match in the string (e.g. /\w/.exec('foo').index ), and the original inputted string (e.g. `/\w/.exec('foo').input === "foo"`). |
//.test("") |
Tests for a match in a string that is passed to the test() method (e.g. /\w/.test('foo') === true ). It returns `true` or `false`. |
//.ignoreCase |
Read only boolean value indicating whether or not the i flag was set. |
//.global |
Read only boolean value indicating whether or not the g flag was set. |
//.lastIndex |
A read/write integer property, specifies the index at which to start the next match. Valid only if the |
//.source |
A read only string property that contains the pattern, excluding the forward slashes. |
//.multiline |
Read only boolean value indicating whether or not the m flag was set. |
Regular Expression String() Methods That Take RegExp Objects
The following String() methods either requires the usage of a regular expression or can optionally accept a regular expression object value as a parameter typically in place of a string value.
Methods | Description |
---|---|
"".split(//) |
Break a string into an array of substrings using either a RegExp object or String object |
"".match(//) |
Creates an array containing the matches found by a regular experssion or returns null if no matches were found |
"".replace(//|"",""|function(){}) |
Produces a new string with some or all matches replaced by a replacement value. The replace() methods accepts both String objects and RegExp objects to identify what parts of the string should be replaced. |
"".search(//) |
Returns the index of the firxt match, or -1 if the no matches are found |
Finding The First Match In A String Using exec()
The exec()
method when used on a regular expression returns an array containing the first match found in the string passed to the exec()
method. The value matched is placed at the 0 index of the array. In the example below, the first match, "DE", based on the experssion /([dD])([eE])/g
can be logged to the console using execArray[0]
.
Notice that the exec()
array also provides the index
and input
properties. The index
value contains the numeric index number pertaining to where in the string the match occurred (0 representing the first character, "A", in the string). For example, the "DE" match occurs at index 35
. The input property simply returns the string value that was originally passed to the exec()
method.
Notes:
["DE","D","E"]
).exec()
returns the null
value if no match is found.test()
and String method search()
are much faster than exec()
.Finding All Matches In A String Using match()
We have being using match()
throughout this tutorial in the JSfiddle mini expression editor. I used the match()
method to return an array containing all of the matches found by the expression. You can also see the usage of match()
in the code example below where I am searching myString
for any word character matches (i.e. /\w/
).
Notes:
.match()
, it is implicityly converted using the RegExp()
constructor.g
flag is not used, match()
will return the same array of
information that is returned from using exec()
test()
and String method search()
are much faster than
match().Determine (true or false) If A Match Occured Using test()
The regular expression test()
method is used to return a boolean value indicating if a match has been found in the string passed to the test()
method. Below I test the myString
value for the underscore character.
A match is found and test()
returns the boolean value true
.
Finding The Index Of First Match Using search()
The search()
method available to be called on JavaScripts string values will search a string for a regular expression match and return the index of the first match found by the expression. In the code example below I search the myString
value for the index of the match found by the /_/g
expression (i.e. index of the first _ character).
Since the underscore is the last character in the string the index should be the same as the .length
of the string minus one.
Replacing Expression Matches To Create A New String Using replace()
The replace()
string method is a mini search and replace tool that optionally accepts a regular expression as the mechanism to find sub strings to be replaced. For example, let's say you would like to replace the word "ahole" or "jackass" anytime it is found within a string. The code below demonstrates the usage of replace()
for cleaning up some undesirable words contained in the string, "He is such an ahole. I can't believe how much of a jackass he can be!".
Notice that while I could have passed the replace()
method a string literal as the second paramater I instead used a callback function. By using a callback function (which is passed the value being replaced) it is possible for me to customize each replacement based on the character length of the match I am replacing. In reality I could have simply passed as the second parameter to replace()
the string "**", but that would assume that all the words I was replacing were four letter words.
Splitting A String Into An Array Based On Matches Using split()
The split()
string method, when passed a regular expression, will use matches from the expression to act as the breaking point for dividing a string into sub strings. The splits are placed into an array and return. In the example below I split the myString
value into an array of substrings by passing the split()
method an expression that matches spaces.
The following three examples are contrived examples for the purpose of learning to read expressions. Study each expression image and see if you can figure out what the meta and shorthand characters are doing and why. These are not plugin and play solutions, considering that most of the time you'll want to match the entire string (i.e. /^expression$/
) alone and not several matches in a string.
Creating an expression to match US Postal Code's.
Creating an expression to match HTML Hex Colors.
Creating an expression to match 12hr times.
A lot of developers ignore regular expressions not necessarily because they are difficult, which they are, but in my opinion because not a lot of helpful resources to learn about regular expessions are available. I wrote this article because it is the article I wanted to read when I was learning. I hope this information can stand as a resource for those JavaScript developers that are seeking knowledge and understanding about JavaScript regular expressions.
Beyond this content I would like to suggest pursuing additional information about about regular expressions by consuming one of the videos listed below.