WARNING: This is a local copy of "lost" http://developer.netscape.com/viewsource/angus_strings.html. Sinc devedge is dead, I also remove contact emails. [DKS April 2006]

STRING MATCHING AND REPLACING IN JAVASCRIPT 1.2 By Angus Davis


As developers, we often encounter problems in our programming techniques where we'd like to easily search for patterns in a string. UNIX scripting languages like Perl have long been heralded as great tools in achieving the goal of string pattern matching and substitution, through something called regular expressions, or "regexps."

Regexps work like wild cards, as vague or specific as your project demands. You can use them to enable powerful validation tools on the client side, conversion of database output on the server side, and virtually any task imaginable that requires the use of string matching and replacing.

With new features in JavaScript 1.2, developers have flexible string pattern matching capabilities equal to that of Perl and other UNIX scripting languages. This represents a giant leap forward in the toolset available to crossware developers, and offers a new set of features that UNIX gurus have longed for in JavaScript.

This article will get you up to speed building JavaScript regexps, with parallel tracks for both regexp neophytes and seasoned Perl programmers who understand regexps inside and out. There are two steps to understanding regexps: first, learning how to build patterns; then, understanding how to implement regexp methods. Patterns are a way of describing strings, while methods such as match() and replace() enable the use of patterns in your applications.

In the first section of this article, I'll introduce patterns and their cryptic language of modifiers, metacharacters, and special characters. Next, you can read one of two sections that explain how to implement patterns using JavaScript. One of these sections is for those of you who have a background in the Perl programming language, while the other is aimed at developers who are not familiar with Perl or its regular expression support.

DEMYSTIFYING REGULAR EXPRESSIONS

A regexp is a tool for matching patterns in strings and replacing those matches with new content. Regexps have their own language of special characters that assist in building complicated patterns. JavaScript patterns are identical to those used in Perl.

Patterns are enclosed in forward slashes and define the string pattern you're looking for. You can append special modifiers to specify options such as a global or case-insensitive search. The typical use is /pattern/, where pattern is the text to search for within a string. One or more modifier characters may follow the closing slash. For example:

myPattern = /Netscape/
myOtherPattern = /Netscape/i

These two patterns both match the word "Netscape." The first pattern will match this word exactly, while the i modifier appended to the second pattern specifies case-insensitive matching. Table 1 lists the various modifiers that are available. (The m and s modifiers affect the behavior of other special characters that we'll learn about later.)


Table 1. Regexp modifiers Modifier Description
g Global pattern matching
i Case-insensitive pattern matching
m Allows the special characters ^ and $ to match multiple times within a string
s Allows the special character . to match newlines
x Ignores whitespace within a pattern

Metacharacters in Regexp Patterns

Patterns can be more than just text. By using something called a metacharacter, you can match patterns such as "numeric digits only" or "words only," or combinations of such restrictions. Metacharacters are simply normal characters preceded by a backlash that have special meaning. Typically a lowercase metacharacter will match something and its uppercase counterpart will match just the opposite. Table 2 describes the regexp metacharacters.


Table 2. Regexp metacharacters Meta- 
character Description Example
\s Matches whitespace (including tabs and newlines) /Moz\silla/ matches "Moz illa" and "Moz   illa" (three spaces) but not "Mozilla"
\S Matches anything that is not whitespace /Moz\Silla/ matches "Mozilla" but not "Moz illa" or "Moz   illa"
\b Matches only a word boundary /\bMoz/ matches any word beginning with "Moz", like "Mozilla" or "Mozillathon" 

/Moz\b/ matches any word ending with "Moz", like "myMoz" or "bigMoz" 

/\bMoz\b/ matches only the word "Moz" -- not "Mozilla" or "myMoz"

\B Matches only nonword boundaries /\Bour/ matches "four" and "sour" but not "our" 

/ject\B/ matches "rejection" and "injection" but not "reject" or "inject" 

/\Bthe\B/ matches "lathes" but not "the" or "them" or "scathe"

\d Matches digits 0 through 9 /Navigator \d/ matches "Navigator 3" but not "Navigator A" 
\D Matches only nonnumeric characters /\Navigator \D/ matches "Navigator A" but not "Navigator 3"
\w Matches only letters, numbers, or underscores /7\w7/ matches "747", "7_7", and "7A7", but not "7.7" or "7+7"
\W Matches only characters that are not letters, numbers, or underscores /7\W7/ matches "7.7" and "7+7" but not "747"
\A Matches the beginning of a string only /\ANetscape/ matches "Netscape Communicator" but not "Communicator by Netscape" 
\Z Matches the end of a string only /\ZNetscape/ matches "Communicator by Netscape" but not "Netscape Communicator" 

Other Special Characters in Regexp Patterns

As you can see, metacharacters are formed by placing a backslash in front of a "standard" character. In addition to metacharacters, you can use other special characters that are similar to metacharacters but are not preceded by a backslash. Table 3 lists these special characters.


Table 3. Other special characters Special character  Description  Example 
* Matches zero or more occurrences of the preceding character /Ne*tscape/ matches "Ntscape" and "Netscape" and "Neeetscape" but not "Notscape"
+ Matches one or more occurrences of the preceding character /Netscape/ matches "Netscape" and "Neeetscape" but not "Ntscape"
? Matches zero or one occurrence of any character /N?tscape/ matches "Netscape" and "N-tscape" and "Ntscape" but not "Neetscape" 
. Matches any one character, except newlines /N.tscape/ matches "Netscape" and "N1tscape" but not "Ntscape"
^ Matches the beginning of a string, like the \A metacharacter /^Netscape/ matches "Netscape Communicator" but not "Communicator by Netscape" 
$ Matches the end of a string, like the \Z metacharacter /Netscape$/ matches "Communicator by Netscape" but not "Netscape Communicator"

One of the first questions developers ask when learning about regexps is how to include these special characters literally in their strings. For instance, if you'd like to search for the "$" character, you would expect to have a problem, because $ is a special character. The answer is to place a backslash (\) in front of the character -- for example, /\$/ will match the dollar sign character.

Parentheses, Square Brackets, and Braces

You can build complicated patterns by combining metacharacters with special "container" characters: parentheses, square brackets, and braces (curly brackets). For instance, you could choose to match one or more digits with /\d+/, or one or more dollar signs at the beginning of a string with /^\$+/.

Parentheses let you do two things. First, just as in a mathematical expression where "(2+2)/3" would evaluate the addition within the parentheses before performing the division contained outside, you can use parentheses to group together regexp characters. For instance, /(Ha+)+/ would match "HaHaaHaaaHaHaa" but not "HaH" or "Ha".

A more powerful use of parentheses is creating what are called backreferences, special pattern results that allow you to determine exactly what match was found. If your pattern contains two pairs of parentheses, for example, it may store two backreferences. Backreferences are stored in variables named $1, $2, $3, $4, and so on. For instance, the pattern /(\D+)(\d+)\.(\d+)/, when matched against the string "JavaScript 1.2", would store "JavaScript" in $1, "1" in $2, and "2" in $3.

Square brackets enclose a choice of characters to be matched, and braces enclose a range of numbers indicating how many occurrences of the preceding character should be considered a match. See Table 4 for some examples.


Table 4. Examples using "container" characters Example  Result 
/[aeiouy]+/ Matches one or more occurrences of a vowel or the letter "y", such as "ee" or "you"
/^(\bh[aiou]t\b)+/ Matches one or more occurrences of the word "hat", "hit", "hot", or "hut" at the beginning of a string
/Ja{1,4}va/ Matches "Java" or "Jaava" or "Jaaava" or "Jaaaava", but not "Jaaaaava" (there can be from 1 to 4 occurrences of the letter "a" following "J")
/Ja{3,}va/ Matches "Jaaava" or "Jaaaaava" but not "Java" (there can be three or more occurrences of the letter "a" following "J")

There are two additional special characters that you would probably use only within these "containers" -- they are the vertical bar, or "pipe" (|), and the hyphen (-). Those of you who are familiar with the pipe know that it typically means "or," and this is the case with regexps. For example, the pattern /(a|z)+/ matches one or more occurrences of either "a" or "z", as in "aa" or "z" (but not "az"). Examples of using the hyphen character are /[5-7]+/, which matches one or more of the numbers 5, 6, and 7, and /[a-z]/, which matches any one letter from "a" to "z."

The Mystery of Regexps Solved

Now you should have a good understanding of basic patterns and their modifiers, metacharacters, and other special characters. Using combinations of these elements along with parentheses, square brackets, and braces, you can build incredibly complicated patterns. Of course, you haven't learned how to do anything with these patterns yet -- that comes next.

Keep in mind that regexps are only as complicated as you need them to be; if you just want to replace one word with another, there's little need for special characters or backreferences.

It's now time to start putting patterns to use. If you already understand how to build regexps in Perl using m/// and s///, you should jump to the section "Migrating from PERL to JavaScript Regexp Methods." JavaScript programmers who are unfamiliar with Perl and its regexp methods should continue now with the next section, "Learning JavaScript Regexp Methods."

LEARNING JAVASCRIPT REGEXP METHODS

In the previous section you learned the art of creating patterns to describe strings. Here I'll explain how you can put patterns to work by using JavaScript 1.2's support for regexps through three new methods for matching, replacing, and splitting strings.

The JavaScript methods I'll explain here are called in the form of string.match(), string.replace(), and string.split(), where string is the name of a variable containing a string.

The string.match() Method

JavaScript's string.match() method allows you to use regular expressions to determine whether a certain pattern is contained in a string. The syntax looks like this:

string.match(/pattern/modifiers)

Typically, you store the result of this statement in a variable, since the string.match() method returns an array of matches found in the string you search in. Example 1 illustrates the use of this method.


Example 1
var myString = "This is my test string";
var inThere = myString.match(/This/g);
if (inThere) {
   // Yes, a match occurred.
   ...
}
else {
   ...
   // No match occurred.
}

One area in which to use this method more practically would be in validating form data before it's sent. In the past, developers have often used JavaScript's indexOf() method to check for the presence of an "@" symbol in form fields asking for an e-mail address. This provides some degree of certainty that the e-mail address is valid, but you can improve on this validation significantly by using regular expressions and making the following assumptions: You can use the power of regexps to build a pattern to match these assumptions, as shown in Example 2.


Example 2
<SCRIPT>
function checkEmail(myString) 
{
   var newString = myString.match(/\b(^(\S+@).+
      ((\.com)|(\.net)|(\.edu)|(\.mil)|(\.gov)|(\.org)|(\..{2,2}))$)\b/gi);
   if (!newString) alert("Invalid e-mail address!");
   else {
      alert(document.ex2.foo.value+" is a valid e-mail address.")
   }
}
</SCRIPT>
<FORM NAME="ex1" onSubmit="return false;">
<INPUT NAME="foo" TYPE="text" SIZE=45 onChange="checkEmail(this.value);">
</FORM>

To see this in action, type an e-mail address into the form below and press Enter. If the match is false, you'll be alerted with an error message signifying an invalid e-mail address; otherwise, you'll receive a validation message. 

Performing pattern matches with the string.match() method is only one way of getting the job done. If you're performing many iterations of a matching statement inside a repeating loop, it's more efficient to craft your code in a different manner using JavaScript 1.2's new RegExp object, as explained later in the section "The JavaScript Regexp Object."

The string.replace() Method

By far the most common and powerful way to use regexps is in "search and replace" operations, where a pattern is matched and then replaced with new text. In JavaScript 1.2, this is accomplished through the string.replace() method.

In this section, I'll examine both a simple and a more complex example of the string.replace() method. For a complete description of this method, check the appropriate section of the JavaScript 1.2 documentation.

In Example 3, I replace all instances of the word "proprietary" with the phrase "Open and Cross-Platform" using the string.replace() method. You'll notice that I use the i and g modifiers to establish a case-insensitive global search and the \b metacharacter to indicate word boundaries.


Example 3
<SCRIPT LANGUAGE="JavaScript1.2">
function replaceMe(myString) 
{
   var pattern = /\bproprietary\b/ig;
   var newString = myString.replace(pattern,"Open and Cross-Platform");
}
</SCRIPT>

To try this out, type some text containing the word "proprietary" into the form below and press Enter. You'll see that all instances of the word are replaced with the phrase "Open and Cross-Platform". 

This simple example of string.replace() is only one way to use regexps in your JavaScript 1.2 application. Building a complicated regexp will truly test your abilities. The pattern used in Example 4, for instance, looks like this:

/\b([^aeiouy]*)(\S+)\x?/gi

If this looks like Latin to you, don't worry -- it is. This regexp pattern will help convert standard English to Pig Latin. "avaScriptJay" (that's Pig Latin for "JavaScript") makes it happen. Does the world really need a Pig Latin converter? Probably not. However, this particular example is a good deal of fun and a favorite among regular expression enthusiasts.


Example 4
<SCRIPT LANGUAGE="JavaScript1.2">
function pigLatin(myString) 
{
   var pattern = /\b([^aeiouy]*)(\S+)\x?/gi;
   var newString = myString.replace(pattern,"$2$1ay")
}
</SCRIPT>
<FORM onSubmit="return false;">
<INPUT TYPE="text" SIZE=60 onChange="this.value = pigLatin(this.value);">
</FORM>

To see this in action, type some text into the form below and press Enter to see JavaScript 1.2 and string.replace() generate Pig Latin.

The string.split() Method

JavaScript 1.2 provides a new method, called string.split(), to split up a string based on a regexp pattern and store the resulting substrings in an array. This is especially useful for parsing data that uses a delimiter, for instance. Traditional flat-file databases and other similar content are appropriate for such parsing.

Using string.split() is as simple as defining a pattern to use as a delimiter in splitting any string. Example 5 shows how you might use numeric characters to act as such a delimiter in a function called ax(), which will slice up any string based on the delimiter and produce an ordered list in HTML.


Example 5
<SCRIPT LANGUAGE="JavaScript1.2">
function ax(myString) 
{
   var pattern = /\d+/;
   logs = myString.split(pattern);
   document.write("<ol>");
   for (i=0; i<logs.length; i++) {
       document.write("<li>"+logs[i]+"</li>")
   }
   document.write("</ol>")
}
</SCRIPT>

The ax() function takes a variable called myString and uses the string.split() method to "chop up" the string into an array called logs. It then writes the values to the document in the form of an ordered list.

If you were to pass this function the string "Netscape12Communications4Corporation977777Makes123Cool555Stuff", you would get the following ordered list as a result:

  1. Netscape
  2. Communications
  3. Corporation
  4. Makes
  5. Cool
  6. Stuff
In this case, the function matches any set of one or more digits and uses them as a delimiter. In practice, you can use any regexp pattern as a delimiter with string.split(). Adventurous developers should try using more complicated patterns to build regexps with this method.

MIGRATING FROM PERL TO JAVASCRIPT REGEXP METHODS

This section is only for those of you who already understand how to build regexps in Perl using m/// and s///; others should skip to the section "The JavaScript Regexp Object."

In earlier versions of Navigator, JavaScript lacked support for powerful, efficient, and lightweight routines to match and replace patterns in strings. Perl, on the other hand, is known for its excellent capabilities in this area. It's only natural then that to answer its need for these routines, JavaScript has borrowed code syntax and functionality from Perl. As a result, Perl developers will experience a straightforward and relatively painless transition to using regexps in JavaScript.

There are several reasons to migrate server applications to JavaScript from Perl. Server-side JavaScript is faster, as explained in a recent View Source article, CGI vs. Server-Side JavaScript for Database Applications. Because JavaScript is a cross-platform scripting language, you can change from a UNIX server to a Windows NT server without having to rewrite any code.

Since JavaScript 1.2 is supported in both the Communicator 4.0 client suite and Enterprise Server 3.0, it's easy to write regexps on either side of the connection.

The Perl s/// Operator Compared to the JavaScript string.replace() Method

Replacing patterns in Perl is carried out through the s/// operator. In JavaScript, string.replace() performs the same function. In Example 6, a script replaces all instances of the word "proprietary" with the phrase "Open and Cross-Platform."


Example 6
Perl:  $myString =~ s/proprietary/Open and Cross\-Platform/ig;
JavaScript:  myString.replace(/proprietary/ig,"Open and Cross-Platform");

To put this code into action, type some text that includes the word "proprietary" into the form below and press Enter. 

When backreferences are used, JavaScript's string.replace() method is very similar to Perl's s/// operator, as shown in Example 7.


Example 7
Perl:  s/(some stuff)(other stuff)/$2$1/ig;
JavaScript: string.relace(/(some stuff)(other stuff)/ig, "$2$1");

The similarity continues in pattern construction: all metacharacters and other special characters that work in Perl regexps are supported in JavaScript. You can also use parentheses, square brackets, and braces as in Perl to gain greater control over regexps in JavaScript. A classic Perl regexp converts standard English text to Pig Latin through the s/// operator. Example 8 shows both the Perl and the JavaScript versions of this popular demonstration regexp.


Example 8
Perl:  $string =~ s/\b([^aeiouy]*)(\S+)\s?/$2$1ay /gi; 
JavaScript:  string.replace(/\b([^aeiouy]*)(\S+)\x?/gi,"$2$1ay");

To see the Pig Latin regexp in action, type some text into the form below and press Enter. You should see the words transformed "automagically" into Pig Latin, thanks to the miracle of JavaScript support for pattern replacement in strings. 

The Perl m/// Operator Compared to the JavaScript string.match() Method

Perl's m/// operator lets you test strings for a pattern match. If a match occurs, the operation returns a Boolean value of true; otherwise, it returns false. As shown in example 9, JavaScript's new string.match() method performs this function in much the same way.


Example 9
Perl:  $myString =~ m/[aeiouy]/ig;
JavaScript:  myString.match(/[aeiouy]/ig);

This example checks for the presence of a vowel (or "y") in a string variable called myString. Both operations return a Boolean value of true if the pattern can be matched, and false otherwise. As with the string.replace() method, JavaScript is capable of handling all the special regexp features that Perl offers.

The Perl split() Method Compared to the JavaScript string.split() Method

Perl lets you split up a string into several other strings, using a regexp pattern as a delimiter. Thanks to the new JavaScript string.split() method, identical functionality is now possible without the need to rely on Perl.

Example 10 shows the code to split up strings in Perl and JavaScript.


Example 10
Perl:  @newStrings = split(/::/,$myString);
JavaScript:  newStrings = myString.split(/::/);

This example breaks up the contents of a string called myString into an array of strings called newStrings. For instance, if I assign a value of "Bert::Ernie::Oscar" to "myString", I can use the split() method to store "Bert", "Ernie", and "Oscar" into newStrings[0], newStrings[1], and newStrings[2], respectively.

Like the Perl operator, JavaScript's string.split() method can take the number of splits as an argument. The following example limits the number of splits to 3:

newStrings = myString.split(/::/,3)

THE JAVASCRIPT REGEXP OBJECT

Performing pattern matches with the string.match() method is only one way of getting the job done. If you're performing many iterations of a matching statement inside a repeating loop, it's more efficient to craft your code in a different manner, using JavaScript 1.2's new RegExp object.

Constructing an instance of the RegExp object allows you to "compile" a regexp pattern in a form that the JavaScript interpreter can use very efficiently. Thus, if you're perfoming hundreds of match tests against database records or some other intense application, using the RegExp object will make your code execute more quickly.

For an in-depth explanation of the RegExp object and its methods, check the JavaScript 1.2 documentation dealing with the subject. Here I'll show you how to migrate a simple pattern matching application that uses the match() method (Example 11) to new code that uses the RegExp object. This example looks at an array of strings in search of the word "Mozilla."


Example 11
<SCRIPT LANGUAGE="JavaScript1.2">
function oldMatch() 
{
   // Create and populate an array of strings.
   var myStrings = new Array(5);
   myStrings[0] = "Some text in here";
   myStrings[1] = "Even more text in here";
   myStrings[2] = "Mozilla likes to eat cookies";
   myStrings[3] = "Some more text in here";
   myStrings[4] = "We all like Mozilla, our friend";
   // Form a pattern, and check for matches in our array.
   var pattern = /Mozilla/ig;
   for (i=0; i<myStrings.length; i++) {
      if (myStrings[i].match(pattern)) {
         // It's a match.
         ...
      }
      else {
         // It's not a match.
         ...
      }
   }
}
</SCRIPT>

In Example 12, we see the equivalent code using the RegExp object.


Example 12
<SCRIPT LANGUAGE="JavaScript1.2">
function oldMatch() 
{
   // Create and populate an array of strings.
   var myStrings = new Array(5);
   myStrings[0] = "Some text in here";
   myStrings[1] = "Even more text in here";
   myStrings[2] = "Mozilla likes to eat cookies";
   myStrings[3] = "Some more text in here";
   myStrings[4] = "We all like Mozilla, our friend";
   // Form a pattern, and check for matches in our array.
   var pattern = /Mozilla/;
   var re = new RegExp(pattern,"ig");
   re.compile(pattern);
   for (i=0; i<myStrings.length; i++) {
      if (re.exec(myStrings[i])) {
         // It's a match.
         ...
      }
      else {
         // It's not a match.
         ...
      }
   }
}
</SCRIPT>

Notice that there are three major differences between string.match() and the RegExp object methods shown in this example: Knowing how to use the RegExp object is just as important as knowing when to use it. When you're evaluating a large number of strings for a pattern match, it's best to take the effort to create and compile a regexp pattern using the RegExp object. However, for quick or individual evaluations, keep things simple by sticking with the string.match() method.

THE NEXT STEP

Now that you understand JavaScript 1.2 building blocks for regexps -- string.match(), string.replace(), string.split(), and the RegExp object -- you're ready to begin including regexps in your own JavaScript applications. If you're a Perl developer, you can take comfort in knowing that that JavaScript regexp support is based almost entirely on the foundation that Perl has set.

The next step is to look over the "Regular Expressions" section of the JavaScript 1.2 documentation for more details. With its powerful methods for matching and replacing patterns in strings, JavaScript gives developers on both the client and the server new reasons to work with this powerful scripting language.