mScriptBox Tutorial
Regular Expressions - Part II

Written by Sigh_
Published with permission.

Table Of Contents

  1. Introduction
  2. Back Referencing - The $regml Identifier
  3. Some Examples using $regml
  4. The $regsub Identifier
  5. Example using $regsub
  6. Conclusion
  7. Regular Expressions - Part I

1. Introduction  Back to Top

This tutorial aims to go one step further than the tutorial on regular expressions that already exists on this site, and you are presumed to already be well versed with the basics of regular expressions i.e. how to form them and use them to match patterns in strings. If you are new to this and haven't yet read Trystan's tutorial on regular expressions please do so before continuing (you will only get confused otherwise)


2. Back Referencing - The $regml Identifier  Back to Top

Alright I'll begin with back references, (which also involves $regml). In the first tutorial, you came across parenthesises which were used to separate parts of an expression. For example:
//echo -a $regex(this is a test,/this is a (test|balloon|sausage)/)

The expression looks for "this is a" followed by either "test" or "balloon" or "sausage", which is matched in the string and so a value of "1" will be returned as you know to tell us that the string matches the pattern expressed.

However, what you weren't told before was that enclosing an item in parenthesis makes mIRC "remember" what was matched inside the parenthesis. This is called a back reference. So in this case, mIRC stores the word "test" because it was that that was matched in the string. You can refer to this value by using \N where N is the Nth back reference to refer to, which for this example would be \1 as (test|balloon|sausage) was the first back reference in the expression. This \1 however can only be used in an expression, for example:

//echo -a $regex(this is a test,/this is a (test|balloon|sausage) \1/)

We know from the first example that the expression will match the "this is a test" part of the string, but when you stick the \1 in there, it tells mIRC to look for the first back reference. So since the first back reference value is "test", that expression looks for "this is a test test" in the string. This will return 0 though because there is not another occurence of "test" after the "test" in the string, there is only one. On the other hand, $regex(this is a test test,this is a (test|balloon|sausage) \1) will return 1 because the expression matches "test" then \1 matches "test" again, so both words are matched. Here is an example of an appropriate use for this:

Suppose you want to match a string that looks like "I am smarter than you but Sigh is smarter than me" but the word "smarter" can be a variety of words, such as braver/cooler/stronger, providing whatever word it is, it is in both places. A regular expression applicable is:

/I am (smarter|braver|cooler|stronger) than you but Sigh is \1 than me/

As you can see, the \1 takes the first back referenced value and substitutes it in there so that regex will match "I am braver than you but Sigh is braver than me" etc. Similarly, if you have 2 items enclosed in parenthesis you use \1 to refer to the first and \2 to refer to the second. If you want to refer to them outside of the expression, that is where $regml comes in.

The "ml" in $regml (probably) stands for "matched last" and its function is to remember the back referenced values in a regular expression. It's easy to use, just $regml(N) to get the Nth back referenced item (which inside the expression would be \N). Try the following:


//echo -a $regex(Hello world I am Sigh,/Hello world I (am|was|will be) Sigh/) - $regml(0) - $regml(1)

You will see "1 - 1 - am" echod, the first "1" returned by the call to $regex to say the expression matched the string successfully, the second "1" to indicate there is 1 back referenced value (as with most identifier that deal with an N parameter, $regml(0) returns total amount) and the "am" is the first back referenced value.

To recap:

  • Items enclosed in parenthesises can be referenced within an expression as \N
  • Outside of an expression they can be referenced by $regml(N)


3. Some Examples using $regml  Back to Top

Let's have a couple of examples to further demonstrate their usage:
//.echo -q (Sigh is 123 years of age and drives a red ferrari or so he wishes,
/Sigh is (\d+) years of age and drives a (\w+) ferrari/) | 
echo -a $regml(0) back referenced values, first: $regml(1) - second: $regml(2)
NOTE: The above THREE lines have to be ONE SINGLE line.

It's simple, the \d+ in the first set of parenthesis looks for one or more numbers, which will grab my age. The \w+ in the second set will look for a bunch of word characters (so it stops at the next space) which retrieves the color of my ferrari. Then these values are echod in the next command. Of course, you can use $regml in an if statement, while loop etc. as it is a normal identifier and remembers the back referenced values from the last call to $regex. If you provide a name for a call to $regex (as shown in the "name" parameter of the identifier) then it provides a name for which to refer to as $regml(name,N) and mIRC can store 10 of these before they begin to get overwritten.

//.echo -q $regex(I enjoy playing basketball because basketball helps me relax,/I (hate|enjoy) playing (basketball|tennis|chess) 
because \2 helps me relax. I also \1 (football|hockey|IRC)) |
echo -a $regml(0) - $regml(1) - $regml(2) - $regml(3)
NOTE: The above THREE lines have to be ONE SINGLE line

This is similar to what we were looking at before. \1 is referred to in there to get the value of the first back reference, it is then referred to outside the identifier as $regml(1), similarly \2 is used inside the expression and $regml(2) is used outside.


4. The $regsub Identifier  Back to Top

Now let's have a look at $regsub. If you understand what you have read so far, you already know how to use it. It works exactly the same as $regex except you are able to replace everything that your expression matches with specific text. At its most basic, it has a similar function to that of $replace and the only pain using it is that you must create a local variable in which to store the result of a substitution since $regsub() returns the number of substitutions made and not the final string with the substitutions. Let's look at the syntax:

$regsub([name], text, re, subtext, %var)

The [name] part is the same as that in $regex, it assigns a name to be used for back references with $regml and is also optional. The "text" part is the initial string containing whatever it is you want substitutions made in.

"re" is the regular expression to use, "subtext" is the text to substitute in place of anything the expression matches and %var is the name of the variable (local or global) to dump the result in to.

To experiment with it the basic method of viewing your substitutions is first a /var command followed by your echo: //var %temp | echo -a $regsub(...,%temp) - %temp. For example:

//var %temp | echo -a $regsub(string,/s/,b,%temp) - %temp

That echos "1 - bring" where "1" is the number of substitutions made and %temp is the result of the substitution. Inside the identifier "string" is our string to start with, /s/ is our regular expression that matches a single letter "s" and "b" is the text to replace whatever is in the string that was matched by the regex with.

%temp of course is the local variable we declared before echoing. This would have had the same effect as $replace(string,s,b) but only in this instance.

$replace as you know replaces all occurences of a substring but the regular expression we have u sed in this substitution only replaces the first instance of "s". So if you change "string" to "strings" you will see that %temp is filled with "brings" and not "bringb".

This is because we have not specified the "g" switch in the expression, which would indicate to mIRC we want a global match (to match all occurences of the pattern expressed in the regex and not just one).

Try the same command with $regsub(strings,/s/g,b,%temp) and see what happens.

Now, back references are possible in $regsub just as they are in $regex, and they can even be used in the "subtext" part of the identifier.

Let's go through a couple of examples:

//var %temp | echo -a $regsub(I am Sigh and I am cool,/am/g,was,%temp)

Because the "g" switch was used, it replaces all parts of the string matching the regular expression which in this case is a word, "am". So the result is "I was Sigh and I was cool". 2 substitutions were made so the $regsub identifier returns 2.


5. Example using $regsub  Back to Top

Let's look at something more complicated such as removing HTML tags from text. First I must tell you what ^ in a character class represents. A character class is what you came across in the first tutorial, a group of characters enclosed in square brackets that could include ranges (such as [a-zA-Z0-9]). A class such as [a] matches the letter "a" and is the same as matching "a" outside a class. However, if you use the ^ character after the opening bracket like [^a] that matches any character that is not an "a".

Example:
$regex(b,[^b]) returns 0 since there is no single character in "b" that isn't a "b".
$regex(a,[^b]) returns 1 because the character "a" is matched by the class [^b] because it isn't 'b".

This is important in thinking of a regular expression to match HTML tags so they may be removed (by removed I mean use $regsub to substitute HTML tags with $null). The first thing to do in this case is to think "What regular expression can I use to match an HTML tag?". So you begin by making the expression:

You know any HTML tags begins with a <, so that is the first part of the regex. That < is followed by one or more characters followed by an ending >. At a first try, your expression may end up looking like this:

/<.+>/g

You can check this quickly like so:

//var %temp | echo -a $regsub(Text,/<.+>/g,-,%temp) - %temp

So we hope to replace every HTML tag with a hyphen for testing purposes. As you can see if you type this, it replaces the whole string with a single hyphen. This is because the expression <.+> matches < followed by one or more characters followed by > but mIRC tries to match as many characters as possible with .+ so it ends up matching everything from the first < to the last >. We don't want this. We only want mIRC to match up until the next > which is the end of the corresponding HTML tag. So instead of .+ to match characters within an HTML tag, it would be more applicable to use [^<>]+ which is a character class representing any character except < or >. Our problem is solved:

//var %temp | echo -a $regsub(Text,/<[^<>]+>/g,-,%temp) - %temp

-Text- is echoed showing us that the two tags have successfully been replaced with -. All that needs to be done to remove them is change the subtext part of $regsub to $null or leave it empty.

Let's recap:

  • $regsub works like $regex but substitutes the text you specify into any area that is matched by the regular expression you give it
  • \N can be used both in the expression and in the subtext (eg. $regsub(Sigh is cool,/(is|was)/,\1,%var) to put the first back referenced value in the substituted text
  • Using ^ at the beginning of a character class will negate it i.e. [^A-Z] matches any character that is not capital letters A to Z.
  • When building a regex substitution you may find it more comfortable to consider the expression to use first, thinking about what needs to be matched in order to substitute the text in the correct places.

6. Conclusion

Now hopefully you are comfortable with $regex and $regsub and have experimented with the both. The key to mastering this is practice and experimentation to see what can and cannot be done. Challenge yourself by thinking of patterns to match and try to express them with a regular expression. Ask any question you may have to your local regex guru. Let's move on to a slightly more advanced component of regex.