mScriptBox Tutorial
Regular Expressions - Part I

Written by Merlin & Tristan
Published with permission.

Table Of Contents

  1. Introduction
  2. What is a Regular Expression?
  3. Regular Expression Basics
  4. mIRC and Regular Expression
  5. Basic Operators
  6. Grouping of items
  7. Special Characters
  8. Switches
  9. Pattern Matching
  10. Length Pattern Matching
  11. Some Examples

1. Introduction  Back to Top

This tutorial explains what Regular Expressions are and how to use them. It is made of two parts:

Part I - Basic Regular Expressions
Part II - Advanced Regular Expressions

In Part I the first three sections (written by Merlin) gives a general overview and are not bound to mIRC while sections 4-11 are (written by Tristan) explain the usage of the Regular Expressions Identifieres of mIRC.

Part II was written by Sigh_

mIRC uses the PCRE Library to support Regular Expressions.
If you want to know more about it, read the Extract of the PCRE Manual.


2. What is a Regular Expression?  Back to Top

A regular expression is a formula for matching strings that follow some pattern. Many people are afraid to use them because they can look confusing and complicated. Unfortunately, nothing in this write up can change that.

However, I have found that with a bit of practice, it's pretty easy to write these complicated expressions. Plus, once you get the hang of them, you can reduce hours of laborious and error-prone text editing down to minutes or seconds.

Regular expressions are supported by many text editors, class libraries such as Rogue Wave's Tools.h++, scripting tools such as awk, grep, sed, and increasingly in interactive development environments such as Microsoft's Visual C++.


3. Regular Expression Basics  Back to Top

Regular expressions are made up of normal characters and metacharacters. Normal characters include upper and lower case letters and digits. The metacharacters have special meanings and are described in detail below.

In the simplest case, a regular expression looks like a standard search string. For example, the regular expression "testing" contains no metacharacters. It will match "testing" and "123testing" but it will not match "Testing".

To really make good use of regular expressions it is critical to understand metacharacters. The table below lists metacharacters and a short explanation of their meaning.
 
Metacharacter Description
.
Matches any single character. For example the regular expression r.t would match the strings rat, rut, r t, but not root
$
Matches the end of a line. For example, the regular expression weasel$ would match the end of the string "He's a weasel" but not the string "They are a bunch of weasels.
^
Matches the beginning of a line. For example, the regular expression ^When in would match the beginning of the string "When in the course of human events" but would not match "What and When in the"
*
Matches zero or more occurences of the character immediately preceding. For example, the regular expression .* means match any number of any characters. 
\
This is the quoting character, use it to treat the following character as an ordinary character. For example, \$ is used to match the dollar sign character ($) rather than the end of a line. Similarly, the expression \. is used to match the period character rather than any single character. 
[ ] 
[c1-c2]
[^c1-c2]
Matches any one of the characters between the brackets. For example, the regular expression r[aou]t matches rat, rot, and rut, but not ret. Ranges of characters can specified by using a hyphen. For example, the regular expression [0-9] means match any digit. Multiple ranges can be specified as well. The regular expression [A-Za-z] means match any upper or lower case letter. To match any character except those in the range, the complement range, use the caret as the first character after the opening bracket. For example, the expression [^269A-Z] will match any characters except 2, 6, 9, and upper case letters. 
\< \>
Matches the beginning (\<) or end (\>) or a word. For example, \<the matches on "the" in the string "for the wise" but does not match "the" in "otherwise".
NOTE: This metacharacter is not supported by all applications.
\( \)
Treat the expression between \( and \) as a group. Also, saves the characters matched by the expression into temporary holding areas. Up to nine pattern matches can be saved in a single regular expression. They can be referenced as \1 through \9.
|
Or two conditions together. For example (him|her) matches the line "it belongs to him" and matches the line "it belongs to her" but does not match the line "it belongs to them." NOTE: this metacharacter is not supported by all applications.
+
Matches one or more occurences of the character or regular expression immediately preceding. For example, the regular expression 9+ matches 9, 99, 999. NOTE: this metacharacter is not supported by all applications.
?
Matches 0 or 1 occurence of the character or regular expression immediately preceding.NOTE: this metacharacter is not supported by all applications.
\{i\}
\{i,j\}
Match a specific number of instances or instances within a range of the preceding character. For example, the expression A[0-9]\{3\} will match "A" followed by exactly 3 digits. That is, it will match A123 but not A1234. The expression [0-9]\{4,6\} any sequence of 4, 5, or 6 digits. NOTE: this metacharacter is not supported by all applications.

The simplest metacharacter is the dot. It matches any one character (excluding the newline character). Consider a file named test.txt consisting of the following lines:

    he is a rat
    he is in a rut
    the food is Rotten
    I like root beer
We can use grep to test our regular expressions. Grep uses the regular expression we supply and tries to match it to every line of the file. It prints all lines where the regular expression matches at least one sequence of characters on a line. The command
    grep r.t test.txt
searches for the regular expression r.t in each line of test.txt and prints the matching lines. The regular expression r.t matches an r followed by any character followed by a t. It will match rat and rut. It does not match the Rot in Rotten because regular expressions are case sensitive. To match both the upper and lower the square brackets (character range metacharacters) can be used. The regular expression [Rr] matches either R or r. So, to match an upper or lower case r followed by any character followed by the character t the regular expression [Rr].t will do the trick.

To match characters at the beginning of a line use the circumflex character (sometimes called a caret). For example, to find the lines containing the word "he" at the beginning of each line in the file test.txt you might first think the use the simple expression he. However, this would match the in the third line. The regular expression ^he only matches the h at the beginning of a line.

Sometimes it is easier to indicate something what should not be matched rather than all the cases that should be matched. When the circumflex is the first character between the square brackets it means to match any character which is not in the range. For example, to match he when it is not preceded by t or s, the following regular expression can be used: [^st]he.

Several character ranges can be specified between the square brackets. For example, the regular expression [A-Za-z] matches any letter in the alphabet, upper or lower case. The regular expression [A-Za-z][A-Za-z]* matches a letter followed by zero or more letters. We can use the + metacharacter to do the same thing. That is, the regular expression [A-Za-z]+ means the same thing as [A-Za-z][A-Za-z]*. Note that the + metacharacter is not supported by all programs that have regular expressions.

To specify the number of occurrences matched, use the braces (they must be escaped with a backslash). As an example, to match all instances of 100 and 1000 but not 10 or 10000 use the following: 10\{2,3\}. This regular expression matches a the digit 1 followed by either 2 or 3 0's. A useful variation is to omit the second number. For example, the regular expression 0\{3,\} will match 3 or more successive 0's.


4. mIRC and Regular Expression  Back to Top

Officially the mIRC help file does not contain much documentation on the $regex() identifier so I took some time to look through how mIRC would support it how to come up with a way to explain them in a simple terms that the new scripter or the maybe those who have never worked with regular expressions to get an idea of how this may be used best with your scripts. I will assume if you are reading this you have at least looked at the mIRC help file and are now wondering what good is this new identifier if there is not real documentation of how it works. So you have probably done onto the Internet and looked up Regular Expression and are now trying to figure out where you want wrong even looking them up, as it looks so complex. Yes Regular Expressions are complex but I hope this document will take some of the complexity out of them and help you better understand them. So now lets take a few to talk about how the Regular Expressions language itself deals with the determining whether your string is in the Expression. I have done my best to document all the switches and characters that I can locate. If there are others, and they work with mIRC I will be glad to hear about it. Also a word of warning do not assume from the first few examples that $regex() is a boolean operator, please read on

Terms used
First lets cover some of the terms that will be used in this tutorial so that there is no confusion. The mIRC help file defines the usage for $regex() as
$regex([name], text, re)
For this tutorial I am going to drop the usage of the [name], its usage is describe well enough in the mIRC help file. The text I will refer to as the string, the re will be referred to as the expression or substring. The expression is how $regex() evaluates the string to determine the matches. Some prefer the term substring when talking about the expression part of $regex() as it is substring of the primary string we are attempting to match.

5. Basic Operators  Back to Top

The table below lists the operators, descriptions and examples how it can be used in mIRC.

Operator Description Example
^ Matches the substring at the start of the string. The first character or characters must match the substring for it to return a value $regex(abc,^a)
$regex(this is a string,^this)

This will return 1, as the string ABC when compared to the substring of A the A appeared at the start of the string, if we were to change the string to bcd it would return 0. Do not think of this as a true, false state of $regex although it can be used in that state, the nature of $regex is not as a true, false identifier
$ Matches a string that ends with the substring of the expression. This time the last character or characters. $regex(abc,c$)
$regex(this is a string,string$)

This will return 1, as the substring appears at the end of the string, if we were to change the end of the string to something else it would return 0
Note: ^ and $ matches the string for an exact match of the substring

$regex(abc,^abc$)
This will return 1, was the string starts with substring and ends with the substring if any change to either the substring or the string is made the returned value would be 0

* Zero or more: with this operator we are looking to find out if there is a character which can exist but does not have to $regex(ab,ab*)
$regex(a,ab*)
$regex(abbbbb,ab*)

All of the above will return a 1, was what we are looking at is, the expression of there must be a least 1 'a' found in the string, followed maybe a 'b' and if there is a 'b' there can be any number of them. This operator is not good when looking for a an exact phrase to appear
+ One or More : similar to the * operator, this time we want to find a character in the string that matches the substring and it must occur it least once but can occur more then once. $regex(abbbb,ab+)
$regex(ab,ab+)

These would return 1 as they are true. The substring ab+ meaning, the character a followed by one or more of the character b
? Zero or One : In the same family as the * and + operates. The ? is the Zero or more operator. It looks for substring which does not have to appear in the string but if if does it must appear at least once. $regex(ab,a?b)
$regex(cb,a?b)
$regex(abbbbb,ab?)

Again we are still dealing with the true, false expression of how this is returned. The substring is evaluated as the first character A is checked against the operator ? to see if it appears in the string. Then moves to the character b and checks to make sure it is in the string.
|
(Pipe)
Logical OR operator : like other languages this is used to to have a substring that can have one value or another $regex(hello there,hi|hello)
It compares the first part of the substring against the string, if it is not found it tries to match the second part of the substring. Else it will return 0 if neither of of the substrings match the string.
.
(Period)
Character operator: The period stands for "one character", this is helpful when looking at a spring that you might want to find one character and another that might be joined together by a character that you do not have concern for. $regex(abc,a.c)
$regex(axe,.x.)
$regex(oxe,.x.)

All will return 1 as they match the pattern of the substring.


6. Grouping of items  Back to Top

Regular expressions allow you to group like parts of the substring into groups which makes it easier to write the expressions.

( ) - Grouped substrings:
This allows you to evaluate a group of characters in a substring separate from the rest of the substring.
$regex(abc,a(bc)*) : returns 1
$regex(ac,a(b|c))

In the above examples the characters inside the () are seprated from the substring and it then applies the given operator to those characters. So we look at a string and there must be an 'a' followed by the characters b or c in the pattern we have determined.

[ ] - Blocked groups:
This allows you to search for characters within a range which appears between the [ ] there are some rules to using these

a-z : mean lower case
A-Z : mean upper case
0-9 : are the valid range for a digit

now to provide an example of usage.

$regex(abcdef,[ab])
this is much like the a|b from the or operator, as this will look at the string and attempt to find a match for 'a' or 'b' which could be said 'a' to 'b'

By adding the - you can now find a range of characters, this can save you from having to type abcdefgh etc.. If you want to search for the - as part of the range is must appear as the first character in the string. (yes there is another way but is more complex)

$regex(the cat,[g-i])
$regex(the-cat,[-w-x])
this looks for a character that appears between the start point 'g' and the end point of 'i' which in the example would find the 'h' and return 1

If you were to use G-I you would get the returned value of 0 as there is no 'H' in the string that you are searching, so if you want to find 'h' or 'H' you need to combine the values between the [ ] like so [g-iG-I] this will now search for a letter that appears between both upper and lower case.

When searching for a number in the string you can search for numbers between 0 to 9, so you can find a number

$regex(1%,[0-9]%)
so the number is between 0 to 9, it found 1 and had the character % after it. Which is an exact match. So how does one match numbers that are over 9? simple apply two [ ] searches [2-3][0-9]
$regex(31,[2-3][0-9])
this will search for a range of numbers between 20 and 39

You can search for both Digits and characters in the same [ ] the order in which they appear does not matter so the [0-9a-z] and [a-z0-9] are both acceptable.

Finally the [ ] have a not ^ operator which can search for a string that is not between the [ ]

$regex(the cat,[^w-z])
this will search for characters that are not between the given range.

[::] - Characters Classes:
These are groups of the character that fall into the same group like alphanumeric characters, the space bar and the return key. These are the types that work so far:

[:alnum]
Any alphanumeric character letters and digits
[:alpha]
Any alphabetical character
[:blank]
A space or horizontal tab
[:space]
Any white space character, including newline and return
[:digit]
Any Digit
[:ctrl]
Any Control Character : useful for finding ctrl+k
[:lower]
Any lower case character
[:upper]
Any upper case character
[:punct]
Neither control nor alphanumeric characters
$regex(123,[[:digit:]])


7. Special Characters  Back to Top

Since Regular Expressions use many of the characters that you might want search for, how do you search for those characters? with the \ operator to escape the run of the Expression. Here is a list of the escape characters:

Characters Escape
Character
Code
? \?
* \*
+ \+
. \.
| \|
{ \{
} \}
\ \\
[ \[
] \]
( \(
) \)
In all other cases, $regex() ignores '\' For example, '\n' matches 'n'.


8. Switches  Back to Top

Regular Expressions have a few switches which can be used to help the pattern matching.
These normally start with the / and end with the / character

//x: Extend the Pattern to allow for white space
//g: Match all possibilities in the pattern


9. Pattern Matching  Back to Top

Regular Expressions can be used to match something, whether it is a character or a phrase or how many times a character or phrase appear in the string we are searching.

Lets take a look at search a string for a character or phrase first before we get into finding the number of times something happens

$regex(we want to match something,a)

This will return 1 as the character 'a' appears in the string this can be helpful if you are looking for a special character to be in the phrase.

$regex(regular expressions tutorial by trystan,tutorial by)
We can also search to see if the phrase we are looking for has been used, this is helpful when dealing with bad language kickers, you search the string and have a 1 or a 0 value returns based on what was said. Now lets look at how we can count the number of times a character or phrase.
For this regular expressions require the //g switch an example:
$regex(test,/t/g)

This returns the number 2 as the letter 't' appears 2 times in the given string. You can search for a full word to appear in a string as long as text appears between the //g operators

$regex(I ran a test of the test,/test/g)

This will return 2 as the phrase test appears 2 times in the given string


10. Length Pattern Matching  Back to Top

First a note of warning: When using {n,n} with mIRC be aware that it see the ,n as a comma dividing the code into the next section. For this its recommend that you set a %variable and run the code:

$regex(abbbb,ab{2}) : returns 1
$regex(abbbb,ab{4}) : returns 1

Now this looks at the string and says, is there at least 1 'a', and is it followed by b who is repeated 2 or more times. You can combine the it to check for the length to be within a given range with the {2,4}

SET %reg ab{2,4}
$regex(abbbb,%reg)

will return 1 as the string finds an a least 1 'a', followed by a B which appears between 2 to 4 times


11. Some Examples  Back to Top

Matching correct format for a channel name:

Before you would have to run three looks at the string to determine if it was in the correct format. Now we can look at it once and determine it

alias eval_chan_name {
  IF ($regex($1,^(#|\+|&))) { echo -a Validated }
  ELSE { echo -a Invalid }
}
The expression ^(#|\+|&) means we want the first character to be a #, or + or &

Matching the validity of an email address:

This allows you to ensure that the email address is in the format of letters and numbers, with a @ and . in it

alias eval_email {
  VAR %reg ^[_\.0-9a-zA-Z]+@([0-9a-zA-Z][0-9a-zA-Z]+\.)+[a-zA-Z]{2,3}$
  echo -a $regex($1,%reg)
}

The expression ^[_\.0-9a-zA-Z]+@([0-9a-zA-Z][0-9a-zA-Z]+\.)+[a-zA-Z]{2,3}$ means that at the start of the string there must be a series of characters which can be _ . numbers from 0 to 9 and letters from a to z and A-Z and that this patter must be one or more characters. followed by a @, followed by a series of characters that forms the providers name. Again this pattern must be one or more characters. Finally we look at the .com part, and we say it must be between 2 to 3 characters and between the letters a-z which must end the string