Regular Expressions are a powerful way to define patterns for searching and matching. Beyond Compare allows you to use regular expressions when searching through text, and when specifying rules for classifying text. The regular expression support in Beyond Compare is a subset of the Perl Compatible Regular Expression (PCRE) syntax.
While Regular Expressions can be a complex topic, there are several excellent resources about them. One such resource is a book called Mastering Regular Expressions. Another excellent resource is Steve Mansour's A Tao of Regular Expressions, a copy of which can be found at:
www.scootersoftware.com/RegEx.html
A regular expression is composed of two types of characters: normal characters and metacharacters. When performing a match, metacharacters take on special meanings, controlling how the match is made and serving as wildcards. Normal characters always match against only themselves. To match against a metacharacter, escape it, by prefixing it with a backslash "\". There are multiple types of metacharacters, each detailed below.
Metacharacters - Escape sequences
Escape sequence |
Meaning |
---|---|
\xnn |
character with the hex code nn |
\x{nnnn} |
character with the hex code nnnn |
\x{F000} |
character with a null value |
\t |
tab (0x09) |
\f |
form feed (0x0C) |
\a |
bell (0x07) |
\e |
escape (0x1B) |
Metacharacters - Predefined classes
Predefined character classes match any of a certain subset of characters. The following classes are already defined for you.
Class |
Meaning |
---|---|
. |
match any character |
\w |
any alphanumeric character or _ |
\W |
any non-alphanumeric character |
\d |
any numeric character (0-9) |
\D |
any non-numeric character |
\s |
any whitespace (space, tab) |
You can also construct your own character classes by surrounding a group of characters in brackets "[]". The predefined classes (except ".") can be used in the brackets, and if a dash "-" appears between two characters, it represents a range. Thus [a-z] would represent all lowercase letters, and [a-zA-Z] would represent both lower and uppercase letters. To include a "-" as part of the class, place it at the beginning or end of the string.
If the first character within the brackets is a caret "^", then the class represents everything except the specified characters. [^a-z] matches on any character that isn't a lowercase alphabetic character.
Metacharacters - Alternatives
By placing an "|" between two groups of items, alternative matches can be represented. a|b will match either a or b. ab|cd will match "ab" or "cd", but not "ac". "|" groups characters from pattern delimiter ("(", "[", or the start of the pattern) to itself and then again to the end of the pattern. Alternatives can be placed within parenthesis "()" to make it obvious what is being matched against, as in a(bc|de)f. Alternatives are matched left to right. bey|beyond will match on bey, even if the string is "beyond".
Metacharacters - Position
The following metacharacters control where the match can occur on a line. Note: \A and \Z match the start and end of text respectively, but since Beyond Compare performs the search on a line by line basis, these have the same effect as ^ and $.
Metacharacter |
Meaning |
---|---|
^ |
match only at start of line |
$ |
match only at end of line |
Metacharacters - Iterators
Anything in a regular expression can be followed by an iterator metacharacter, which refers to the item before it. There are two kinds of iterators - greedy and nongreedy. Greedy iterators match as many as they can, nongreedy match as few as they can.
Greedy:
Metacharacter |
Meaning |
---|---|
* |
match zero or more of the preceding character (equivalent to {0,}) |
+ |
match one or more of the preceding character (equivalent to {1,}) |
? |
matches zero or one times (equivalent to {0,1}) |
{n} |
matches exactly n times (equivalent to {n,n}) |
{n,} |
matches n or more times |
{n,m} |
matches at least n but no more than m times |
Nongreedy:
Metacharacter |
Meaning |
---|---|
*? |
matches zero or more times |
+? |
matches one or more times |
?? |
matches zero or one time |
{n}? |
matches exactly n times |
{n,}? |
matches at least n times |
{n,m}? |
matches at least n but no more than m times |
Metacharacters - Subexpressions
Parenthesis "()" can also be used to group characters for use with iterators and backreferences (discussed below). (bey){4,5} will match between 4 and 5 instances of "bey". (abc|[0-9])* will match any combination of "abc" and the digits 0 to 9 (e.g. "abc5", "679abc" and "abc77abc").
Metacharacters - Back references
Each sequence of characters which is matched within a "()" will be saved as a subexpression, which you can refer to later with \1 to \9, which refer to the subexpressions from left to right. b(.)\1n will match "been" and "boon", but not "bean", "ben" or "beeen".
Modifiers
Modifiers allow changes to the matching behavior from that point on. If the modifier is contained within a subexpression, it affects only that subexpression. Use (?i) and (?-i) to control the case sensitivity of matching.
Examples:
(?i)Beyond Compare |
matches both "Beyond Compare" and "beyond compare" |
(?i)Beyond (?-i)Compare |
matches "Beyond Compare" and "bEyOnD Compare", but not "beyond compare" |
See also