C O N T E N T S
What Are Regular Expressions
Examples
Simple
Medium (Strange Incantations)
Hard (Magical Hieroglyphics)
Regular Expressions
In Various Tools
Regular expressions usage is explained by examples in the sections that follow. Most examples are presented as vi substitution commands or as grep file search commands, but they are representative examples and the concepts can be applied in the use of tools such as sed, awk, perl and other programs that support regular expressions. Have a look at Regular Expressions In Various Tools for examples of regular expression usage in other tools. A short explanation of vi's substitution command and syntax is provided at the end of this document.
In the simplest case, a regular expression looks like a standard search string. For example, the regular expression "testing" contains no metacharacters. It will match "testing" and "123testing" but it will not match "Testing".
To really make good use of regular expressions it is critical to understand
metacharacters. The table below lists metacharacters and a short explanation
of their meaning.
Metacharacter | Description | |
---|---|---|
|
|
|
|
Matches any single character. For example the regular expression r.t would match the strings rat, rut, r t, but not root. | |
|
Matches the end of a line. For example, the regular expression weasel$ would match the end of the string "He's a weasel" but not the string "They are a bunch of weasels." | |
|
Matches the beginning of a line. For example, the regular expression ^When in would match the beginning of the string "When in the course of human events" but would not match "What and When in the" . | |
|
Matches zero or more occurences of the character immediately preceding. For example, the regular expression .* means match any number of any characters. | |
|
This is the quoting character, use it to treat the following character as an ordinary character. For example, \$ is used to match the dollar sign character ($) rather than the end of a line. Similarly, the expression \. is used to match the period character rather than any single character. | |
[c1-c2] [^c1-c2] |
Matches any one of the characters between the brackets. For example, the regular expression r[aou]t matches rat, rot, and rut, but not ret. Ranges of characters can specified by using a hyphen. For example, the regular expression [0-9] means match any digit. Multiple ranges can be specified as well. The regular expression [A-Za-z] means match any upper or lower case letter. To match any character except those in the range, the complement range, use the caret as the first character after the opening bracket. For example, the expression [^269A-Z] will match any characters except 2, 6, 9, and upper case letters. | |
|
Matches the beginning (\<) or end (\>) or a word. For example, \<the matches on "the" in the string "for the wise" but does not match "the" in "otherwise". NOTE: this metacharacter is not supported by all applications. | |
|
Treat the expression between \( and \) as a group. Also, saves the characters matched by the expression into temporary holding areas. Up to nine pattern matches can be saved in a single regular expression. They can be referenced as \1 through \9. | |
|
Or two conditions together. For example (him|her) matches the line "it belongs to him" and matches the line "it belongs to her" but does not match the line "it belongs to them." NOTE: this metacharacter is not supported by all applications. | |
|
Matches one or more occurences of the character or regular expression immediately preceding. For example, the regular expression 9+ matches 9, 99, 999. NOTE: this metacharacter is not supported by all applications. | |
|
Matches 0 or 1 occurence of the character or regular expression immediately preceding.NOTE: this metacharacter is not supported by all applications. | |
\{i,j\} |
Match a specific number of instances or instances within a range of the preceding character. For example, the expression A[0-9]\{3\} will match "A" followed by exactly 3 digits. That is, it will match A123 but not A1234. The expression [0-9]\{4,6\} any sequence of 4, 5, or 6 digits. NOTE: this metacharacter is not supported by all applications. |
The simplest metacharacter is the dot. It matches any one character (excluding the newline character). Consider a file named test.txt consisting of the following lines:
To match characters at the beginning of a line use the circumflex character (sometimes called a caret). For example, to find the lines containing the word "he" at the beginning of each line in the file test.txt you might first think the use the simple expression he. However, this would match the in the third line. The regular expression ^he only matches the h at the beginning of a line.
Sometimes it is easier to indicate something what should not be matched rather than all the cases that should be matched. When the circumflex is the first character between the square brackets it means to match any character which is not in the range. For example, to match he when it is not preceded by t or s, the following regular expression can be used: [^st]he.
Several character ranges can be specified between the square brackets. For example, the regular expression [A-Za-z] matches any letter in the alphabet, upper or lower case. The regular expression [A-Za-z][A-Za-z]* matches a letter followed by zero or more letters. We can use the + metacharacter to do the same thing. That is, the regular expression [A-Za-z]+ means the same thing as [A-Za-z][A-Za-z]*. Note that the + metacharacter is not supported by all programs that have regular expressions. See Regular Expressions Syntax Support for more details.
To specify the number of occurrences matched, use the braces (they must be escaped with a backslash). As an example, to match all instances of 100 and 1000 but not 10 or 10000 use the following: 10\{2,3\}. This regular expression matches a the digit 1 followed by either 2 or 3 0's. A useful variation is to omit the second number. For example, the regular expression 0\{3,\} will match 3 or more successive 0's.
vi command | What it does |
|
|
:%s/ */ /g | Change 1 or more spaces into a single space. |
:%s/ *$// | Remove all spaces from the end of the line. |
:%s/^/ / | Insert a space at the beginning of every line. |
:%s/^[0-9][0-9]* // | Remove all numbers at the beginning of a line. |
:%s/b[aeio]g/bug/g | Change all occurences of bag, beg, big, and bog, to bug. |
:%s/t\([aou]\)g/h\1t/g | Change all occurences of tag, tog, and tug to hat, hot, and hug respectively. |
Before | After | |
foo(10,7,2) | foo(7,10,2) | |
foo(x+13,y-2,10) | foo(y-2,x+13,10) | |
foo( bar(8), x+y+z, 5) | foo( x+y+z, bar(8), 5) |
The following substitution command will do the trick :
[^,] | means any character which is not a comma | |
[^,]* | means 0 or more characters which are not commas | |
\([^,]*\) | tags the non-comma characters as \1 for use in the replacement part of the command | |
\([^,]*\), | means that we must match 0 or more non-comma characters which are followed by a comma. The non-comma characters are tagged. |
This is a good time to point out one of the most common problems people have with regular expressions. Why would we use an expression like [^,]*, instead of something more straightforward like .*, to match the first parameter? Consider applying the pattern .*, to the string "10,7,2". Should it match "10," or "10,7," ? To resolve this ambiguity, regular expressions will always match the longest string possible. In this case "10,7," which covers two parameters instead of one parameter like we want. So, by using the expression [^,]*, we force the pattern to match all characters up to the first comma.
The expression up to this point is: foo(\([^,]*\), and can be roughly translated as "after you find foo( tag all characters up to the next comma as \1". We tag the second parameter just like the first and it can be referenced as \2. The tag used on the third parameter is exactly like the others except that we search for all characters up to the right parenthesis. It may be superfluous to search for the last parameter since we don't have to move it. But this pattern guarantees that we update only those instances of foo() where 3 parameters are specified. In these times of function and method overloading, being explicit often proves to be useful. In the substitution portion of the command, we explicitly enter the invocation of foo() as we want it, referencing the matched patterns in the new order where the first and second parameter have been switched.
Here are a few lines from the data we have:
Here is the first pass at a substitution command that will solve the problem:
The following substitution command will remove the excess spaces:
Billy tried really hardNow suppose you want to change "really", "really really", and any number of consecutive "really" strings to a single word: "very". The command
Sally tried really really hard
Timmy tried really really really hard
Johnny tried really really really really hard
:%s/\(really \)\(really \)*/very /changes the text above to:
Billy tried very hardThe expression \(really \)* matches 0 or more sequences of "really ". The sequence \(really \)\(really \)* matches one or more instances of the sequence "really ".
Sally tried very hard
Timmy tried very hard
Johnny tried very hard
You can use regular expressions in the Visual C++ editor. Select Edit->Replace, then be sure to check the checkbox labled "Regular expression". For vi expressions of the form :%s/pat1/pat2/g set the Find What field to pat1 and the Replace with field to pat2. To simulate the range (% in this case) and the g option you will have to use the Replace All button or appropriate combinations of Find Next and Replace
Here are a few interesting sed scripts. Assume that we're processing
a file called price.txt. Note that the edits don't actually happen to the
input file, sed simply processes each line of the file with the command
you supply and echos the result to its standard out.
sed script | Description | |
|
|
|
sed 's/^$/d' price.txt | removes all empty lines | |
sed 's/^[ \t]*$/d' price.txt | removes all lines containing only whitespace | |
sed 's/"//g' price.txt | remove all quotation marks |
There are many good awk examples in the book The AWK Programming
Language (written by Aho, Weinberger, and Kernighan). Please don't
form any broad opinions about awk's capabilities based on the following
trivial sample scripts. For purposes of these examples, assume that we're
working with a file called price.txt. As with sed, awk simply echos its
output to its standard out.
awk script | Description | |
|
|
|
awk '$0 !~ /^$/' price.txt | removes all empty lines | |
awk 'NF > 0' price.txt | a better way to remove all lines in awk | |
awk '$2 ~ /^[JT]/ {print $3}' price.txt | print the third field of all lines whose second field begins with 'J' or 'T' | |
awk '$2 !~ /[Mm]isc/ {print $3 + $4}' price.txt | for all lines whose second field does not contain 'Misc' or 'misc' print the sum of columns 3 and 4 (assumed to be numbers). | |
awk '$3 !~ /^[0-9]+\.[0-9]*$/ {print $0}' price.txt | print all lines where field 3 is not a number. The number must be of the form: d.d or d. where d is any number of digits from 0 to 9. | |
awk '$2 ~ /John|Fred/ {print $0}' price.txt | print the entire line if the second field contains 'John' or 'Fred' |
For the examples below, assume we have the text below in a file named phone.txt. Its format is last name followed by a comma, first name followed by a tab, then a phone number.
grep command | Description | |
|
|
|
grep '\t5-...1' phone.txt | print all the lines in phone.txt where the phone number begins with 5 and ends with 1. Note that the tab character is represented by \t. | |
grep '^S[^ ]* R' phone.txt | print lines where the last name begins with S and first name begins with R. | |
grep '^[JW]' phone.txt | print lines where the last name begins with J or W | |
grep ', ....\t' phone.txt | print lines where the first name is 4 characters. The tab character is represented by \t. | |
grep -v '^[JW]' phone.txt | print lines that do not begin with J or W | |
grep '^[M-Z]' phone.txt | print lines where the last name begins with any letter from M to Z. | |
grep '^[M-Z].*[12]' phone.txt | print lines where the last name begins with a letter from M to Z and where the phone number ends with a 1 or 2. |
egrep command | Description | |
|
|
|
egrep '(John|Fred)' phone.txt | print all lines that contain the name John or Fred. | |
egrep 'John|22$|^W' phone.txt | print lines that contain John or that end with 22 or that begin with W. | |
egrep 'net(work)?s' report.txt | print lines in report.txt contain networks or nets. |
Command or
Environment |
. | [ ] | ^ | $ | \( \) | \{ \} | ? | + | | | ( ) |
vi | X | X | X | X | X | |||||
Visual C++ | X | X | X | X | X | |||||
awk | X | X | X | X | X | X | X | X | ||
sed | X | X | X | X | X | X | ||||
Tcl | X | X | X | X | X | X | X | X | X | |
ex | X | X | X | X | X | X | ||||
grep | X | X | X | X | X | X | ||||
egrep | X | X | X | X | X | X | X | X | X | |
fgrep | X | X | X | X | X | |||||
perl | X | X | X | X | X | X | X | X | X |
s is the substitution command.
pat1 is the regular expression to be searched for. This paper is full of examples.
g is optional. When present the substitution is made to all matches on the line. When it is not present, the substitution is applied only to the first match on the line.