OPS102 - Regular Expressions
Contents
Why Use Regular Expressions?
Regular Expressions can be a little daunting to learn: they often look like someone was just bashing their head against the keyboard (or, like a cat was lying on the keyboard). But they are very powerful - a well-written regular expression can replace many pages of code in a programming language such as C or C++ - and so it is worth investing some time to understand them.
The Seven Basic Elements of Regular Expressions
Characters
In a regular expression (regexp), any character that doesn't otherwise have a special meaning matches that character. So the digit "5"
, for example, matches the digit "5"
; similarly "cat"
matches the letters "c"
, "a"
, and "t"
in sequence.
A backslash can be used to remove any special meaning which a character has. The period character "."
is a type of wildcard (see below), so to search for a literal period, we place a backslash in front of it: "\."
Wildcards
A period "."
will match any single character. Similarly, three periods "..."
will match any three characters.
Bracket Expressions / Character Classes
Bracket Expressions or Character Classes are contained in square brackets "[ ]"
:
- A list of characters in square brackets will match any one character from the list of characters:
"[abc]"
will match"a"
,"b"
, or"c"
- A range of characters in square brackets, written as a starting character, a dash, and an ending character, will match any character in that range:
"[0-9]"
will match any one digit. - There are some pre-defined named character classes. These are selected by specifying the name of the character class surrounded by colons and square brackets, placed within outer square brackets, like
"[[:digits:]]"
. The available names are:- alnum - alphanumeric
- alpha - alphabetic characters
- blank - horizontal whitespace (space, tab)
- cntrl - control characters
- digit - digits
- graph - letters, digits, and punctuation
- print - letters, digits, punctuation, and space
- punct - punctuation marks
- space - horizontal and vertical whitespace (space, tab, vertical tab, form feed)
- upper - UPPERCASE letters
- lower - lowercase letters
- xdigit - hexidecimal digits (digits plus a-f and A-F)
- Ranges, lists, and named character classes may be combined - e.g.,
"[[:digit:]+-.,]"
"[[:digit:][:punct:]]"
"[0-9_*]"
- To invert a character class, add a carat ^ character as the first character after the opening square bracket:
"[^[:digit:]]"
matches any non-digit character, and"[^:]"
matches any character that is not a colon. - To include a literal carat, place it at the end of the character class. To include a literal dash or closing square bracket, place it at the start of the character class.
Repetition
- A repeat count can be placed in curly brackets. It applies to the previous element:
"x{3}"
matches"xxx"
- A repeat can be a range, written as min,max in curly brackets:
"x{2,5}"
will match"xx"
,"xxx"
,"xxxx"
, or"xxxxx"
- The maximum value in a range can be omitted:
"x{2,}"
will two or more"x"
characters in a row - There are short forms for some commonly-used ranges:
-
"*"
is the same as"{0,}"
(zero or more) -
"+"
is the same as"{1,}"
(one or more) -
"?"
is the same as"{0,1}"
(zero or one)
-
Alternation
- The vertical bar indicates alternation - either the expression on the left or the right can be matched:
"hot|cold"
will match"hot"
or"cold"
Grouping
- Elements placed in parenthesis are treated as a group, and can be repeated:
"(na)* batman"
will match"nananana batman"
and"nananananananana batman"
- Grouping may also be used to limit alternation:
"(fire|green)house"
will match"firehouse"
and"greenhouse"
Anchors
- Anchors match locations, not characters.
- A carat symbol will match the start of a line:
"^[[:upper:]]"
wil match lines that start with an uppercase letter. - A dollar sign will match the end of a line:
"[[:punct:]]$"
will match lines that end with a punctuation mark. - The two characters may be used together:
"cat"
will match the word"cat"
anywhere on a line, but"^cat$"
will only match lines that contain only the word"cat"
. Likewise,"^[0-9.]$"
will match lines that are made up of only digits and dot characters.
Examples
Description | Regexp | Matches | Does not match | Comments |
---|---|---|---|---|
A specific word | Hello |
Hello Hello there! Hello, World! He said, "Hello James", in a very threatening tone |
Hi there Hell of a Day h el lo |
|
A specific word with nothing else on the line | ^Hello$ |
Hello | Hello there! Hello, World! He said, "Hello James", in a very threatening tone Hi there Hell of a Day h el lo |
|
5-character line | ^.....$ |
rouge green Ho-ho |
Yellow long line tiny 12-45-78 |
|
Lines that start with a vowel | ^[AEIOUYaeiouy] |
Allo Everyhing Energy Under Yellow |
Hello White 4164915050 Grinch |
|
Lines that end in a punctuation mark | [[:punct:]]$ |
Hello there! Thanks. What do you think? |
Hello there 416-491-5050 New Year greetings |
|
An integer | ^[-+]?[[:digit:]]+$ |
+15 -2 720 1440 1280 1920 000 012 |
+ 4 3.14 0x47 $1.13 $4 |
|
A decimal number | ^[-+]?[[:digit:]]+(\.[[:digit:]]*)?$ |
+3.14 42 -1000.0 +212 +36.7 42.00 3.333333333 0.976 |
.976 +-200 1.1.1.1 13.4.7 |
|
A Canadian Postal Code | ^[ABCEGHJKLMNPRSTVXY][0-9][ABCEGHJKLMNPRSTVWXYZ] ?[0-9][ABCEGHJKLMNPRSTVWXYZ][0-9]$ |
H0H 0H0 M3C 1L2 K1A 0A2 T2G 0P3 V8W 9W2 R3B 0N2 M2J2X5 M5S 2C6 |
POB 1L0 90210 MN4 2R6 |
A Canadian postal code alternate between letters and digits: A9A 9A9. The first letter must be of of ABCEGHJKLMNPRSTVXY and the remaining letters must be one of ABCEGHJKLMNPRSTVXY. |
Phone Numbers | ^[^+[:digit:]]*(\+?1)?[^+[:digit:]]*[2-9]([^+[:digit:]]*[0-9]){9}[^+[:digit:]]*$ |
(416) 967-1111 +1 416-736-3636 416-439-0000 |
+65 6896 2391 555-1212 |
A Canadian phone number consists of a 3-digit Area Code (which may not start with 0 or 1) and a 10-digit local number consisting of an exchange (3 digits) and a line (4 digits). The country code for Canada (and the US) is 1, so the number may be preceeded by +1 or 1. Area codes are sometimes contained in parenthesis, and dashes or spaces are sometimes used as separators. |
IP Address (IPv4 dotted quad) | ^((2[0-5][0-9]|[1-2][0-9][0-9]|[1-9][0-9]|[1-9])\.){3}(2[0-5][0-9]|[1-2][0-9][0-9]|[1-9][0-9]|[1-9])$ |
1.1.1.1 4.4.8.8 8.8.8.8 7.12.9.43 10.106.32.109 172.16.97.1 192.168.0.1 |
IP=67.69.105.143 1.10.100.1000 255.255.255.0 IP=100.150.200.250 103.271.92.16 1O.10.10.10 |
An IPv4 address in "dotted quad" notations consists of four numbers in the range 0-255 separated by periods. The numbers are called "octets" (which means a collection of eight bits, an alternate way of saying "byte"). |
Private IP Address | (10\.((2[0-5][0-9]|[1-2][0-9][0-9]|[1-9][0-9]|[1-9]))|192\.168|172\.(1[6-9]|2[0-9]|3[0-1]))\.(2[0-5][0-9]|[1-2][0-9][0-9]|[1-9][0-9]|[1-9])\.(2[0-5][0-9]|[1-2][0-9][0-9]|[1-9][0-9]|[1-9]) |
10.4.72.13 172.16.97.1 192.168.0.1 IP=192.168.113.42 |
1.1.1.1 4.4.8.8 192.169.12.6 192.168.400.37 Address is 1 . 2 . 3 . 4 |
Valid IPv4 dotted quad address with a first octet of 10; or first two octets of 192.168; or first octet of 172 followed by a second octet in the range 16-31. |