Difference between revisions of "OPS102 - Regular Expressions"
Chris Tyler (talk | contribs) |
Chris Tyler (talk | contribs) (→Examples) |
||
Line 73: | Line 73: | ||
|A specific word||<code><nowiki>Hello</nowiki></code>||Hello<br>Hello there!<br>Hello, World!<br>He said, "Hello James", in a very threatening tone||Hi there<br>Hell of a Day<br>h el lo|| | |A specific word||<code><nowiki>Hello</nowiki></code>||Hello<br>Hello there!<br>Hello, World!<br>He said, "Hello James", in a very threatening tone||Hi there<br>Hell of a Day<br>h el lo|| | ||
|- | |- | ||
− | |A specific word with nothing else on the line||<code><nowiki>^Hello$</nowiki></code>||Hello there!<br>Hello, World!<br>He said, "Hello James", in a very threatening tone<br>Hi there<br>Hell of a Day<br>h el lo|| | + | |A specific word with nothing else on the line||<code><nowiki>^Hello$</nowiki></code>||Hello||Hello there!<br>Hello, World!<br>He said, "Hello James", in a very threatening tone<br>Hi there<br>Hell of a Day<br>h el lo|| |
|- | |- | ||
− | |IP Address (IPv4 dotted quad)||<code><nowiki>((2[0-5][0-9]|[1-2][0-9][0-9]|[1-9][0-9]|[1-9])\.){3}(2[0-5][0-9]|[1-2][0-9][0-9]|[1-9][0-9]|[1-9])</nowiki></code>||1.1.1.1<br>4.4.8.8<br>8.8.8.8<br>7.12.9.43<br>10.106.32.109<br>IP=100.150.200.250<br>172.16.97.1<br>192.168.0.1<br>IP=67.69.105.143||1.10.100.1000<br>255.255.255.0<br>103.271.92.16<br>1O.10.10.10||An IPv4 address in "dotted quad" notations consists of four numbers in the range 0-255 separated by periods. The numbers are called "octets" (which means a collection of eight bits, a more precise definition of a "byte"). | + | |An integer||<code><nowiki>[-+]?[[:digit:]]+$||+15<br>-2<br>720<br>1440<br>1280<br>1920<br>000<br>012||+ 4<br>3.14<br>0x47<br>$1.13|| |
+ | |- | ||
+ | |A decimal number||<code><nowiki>[-+]?[[:digit:]]+(\.[[:digit:]]*)?$</nowiki></code>||+3.14<br>42<br>-1000.0<br>+212<br>+36.7<br>42.00<br>3.333333333<br>0.976||.976<br>+-200<br>1.1.1.1<br>13.4.7|| | ||
+ | |- | ||
+ | |A Canadian Postal Code||<code><nowiki>^[ABCEGHJKLMNPRSTVXY][0-9][ABCEGHJKLMNPRSTVWXYZ] ?[0-9][ABCEGHJKLMNPRSTVWXYZ][0-9]$</nowiki></code>||H0H 0H0<br>M3C 1L2<br>K1A 0A2<br>T2G 0P3<br>V8W 9W2<br>R3B 0N2<br>M2J2X5<br>M5S 2C6||POB 1L0<br>90210<br>MN4 2R6||A Canadian postal code alternate between letters and digits: A9A 9A9. The first letter must be of of ABCEGHJKLMNPRSTVXY and the remaining letters must be one of ABCEGHJKLMNPRSTVXY. | ||
+ | |- | ||
+ | |Phone Numbers||<code><nowiki>^[^+[:digit:]]*(\+?1)?[^+[:digit:]]*[2-9]([^+[:digit:]]*[0-9]){9}[^+[:digit:]]*$</nowiki></code>||(416) 967-1111<br>+1 416-736-3636<br>416-439-0000||+65 6896 2391<br>555-1212||A Canadian phone number consists of a 3-digit Area Code (which may not start with 0 or 1) and a 10-digit local number consisting of an exchange (3 digits) and a line (4 digits). The country code for Canada (and the US) is 1, so the number may be preceeded by +1 or 1. Area codes are sometimes contained in parenthesis, and dashes or spaces are sometimes used as separators. | ||
+ | |- | ||
+ | |IP Address (IPv4 dotted quad)||<code><nowiki>((2[0-5][0-9]|[1-2][0-9][0-9]|[1-9][0-9]|[1-9])\.){3}(2[0-5][0-9]|[1-2][0-9][0-9]|[1-9][0-9]|[1-9])</nowiki>|| | ||
+ | |- | ||
+ | </code>||1.1.1.1<br>4.4.8.8<br>8.8.8.8<br>7.12.9.43<br>10.106.32.109<br>IP=100.150.200.250<br>172.16.97.1<br>192.168.0.1<br>IP=67.69.105.143||1.10.100.1000<br>255.255.255.0<br>103.271.92.16<br>1O.10.10.10||An IPv4 address in "dotted quad" notations consists of four numbers in the range 0-255 separated by periods. The numbers are called "octets" (which means a collection of eight bits, a more precise definition of a "byte"). | ||
|- | |- | ||
|Private IP Address||<code><nowiki>(10\.((2[0-5][0-9]|[1-2][0-9][0-9]|[1-9][0-9]|[1-9]))|192\.168|172\.(1[6-9]|2[0-9]|3[0-1]))\.(2[0-5][0-9]|[1-2][0-9][0-9]|[1-9][0-9]|[1-9])\.(2[0-5][0-9]|[1-2][0-9][0-9]|[1-9][0-9]|[1-9])</nowiki></code>||10.4.72.13<br>172.16.97.1<br>192.168.0.1<br>IP=192.168.113.42||1.1.1.1<br>4.4.8.8<br>192.169.12.6<br>192.168.400.37<br>Address is 1 . 2 . 3 . 4||Valid IPv4 dotted quad address with a first octet of 10; or first two octets of 192.168; or first octet of 172 followed by a second octet in the range 16-31. | |Private IP Address||<code><nowiki>(10\.((2[0-5][0-9]|[1-2][0-9][0-9]|[1-9][0-9]|[1-9]))|192\.168|172\.(1[6-9]|2[0-9]|3[0-1]))\.(2[0-5][0-9]|[1-2][0-9][0-9]|[1-9][0-9]|[1-9])\.(2[0-5][0-9]|[1-2][0-9][0-9]|[1-9][0-9]|[1-9])</nowiki></code>||10.4.72.13<br>172.16.97.1<br>192.168.0.1<br>IP=192.168.113.42||1.1.1.1<br>4.4.8.8<br>192.169.12.6<br>192.168.400.37<br>Address is 1 . 2 . 3 . 4||Valid IPv4 dotted quad address with a first octet of 10; or first two octets of 192.168; or first octet of 172 followed by a second octet in the range 16-31. | ||
|} | |} |
Revision as of 16:32, 5 December 2023
Contents
Why Use Regular Expressions?
Regular Expressions can be a little daunting to learn: they often look like someone was just bashing their head against the keyboard (or, like a cat was lying on the keyboard). But they are very powerful - a well-written regular expression can replace many pages of code in a programming language such as C or C++ - and so it is worth investing some time to understand them.
The Seven Basic Elements of Regular Expressions
Characters
In a regular expression (regexp), any character that doesn't otherwise have a special meaning matches that character. So the digit "5"
, for example, matches the digit "5"
; similarly "cat"
matches the letters "c"
, "a"
, and "t"
in sequence.
A backslash can be used to remove any special meaning which a character has. The period character "."
is a type of wildcard (see below), so to search for a literal period, we place a backslash in front of it: "\."
Wildcards
A period "."
will match any single character. Similarly, three periods "..."
will match any three characters.
Bracket Expressions / Character Classes
Bracket Expressions or Character Classes are contained in square brackets "[ ]"
:
- A list of characters in square brackets will match any one character from the list of characters:
"[abc]"
will match"a"
,"b"
, or"c"
- A range of characters in square brackets, written as a starting character, a dash, and an ending character, will match any character in that range:
"[0-9]"
will match any one digit. - There are some pre-defined named character classes. These are selected by specifying the name of the character class surrounded by colons and square brackets, placed within outer square brackets, like
"[[:digits:]]"
. The available names are:- alnum - alphanumeric
- alpha - alphabetic characters
- blank - horizontal whitespace (space, tab)
- cntrl - control characters
- digit - digits
- graph - letters, digits, and punctuation
- print - letters, digits, punctuation, and space
- punct - punctuation marks
- space - horizontal and vertical whitespace (space, tab, vertical tab, form feed)
- upper - UPPERCASE letters
- lower - lowercase letters
- xdigit - hexidecimal digits (digits plus a-f and A-F)
- Ranges, lists, and named character classes may be combined - e.g.,
"[[:digit:]+-.,]"
"[[:digit:][:punct:]]"
"[0-9_*]"
- To invert a character class, add a carat ^ character as the first character after the opening square bracket:
"[^[:digit:]]"
matches any non-digit character, and"[^:]"
matches any character that is not a colon. - To include a literal carat, place it at the end of the character class. To include a literal dash or closing square bracket, place it at the start of the character class.
Repetition
- A repeat count can be placed in curly brackets. It applies to the previous element:
"x{3}"
matches"xxx"
- A repeat can be a range, written as min,max in curly brackets:
"x{2,5}"
will match"xx"
,"xxx"
,"xxxx"
, or"xxxxx"
- The maximum value in a range can be omitted:
"x{2,}"
will two or more"x"
characters in a row - There are short forms for some commonly-used ranges:
-
"*"
is the same as"{0,}"
(zero or more) -
"+"
is the same as"{1,}"
(one or more) -
"?"
is the same as"{0,1}"
(zero or one)
-
Alternation
- The vertical bar indicates alternation - either the expression on the left or the right can be matched:
"hot|cold"
will match"hot"
or"cold"
Grouping
- Elements placed in parenthesis are treated as a group, and can be repeated:
"(na)* batman"
will match"nananana batman"
and"nananananananana batman"
- Grouping may also be used to limit alternation:
"(fire|green)house"
will match"firehouse"
and"greenhouse"
Anchors
- Anchors match locations, not characters.
- A carat symbol will match the start of a line:
"^[[:upper:]]"
wil match lines that start with an uppercase letter. - A dollar sign will match the end of a line:
"[[:punct:]]$"
will match lines that end with a punctuation mark. - The two characters may be used together:
"cat"
will match the word"cat"
anywhere on a line, but"^cat$"
will only match lines that contain only the word"cat"
. Likewise,"^[0-9.]$"
will match lines that are made up of only digits and dot characters.
Examples
Description | Regexp | Matches | Does not match | Comments |
---|---|---|---|---|
A specific word | Hello |
Hello Hello there! Hello, World! He said, "Hello James", in a very threatening tone |
Hi there Hell of a Day h el lo |
|
A specific word with nothing else on the line | ^Hello$ |
Hello | Hello there! Hello, World! He said, "Hello James", in a very threatening tone Hi there Hell of a Day h el lo |
|
An integer | [-+]?[[:digit:]]+$||+15<br>-2<br>720<br>1440<br>1280<br>1920<br>000<br>012||+ 4<br>3.14<br>0x47<br>$1.13||
|-
|A decimal number||<code><nowiki>[-+]?[[:digit:]]+(\.[[:digit:]]*)?$ |
+3.14 42 -1000.0 +212 +36.7 42.00 3.333333333 0.976 |
.976 +-200 1.1.1.1 13.4.7 |
|
A Canadian Postal Code | ^[ABCEGHJKLMNPRSTVXY][0-9][ABCEGHJKLMNPRSTVWXYZ] ?[0-9][ABCEGHJKLMNPRSTVWXYZ][0-9]$ |
H0H 0H0 M3C 1L2 K1A 0A2 T2G 0P3 V8W 9W2 R3B 0N2 M2J2X5 M5S 2C6 |
POB 1L0 90210 MN4 2R6 |
A Canadian postal code alternate between letters and digits: A9A 9A9. The first letter must be of of ABCEGHJKLMNPRSTVXY and the remaining letters must be one of ABCEGHJKLMNPRSTVXY. |
Phone Numbers | ^[^+[:digit:]]*(\+?1)?[^+[:digit:]]*[2-9]([^+[:digit:]]*[0-9]){9}[^+[:digit:]]*$ |
(416) 967-1111 +1 416-736-3636 416-439-0000 |
+65 6896 2391 555-1212 |
A Canadian phone number consists of a 3-digit Area Code (which may not start with 0 or 1) and a 10-digit local number consisting of an exchange (3 digits) and a line (4 digits). The country code for Canada (and the US) is 1, so the number may be preceeded by +1 or 1. Area codes are sometimes contained in parenthesis, and dashes or spaces are sometimes used as separators. |
IP Address (IPv4 dotted quad) | ((2[0-5][0-9]|[1-2][0-9][0-9]|[1-9][0-9]|[1-9])\.){3}(2[0-5][0-9]|[1-2][0-9][0-9]|[1-9][0-9]|[1-9]) |
|||
Private IP Address | (10\.((2[0-5][0-9]|[1-2][0-9][0-9]|[1-9][0-9]|[1-9]))|192\.168|172\.(1[6-9]|2[0-9]|3[0-1]))\.(2[0-5][0-9]|[1-2][0-9][0-9]|[1-9][0-9]|[1-9])\.(2[0-5][0-9]|[1-2][0-9][0-9]|[1-9][0-9]|[1-9]) |
10.4.72.13 172.16.97.1 192.168.0.1 IP=192.168.113.42 |
1.1.1.1 4.4.8.8 192.169.12.6 192.168.400.37 Address is 1 . 2 . 3 . 4 |
Valid IPv4 dotted quad address with a first octet of 10; or first two octets of 192.168; or first octet of 172 followed by a second octet in the range 16-31. |