OPS102 - Regular Expressions
Regular Expressions are search patterns for "Regular Text". They are used by many different tools and languages, including the Linux grep command, the Windows findstr command, less, vi/vim, sed, awk, perl, python, and many others.
Contents
Why Use Regular Expressions?
Regular Expressions can be a little daunting to learn: they often look like someone was just bashing their head against the keyboard (or, like a cat was lying on the keyboard). But they are very powerful - a well-written regular expression can replace many pages of code in a programming language such as C or C++ - and so it is worth investing some time to understand them.
The Seven Basic Elements of Regular Expressions
Characters
In a regular expression (regexp), any character that doesn't otherwise have a special meaning matches that character. So the digit "5"
, for example, matches the digit "5"
; similarly "cat"
matches the letters "c"
, "a"
, and "t"
in sequence.
A backslash can be used to remove any special meaning which a character has. The period character "."
is a type of wildcard (see below), so to search for a literal period, we place a backslash in front of it: "\."
Wildcards
A period "."
will match any single character. Similarly, three periods "..."
will match any three characters.
Bracket Expressions / Character Classes
Bracket Expressions or Character Classes are contained in square brackets "[ ]"
:
- A list of characters in square brackets will match any one character from the list of characters:
"[abc]"
will match"a"
,"b"
, or"c"
- A range of characters in square brackets, written as a starting character, a dash, and an ending character, will match any character in that range:
"[0-9]"
will match any one digit. - There are some pre-defined named character classes. These are selected by specifying the name of the character class surrounded by colons and square brackets, placed within outer square brackets, like
"[[:digits:]]"
. The available names are:- alnum - alphanumeric
- alpha - alphabetic characters
- blank - horizontal whitespace (space, tab)
- cntrl - control characters
- digit - digits
- graph - letters, digits, and punctuation
- print - letters, digits, punctuation, and space
- punct - punctuation marks
- space - horizontal and vertical whitespace (space, tab, vertical tab, form feed)
- upper - UPPERCASE letters
- lower - lowercase letters
- xdigit - hexidecimal digits (digits plus a-f and A-F)
- Ranges, lists, and named character classes may be combined - e.g.,
"[[:digit:]+-.,]"
"[[:digit:][:punct:]]"
"[0-9_*]"
- To invert a character class, add a carat ^ character as the first character after the opening square bracket:
"[^[:digit:]]"
matches any non-digit character, and"[^:]"
matches any character that is not a colon. - To include a literal carat, place it at the end of the character class. To include a literal dash or closing square bracket, place it at the start of the character class.
Repetition
- A repeat count can be placed in curly brackets. It applies to the previous element:
"x{3}"
matches"xxx"
- A repeat can be a range, written as min,max in curly brackets:
"x{2,5}"
will match"xx"
,"xxx"
,"xxxx"
, or"xxxxx"
- The maximum value in a range can be omitted:
"x{2,}"
will two or more"x"
characters in a row - There are short forms for some commonly-used ranges:
-
"*"
is the same as"{0,}"
(zero or more) -
"+"
is the same as"{1,}"
(one or more) -
"?"
is the same as"{0,1}"
(zero or one)
-
Alternation
- The vertical bar indicates alternation - either the expression on the left or the right can be matched:
"hot|cold"
will match"hot"
or"cold"
Grouping
- Elements placed in parenthesis are treated as a group, and can be repeated:
"(na)* batman"
will match"nananana batman"
and"nananananananana batman"
- Grouping may also be used to limit alternation:
"(fire|green)house"
will match"firehouse"
and"greenhouse"
Anchors
- Anchors match locations, not characters.
- A carat symbol will match the start of a line:
"^[[:upper:]]"
wil match lines that start with an uppercase letter. - A dollar sign will match the end of a line:
"[[:punct:]]$"
will match lines that end with a punctuation mark. - The two characters may be used together:
"cat"
will match the word"cat"
anywhere on a line, but"^cat$"
will only match lines that contain only the word"cat"
. Likewise,"^[0-9.]$"
will match lines that are made up of only digits and dot characters.
Examples
Description | Regexp | Matches | Does not match | Comments |
---|---|---|---|---|
Word | Hello | hello there! Hello, World! He said, "Hello James", in a very threatening tone |
Hi there Hell Of a Day h el lo |
|
IP Address (IPv4 dotted quad) | ((2[0-5][0-9]|[1-2][0-9][0-9]|[1-9][0-9]|[1-9])\.){3}(2[0-5][0-9]|[1-2][0-9][0-9]|[1-9][0-9]|[1-9]) |
|||
Private IP Address | (10\.((2[0-5][0-9]|[1-2][0-9][0-9]|[1-9][0-9]|[1-9]))|192\.168|172\.(1[6-9]|2[0-9]|3[0-1]))\.(2[0-5][0-9]|[1-2][0-9][0-9]|[1-9][0-9]|[1-9])\.(2[0-5][0-9]|[1-2][0-9][0-9]|[1-9][0-9]|[1-9]) |
Valid IPv4 address with a first octet of "10." or first two octets of "192.168." or first octet of "172." followed by a second octet in the range 16-31. |