OPS102 - Regular Expressions
Regular Expressions are search patterns for "Regular Text". They are used by many different tools and languages, including the Linux grep command, the Windows findstr command, less, vi/vim, sed, awk, perl, python, and many others.
Why Use Regular Expressions?
Regular Expressions can be a little daunting to learn: they often look like someone was just bashing their head against the keyboard (or, like a cat was lying on the keyboard). But they are very powerful - a well-written regular expression can replace many pages of code in a programming language such as C or C++ - and so it is worth investing some time to understand them.
The Seven Basic Elements of Regular Expressions
Characters
In a regular expression (regexp), any character that doesn't otherwise have a special meaning matches that character. So the digit "5"
, for example, matches the digit "5"
; similarly "cat"
matches the letters "c"
, "a"
, and "t"
in sequence.
A backslash can be used to remove any special meaning which a character has. The period character "."
is a type of wildcard (see below), so to search for a literal period, we place a backslash in front of it: "\."
Wildcards
A period "."
will match any single character. Similarly, three periods "..."
will match any three characters.
Bracket Expressions / Character Classes
Bracket Expressions or Character Classes are contained in square brackets "[ ]"
:
- A list of characters in square brackets will match any one character from the list of characters:
"[abc]"
will match"a"
,"b"
, or"c"
- A range of characters in square brackets, written as a starting character, a dash, and an ending character, will match any character in that range:
"[0-9]"
will match any one digit. - There are some pre-defined named character classes. These are selected by specifying the name of the character class surrounded by colons and square brackets, placed within outer square brackets, like
"digits:"
. The available names are:- alnum - alphanumeric
- alpha - alphabetic characters
- blank - horizontal whitespace (space, tab)
- cntrl - control characters
- digit - digits
- graph - letters, digits, and punctuation
- print - letters, digits, punctuation, and space
- punct - punctuation marks
- space - horizontal and vertical whitespace (space, tab, vertical tab, form feed)
- upper - UPPERCASE letters
- lower - lowercase letters
- xdigit - hexidecimal digits (digits plus a-f and A-F)
- Ranges, lists, and named character classes may be combined - e.g., "[[:digit:]+-.,]" "[[:digit:][:punct:]]" "[0-9_*]"
- To invert a character class, add a carat ^ character as the first character after the opening square bracket: "[^[:digit:]]" matches any non-digit character, and "[^:]" matches any character that is not a colon.
- To include a literal carat, place it at the end of the character class. To include a literal dash or closing square bracket, place it at the start of the character class.
Repetition
- A repeat count can be placed in curly brackets. It applies to the previous element: "x{3}" matches "xxx"
- A repeat can be a range, written as min,max in curly brackets: "x{2,5}" will match "xx", "xxx", "xxxx", or "xxxxx"
- The maximum value in a range can be omitted: "x{2,}" will two or more "x" characters in a row
- There are short forms for some commonly-used ranges:
- "*" is the same as "{0,}" (zero or more)
- "+" is the same as "{1,}" (one or more)
- "?" is the same as "{0,1}" (zero or one)
Alternation
- The vertical bar indicates alternation - either the expression on the left or the right can be matched: "hot|cold" will match "hot" or "cold"
Grouping
- Elements placed in parenthesis are treated as a group, and can be repeated: "(na)* batman" will match "nananana batman" and "nananananananana batman"
- Grouping may also be used to limit alternation: "(fire|green)house" will match "firehouse" and "greenhouse"
Anchors
- Anchors match locations, not characters.
- A carat symbol will match the start of a line: "^upper:" wil match lines that start with an uppercase letter.
- A dollar sign will match the end of a line: "punct:$" will match lines that end with a punctuation mark.
- The two characters may be used together: "cat" will match the word "cat" anywhere on a line, but "^cat$" will only match lines that contain nothing besides the word "cat". Likewise, "^[0-9.]$" will match lines that are made up of only digits and dot characters.