OPS102 - Regular Expressions

Regular Expressions are search patterns for "Regular Text". They are used by many different tools and languages, including the Linux grep command, the Windows findstr command, less, vi/vim, sed, awk, perl, python, and many others.

Why Use Regular Expressions?

Regular Expressions can be a little daunting to learn: they often look like someone was just bashing their head against the keyboard (or, like a cat was lying on the keyboard). But they are very powerful - a well-written regular expression can replace many pages of code in a programming language such as C or C++ - and so it is worth investing some time to understand them.

The Seven Basic Elements of Regular Expressions

Characters

In a regular expression (regexp), any character that doesn't otherwise have a special meaning matches that character. So the digit "5", for example, matches the digit "5"; similarly "cat" matches the letters "c", "a", and "t" in sequence.

A backslash can be used to remove any special meaning which a character has. The period character "." is a type of wildcard (see below), so to search for a literal period, we place a backslash in front of it: "\."

Wildcards

A period "." will match any single character. Similarly, three periods "..." will match any three characters.

Bracket Expressions / Character Classes

Bracket Expressions or Character Classes are contained in square brackets "[ ]":

A list of characters in square brackets will match any one character from the list of characters: "[abc]" will match "a", "b", or "c"
A range of characters in square brackets, written as a starting character, a dash, and an ending character, will match any character in that range: "[0-9]" will match any one digit.
There are some pre-defined named character classes. These are selected by specifying the name of the character class surrounded by colons and square brackets, placed within outer square brackets, like "digits:". The available names are:
- alnum - alphanumeric
- alpha - alphabetic characters
- blank - horizontal whitespace (space, tab)
- cntrl - control characters
- digit - digits
- graph - letters, digits, and punctuation
- print - letters, digits, punctuation, and space
- punct - punctuation marks
- space - horizontal and vertical whitespace (space, tab, vertical tab, form feed)
- upper - UPPERCASE letters
- lower - lowercase letters
- xdigit - hexidecimal digits (digits plus a-f and A-F)
Ranges, lists, and named character classes may be combined - e.g., "[[:digit:]+-.,]" "[[:digit:][:punct:]]" "[0-9_*]"
To invert a character class, add a carat ^ character as the first character after the opening square bracket: "[^[:digit:]]" matches any non-digit character, and "[^:]" matches any character that is not a colon.
To include a literal carat, place it at the end of the character class. To include a literal dash or closing square bracket, place it at the start of the character class.

Repetition

A repeat count can be placed in curly brackets. It applies to the previous element: "x{3}" matches "xxx"
A repeat can be a range, written as min,max in curly brackets: "x{2,5}" will match "xx", "xxx", "xxxx", or "xxxxx"
The maximum value in a range can be omitted: "x{2,}" will two or more "x" characters in a row
There are short forms for some commonly-used ranges:
- "*" is the same as "{0,}" (zero or more)
- "+" is the same as "{1,}" (one or more)
- "?" is the same as "{0,1}" (zero or one)

Alternation

The vertical bar indicates alternation - either the expression on the left or the right can be matched: "hot|cold" will match "hot" or "cold"

Grouping

Elements placed in parenthesis are treated as a group, and can be repeated: "(na)* batman" will match "nananana batman" and "nananananananana batman"
Grouping may also be used to limit alternation: "(fire|green)house" will match "firehouse" and "greenhouse"

Anchors

Anchors match locations, not characters.
A carat symbol will match the start of a line: "^upper:" wil match lines that start with an uppercase letter.
A dollar sign will match the end of a line: "punct:$" will match lines that end with a punctuation mark.
The two characters may be used together: "cat" will match the word "cat" anywhere on a line, but "^cat$" will only match lines that contain nothing besides the word "cat". Likewise, "^[0-9.]$" will match lines that are made up of only digits and dot characters.

CDOT Wiki β