Difference between revisions of "OPS102 - Regular Expressions"
Chris Tyler (talk | contribs) (Created page with "'''Regular Expressions''' are search patterns for "Regular Text". They are used by many different tools and languages, including the Linux grep command, the Windows findstr co...") |
Chris Tyler (talk | contribs) |
||
Line 9: | Line 9: | ||
=== Characters === | === Characters === | ||
− | In a regular expression (regexp), any character that doesn't otherwise have a special meaning matches that character. So the digit <code>"5"</code>, for example, matches the digit <code>"5"</code>; similarly <code>"cat"</code> matches the letters <code>"c"</code>, <code>"a"</code>, and <code>"t"</code> in sequence. | + | In a regular expression (regexp), any character that doesn't otherwise have a special meaning matches that character. So the digit <code><nowiki>"5"</nowiki></code>, for example, matches the digit <code><nowiki>"5"</nowiki></code>; similarly <code><nowiki>"cat"</nowiki></code> matches the letters <code><nowiki>"c"</nowiki></code>, <code><nowiki>"a"</nowiki></code>, and <code><nowiki>"t"</nowiki></code> in sequence. |
− | A backslash can be used to remove any special meaning which a character has. The period character <code>"."</code> is a type of wildcard (see below), so to search for a literal period, we place a backslash in front of it: <code>"\."</code> | + | A backslash can be used to remove any special meaning which a character has. The period character <code><nowiki>"."</nowiki></code> is a type of wildcard (see below), so to search for a literal period, we place a backslash in front of it: <code><nowiki>"\."</nowiki></code> |
=== Wildcards === | === Wildcards === | ||
− | A period <code>"."</code> will match '''any''' single character. Similarly, three periods <code>"..."</code> will match any three characters. | + | A period <code><nowiki>"."</nowiki></code> will match '''any''' single character. Similarly, three periods <code><nowiki>"..."</nowiki></code> will match any three characters. |
=== Bracket Expressions / Character Classes === | === Bracket Expressions / Character Classes === | ||
− | Bracket Expressions or Character Classes are contained in square brackets <code>"[ ]"</code>: | + | Bracket Expressions or Character Classes are contained in square brackets <code><nowiki>"[ ]"</nowiki></code>: |
− | * A list of characters in square brackets will match any ''one'' character from the list of characters: <code>"[abc]"</code> will match <code>"a"</code>, <code>"b"</code>, or <code>"c"</code> | + | * A list of characters in square brackets will match any ''one'' character from the list of characters: <code><nowiki>"[abc]"</nowiki></code> will match <code><nowiki>"a"</nowiki></code>, <code><nowiki>"b"</nowiki></code>, or <code><nowiki>"c"</nowiki></code> |
− | * A range of characters in square brackets, written as a starting character, a dash, and an ending character, will match any character in that range: <code>"[0-9]"</code> will match any one digit. | + | * A range of characters in square brackets, written as a starting character, a dash, and an ending character, will match any character in that range: <code><nowiki>"[0-9]"</nowiki></code> will match any one digit. |
− | * There are some pre-defined named character classes. These are selected by specifying the name of the character class surrounded by colons and square brackets, placed within outer square brackets, like <code>"[[:digits:]]"</code>. The available names are: | + | * There are some pre-defined named character classes. These are selected by specifying the name of the character class surrounded by colons and square brackets, placed within outer square brackets, like <code><nowiki>"[[:digits:]]"</nowiki></code>. The available names are: |
** alnum - alphanumeric | ** alnum - alphanumeric | ||
** alpha - alphabetic characters | ** alpha - alphabetic characters | ||
Line 35: | Line 35: | ||
** lower - lowercase letters | ** lower - lowercase letters | ||
** xdigit - hexidecimal digits (digits plus a-f and A-F) | ** xdigit - hexidecimal digits (digits plus a-f and A-F) | ||
− | * Ranges, lists, and named character classes may be combined - e.g., "[[:digit:]+-.,]" "[[:digit:][:punct:]]" "[0-9_*]" | + | * Ranges, lists, and named character classes may be combined - e.g., <code><nowiki>"[[:digit:]+-.,]"</nowiki></code> <code><nowiki>"[[:digit:][:punct:]]"</nowiki></code> <code><nowiki>"[0-9_*]"</nowiki></code> |
− | * To invert a character class, add a carat ^ character as the first character after the opening square bracket: "[^[:digit:]]" matches any non-digit character, and "[^:]" matches any character that is not a colon. | + | * To invert a character class, add a carat ^ character as the first character after the opening square bracket: <code><nowiki>"[^[:digit:]]"</nowiki></code> matches any non-digit character, and <code><nowiki>"[^:]"</nowiki></code> matches any character that is not a colon. |
* To include a literal carat, place it at the end of the character class. To include a literal dash or closing square bracket, place it at the start of the character class. | * To include a literal carat, place it at the end of the character class. To include a literal dash or closing square bracket, place it at the start of the character class. | ||
− | == Repetition == | + | === Repetition === |
− | * A repeat count can be placed in curly brackets. It applies to the previous element: "x{3}" matches "xxx" | + | * A repeat count can be placed in curly brackets. It applies to the previous element: <code><nowiki>"x{3}"</nowiki></code> matches <code><nowiki>"xxx"</nowiki></code> |
− | * A repeat can be a range, written as min,max in curly brackets: "x{2,5}" will match "xx", "xxx", "xxxx", or "xxxxx" | + | * A repeat can be a range, written as min,max in curly brackets: <code><nowiki>"x{2,5}"</nowiki></code> will match <code><nowiki>"xx"</nowiki></code>, <code><nowiki>"xxx"</nowiki></code>, <code><nowiki>"xxxx"</nowiki></code>, or <code><nowiki>"xxxxx"</nowiki></code> |
− | * The maximum value in a range can be omitted: "x{2,}" will two or more "x" characters in a row | + | * The maximum value in a range can be omitted: <code><nowiki>"x{2,}"</nowiki></code> will two or more <code><nowiki>"x"</nowiki></code> characters in a row |
* There are short forms for some commonly-used ranges: | * There are short forms for some commonly-used ranges: | ||
− | ** "*" is the same as "{0,}" (zero or more) | + | ** <code><nowiki>"*"</nowiki></code> is the same as <code><nowiki>"{0,}"</nowiki></code> (zero or more) |
− | ** "+" is the same as "{1,}" (one or more) | + | ** <code><nowiki>"+"</nowiki></code> is the same as <code><nowiki>"{1,}"</nowiki></code> (one or more) |
− | ** "?" is the same as "{0,1}" (zero or one) | + | ** <code><nowiki>"?"</nowiki></code> is the same as <code><nowiki>"{0,1}"</nowiki></code> (zero or one) |
− | == Alternation == | + | === Alternation === |
− | * The vertical bar indicates alternation - either the expression on the left or the right can be matched: "hot|cold" will match "hot" or "cold" | + | * The vertical bar indicates alternation - either the expression on the left or the right can be matched: <code><nowiki>"hot|cold"</nowiki></code> will match <code><nowiki>"hot"</nowiki></code> or <code><nowiki>"cold"</nowiki></code> |
− | == Grouping == | + | === Grouping === |
− | * Elements placed in parenthesis are treated as a group, and can be repeated: "(na)* batman" will match "nananana batman" and "nananananananana batman" | + | * Elements placed in parenthesis are treated as a group, and can be repeated: <code><nowiki>"(na)* batman"</nowiki></code> will match <code><nowiki>"nananana batman"</nowiki></code> and <code><nowiki>"nananananananana batman"</nowiki></code> |
− | * Grouping may also be used to limit alternation: "(fire|green)house" will match "firehouse" and "greenhouse" | + | * Grouping may also be used to limit alternation: <code><nowiki>"(fire|green)house"</nowiki></code> will match <code><nowiki>"firehouse"</nowiki></code> and <code><nowiki>"greenhouse"</nowiki></code> |
− | == Anchors == | + | === Anchors === |
* Anchors match '''locations''', not characters. | * Anchors match '''locations''', not characters. | ||
− | * A carat symbol will match the start of a line: "^[[:upper:]]" wil match lines that start with an uppercase letter. | + | * A carat symbol will match the start of a line: <code><nowiki>"^[[:upper:]]"</nowiki></code> wil match lines that start with an uppercase letter. |
− | * A dollar sign will match the end of a line: "[[:punct:]]$" will match lines that end with a punctuation mark. | + | * A dollar sign will match the end of a line: <code><nowiki>"[[:punct:]]$"</nowiki></code> will match lines that end with a punctuation mark. |
− | * The two characters may be used together: "cat" will match the word "cat" anywhere on a line, but "^cat$" will only match lines that contain | + | * The two characters may be used together: <code><nowiki>"cat"</nowiki></code> will match the word <code><nowiki>"cat"</nowiki></code> anywhere on a line, but <code><nowiki>"^cat$"</nowiki></code> will only match lines that contain ''only'' the word <code><nowiki>"cat"</nowiki></code>. Likewise, <code><nowiki>"^[0-9.]$"</nowiki></code> will match lines that are made up of only digits and dot characters. |
+ | |||
+ | == Examples == | ||
+ | |||
+ | {|cellspacing="0" width="100%" cellpadding="5" border="1" | ||
+ | |- | ||
+ | !Description!!Regexp!!Matches!!Does not match!!Comments | ||
+ | |- | ||
+ | |Word||Hello||hello there!<br>Hello, World!<br>He said, "Hello James", in a very threatening tone||Hi there<br>Hell Of a Day<br>h el lo|| | ||
+ | |- | ||
+ | |IP Address (IPv4 dotted quad)||<code><nowiki>((2[0-5][0-9]|[1-2][0-9][0-9]|[1-9][0-9]|[1-9])\.){3}(2[0-5][0-9]|[1-2][0-9][0-9]|[1-9][0-9]|[1-9])</nowiki></code>|| | ||
+ | |- | ||
+ | |Private IP Address||<code><nowiki>(10\.((2[0-5][0-9]|[1-2][0-9][0-9]|[1-9][0-9]|[1-9]))|192\.168|172\.(1[6-9]|2[0-9]|3[0-1]))\.(2[0-5][0-9]|[1-2][0-9][0-9]|[1-9][0-9]|[1-9])\.(2[0-5][0-9]|[1-2][0-9][0-9]|[1-9][0-9]|[1-9])</nowiki></code>|| || ||Valid IPv4 address with a first octet of "10." or first two octets of "192.168." or first octet of "172." followed by a second octet in the range 16-31. | ||
+ | |} |
Revision as of 10:40, 5 December 2023
Regular Expressions are search patterns for "Regular Text". They are used by many different tools and languages, including the Linux grep command, the Windows findstr command, less, vi/vim, sed, awk, perl, python, and many others.
Contents
Why Use Regular Expressions?
Regular Expressions can be a little daunting to learn: they often look like someone was just bashing their head against the keyboard (or, like a cat was lying on the keyboard). But they are very powerful - a well-written regular expression can replace many pages of code in a programming language such as C or C++ - and so it is worth investing some time to understand them.
The Seven Basic Elements of Regular Expressions
Characters
In a regular expression (regexp), any character that doesn't otherwise have a special meaning matches that character. So the digit "5"
, for example, matches the digit "5"
; similarly "cat"
matches the letters "c"
, "a"
, and "t"
in sequence.
A backslash can be used to remove any special meaning which a character has. The period character "."
is a type of wildcard (see below), so to search for a literal period, we place a backslash in front of it: "\."
Wildcards
A period "."
will match any single character. Similarly, three periods "..."
will match any three characters.
Bracket Expressions / Character Classes
Bracket Expressions or Character Classes are contained in square brackets "[ ]"
:
- A list of characters in square brackets will match any one character from the list of characters:
"[abc]"
will match"a"
,"b"
, or"c"
- A range of characters in square brackets, written as a starting character, a dash, and an ending character, will match any character in that range:
"[0-9]"
will match any one digit. - There are some pre-defined named character classes. These are selected by specifying the name of the character class surrounded by colons and square brackets, placed within outer square brackets, like
"[[:digits:]]"
. The available names are:- alnum - alphanumeric
- alpha - alphabetic characters
- blank - horizontal whitespace (space, tab)
- cntrl - control characters
- digit - digits
- graph - letters, digits, and punctuation
- print - letters, digits, punctuation, and space
- punct - punctuation marks
- space - horizontal and vertical whitespace (space, tab, vertical tab, form feed)
- upper - UPPERCASE letters
- lower - lowercase letters
- xdigit - hexidecimal digits (digits plus a-f and A-F)
- Ranges, lists, and named character classes may be combined - e.g.,
"[[:digit:]+-.,]"
"[[:digit:][:punct:]]"
"[0-9_*]"
- To invert a character class, add a carat ^ character as the first character after the opening square bracket:
"[^[:digit:]]"
matches any non-digit character, and"[^:]"
matches any character that is not a colon. - To include a literal carat, place it at the end of the character class. To include a literal dash or closing square bracket, place it at the start of the character class.
Repetition
- A repeat count can be placed in curly brackets. It applies to the previous element:
"x{3}"
matches"xxx"
- A repeat can be a range, written as min,max in curly brackets:
"x{2,5}"
will match"xx"
,"xxx"
,"xxxx"
, or"xxxxx"
- The maximum value in a range can be omitted:
"x{2,}"
will two or more"x"
characters in a row - There are short forms for some commonly-used ranges:
-
"*"
is the same as"{0,}"
(zero or more) -
"+"
is the same as"{1,}"
(one or more) -
"?"
is the same as"{0,1}"
(zero or one)
-
Alternation
- The vertical bar indicates alternation - either the expression on the left or the right can be matched:
"hot|cold"
will match"hot"
or"cold"
Grouping
- Elements placed in parenthesis are treated as a group, and can be repeated:
"(na)* batman"
will match"nananana batman"
and"nananananananana batman"
- Grouping may also be used to limit alternation:
"(fire|green)house"
will match"firehouse"
and"greenhouse"
Anchors
- Anchors match locations, not characters.
- A carat symbol will match the start of a line:
"^[[:upper:]]"
wil match lines that start with an uppercase letter. - A dollar sign will match the end of a line:
"[[:punct:]]$"
will match lines that end with a punctuation mark. - The two characters may be used together:
"cat"
will match the word"cat"
anywhere on a line, but"^cat$"
will only match lines that contain only the word"cat"
. Likewise,"^[0-9.]$"
will match lines that are made up of only digits and dot characters.
Examples
Description | Regexp | Matches | Does not match | Comments |
---|---|---|---|---|
Word | Hello | hello there! Hello, World! He said, "Hello James", in a very threatening tone |
Hi there Hell Of a Day h el lo |
|
IP Address (IPv4 dotted quad) | ((2[0-5][0-9]|[1-2][0-9][0-9]|[1-9][0-9]|[1-9])\.){3}(2[0-5][0-9]|[1-2][0-9][0-9]|[1-9][0-9]|[1-9]) |
|||
Private IP Address | (10\.((2[0-5][0-9]|[1-2][0-9][0-9]|[1-9][0-9]|[1-9]))|192\.168|172\.(1[6-9]|2[0-9]|3[0-1]))\.(2[0-5][0-9]|[1-2][0-9][0-9]|[1-9][0-9]|[1-9])\.(2[0-5][0-9]|[1-2][0-9][0-9]|[1-9][0-9]|[1-9]) |
Valid IPv4 address with a first octet of "10." or first two octets of "192.168." or first octet of "172." followed by a second octet in the range 16-31. |