Difference between revisions of "OPS102 - Regular Expressions"

From CDOT Wiki
Jump to: navigation, search
(Created page with "'''Regular Expressions''' are search patterns for "Regular Text". They are used by many different tools and languages, including the Linux grep command, the Windows findstr co...")
 
Line 9: Line 9:
 
=== Characters ===
 
=== Characters ===
  
In a regular expression (regexp), any character that doesn't otherwise have a special meaning matches that character. So the digit <code>"5"</code>, for example, matches the digit <code>"5"</code>; similarly <code>"cat"</code> matches the letters <code>"c"</code>, <code>"a"</code>, and <code>"t"</code> in sequence.
+
In a regular expression (regexp), any character that doesn't otherwise have a special meaning matches that character. So the digit <code><nowiki>"5"</nowiki></code>, for example, matches the digit <code><nowiki>"5"</nowiki></code>; similarly <code><nowiki>"cat"</nowiki></code> matches the letters <code><nowiki>"c"</nowiki></code>, <code><nowiki>"a"</nowiki></code>, and <code><nowiki>"t"</nowiki></code> in sequence.
  
A backslash can be used to remove any special meaning which a character has. The period character <code>"."</code> is a type of wildcard (see below), so to search for a literal period, we place a backslash in front of it: <code>"\."</code>
+
A backslash can be used to remove any special meaning which a character has. The period character <code><nowiki>"."</nowiki></code> is a type of wildcard (see below), so to search for a literal period, we place a backslash in front of it: <code><nowiki>"\."</nowiki></code>
  
 
=== Wildcards ===
 
=== Wildcards ===
  
A period <code>"."</code> will match '''any''' single character. Similarly, three periods <code>"..."</code> will match any three characters.
+
A period <code><nowiki>"."</nowiki></code> will match '''any''' single character. Similarly, three periods <code><nowiki>"..."</nowiki></code> will match any three characters.
  
 
=== Bracket Expressions / Character Classes ===
 
=== Bracket Expressions / Character Classes ===
  
Bracket Expressions or Character Classes are contained in square brackets <code>"[ ]"</code>:
+
Bracket Expressions or Character Classes are contained in square brackets <code><nowiki>"[ ]"</nowiki></code>:
* A list of characters in square brackets will match any ''one'' character from the list of characters: <code>"[abc]"</code> will match <code>"a"</code>, <code>"b"</code>, or <code>"c"</code>
+
* A list of characters in square brackets will match any ''one'' character from the list of characters: <code><nowiki>"[abc]"</nowiki></code> will match <code><nowiki>"a"</nowiki></code>, <code><nowiki>"b"</nowiki></code>, or <code><nowiki>"c"</nowiki></code>
* A range of characters in square brackets, written as a starting character, a dash, and an ending character, will match any character in that range: <code>"[0-9]"</code> will match any one digit.
+
* A range of characters in square brackets, written as a starting character, a dash, and an ending character, will match any character in that range: <code><nowiki>"[0-9]"</nowiki></code> will match any one digit.
* There are some pre-defined named character classes. These are selected by specifying the name of the character class surrounded by colons and square brackets, placed within outer square brackets, like <code>"[[:digits:]]"</code>. The available names are:
+
* There are some pre-defined named character classes. These are selected by specifying the name of the character class surrounded by colons and square brackets, placed within outer square brackets, like <code><nowiki>"[[:digits:]]"</nowiki></code>. The available names are:
 
** alnum - alphanumeric
 
** alnum - alphanumeric
 
** alpha - alphabetic characters
 
** alpha - alphabetic characters
Line 35: Line 35:
 
** lower - lowercase letters
 
** lower - lowercase letters
 
** xdigit - hexidecimal digits (digits plus a-f and A-F)
 
** xdigit - hexidecimal digits (digits plus a-f and A-F)
* Ranges, lists, and named character classes may be combined - e.g., "[[:digit:]+-.,]" "[[:digit:][:punct:]]" "[0-9_*]"
+
* Ranges, lists, and named character classes may be combined - e.g., <code><nowiki>"[[:digit:]+-.,]"</nowiki></code> <code><nowiki>"[[:digit:][:punct:]]"</nowiki></code> <code><nowiki>"[0-9_*]"</nowiki></code>
* To invert a character class, add a carat ^ character as the first character after the opening square bracket: "[^[:digit:]]" matches any non-digit character, and "[^:]" matches any character that is not a colon.
+
* To invert a character class, add a carat ^ character as the first character after the opening square bracket: <code><nowiki>"[^[:digit:]]"</nowiki></code> matches any non-digit character, and <code><nowiki>"[^:]"</nowiki></code> matches any character that is not a colon.
 
* To include a literal carat, place it at the end of the character class. To include a literal dash or closing square bracket, place it at the start of the character class.
 
* To include a literal carat, place it at the end of the character class. To include a literal dash or closing square bracket, place it at the start of the character class.
  
== Repetition ==
+
=== Repetition ===
  
* A repeat count can be placed in curly brackets. It applies to the previous element:  "x{3}" matches "xxx"
+
* A repeat count can be placed in curly brackets. It applies to the previous element:  <code><nowiki>"x{3}"</nowiki></code> matches <code><nowiki>"xxx"</nowiki></code>
* A repeat can be a range, written as min,max in curly brackets: "x{2,5}" will match "xx", "xxx", "xxxx", or "xxxxx"
+
* A repeat can be a range, written as min,max in curly brackets: <code><nowiki>"x{2,5}"</nowiki></code> will match <code><nowiki>"xx"</nowiki></code>, <code><nowiki>"xxx"</nowiki></code>, <code><nowiki>"xxxx"</nowiki></code>, or <code><nowiki>"xxxxx"</nowiki></code>
* The maximum value in a range can be omitted: "x{2,}" will two or more "x" characters in a row
+
* The maximum value in a range can be omitted: <code><nowiki>"x{2,}"</nowiki></code> will two or more <code><nowiki>"x"</nowiki></code> characters in a row
 
* There are short forms for some commonly-used ranges:
 
* There are short forms for some commonly-used ranges:
** "*" is the same as "{0,}" (zero or more)
+
** <code><nowiki>"*"</nowiki></code> is the same as <code><nowiki>"{0,}"</nowiki></code> (zero or more)
** "+" is the same as "{1,}" (one or more)
+
** <code><nowiki>"+"</nowiki></code> is the same as <code><nowiki>"{1,}"</nowiki></code> (one or more)
** "?" is the same as "{0,1}" (zero or one)
+
** <code><nowiki>"?"</nowiki></code> is the same as <code><nowiki>"{0,1}"</nowiki></code> (zero or one)
  
== Alternation ==
+
=== Alternation ===
  
* The vertical bar indicates alternation - either the expression on the left or the right can be matched: "hot|cold" will match "hot" or "cold"
+
* The vertical bar indicates alternation - either the expression on the left or the right can be matched: <code><nowiki>"hot|cold"</nowiki></code> will match <code><nowiki>"hot"</nowiki></code> or <code><nowiki>"cold"</nowiki></code>
  
== Grouping ==
+
=== Grouping ===
  
* Elements placed in parenthesis are treated as a group, and can be repeated: "(na)* batman" will match "nananana batman" and "nananananananana batman"
+
* Elements placed in parenthesis are treated as a group, and can be repeated: <code><nowiki>"(na)* batman"</nowiki></code> will match <code><nowiki>"nananana batman"</nowiki></code> and <code><nowiki>"nananananananana batman"</nowiki></code>
* Grouping may also be used to limit alternation: "(fire|green)house" will match "firehouse" and "greenhouse"
+
* Grouping may also be used to limit alternation: <code><nowiki>"(fire|green)house"</nowiki></code> will match <code><nowiki>"firehouse"</nowiki></code> and <code><nowiki>"greenhouse"</nowiki></code>
  
== Anchors ==
+
=== Anchors ===
  
 
* Anchors match '''locations''', not characters.
 
* Anchors match '''locations''', not characters.
* A carat symbol will match the start of a line: "^[[:upper:]]" wil match lines that start with an uppercase letter.
+
* A carat symbol will match the start of a line: <code><nowiki>"^[[:upper:]]"</nowiki></code> wil match lines that start with an uppercase letter.
* A dollar sign will match the end of a line: "[[:punct:]]$" will match lines that end with a punctuation mark.
+
* A dollar sign will match the end of a line: <code><nowiki>"[[:punct:]]$"</nowiki></code> will match lines that end with a punctuation mark.
* The two characters may be used together: "cat" will match the word "cat" anywhere on a line, but "^cat$" will only match lines that contain nothing besides the word "cat". Likewise, "^[0-9.]$" will match lines that are made up of only digits and dot characters.
+
* The two characters may be used together: <code><nowiki>"cat"</nowiki></code> will match the word <code><nowiki>"cat"</nowiki></code> anywhere on a line, but <code><nowiki>"^cat$"</nowiki></code> will only match lines that contain ''only'' the word <code><nowiki>"cat"</nowiki></code>. Likewise, <code><nowiki>"^[0-9.]$"</nowiki></code> will match lines that are made up of only digits and dot characters.
 +
 
 +
== Examples ==
 +
 
 +
{|cellspacing="0" width="100%" cellpadding="5" border="1"
 +
|-
 +
!Description!!Regexp!!Matches!!Does not match!!Comments
 +
|-
 +
|Word||Hello||hello there!<br>Hello, World!<br>He said, "Hello James", in a very threatening tone||Hi there<br>Hell Of a Day<br>h el lo||
 +
|-
 +
|IP Address (IPv4 dotted quad)||<code><nowiki>((2[0-5][0-9]|[1-2][0-9][0-9]|[1-9][0-9]|[1-9])\.){3}(2[0-5][0-9]|[1-2][0-9][0-9]|[1-9][0-9]|[1-9])</nowiki></code>||
 +
|-
 +
|Private IP Address||<code><nowiki>(10\.((2[0-5][0-9]|[1-2][0-9][0-9]|[1-9][0-9]|[1-9]))|192\.168|172\.(1[6-9]|2[0-9]|3[0-1]))\.(2[0-5][0-9]|[1-2][0-9][0-9]|[1-9][0-9]|[1-9])\.(2[0-5][0-9]|[1-2][0-9][0-9]|[1-9][0-9]|[1-9])</nowiki></code>|| || ||Valid IPv4 address with a first octet of "10." or first two octets of "192.168." or first octet of "172." followed by a second octet in the range 16-31.
 +
|}

Revision as of 10:40, 5 December 2023

Regular Expressions are search patterns for "Regular Text". They are used by many different tools and languages, including the Linux grep command, the Windows findstr command, less, vi/vim, sed, awk, perl, python, and many others.

Why Use Regular Expressions?

Regular Expressions can be a little daunting to learn: they often look like someone was just bashing their head against the keyboard (or, like a cat was lying on the keyboard). But they are very powerful - a well-written regular expression can replace many pages of code in a programming language such as C or C++ - and so it is worth investing some time to understand them.

The Seven Basic Elements of Regular Expressions

Characters

In a regular expression (regexp), any character that doesn't otherwise have a special meaning matches that character. So the digit "5", for example, matches the digit "5"; similarly "cat" matches the letters "c", "a", and "t" in sequence.

A backslash can be used to remove any special meaning which a character has. The period character "." is a type of wildcard (see below), so to search for a literal period, we place a backslash in front of it: "\."

Wildcards

A period "." will match any single character. Similarly, three periods "..." will match any three characters.

Bracket Expressions / Character Classes

Bracket Expressions or Character Classes are contained in square brackets "[ ]":

  • A list of characters in square brackets will match any one character from the list of characters: "[abc]" will match "a", "b", or "c"
  • A range of characters in square brackets, written as a starting character, a dash, and an ending character, will match any character in that range: "[0-9]" will match any one digit.
  • There are some pre-defined named character classes. These are selected by specifying the name of the character class surrounded by colons and square brackets, placed within outer square brackets, like "[[:digits:]]". The available names are:
    • alnum - alphanumeric
    • alpha - alphabetic characters
    • blank - horizontal whitespace (space, tab)
    • cntrl - control characters
    • digit - digits
    • graph - letters, digits, and punctuation
    • print - letters, digits, punctuation, and space
    • punct - punctuation marks
    • space - horizontal and vertical whitespace (space, tab, vertical tab, form feed)
    • upper - UPPERCASE letters
    • lower - lowercase letters
    • xdigit - hexidecimal digits (digits plus a-f and A-F)
  • Ranges, lists, and named character classes may be combined - e.g., "[[:digit:]+-.,]" "[[:digit:][:punct:]]" "[0-9_*]"
  • To invert a character class, add a carat ^ character as the first character after the opening square bracket: "[^[:digit:]]" matches any non-digit character, and "[^:]" matches any character that is not a colon.
  • To include a literal carat, place it at the end of the character class. To include a literal dash or closing square bracket, place it at the start of the character class.

Repetition

  • A repeat count can be placed in curly brackets. It applies to the previous element: "x{3}" matches "xxx"
  • A repeat can be a range, written as min,max in curly brackets: "x{2,5}" will match "xx", "xxx", "xxxx", or "xxxxx"
  • The maximum value in a range can be omitted: "x{2,}" will two or more "x" characters in a row
  • There are short forms for some commonly-used ranges:
    • "*" is the same as "{0,}" (zero or more)
    • "+" is the same as "{1,}" (one or more)
    • "?" is the same as "{0,1}" (zero or one)

Alternation

  • The vertical bar indicates alternation - either the expression on the left or the right can be matched: "hot|cold" will match "hot" or "cold"

Grouping

  • Elements placed in parenthesis are treated as a group, and can be repeated: "(na)* batman" will match "nananana batman" and "nananananananana batman"
  • Grouping may also be used to limit alternation: "(fire|green)house" will match "firehouse" and "greenhouse"

Anchors

  • Anchors match locations, not characters.
  • A carat symbol will match the start of a line: "^[[:upper:]]" wil match lines that start with an uppercase letter.
  • A dollar sign will match the end of a line: "[[:punct:]]$" will match lines that end with a punctuation mark.
  • The two characters may be used together: "cat" will match the word "cat" anywhere on a line, but "^cat$" will only match lines that contain only the word "cat". Likewise, "^[0-9.]$" will match lines that are made up of only digits and dot characters.

Examples

Description Regexp Matches Does not match Comments
Word Hello hello there!
Hello, World!
He said, "Hello James", in a very threatening tone
Hi there
Hell Of a Day
h el lo
IP Address (IPv4 dotted quad) ((2[0-5][0-9]|[1-2][0-9][0-9]|[1-9][0-9]|[1-9])\.){3}(2[0-5][0-9]|[1-2][0-9][0-9]|[1-9][0-9]|[1-9])
Private IP Address (10\.((2[0-5][0-9]|[1-2][0-9][0-9]|[1-9][0-9]|[1-9]))|192\.168|172\.(1[6-9]|2[0-9]|3[0-1]))\.(2[0-5][0-9]|[1-2][0-9][0-9]|[1-9][0-9]|[1-9])\.(2[0-5][0-9]|[1-2][0-9][0-9]|[1-9][0-9]|[1-9]) Valid IPv4 address with a first octet of "10." or first two octets of "192.168." or first octet of "172." followed by a second octet in the range 16-31.