Difference between revisions of "OPS102 - Regular Expressions"
Chris Tyler (talk | contribs) (Created page with "'''Regular Expressions''' are search patterns for "Regular Text". They are used by many different tools and languages, including the Linux grep command, the Windows findstr co...") |
Chris Tyler (talk | contribs) (→Video Lecture) |
||
(20 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | '''Regular Expressions''' are search patterns for "Regular Text". They are used by many different tools and languages, including the Linux grep command, the Windows findstr command, less, vi/vim, sed, awk, perl, python, and many others. | + | [[Category:OPS102]]<!-- {{Chris Tyler Draft}} -->'''Regular Expressions''' are search patterns for "Regular Text". They are used by many different tools and languages, including the Linux grep command, the Windows findstr command, less, vi/vim, sed, awk, perl, python, and many others. |
+ | |||
+ | == Video Lecture == | ||
+ | |||
+ | * [https://seneca-my.sharepoint.com/:v:/g/personal/chris_tyler_senecapolytechnic_ca/EUGN0BHIlzlCmrjXwZgYdSQBoJvWjX9wwfDZKFKS9sGXVg Video Lecture on Regular Expressions] | ||
+ | |||
+ | Recommendations: | ||
+ | * Take notes | ||
+ | * Use the lecture speedup control if desired (1.2x is a good setting) | ||
+ | * Use the lecture along with the notes on this page | ||
== Why Use Regular Expressions? == | == Why Use Regular Expressions? == | ||
Line 9: | Line 18: | ||
=== Characters === | === Characters === | ||
− | In a regular expression (regexp), any character that doesn't otherwise have a special meaning matches that character. So the digit <code>"5"</code>, for example, matches the digit <code>"5"</code>; similarly <code>"cat"</code> matches the letters <code>"c"</code>, <code>"a"</code>, and <code>"t"</code> in sequence. | + | In a regular expression (regexp), any character that doesn't otherwise have a special meaning matches that character. So the digit <code><nowiki>"5"</nowiki></code>, for example, matches the digit <code><nowiki>"5"</nowiki></code>; similarly <code><nowiki>"cat"</nowiki></code> matches the letters <code><nowiki>"c"</nowiki></code>, <code><nowiki>"a"</nowiki></code>, and <code><nowiki>"t"</nowiki></code> in sequence. |
− | A backslash can be used to remove any special meaning which a character has. The period character <code>"."</code> is a type of wildcard (see below), so to search for a literal period, we place a backslash in front of it: <code>"\."</code> | + | A backslash can be used to remove any special meaning which a character has. The period character <code><nowiki>"."</nowiki></code> is a type of wildcard (see below), so to search for a literal period, we place a backslash in front of it: <code><nowiki>"\."</nowiki></code> |
=== Wildcards === | === Wildcards === | ||
− | A period <code>"."</code> will match '''any''' single character. Similarly, three periods <code>"..."</code> will match any three characters. | + | A period <code><nowiki>"."</nowiki></code> will match '''any''' single character. Similarly, three periods <code><nowiki>"..."</nowiki></code> will match any three characters. |
=== Bracket Expressions / Character Classes === | === Bracket Expressions / Character Classes === | ||
− | Bracket Expressions or Character Classes are contained in square brackets <code>"[ ]"</code>: | + | Bracket Expressions or Character Classes are contained in square brackets <code><nowiki>"[ ]"</nowiki></code>: |
− | * A list of characters in square brackets will match any ''one'' character from the list of characters: <code>"[abc]"</code> will match <code>"a"</code>, <code>"b"</code>, or <code>"c"</code> | + | * A list of characters in square brackets will match any ''one'' character from the list of characters: <code><nowiki>"[abc]"</nowiki></code> will match <code><nowiki>"a"</nowiki></code>, <code><nowiki>"b"</nowiki></code>, or <code><nowiki>"c"</nowiki></code> |
− | * A range of characters in square brackets, written as a starting character, a dash, and an ending character, will match any character in that range: <code>"[0-9]"</code> will match any one digit. | + | * A range of characters in square brackets, written as a starting character, a dash, and an ending character, will match any character in that range: <code><nowiki>"[0-9]"</nowiki></code> will match any one digit. |
− | * There are some pre-defined named character classes. These are selected by specifying the name of the character class surrounded by colons and square brackets, placed within outer square brackets, like <code>"[[:digits:]]"</code>. The available names are: | + | * There are some pre-defined named character classes. These are selected by specifying the name of the character class surrounded by colons and square brackets, placed within outer square brackets, like <code><nowiki>"[[:digits:]]"</nowiki></code>. The available names are: |
** alnum - alphanumeric | ** alnum - alphanumeric | ||
** alpha - alphabetic characters | ** alpha - alphabetic characters | ||
Line 35: | Line 44: | ||
** lower - lowercase letters | ** lower - lowercase letters | ||
** xdigit - hexidecimal digits (digits plus a-f and A-F) | ** xdigit - hexidecimal digits (digits plus a-f and A-F) | ||
− | * Ranges, lists, and named character classes may be combined - e.g., "[[:digit:]+-.,]" "[[:digit:][:punct:]]" "[0-9_*]" | + | * Ranges, lists, and named character classes may be combined - e.g., <code><nowiki>"[[:digit:]+-.,]"</nowiki></code> <code><nowiki>"[[:digit:][:punct:]]"</nowiki></code> <code><nowiki>"[0-9_*]"</nowiki></code> |
− | * To invert a character class, add a carat ^ character as the first character after the opening square bracket: "[^[:digit:]]" matches any non-digit character, and "[^:]" matches any character that is not a colon. | + | * To invert a character class, add a carat ^ character as the first character after the opening square bracket: <code><nowiki>"[^[:digit:]]"</nowiki></code> matches any non-digit character, and <code><nowiki>"[^:]"</nowiki></code> matches any character that is not a colon. |
* To include a literal carat, place it at the end of the character class. To include a literal dash or closing square bracket, place it at the start of the character class. | * To include a literal carat, place it at the end of the character class. To include a literal dash or closing square bracket, place it at the start of the character class. | ||
− | == Repetition == | + | === Repetition === |
− | * A repeat count can be placed in curly brackets. It applies to the previous element: "x{3}" matches "xxx" | + | * A repeat count can be placed in curly brackets. It applies to the previous element: <code><nowiki>"x{3}"</nowiki></code> matches <code><nowiki>"xxx"</nowiki></code> |
− | * A repeat can be a range, written as min,max in curly brackets: "x{2,5}" will match "xx", "xxx", "xxxx", or "xxxxx" | + | * A repeat can be a range, written as min,max in curly brackets: <code><nowiki>"x{2,5}"</nowiki></code> will match <code><nowiki>"xx"</nowiki></code>, <code><nowiki>"xxx"</nowiki></code>, <code><nowiki>"xxxx"</nowiki></code>, or <code><nowiki>"xxxxx"</nowiki></code> |
− | * The maximum value in a range can be omitted: "x{2,}" will two or more "x" characters in a row | + | * The maximum value in a range can be omitted: <code><nowiki>"x{2,}"</nowiki></code> will two or more <code><nowiki>"x"</nowiki></code> characters in a row |
* There are short forms for some commonly-used ranges: | * There are short forms for some commonly-used ranges: | ||
− | ** "*" is the same as "{0,}" (zero or more) | + | ** <code><nowiki>"*"</nowiki></code> is the same as <code><nowiki>"{0,}"</nowiki></code> (zero or more) |
− | ** "+" is the same as "{1,}" (one or more) | + | ** <code><nowiki>"+"</nowiki></code> is the same as <code><nowiki>"{1,}"</nowiki></code> (one or more) |
− | ** "?" is the same as "{0,1}" (zero or one) | + | ** <code><nowiki>"?"</nowiki></code> is the same as <code><nowiki>"{0,1}"</nowiki></code> (zero or one) |
− | == Alternation == | + | === Alternation === |
− | * The vertical bar indicates alternation - either the expression on the left or the right can be matched: "hot|cold" will match "hot" or "cold" | + | * The vertical bar indicates alternation - either the expression on the left or the right can be matched: <code><nowiki>"hot|cold"</nowiki></code> will match <code><nowiki>"hot"</nowiki></code> or <code><nowiki>"cold"</nowiki></code> |
− | == Grouping == | + | === Grouping === |
− | * Elements placed in parenthesis are treated as a group, and can be repeated: "(na)* batman" will match "nananana batman" and "nananananananana batman" | + | * Elements placed in parenthesis are treated as a group, and can be repeated: <code><nowiki>"(na)* batman"</nowiki></code> will match <code><nowiki>"nananana batman"</nowiki></code> and <code><nowiki>"nananananananana batman"</nowiki></code> |
− | * Grouping may also be used to limit alternation: "(fire|green)house" will match "firehouse" and "greenhouse" | + | * Grouping may also be used to limit alternation: <code><nowiki>"(fire|green)house"</nowiki></code> will match <code><nowiki>"firehouse"</nowiki></code> and <code><nowiki>"greenhouse"</nowiki></code> |
− | == Anchors == | + | === Anchors === |
* Anchors match '''locations''', not characters. | * Anchors match '''locations''', not characters. | ||
− | * A carat symbol will match the start of a line: "^[[:upper:]]" wil match lines that start with an uppercase letter. | + | * A carat symbol will match the start of a line: <code><nowiki>"^[[:upper:]]"</nowiki></code> wil match lines that start with an uppercase letter. |
− | * A dollar sign will match the end of a line: "[[:punct:]]$" will match lines that end with a punctuation mark. | + | * A dollar sign will match the end of a line: <code><nowiki>"[[:punct:]]$"</nowiki></code> will match lines that end with a punctuation mark. |
− | * The two characters may be used together: "cat" will match the word "cat" anywhere on a line, but "^cat$" will only match lines that contain | + | * The two characters may be used together: <code><nowiki>"cat"</nowiki></code> will match the word <code><nowiki>"cat"</nowiki></code> anywhere on a line, but <code><nowiki>"^cat$"</nowiki></code> will only match lines that contain ''only'' the word <code><nowiki>"cat"</nowiki></code>. Likewise, <code><nowiki>"^[0-9.]$"</nowiki></code> will match lines that are made up of only digits and dot characters. |
+ | |||
+ | == Examples == | ||
+ | |||
+ | {|cellspacing="0" width="100%" cellpadding="5" border="1" | ||
+ | |- | ||
+ | !width="*"|Description!!width="*"|Regexp (GNU Extended Grep dialect - "grep -E")!!width="*"|Matches!!width="*"|Does not match!!width="*"|Comments | ||
+ | |- | ||
+ | |A specific word||<code><nowiki>Hello</nowiki></code>||Hello<br>Hello there!<br>Hello, World!<br>He said, "Hello James", in a very threatening tone||Hi there<br>Hell of a Day<br>h el lo|| | ||
+ | |- | ||
+ | |A specific word with nothing else on the line||<code><nowiki>^Hello$</nowiki></code>||Hello||Hello there!<br>Hello, World!<br>He said, "Hello James", in a very threatening tone<br>Hi there<br>Hell of a Day<br>h el lo|| | ||
+ | |- | ||
+ | |5-character line||<code><nowiki>^.....$</nowiki></code>||rouge<br>green<br>Ho-ho<br>||Yellow<br>long line<br>tiny<br>12-45-78|| | ||
+ | |- | ||
+ | |Lines that start with a vowel||<code><nowiki>^[AEIOUYaeiouy]</nowiki></code>||Allo<br>Everyhing<br>Energy<br>Under<br>Yellow||Hello<br>White<br>4164915050<br>Grinch||The character class includes both UPPERCASE and lowercase letters. You could instead use the option (specific to the tool you're using) to ignore case; for example, <code>-i</code> for grep or <code>/I</code> for findstr. | ||
+ | |- | ||
+ | |Lines that end in a punctuation mark||<code><nowiki>[[:punct:]]$</nowiki></code>||Hello there!<br>Thanks.<br>What do you think?||Hello there<br>416-491-5050<br>New Year greetings|| | ||
+ | |- | ||
+ | |An integer||<code><nowiki>^[-+]?[[:digit:]]+$</nowiki></code>||+15<br>-2<br>720<br>1440<br>1280<br>1920<br>000<br>012||+ 4<br>3.14<br>0x47<br>$1.13<br>$4|| | ||
+ | |- | ||
+ | |A decimal number||<code><nowiki>^[-+]?[[:digit:]]+\.[[:digit:]]*$</nowiki></code>||+3.14<br>42<br>-1000.0<br>+212<br>+36.7<br>42.00<br>3.333333333<br>0.976||.976<br>+-200<br>1.1.1.1<br>13.4.7|| | ||
+ | |- | ||
+ | |A Canadian Postal Code||<code><nowiki>^[ABCEGHJKLMNPRSTVXY][0-9][ABCEGHJKLMNPRSTVWXYZ] ?[0-9][ABCEGHJKLMNPRSTVWXYZ][0-9]$</nowiki></code>||H0H 0H0<br>M3C 1L2<br>K1A 0A2<br>T2G 0P3<br>V8W 9W2<br>R3B 0N2<br>M2J2X5<br>M5S 2C6||POB 1L0<br>90210<br>MN4 2R6||A Canadian postal code alternates between letters and digits: A9A 9A9. The first letter must be of of ABCEGHJKLMNPRSTVXY and the remaining letters must be one of ABCEGHJKLMNPRSTVXY. | ||
+ | |- | ||
+ | |Phone Numbers (Canada/US)||<code><nowiki>^[^+[:digit:]]*(\+?1)?[^+[:digit:]]*[2-9]([^+[:digit:]]*[0-9]){9}[^+[:digit:]]*$</nowiki></code>||(416) 967-1111<br>+1 416-736-3636<br>416-439-0000||+65 6896 2391<br>555-1212||A Canadian/US phone number consists of a 3-digit Area Code (which may not start with 0 or 1) and a 10-digit local number consisting of an exchange (3 digits) and a line (4 digits). The country code for Canada and the US is 1, so the number may be preceeded by +1 or 1. Area codes are sometimes contained in parenthesis, and dashes or spaces are sometimes used as separators. | ||
+ | |- | ||
+ | |IP Address (IPv4 dotted quad)||<code><nowiki>^(((25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9]))\.){3}(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])$</nowiki></code>||1.1.1.1<br>4.4.8.8<br>8.8.8.8<br>7.12.9.43<br>10.106.32.109<br>172.16.97.1<br>192.168.0.1<br>||IP=67.69.105.143<br>1.10.100.1000<br>255.255.255.0<br>IP=100.150.200.250<br>103.271.92.16<br>1O.10.10.10||An IPv4 address in "dotted quad" notations consists of four numbers in the range 0-255 separated by periods. The numbers are called "octets" (which means a collection of eight bits, an alternate way of saying "byte"). | ||
+ | |- | ||
+ | |Private IP Address||<code><nowiki>^(10\.((25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9]))|192\.168|172\.(1[6-9]|2[0-9]|3[0-1]))\.((25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9]))\.((25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9]))</nowiki></code>||10.4.72.13<br>172.16.97.1<br>192.168.0.1||IP=192.168.113.42<br>1.1.1.1<br>4.4.8.8<br>192.169.12.6<br>192.168.400.37<br>Address is 1 . 2 . 3 . 4||Valid IPv4 dotted quad address with a first octet of 10; or first two octets of 192.168; or first octet of 172 followed by a second octet in the range 16-31. | ||
+ | |} | ||
+ | |||
+ | == Regular Expression Dialects == | ||
+ | |||
+ | Regular expressions have evolved over the years, and the various tools that handle regular expressions have different capabilities and slightly different syntax. | ||
+ | |||
+ | In particular, the original Unix search tool <code>grep</code> came in three varieties: | ||
+ | * fgrep, which could search only for fixed text patterns | ||
+ | * grep, which handled basic regular expressions | ||
+ | * egrep, which handled an extended form of regular expressions | ||
+ | |||
+ | The GNU project originally shipped all three commands, but fgrep and egrep were never fully standardized, so they were removed from the Posix standard in 2001. They were recently also removed from the GNU project. | ||
+ | |||
+ | Unlike the original Unix grep, the GNU grep can handle the full extended regular expression syntax, in either of two ways: | ||
+ | * To use the special characters (called "meta-characters") ?, +, {, |, (, and ) preceed them with a backslash. In other words, while a backslash makes special characters like . or * ''ordinary'', it also makes ''ordinary'' characters ? + { | } into special characters. | ||
+ | * Alternately, use the <code>-E</code> option to make grep understand extended regular expressions, which causes ? + { ( | ) to become special characters. | ||
+ | |||
+ | Other tools, such as sed, similarly require backslashes in front of some of the extended regexp meta-characters (or, if you're using a GNU version of sed, you can use the -E option to enable extended regular expressions, just like GNU grep). | ||
+ | |||
+ | The Perl language introduced one of the most powerful and consistent versions of the regular expression language. There has been increasing consensus around "Perl-Compatible Regular Extensions" (aka PCRE) and that dialect is available in many tools (including GNU grep via the <code>-P</code> option, as well as the [https://www.pcre.org/ PCRE/PCRE2 library] for C and C++ programs, which is used in many software packages including Safari and Apache httpd). | ||
+ | |||
+ | == Using Regular Expressions == | ||
+ | |||
+ | Regular expressions can be used in many places: | ||
+ | |||
+ | * Linux | ||
+ | ** GNU grep | ||
+ | ** The bash test command <code><nowiki>[[ "string" =~ regexp ]]</nowiki></code> | ||
+ | ** The less command, using the / and ? keystrokes for searching forward and backward | ||
+ | ** The vi/vim editor, also using the / and ? keystrokes for searching forward and backward | ||
+ | ** The sed and awk utilities | ||
+ | |||
+ | * Windows | ||
+ | ** findstr /R | ||
+ | |||
+ | * Languages | ||
+ | ** Powershell | ||
+ | ** Python | ||
+ | ** JavaScript | ||
+ | ** Perl | ||
+ | ** ...and many others! | ||
+ | |||
+ | == Windows findstr and Regular Expressions == | ||
+ | |||
+ | The Windows <code>findstr</code> command accepts regular expressions or literal expressions. It will guess what you're using, and may guess incorrectly, so it's best to use the <code>/R</code> and <code>/L</code> optons to directly specify if your search pattern is a regexp or literal. | ||
+ | |||
+ | Findstr permits multiple search patterns in a quoted string, separated by a space; this acts like a type of alternation. However, this makes it impossible to use a literal space in a search pattern. If you wish to include a space in your search pattern, prepend <code>/C:</code> to your search string. You can use multiple <code>/C:</code> search strings. | ||
+ | |||
+ | For example, <code>FINDSTR /R /C:"red" /C:"blue" INPUTFILE</code> is roughly equivalent to <code>grep -E "red|blue" INPUTFILE</code> | ||
+ | |||
+ | Findstr is also limited to (approximately) 127 characters in the regular expression. | ||
+ | |||
+ | For information on findstr's regular expression dialect, see <code>help findstr</code> |
Latest revision as of 11:35, 27 March 2024
Regular Expressions are search patterns for "Regular Text". They are used by many different tools and languages, including the Linux grep command, the Windows findstr command, less, vi/vim, sed, awk, perl, python, and many others.
Contents
Video Lecture
Recommendations:
- Take notes
- Use the lecture speedup control if desired (1.2x is a good setting)
- Use the lecture along with the notes on this page
Why Use Regular Expressions?
Regular Expressions can be a little daunting to learn: they often look like someone was just bashing their head against the keyboard (or, like a cat was lying on the keyboard). But they are very powerful - a well-written regular expression can replace many pages of code in a programming language such as C or C++ - and so it is worth investing some time to understand them.
The Seven Basic Elements of Regular Expressions
Characters
In a regular expression (regexp), any character that doesn't otherwise have a special meaning matches that character. So the digit "5"
, for example, matches the digit "5"
; similarly "cat"
matches the letters "c"
, "a"
, and "t"
in sequence.
A backslash can be used to remove any special meaning which a character has. The period character "."
is a type of wildcard (see below), so to search for a literal period, we place a backslash in front of it: "\."
Wildcards
A period "."
will match any single character. Similarly, three periods "..."
will match any three characters.
Bracket Expressions / Character Classes
Bracket Expressions or Character Classes are contained in square brackets "[ ]"
:
- A list of characters in square brackets will match any one character from the list of characters:
"[abc]"
will match"a"
,"b"
, or"c"
- A range of characters in square brackets, written as a starting character, a dash, and an ending character, will match any character in that range:
"[0-9]"
will match any one digit. - There are some pre-defined named character classes. These are selected by specifying the name of the character class surrounded by colons and square brackets, placed within outer square brackets, like
"[[:digits:]]"
. The available names are:- alnum - alphanumeric
- alpha - alphabetic characters
- blank - horizontal whitespace (space, tab)
- cntrl - control characters
- digit - digits
- graph - letters, digits, and punctuation
- print - letters, digits, punctuation, and space
- punct - punctuation marks
- space - horizontal and vertical whitespace (space, tab, vertical tab, form feed)
- upper - UPPERCASE letters
- lower - lowercase letters
- xdigit - hexidecimal digits (digits plus a-f and A-F)
- Ranges, lists, and named character classes may be combined - e.g.,
"[[:digit:]+-.,]"
"[[:digit:][:punct:]]"
"[0-9_*]"
- To invert a character class, add a carat ^ character as the first character after the opening square bracket:
"[^[:digit:]]"
matches any non-digit character, and"[^:]"
matches any character that is not a colon. - To include a literal carat, place it at the end of the character class. To include a literal dash or closing square bracket, place it at the start of the character class.
Repetition
- A repeat count can be placed in curly brackets. It applies to the previous element:
"x{3}"
matches"xxx"
- A repeat can be a range, written as min,max in curly brackets:
"x{2,5}"
will match"xx"
,"xxx"
,"xxxx"
, or"xxxxx"
- The maximum value in a range can be omitted:
"x{2,}"
will two or more"x"
characters in a row - There are short forms for some commonly-used ranges:
-
"*"
is the same as"{0,}"
(zero or more) -
"+"
is the same as"{1,}"
(one or more) -
"?"
is the same as"{0,1}"
(zero or one)
-
Alternation
- The vertical bar indicates alternation - either the expression on the left or the right can be matched:
"hot|cold"
will match"hot"
or"cold"
Grouping
- Elements placed in parenthesis are treated as a group, and can be repeated:
"(na)* batman"
will match"nananana batman"
and"nananananananana batman"
- Grouping may also be used to limit alternation:
"(fire|green)house"
will match"firehouse"
and"greenhouse"
Anchors
- Anchors match locations, not characters.
- A carat symbol will match the start of a line:
"^[[:upper:]]"
wil match lines that start with an uppercase letter. - A dollar sign will match the end of a line:
"[[:punct:]]$"
will match lines that end with a punctuation mark. - The two characters may be used together:
"cat"
will match the word"cat"
anywhere on a line, but"^cat$"
will only match lines that contain only the word"cat"
. Likewise,"^[0-9.]$"
will match lines that are made up of only digits and dot characters.
Examples
Description | Regexp (GNU Extended Grep dialect - "grep -E") | Matches | Does not match | Comments |
---|---|---|---|---|
A specific word | Hello |
Hello Hello there! Hello, World! He said, "Hello James", in a very threatening tone |
Hi there Hell of a Day h el lo |
|
A specific word with nothing else on the line | ^Hello$ |
Hello | Hello there! Hello, World! He said, "Hello James", in a very threatening tone Hi there Hell of a Day h el lo |
|
5-character line | ^.....$ |
rouge green Ho-ho |
Yellow long line tiny 12-45-78 |
|
Lines that start with a vowel | ^[AEIOUYaeiouy] |
Allo Everyhing Energy Under Yellow |
Hello White 4164915050 Grinch |
The character class includes both UPPERCASE and lowercase letters. You could instead use the option (specific to the tool you're using) to ignore case; for example, -i for grep or /I for findstr.
|
Lines that end in a punctuation mark | [[:punct:]]$ |
Hello there! Thanks. What do you think? |
Hello there 416-491-5050 New Year greetings |
|
An integer | ^[-+]?[[:digit:]]+$ |
+15 -2 720 1440 1280 1920 000 012 |
+ 4 3.14 0x47 $1.13 $4 |
|
A decimal number | ^[-+]?[[:digit:]]+\.[[:digit:]]*$ |
+3.14 42 -1000.0 +212 +36.7 42.00 3.333333333 0.976 |
.976 +-200 1.1.1.1 13.4.7 |
|
A Canadian Postal Code | ^[ABCEGHJKLMNPRSTVXY][0-9][ABCEGHJKLMNPRSTVWXYZ] ?[0-9][ABCEGHJKLMNPRSTVWXYZ][0-9]$ |
H0H 0H0 M3C 1L2 K1A 0A2 T2G 0P3 V8W 9W2 R3B 0N2 M2J2X5 M5S 2C6 |
POB 1L0 90210 MN4 2R6 |
A Canadian postal code alternates between letters and digits: A9A 9A9. The first letter must be of of ABCEGHJKLMNPRSTVXY and the remaining letters must be one of ABCEGHJKLMNPRSTVXY. |
Phone Numbers (Canada/US) | ^[^+[:digit:]]*(\+?1)?[^+[:digit:]]*[2-9]([^+[:digit:]]*[0-9]){9}[^+[:digit:]]*$ |
(416) 967-1111 +1 416-736-3636 416-439-0000 |
+65 6896 2391 555-1212 |
A Canadian/US phone number consists of a 3-digit Area Code (which may not start with 0 or 1) and a 10-digit local number consisting of an exchange (3 digits) and a line (4 digits). The country code for Canada and the US is 1, so the number may be preceeded by +1 or 1. Area codes are sometimes contained in parenthesis, and dashes or spaces are sometimes used as separators. |
IP Address (IPv4 dotted quad) | ^(((25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9]))\.){3}(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])$ |
1.1.1.1 4.4.8.8 8.8.8.8 7.12.9.43 10.106.32.109 172.16.97.1 192.168.0.1 |
IP=67.69.105.143 1.10.100.1000 255.255.255.0 IP=100.150.200.250 103.271.92.16 1O.10.10.10 |
An IPv4 address in "dotted quad" notations consists of four numbers in the range 0-255 separated by periods. The numbers are called "octets" (which means a collection of eight bits, an alternate way of saying "byte"). |
Private IP Address | ^(10\.((25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9]))|192\.168|172\.(1[6-9]|2[0-9]|3[0-1]))\.((25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9]))\.((25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])) |
10.4.72.13 172.16.97.1 192.168.0.1 |
IP=192.168.113.42 1.1.1.1 4.4.8.8 192.169.12.6 192.168.400.37 Address is 1 . 2 . 3 . 4 |
Valid IPv4 dotted quad address with a first octet of 10; or first two octets of 192.168; or first octet of 172 followed by a second octet in the range 16-31. |
Regular Expression Dialects
Regular expressions have evolved over the years, and the various tools that handle regular expressions have different capabilities and slightly different syntax.
In particular, the original Unix search tool grep
came in three varieties:
- fgrep, which could search only for fixed text patterns
- grep, which handled basic regular expressions
- egrep, which handled an extended form of regular expressions
The GNU project originally shipped all three commands, but fgrep and egrep were never fully standardized, so they were removed from the Posix standard in 2001. They were recently also removed from the GNU project.
Unlike the original Unix grep, the GNU grep can handle the full extended regular expression syntax, in either of two ways:
- To use the special characters (called "meta-characters") ?, +, {, |, (, and ) preceed them with a backslash. In other words, while a backslash makes special characters like . or * ordinary, it also makes ordinary characters ? + { | } into special characters.
- Alternately, use the
-E
option to make grep understand extended regular expressions, which causes ? + { ( | ) to become special characters.
Other tools, such as sed, similarly require backslashes in front of some of the extended regexp meta-characters (or, if you're using a GNU version of sed, you can use the -E option to enable extended regular expressions, just like GNU grep).
The Perl language introduced one of the most powerful and consistent versions of the regular expression language. There has been increasing consensus around "Perl-Compatible Regular Extensions" (aka PCRE) and that dialect is available in many tools (including GNU grep via the -P
option, as well as the PCRE/PCRE2 library for C and C++ programs, which is used in many software packages including Safari and Apache httpd).
Using Regular Expressions
Regular expressions can be used in many places:
- Linux
- GNU grep
- The bash test command
[[ "string" =~ regexp ]]
- The less command, using the / and ? keystrokes for searching forward and backward
- The vi/vim editor, also using the / and ? keystrokes for searching forward and backward
- The sed and awk utilities
- Windows
- findstr /R
- Languages
- Powershell
- Python
- JavaScript
- Perl
- ...and many others!
Windows findstr and Regular Expressions
The Windows findstr
command accepts regular expressions or literal expressions. It will guess what you're using, and may guess incorrectly, so it's best to use the /R
and /L
optons to directly specify if your search pattern is a regexp or literal.
Findstr permits multiple search patterns in a quoted string, separated by a space; this acts like a type of alternation. However, this makes it impossible to use a literal space in a search pattern. If you wish to include a space in your search pattern, prepend /C:
to your search string. You can use multiple /C:
search strings.
For example, FINDSTR /R /C:"red" /C:"blue" INPUTFILE
is roughly equivalent to grep -E "red|blue" INPUTFILE
Findstr is also limited to (approximately) 127 characters in the regular expression.
For information on findstr's regular expression dialect, see help findstr