Open main menu

CDOT Wiki β

Changes

OPS102 - Regular Expressions

1,557 bytes added, 27 March
Video Lecture
[[Category:OPS102]]<!-- {{Chris Tyler Draft}} -->'''Regular Expressions''' are search patterns for "Regular Text". They are used by many different tools and languages, including the Linux grep command, the Windows findstr command, less, vi/vim, sed, awk, perl, python, and many others.
 
== Video Lecture ==
 
* [https://seneca-my.sharepoint.com/:v:/g/personal/chris_tyler_senecapolytechnic_ca/EUGN0BHIlzlCmrjXwZgYdSQBoJvWjX9wwfDZKFKS9sGXVg Video Lecture on Regular Expressions]
 
Recommendations:
* Take notes
* Use the lecture speedup control if desired (1.2x is a good setting)
* Use the lecture along with the notes on this page
== Why Use Regular Expressions? ==
{|cellspacing="0" width="100%" cellpadding="5" border="1"
|-
!width="*"|Description!!width="*"|Regexp(GNU Extended Grep dialect - "grep -E")!!width="*"|Matches!!width="*"|Does not match!!width="*"|Comments
|-
|A specific word||<code><nowiki>Hello</nowiki></code>||Hello<br>Hello there!<br>Hello, World!<br>He said, "Hello James", in a very threatening tone||Hi there<br>Hell of a Day<br>h el lo||
|An integer||<code><nowiki>^[-+]?[[:digit:]]+$</nowiki></code>||+15<br>-2<br>720<br>1440<br>1280<br>1920<br>000<br>012||+ 4<br>3.14<br>0x47<br>$1.13<br>$4||
|-
|A decimal number||<code><nowiki>^[-+]?[[:digit:]]+(\.[[:digit:]]*)?$</nowiki></code>||+3.14<br>42<br>-1000.0<br>+212<br>+36.7<br>42.00<br>3.333333333<br>0.976||.976<br>+-200<br>1.1.1.1<br>13.4.7||
|-
|A Canadian Postal Code||<code><nowiki>^[ABCEGHJKLMNPRSTVXY][0-9][ABCEGHJKLMNPRSTVWXYZ] ?[0-9][ABCEGHJKLMNPRSTVWXYZ][0-9]$</nowiki></code>||H0H 0H0<br>M3C 1L2<br>K1A 0A2<br>T2G 0P3<br>V8W 9W2<br>R3B 0N2<br>M2J2X5<br>M5S 2C6||POB 1L0<br>90210<br>MN4 2R6||A Canadian postal code alternate alternates between letters and digits: A9A 9A9. The first letter must be of of ABCEGHJKLMNPRSTVXY and the remaining letters must be one of ABCEGHJKLMNPRSTVXY.
|-
|Phone Numbers (Canada/US)||<code><nowiki>^[^+[:digit:]]*(\+?1)?[^+[:digit:]]*[2-9]([^+[:digit:]]*[0-9]){9}[^+[:digit:]]*$</nowiki></code>||(416) 967-1111<br>+1 416-736-3636<br>416-439-0000||+65 6896 2391<br>555-1212||A Canadian/US phone number consists of a 3-digit Area Code (which may not start with 0 or 1) and a 10-digit local number consisting of an exchange (3 digits) and a line (4 digits). The country code for Canada and the US is 1, so the number may be preceeded by +1 or 1. Area codes are sometimes contained in parenthesis, and dashes or spaces are sometimes used as separators.
|-
|IP Address (IPv4 dotted quad)||<code><nowiki>^(((25[0-5]|2[0-54][0-9]|[1-2][0-9][0-9]|[1-9][0-9]|[10-9]))\.){3}(25[0-5]|2[0-54][0-9]|[1-2][0-9][0-9]|[1-9][0-9]|[10-9])$</nowiki></code>||1.1.1.1<br>4.4.8.8<br>8.8.8.8<br>7.12.9.43<br>10.106.32.109<br>172.16.97.1<br>192.168.0.1<br>||IP=67.69.105.143<br>1.10.100.1000<br>255.255.255.0<br>IP=100.150.200.250<br>103.271.92.16<br>1O.10.10.10||An IPv4 address in "dotted quad" notations consists of four numbers in the range 0-255 separated by periods. The numbers are called "octets" (which means a collection of eight bits, an alternate way of saying "byte").
|-
|Private IP Address||<code><nowiki>^(10\.((25[0-5]|2[0-54][0-9]|[1-2][0-9][0-9]|[1-9][0-9]|[10-9]))|192\.168|172\.(1[6-9]|2[0-9]|3[0-1]))\.((25[0-5]|2[0-54][0-9]|[1-2][0-9][0-9]|[1-9][0-9]|[10-9]))\.((25[0-5]|2[0-54][0-9]|[1-2][0-9][0-9]|[1-9][0-9]|[10-9]))</nowiki></code>||10.4.72.13<br>172.16.97.1<br>192.168.0.1||IP=192.168.113.42<br>1.1.1.1<br>4.4.8.8<br>192.169.12.6<br>192.168.400.37<br>Address is 1 . 2 . 3 . 4||Valid IPv4 dotted quad address with a first octet of 10; or first two octets of 192.168; or first octet of 172 followed by a second octet in the range 16-31.
|}
* Alternately, use the <code>-E</code> option to make grep understand extended regular expressions, which causes ? + { ( | ) to become special characters.
Other tools, such as sed, similarly require backslashes in front of some of the extended regexp meta-characters(or, if you're using a GNU version of sed, you can use the -E option to enable extended regular expressions, just like GNU grep).
The Perl language introduced one of the most powerful and consistent versions of the regular expression language. There has been increasing consensus around "Perl-Compatible Regular Extensions" (aka PCRE) and that dialect is available in many tools (including GNU grep via the <code>-P</code> option, as well as the [https://www.pcre.org/ PCRE/PCRE2 library] for C and C++ programs, which is used in many software packages including Safari and Apachehttpd).
== Using Regular Expressions ==
* Languages
** Powershell
** Python
** JavaScript
** Perl
** ...and many others!
 
== Windows findstr and Regular Expressions ==
 
The Windows <code>findstr</code> command accepts regular expressions or literal expressions. It will guess what you're using, and may guess incorrectly, so it's best to use the <code>/R</code> and <code>/L</code> optons to directly specify if your search pattern is a regexp or literal.
 
Findstr permits multiple search patterns in a quoted string, separated by a space; this acts like a type of alternation. However, this makes it impossible to use a literal space in a search pattern. If you wish to include a space in your search pattern, prepend <code>/C:</code> to your search string. You can use multiple <code>/C:</code> search strings.
 
For example, <code>FINDSTR /R /C:"red" /C:"blue" INPUTFILE</code> is roughly equivalent to <code>grep -E "red|blue" INPUTFILE</code>
 
Findstr is also limited to (approximately) 127 characters in the regular expression.
 
For information on findstr's regular expression dialect, see <code>help findstr</code>