Changes

Jump to: navigation, search

Tutorial 9 - Regular Expressions

13,065 bytes added, 20:28, 25 October 2021
INVESTIGATION 1: SIMPLE & COMPLEX REGULAR EXPRESSIONS
<br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br>
= INVESTIGATION 1: SIMPLE & COMPLEX REGULAR EXPRESSIONS =
<span style="color:red;">'''ATTENTION''': The due date for successfully completing this tutorial (i.e. tutorial 8) is by Friday, December 15 @ 11:59 PM (Week 14).</span><br>
 
In this investigation, you will learn how to use the '''grep''' command with '''simple and complex regular expressions'''<br>to help search for ''patterns'' contained in text files.
 
 
'''Perform the Following Steps:'''
 
# '''Login''' to your matrix account.<br><br>
# Issue a Linux command to '''confirm''' you are located in your '''home''' directory.<br><br>The '''wget''' command is used to download files from the Internet to your shell.<br>This will be useful to download '''text files''' and '''data files''' that we will be using in this tutorial.<br><br>
# Issue the following linux Linux command to '''download''' a text file to your '''home''' directory:<br><span style="color:blue;font-weight:bold;font-family:courier;">wget <nowiki>https://ict.senecacollege.ca/~murray.saul/uli101/textfile1.txt</nowiki></span><br><br>
# View the contents of the '''textfile1.txt''' file using the '''more''' command see what data is contained in this file.<br><br>Although there are several Linux commands that use regular expressions,<br>we will be using the '''grep''' command for this investigation.<br><br>[[Image:regexps-1.png|thumb|right|250px|Output of '''grep''' command matching simple regular expression "'''the'''" (only lowercase). Notice the pattern matches larger words like "'''their'''" or "'''them'''".]]
#Issue the following Linux command to match the pattern '''the''' within '''textfile1.txt''':<br><span style="color:blue;font-weight:bold;font-family:courier;">grep "the" textfile1.txt</span><br><br>Take a few moments to view the output and observe the matched patterns.<br><br>
# Issue the grep Linux command with the <span style="font-weight:bold;font-family:courier;">-i</span> option to ignore case sensitively:<br><span style="color:blue;font-weight:bold;font-family:courier;">grep -i "the" textfile1.txt</span><br><br>What do you notice is different when issuing this command?<br><br>You will notice that the pattern "'''the'''" is matched including larger words like "'''them'''" and "'''their'''".<br>You can issue the '''grep''' command with the <span style="font-weight:bold;font-family:courier;">-w</span> option to only match the pattern as a '''word'''.<br><br>
# Issue the following Linux command:<br><span style="color:blue;font-weight:bold;font-family:courier;">grep -w -i "the" textfile1.txt</span><br><br>You should now see only strings of text that match the word '''the''' (upper or lower case).<br><br>Matching literal or simple regular expressions can be useful, but are '''limited'''<br>in what pattens they can match. For example, you may want to<br>search for a pattern located at the '''beginning''' or '''end''' of the string.<br><br>There are other regular expression symbols that provide more '''precise''' search pattern matching.<br>These special characters are known as '''complex''' and '''extended''' regular expressions symbols.<br><br>For the remainder of this investigation, we will focus on '''complex regular expressions''' and then<br>focus on ''extended regular expressions'' in INVESTIGATION 2.<br><br><table align="right"><tr valign="top"><td>[[Image:regexps-2.png|thumb|right|280px|Anchoring regular expressions at the '''beginning''' of text.]]</td><td>[[Image:regexps-3.png|thumb|right|250px|Anchoring regular expressions at the '''ending''' of text.]]</td></tr></table>
# Issue the following Linux command:<br><span style="color:blue;font-weight:bold;font-family:courier;">grep -w -i "^the" textfile1.txt</span><br><br>The '''^''' symbol is referred to as an '''anchor'''.<br>In this case, it only matches<br>the word "'''the'''" (both upper or lowercase) at the <u>beginning</u> of the string.<br><br>
# Issue the following Linux command:<br><span style="color:blue;font-weight:bold;font-family:courier;">grep -w -i "the$" textfile1.txt</span><br><br>The '''$''' symbol is used to anchor patterns at the <u>end</u> of the string.<br><br>
# Issue the following Linux command to anchor the <u>word</u> "'''the'''"<br>'''simultaneously''' at the <u>beginning</u> and <u>end</u> of the string:<br><span style="color:blue;font-weight:bold;font-family:courier;">grep -w -i "^the$" textfile1.txt </span><br><br>What do you notice?<br><br>Anchoring patterns at both the <u>beginning</u> and <u>ending</u> of strings can greatly assist<br>for more '''precise''' search pattern matching.<br><br>We will now be demonstrate the '''effectiveness''' of <u>combining</u><br> '''anchors''' with <u>other</u> complex regular expressions symbols.<br><br><table align="right"><tr valign="top"><td>[[Image:regexps-4.png|thumb|right|280px|Anchoring regular expressions using '''period''' symbols at the '''beginning''' of text.]]</td><td>[[Image:regexps-5.png|thumb|right|250px|Anchoring regular expressions using '''period''' symbols simultaneously at the '''beginning''' and '''ending''' of text.]]</td></tr></table>
# Issue the following Linux command to match strings that '''begin with 3 characters''':<br><span style="color:blue;font-weight:bold;font-family:courier;">grep "^..." textfile1.txt</span><br><br>What do you notice? Can lines that contain '''less than 3 characters''' be displayed?<br><br>
# Issue the following Linux command to match strings that '''begin <u>and</u> end with 3 characters''':<br><span style="color:blue;font-weight:bold;font-family:courier;">grep "^...$" textfile1.txt</span><br><br>What do you notice compared to the previous command?<br><br>
# Issue the following Linux command to match strings that '''begin with 3 digits''':<br><span style="color:blue;font-weight:bold;font-family:courier;">grep "^[0-9][0-9][0-9]" textfile1.txt</span><br><br>What did you notice?<br><br>
# Issue the following Linux command to match strings that '''end with 3 uppercase letters''':<br><span style="color:blue;font-weight:bold;font-family:courier;">grep "[A-Z][A-Z][A-Z]$" textfile1.txt</span><br><br><table align="right"><tr valign="top"><td>[[Image:regexps-6.png|thumb|right|220px|Anchoring '''3 digits''' at the '''beginning''' and '''ending''' of text.]]</td><td>[[Image:regexps-7.png|thumb|right|250px|Anchoring '''3 alpha-numeric characters''' at the '''beginning''' and '''ending''' of text.]]</td></tr></table>What type of strings match this pattern?<br><br>
# Issue the following Linux command to match strings that '''consist of only 3 digits''':<br><span style="color:blue;font-weight:bold;font-family:courier;">grep "^[0-9][0-9][0-9]$" textfile1.txt</span><br><br>What did you notice?<br><br>
# Issue the following Linux command to match strings that '''consist of only 3 alphanumeric digits''':<br><span style="color:blue;font-weight:bold;font-family:courier;">grep "^[a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9]$" textfile1.txt</span><br><br>What did you notice?<br><br>The <span style="font-weight:bold;font-family:courier;">"*"</span> complex regular expression symbol is often confused with the "*" '''filename expansion''' symbol.<br>In other words, it does NOT represent zero or more of '''any character''', but zero or more '''occurrences'''<br>of the character that comes '''before''' the <span style="font-weight:bold;font-family:courier;">"*"</span> symbol.<br><br>
# To demonstrate, issue the following Linux command to display '''zero or more occurrences''' of the letter "'''x'''":<br><span style="color:blue;font-weight:bold;font-family:courier;">grep "x*" textfile1.txt</span><br><br>You will most likely notice most lines of the file is displayed.<br><br>
# Let's issue a Linux command to display strings that contain '''more than one occurrence''' of the letter "'''x'''":<br><span style="color:blue;font-weight:bold;font-family:courier;">grep "xx*" textfile1.txt</span><br><br>Why did this work? because the pattern indicates one occurrence of the letter "x",<br>followed by '''zero or MORE occurrences''' of the <u>next</u> letter "x".<br><br>If you combine the complex regular expression symbols <span style="font-weight:bold;font-family:courier;">".*"</span> it will act like<br>zero or more occurrences of <u>any</u> character (i.e. like <span style="font-weight:bold;font-family:courier;">"*"</span> did in filename expansion).<br><br>
# Issue the following Linux command to match strings begin and end with a number with nothing or anything inbetween:<br><span style="color:blue;font-weight:bold;font-family:courier;">grep "^[0-9].*[0-9]$" textfile1.txt</span><br><br>Using '''simultaneous anchors''' combined with the <span style="font-weight:bold;font-family:courier;">".*"</span> symbol(s) can help you to refine your search patterns of strings.<br><br>
# Issue the following Linux command to display strings that begin with a capital letter,<br>end with a number, and contains a capital X somewhere inbetween:<br><span style="color:blue;font-weight:bold;font-family:courier;">grep "^[A-Z].*X.*[0-9]$" textfile1.txt</span><br><br>Let's look at another series of examples involving searching for strings that only contain '''valid numbers'''.<br>We will use '''pipeline commands''' to both display stdout to the screen and save to files<br>for confirmation of running these pipeline commands when run a '''checking-script''' later in this investigation.<br><br>
# Issue the following Linux command to create the '''regexps''' directory: <span style="color:blue;font-weight:bold;font-family:courier;">mkdir ~/regexps</span><br><br>
# Change to the '''regexps''' directory and confirm that you have moved to this directory.<br><br>
# First, issue the following Linux command to download another data file called '''numbers1.dat''':<br><span style="color:blue;font-weight:bold;font-family:courier;">wget <nowiki>https://ict.senecacollege.ca/~murray.saul/uli101/numbers1.dat</nowiki></span><br><br>
# View the contents of the '''numbers.dat''' file using the '''more''' command and quickly view the contents of this file.<br>You should notice '''valid''' and '''invalid''' numbers contained in this file. When finished, exit the more command.<br><br>
# Issue the following linux pipeline command to display only '''whole''' numbers (i.e. no '''+''' or '''-''' sign):<br><span style="color:blue;font-weight:bold;font-family:courier;">grep "^[0-9]*$" numbers1.dat | tee faulty.txt</span><br><br>You may have noticed that the command '''does not entirely work'''. You may notice an '''empty line'''<br>(which is NOT a whole number). This occurs since the * regular expression symbol represents<br>ZERO or MORE occurrences of a number. You can use an additional numeric character class<br>with the * regular expression symbol to search for one or more occurrences of a number.<br><br>
# Issue the following Linux pipeline command to display only whole numbers:<br><span style="color:blue;font-weight:bold;font-family:courier;">grep "^[0-9][0-9]*$" numbers1.dat | tee whole.txt</span><br><br>You should see that this now works.<br><br>
# Issue the following Linux pipeline command to display <u>only</u> '''signed''' integers:<br><span style="color:blue;font-weight:bold;font-family:courier;">grep "^[+-][0-9][0-9]*$" numbers1.dat | tee signed.txt</span><br><br>What did you notice? Positive and negative numbers display, not '''unsigned''' numbers.<br><br>[[Image:regexps-8.png|thumb|right|300px|Simultaneous '''anchoring''' of regular expressions using '''character class''' and '''zero or more occurrences''' to display '''signed''' and '''unsigned''' integers.]]
# Issue the following Linux pipeline command to display '''signed''' or '''unsigned integers''':<br><span style="color:blue;font-weight:bold;font-family:courier;">grep "^[+-]*[0-9][0-9]*$" numbers1.dat | tee all.txt</span><br><br>Did this command work?<br><br>
# Issue the following command to check that you created those hard links: <br><span style="color:blue;font-weight:bold;font-family:courier;">~uli101/week9-check-1</span><br><br>If you encounter errors, then view the feedback to make corrections, and then re-run the checking script. If you receive a congratulation message that there are no errors, then proceed with this tutorial.<br><br>You can also use the '''grep''' command using ''regular expression'' as a '''filter''' in pipeline commands.<br><br>
# Issue the following Linux pipeline command:<br><span style="color:blue;font-weight:bold;font-family:courier;">ls | grep "[0-9].*dat$"</span><br><br>What did this pipeline display?<br><br>
# Issue the following Linux pipeline command:<br><span style="color:blue;font-weight:bold;font-family:courier;">ls | grep "[a-z].*txt$"</span><br><br>What did this pipeline display?<br><br>
 
: Although very useful, '''complex''' regular expressions do NOT <u>entirely</u> solve our problem of displaying<br> '''valid''' unsigned and signed numbers (not to mention displaying decimal numbers).<br><br>In the next investigation, you will learn how to use '''extended''' regular expressions that will completely solve this issue.<br>
 
: You can proceed to INVESTIGATION 2.
<br>
= INVESTIGATION 2: EXTENDED REGULAR EXPRESSIONS =

Navigation menu