Changes

Jump to: navigation, search

Tutorial9: Regular Expressions

36 bytes added, 13:56, 2 September 2020
INVESTIGATION 1: SIMPLE & COMPLEX REGULAR EXPRESSIONS
<br>
In this section, you will learn how to use the '''grep ''' command with '''simple and complex regular expressions ''' to help search for patterns contained in text files.
# Issue the following linux command ('''copy and paste''' to save time):<br><span style="color:blue;font-weight:bold;font-family:courier;">wget <nowiki>https://ict.senecacollege.ca/~murray.saul/uli101/textfile1.txt</nowiki></span><br><br>
# Issue the '''ls''' command to confirm that the text file was downloaded.<br><br>
# View the contents of the '''textfile1.txt''' file using the '''more''' command and quickly view the contents of this file. When finished, '''exit ''' the more command.<br><br>Although there are several Linux commands that use regular expressions, we will only be using the '''grep ''' command for this section.<br><br>
#Issue the following linux pipeline command to match the pattern the within '''textfile1.txt''':<br><span style="color:blue;font-weight:bold;font-family:courier;">grep "the" textfile1.txt | more<br><br>
# Now, issue the grep linux pipeline command with the '''-i''' option to ignore case sensitively:<br><span style="color:blue;font-weight:bold;font-family:courier;">grep -i "the" textfile1.txt | more</span><br><br>What do you notice is different with this pipeline command?<br><br>You will notice that the pattern "the" is matched including larger words that contain the pattern "the". You can use the -w option with the grep command in order to just match only words for a pattern.<br><br>
# Issue the following linux pipeline command:<br><span style="color:blue;font-weight:bold;font-family:courier;">grep -w -i "the" textfile1.txt | more</span><br><br>You should now see only strings of text that match the word '''"the"'''.<br><br>Just matching Matching literal or simple regular expressions can be useful, but are limited in what they can assist with pattern matching.<br>For Example, you may want to search for pattern at a specific location within the string of text (like at the beginning or end of the string).<br><br>There are other regular expression tools to provide more precise matches. These tools are '''complex''' and '''extended''' regular expressions. We will now look at complex regular expression symbols now, and we will discuss ''extended regular expressions''''''Italic text'''' in the next section of this tutorial.<br><br>
# Issue the following Linux pipeline command:<br><span style="color:blue;font-weight:bold;font-family:courier;">grep -w -i "^the" textfile1.txt | more</span><br><br>The '''^''' symbol is an anchor. In this case, it only matches the <u>word</u> "the" (both upper or lowercase) at the beginning of strings.<br>The '''$''' symbol is used to anchor patterns at the end of strings.<br><br>
# Issue the following Linux pipeline command:<br><span style="color:blue;font-weight:bold;font-family:courier;">grep -w -i "the$" textfile1.txt | more</span><br><br>What do you notice?<br><br>
# Issue the following Linux pipeline command to anchor the work "the" simultaneously at the beginning and the end of the string:<br><span style="color:blue;font-weight:bold;font-family:courier;">grep -w -i "^the$" textfile1.txt | more</span><br><br>What do you notice?<br><br>Anchoring patterns at both the <u>beginning</u> and <u>ending</u> of strings can greatly assist for more complex robust search patterns.<br>We will now be demonstrating '''simultaneous anchoring ''' with other complex regular expressions symbols.<br><br>
# Issue the following command to match strings that begin with 3 characters:<br><span style="color:blue;font-weight:bold;font-family:courier;">grep "^..." textfile1.txt | more</span><br><br>What do you notice?<br><br>
# Issue the following command to match strings that begin and end with 3 characters:<br><span style="color:blue;font-weight:bold;font-family:courier;">grep "^...$" textfile1.txt | more</span><br><br>What do you notice?<br><br>
# To demonstration, issue the following command to display zero or more occurrences of the letter x:<br><span style="color:blue;font-weight:bold;font-family:courier;">grep "x*" textfile1.txt | more</span><br><br>You will most likely notice most lines of the file is displayed.<br><br>
# Let's issue a command to display strings that contain more than one occurrence of the letter x:<br><span style="color:blue;font-weight:bold;font-family:courier;">grep "xx*" textfile1.txt | more</span><br><br>Why did this work? because the pattern indicates one occurrence of the letter x, followed by zero or MORE occurrences of the letter x.<br><br>If you combine the complex regular expression symbols .* it will act like zero or more occurrence of any character (like * did in filename expansion).<br><br>
# Issue the following command to match strings begin and end with a number with nothing or anything inbetween:<br><span style="color:blue;font-weight:bold;font-family:courier;">grep "^[0-9].*[0-9]$" textfile1.txt | more</span><br><br>Using '''simultaneous anchors ''' combined with the .* symbol(s) can help you to refine your search patterns of strings.<br><br>
# Issue the following linux pipeline command to display strings that begin with a capital letter, ends with a number, and contains a capital X somewhere inbetween:<br><span style="color:blue;font-weight:bold;font-family:courier;">grep "^[A-Z].*X.*[0-9]$" textfile1.txt | more</span><br><br>Let's look at another series of examples involving '''filtering''' with numbers so only strings containing valid numbers are displayed.<br><br>
# First, issue the following linux command to download another data file called '''numbers1.dat''':<br><span style="color:blue;font-weight:bold;font-family:courier;">wget <nowiki>https://ict.senecacollege.ca/~murray.saul/uli101/numbers1.dat</nowiki></span><br><br>
# View the contents of the '''numbers.dat''' file using the '''more''' command and quickly view the contents of this file. You should notice valid and invalid numbers contained in this file. When finished, exit the more command.<br><br>
# Issue the following linux command to display only whole numbers:<br><span style="color:blue;font-weight:bold;font-family:courier;">grep "^[0-9]*$" numbers1.dat | more</span><br><br>You may have noticed that the command does not entirely work. You may notice an empty line (which is NOT a whole number). This occurs since the * regular expression symbol represents ZERO or MORE occurrences of a number. You can use an additional numeric character class with the * regular expression symbol to search for one or more occurrences of a number.<br><br>
13,420
edits

Navigation menu