Changes

Tutorial9: Regular Expressions

42 bytes removed, 09:30, 28 February 2021

→‎INVESTIGATION 1: SIMPLE & COMPLEX REGULAR EXPRESSIONS

# Issue the '''ls''' Linux command to confirm that the text file was downloaded.

# View the contents of the '''textfile1.txt''' file using the '''more''' command and quickly view the contents of this file. Although there are several Linux commands that use regular expressions, we will only be using the '''grep''' command for this investigation.

#Issue the following Linux ~~pipeline~~ command to match the pattern "'''the'''" within '''textfile1.txt''': grep "the" textfile1.txt Take a few moments to view the output and observe the matched pattern. # Now, issue the grep Linux ~~pipeline~~ command with the -i option to ignore case sensitively: grep -i "the" textfile1.txt What do you notice is different when issuing this ~~pipeline~~ command? You will notice that the pattern "'''the'''" is matched including larger words like "'''them'''" and "'''their'''". You can issue the '''grep''' command with the -w option to only match the pattern as a '''word'''. # Issue the following Linux ~~pipeline~~ command: grep -w -i "the" textfile1.txt You should now see only strings of text that match the word '''"the"''' (upper or lower case). Matching literal or simple regular expressions can be useful, but are '''limited''' in what they can assist with pattern matching. For example, you may want to search for a pattern located at the '''beginning''' or '''end''' of the string. There are other regular expression symbols that provide more '''precise''' pattern matches. These special characters are known as '''complex''' and '''extended''' regular expressions symbols. In this section, we will focus on complex ''regular expressions'' and then discuss ''extended regular expressions'' in INVESTIGATION 2. # Issue the following Linux ~~pipeline~~ command: grep -w -i "^the" textfile1.txt The '''^''' symbol is referred to as an '''anchor'''. In this case, it only matches the word "'''the'''" (both upper or lowercase) at the beginning of the string. # Issue the following Linux ~~pipeline~~ command: grep -w -i "the$" textfile1.txt The '''$''' symbol is used to anchor patterns at the end of the string. # Issue the following Linux ~~pipeline~~ command to anchor the word "'''the'''" simultaneously at the beginning and end of the string: grep -w -i "^the$" textfile1.txt What do you notice? Anchoring patterns at both the beginning and ending of strings can greatly assist for more '''precise''' search patterns. We will now be demonstrate the power of anchoring combined with other complex regular expressions symbols.

# Issue the following Linux command to match strings that begin with 3 characters: grep "^..." textfile1.txt What do you notice? Can lines that contain '''less than 3 characters''' be displayed?

# Issue the following Linux command to match strings that begin and end with 3 characters: grep "^...$" textfile1.txt What do you notice compared to the previous command?

# Let's issue a Linux command to display strings that contain more than one occurrence of the letter "x": grep "xx*" textfile1.txt Why did this work? because the pattern indicates one occurrence of the letter "x", followed by zero or MORE occurrences of the letter "x". If you combine the complex regular expression symbols ".*" it will act like zero or more occurrences of any character (i.e. like "*" did in filename expansion).

# Issue the following Linux command to match strings begin and end with a number with nothing or anything inbetween: grep "^[0-9].*[0-9]$" textfile1.txt Using '''simultaneous anchors''' combined with the ".*" symbol(s) can help you to refine your search patterns of strings.

# Issue the following Linux ~~pipeline~~ command to display strings that begin with a capital letter, end with a number, and contains a capital X somewhere inbetween: grep "^[A-Z].*X.*[0-9]$" textfile1.txt Let's look at another series of examples involving searching for strings that only contain '''valid numbers'''. We will use '''pipeline commands''' to both display stdout to the screen and save to files for confirmation of running these pipeline commands when run a '''checking-script''' later in this investigation.

# Issue the following Linux command to create the '''regexps''' directory: mkdir ~/regexps

# Change to the '''regexps''' directory and confirm that you have moved to this directory.

# First, issue the following ~~linux~~ Linux command to download another data file called '''numbers1.dat''': wget <nowiki>https://ict.senecacollege.ca/~murray.saul/uli101/numbers1.dat</nowiki>

# View the contents of the '''numbers.dat''' file using the '''more''' command and quickly view the contents of this file. You should notice valid and invalid numbers contained in this file. When finished, exit the more command.

# Issue the following linux pipeline command to display only whole numbers: grep "^[0-9]*$" numbers1.dat | tee faulty.txt You may have noticed that the command does not entirely work. You may notice an empty line (which is NOT a whole number). This occurs since the * regular expression symbol represents ZERO or MORE occurrences of a number. You can use an additional numeric character class with the * regular expression symbol to search for one or more occurrences of a number. # Issue the following ~~linux~~ Linux pipeline command to display only whole numbers: grep "^[0-9][0-9]*$" numbers1.dat | tee whole.txt You should see that this now works. # Issue the following ~~linux~~ Linux pipeline command to display whole positive or negative integers: grep "^[+-][0-9][0-9]*$" numbers1.dat | tee signed.txt What did you notice? # Issue the following ~~linux~~ Linux pipeline command to display only whole numbers (with or without a positive or negative sign): grep "^[+-]*[0-9]*$" numbers1.dat | tee all.txt Did this command work? # Issue the following ~~Linux~~ command to check that you created those hard links: bash /home/murray.saul/scripts/week9-check-1 If you encounter errors, then view the feedback to make corrections, and then re-run the checking script. If you receive a congratulation message that there are no errors, then proceed with this tutorial.

:Proceed to INVESTIGATION 2.

Msaul

Administrators

13,420

edits

CDOT Wiki β

Changes

Tutorial9: Regular Expressions

CDOT Wiki ^β