Changes

Tutorial9: Regular Expressions

36 bytes added, 13:56, 2 September 2020

→‎INVESTIGATION 1: SIMPLE & COMPLEX REGULAR EXPRESSIONS

In this section, you will learn how to use the '''grep ''' command with '''simple and complex regular expressions ''' to help search for patterns contained in text files.

# Issue the following linux command ('''copy and paste''' to save time): wget <nowiki>https://ict.senecacollege.ca/~murray.saul/uli101/textfile1.txt</nowiki>

# Issue the '''ls''' command to confirm that the text file was downloaded.

# View the contents of the '''textfile1.txt''' file using the '''more''' command and quickly view the contents of this file. When finished, '''exit ''' the more command. Although there are several Linux commands that use regular expressions, we will only be using the '''grep ''' command for this section.

#Issue the following linux pipeline command to match the pattern the within '''textfile1.txt''': grep "the" textfile1.txt | more

# Now, issue the grep linux pipeline command with the '''-i''' option to ignore case sensitively: grep -i "the" textfile1.txt | more What do you notice is different with this pipeline command? You will notice that the pattern "the" is matched including larger words that contain the pattern "the". You can use the -w option with the grep command in order to just match only words for a pattern.

# Issue the following linux pipeline command: grep -w -i "the" textfile1.txt | more You should now see only strings of text that match the word '''"the"'''. ~~Just matching~~ Matching literal or simple regular expressions can be useful, but are limited in what they can assist with pattern matching. For Example, you may want to search for pattern at a specific location within the string of text (like at the beginning or end of the string). There are other regular expression tools to provide more precise matches. These tools are '''complex''' and '''extended''' regular expressions. We will now look at complex regular expression symbols now, and we will discuss ''extended regular expressions''''''Italic text'''' in the next section of this tutorial.

# Issue the following Linux pipeline command: grep -w -i "^the" textfile1.txt | more The '''^''' symbol is an anchor. In this case, it only matches the word "the" (both upper or lowercase) at the beginning of strings. The '''$''' symbol is used to anchor patterns at the end of strings.

# Issue the following Linux pipeline command: grep -w -i "the$" textfile1.txt | more What do you notice?

# Issue the following Linux pipeline command to anchor the work "the" simultaneously at the beginning and the end of the string: grep -w -i "^the$" textfile1.txt | more What do you notice? Anchoring patterns at both the beginning and ending of strings can greatly assist for more ~~complex~~ robust search patterns. We will now be demonstrating '''simultaneous anchoring ''' with other complex regular expressions symbols.

# Issue the following command to match strings that begin with 3 characters: grep "^..." textfile1.txt | more What do you notice?

# Issue the following command to match strings that begin and end with 3 characters: grep "^...$" textfile1.txt | more What do you notice?

# To demonstration, issue the following command to display zero or more occurrences of the letter x: grep "x*" textfile1.txt | more You will most likely notice most lines of the file is displayed.

# Let's issue a command to display strings that contain more than one occurrence of the letter x: grep "xx*" textfile1.txt | more Why did this work? because the pattern indicates one occurrence of the letter x, followed by zero or MORE occurrences of the letter x. If you combine the complex regular expression symbols .* it will act like zero or more occurrence of any character (like * did in filename expansion).

# Issue the following command to match strings begin and end with a number with nothing or anything inbetween: grep "^[0-9].*[0-9]$" textfile1.txt | more Using '''simultaneous anchors ''' combined with the .* symbol(s) can help you to refine your search patterns of strings.

# Issue the following linux pipeline command to display strings that begin with a capital letter, ends with a number, and contains a capital X somewhere inbetween: grep "^[A-Z].*X.*[0-9]$" textfile1.txt | more Let's look at another series of examples involving '''filtering''' with numbers so only strings containing valid numbers are displayed.

# First, issue the following linux command to download another data file called '''numbers1.dat''': wget <nowiki>https://ict.senecacollege.ca/~murray.saul/uli101/numbers1.dat</nowiki>

# View the contents of the '''numbers.dat''' file using the '''more''' command and quickly view the contents of this file. You should notice valid and invalid numbers contained in this file. When finished, exit the more command.

# Issue the following linux command to display only whole numbers: grep "^[0-9]*$" numbers1.dat | more You may have noticed that the command does not entirely work. You may notice an empty line (which is NOT a whole number). This occurs since the * regular expression symbol represents ZERO or MORE occurrences of a number. You can use an additional numeric character class with the * regular expression symbol to search for one or more occurrences of a number.

Msaul

Administrators

13,420

edits

Changes

Tutorial9: Regular Expressions

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools