Day 43 - Learning Regular Expressions

There is this command called grep that I wrote about in Day 36. It is used to find all the matches of a given text or regex from the given text file or from the output of some other command. If a single word like email is typed as the argument to grep it will only look got that word and nothing more but that is not helpful at all when you want to search for text that you don't even know what it is or looks like. For example you want to find all the emails from a large text and some have .edu or other domain in front of them then it will be very difficult to extract all the emails from that file without lossing some.

Thats where Regular Expressions or regex comes to help. Regular Expressions are used to look for patterns in a large text and extract them. grep uses regex to find all the text that matches the regex pattern. In case of searching for emails using regex will make them easily extractable because all emails follow a same pattern, emailname@provider.topleveldomain.

What Are Regular Expressions?

Regular Expressions or more easy to read regex are sequence of characters that define a search pattern to find text from a large files of text. The concept of regex is as old as the FORTRAN Programming Language or maybe older.

Regex is mostly used to validate user data and search through a lrage body of text. A simple use case is validating emails when signing up or logging in to a website and the website checks if the entered emails follows the pattern.

Regex Syntax

I will be using grep command to look for patterns in the text of my post of Day36 and will do regex with examples.

Matching Words

First I will look for all the words that have "th" at the end of the word in the text file called day36.txt using grep and later will explain the pattern that I wrote to do so. To write the regex in the grep use double quotes and write the expression in them.

root@User:~$ grep "[a-zA-Z]*th " day36.txt
with
path
fifth

This is read like this: One or more characters (*) that are between the range a to z and A to Z ([a-zA-Z]) followed by th and a single space.

  • [] Mathces a single character from the characters that are inside the brackets If I only add a inside the brackets like this [a] it will only look for a single a
  • * Asterisk after the brackets matches one or more character that are inside the brackets
  • a-z This means match characters between a to z.
  • A-Z This means match characters between A to Z.

Now if I replace th with l follwing will be the output.

root@User:~$ grep "[a-zA-Z]*l " day36.txt
will
all
useful
peaceful
terminal

All the letters ending with l.

Searching For Domains

This expression will look for all the domain names inside the text file.

root@User:~$ grep "^[a-zA-Z0-9]*\.[a-zA-Z0-9]*" day36.txt
wikipedia.com
google.com
wikimedia.org
github.com
zainsci.github.io/blog
another.blah
newfile.txt
some.com.au.aus.australia
wikipedia.com
google.com
github.com
some.com.au.aus.australia
zainsci.github.io/blog
zainsci.github.io/blog

You see that it will not inclue fake-site.com in the results because - inside the brackets is a special character and to include the special characters to search for you have to escape them like I did for . in the expression.

  • ^ This means the starting position of the line
  • \. \ (Back Slash) to escape the . (Dot) because dot in regex means any character.

So if I didn't excaped the dot the result would have been the full file.

Not Mathcing Any Letter

Adding ^ inside the brackets will cause the expression to look for all the character that does not match the characters inside the brackets

root@User:~$ grep "[^a-zA-Z0-9]" day36.txt
~
.
\
/
(
)
-

These are all the symbols that are written inside the text file.

Matching Word Between Mentioned Numbers

To match words that are in the specifeid length in the text file I will use the following expression.

root@User:~$ grep " [a-z]{5} " day36.txt
fifth
other
about
needs
...
...

This is like "find all words of length five between a and z"

  • {min, max} Curly Brackets define the range of letter to look for
  • {5} = five characters
  • {5, 9} = five to nine characters
  • {5,} = more than five characters
  • {,5} = less or five characters

And that's it for now because Regex can be much difficult to learn in one day, SO more of it tomorrow.


zainscizainsci