Mastering the Regular Expression (RE) in python

Regular Expression or RE are a great way to search pattern in a large text files and then performing various operations like substitutions and deletion.In this post,you will learn about python’s re module and its various functions like compile,findall,match,search and some advance topics like meta-characters,quantifiers,groups and pattern matching.

re.compile()

re.compile method is used to compile the pattern into a variable so that we can use that variable when we want to perform any search.We pass the text we want to search to re.compile method and it saves the pattern as a object of re.Pattern class.

import re
text="Abcdefghijklmnopqrstuvwxyzabc"

pattern=re.compile("abc")
matche=pattern.finditer(text)

for m in matches:
    print(match)

####OUTPUT
# <re.match object; span=(12, 15), match='abc'>

It return a match object with span and the matching string.We can use string slicing with that span to get the matching string.

text=[12:15] ###abc

Notice that,in our text we have 2 abc string (Abc and abc) , but re.compile matched only lowercase “abc”.To make searches case insensitive we pass re.I argument to re.compile method.

pattern=re.compile("abc",re.I)
match=pattern.finditer(text)
for m in match:
    print(match)

####OUTPUT
# <re.match object; span=(0, 3), match='Abc'>
# <re.match object; span=(12, 15), match='abc'>
text[0:3] ###Abc
text[12:15] ###abc

re.match()

re.match is used to match the pattern at the starting of the string.If the pattern is not at the starting of the string,then it return None as the output.

pattern=re.compile("abc",re.I)
print(pattern.match(text))

####OUTPUT
# <re.Match object; span=(0, 3), match='Abc'>

You can see that,our string has 2 “abc” but it only matches the “Abc” which is at the start of the string. If we try to to search a pattern which is not at the start of string,then it will return None as output.

pattern=re.compile("abc")
print(pattern.match(text))

##Output
## None

Now ,we are searching for “abc” (we removed the re.I ,so search is case sensitive) so it will not match Abc which is at the start of the string.Although,”abc” is present in string,it will return None because “abc” is not in the beginning of the string.

re.search()

re.search is use to search for a pattern anywhere in the string.It will start searching for a pattern from beginning of a string and return the first matching result.

pattern=re.compile("abc",re.I)
pattern.search(text)

####OUTPUT
# <re.Match object; span=(0, 3), match='Abc'>

pattern=re.compile("abc")
pattern.search(text)

####OUTPUT
# <re.Match object; span=(12, 15), match='abc'>

re.findall()

re.findall method will return the matched strings as a list.

pattern=re.compile("abc",re.I)
pattern.findall(text)

####OUTPUT
# ['Abc' , 'abc']

groups in regular expression

groups allows us to combine more than one pattern in a regular expression. Groups are created using ().

import re
sites="""
site1=wikepedia.com
site2=wikepedia.org
"""
pattern=re.compile(r'(\.com|\.org)')
pattern.findall(sites)

####OUTPUT
# ['.com' , '.org']

In the above code,we want to match either .com or .org. So,we use groups to combine both the pattern to match .com and .org in (). “|” symbol is used to match .com or .org pattern. “\.” is used to escape the “.” in the string which is a special character in re. Escaping the special characters, loses their special meaning and are treated as normal character.“r” instruct python to treat string as a raw string and escape any special character in it.You will learn about special characters and advance re in next section.

Advance RE

Python regex chart
Meta-characters and Quantifiers chart

Above ,I have listed all the metachacters and quantifiers.Now,I will use them,to show how they work.

Note: If meta-characters and quantifiers are part of your pattern and you want to suppress the special meaning of it,you need to escape it with backslash “\”.

import re
text="This is first line_!"

##### Using ^ to find match at beginning of string ####

pattern=re.compile(r'^This')
match=pattern.finditer(text)
for m in match:
    print(m)

##Output##
# <re.Match object; span=(0, 4), match='This'>

##### Using $ to find match at beginning of string ####

pattern=re.compile(r'!$')
match=pattern.finditer(text)
for m in match:
    print(m)

##Output##
#<re.Match object; span=(19, 20), match='!'>

Match digit and non-digits with \d and \D

import re
text="One _ 2"
#### Matching digit ####

pattern=re.compile(r'\d')
match=pattern.finditer(text)
for m in match:
    print(m)

####OUTPUT####
# <re.Match object; span=(4, 5), match='2'>

####Matching Non-digits####

pattern=re.compile(r'\D')
match=pattern.finditer(text)
for m in match:
    print(m)

####OUTPUT####

#<re.Match object; span=(0, 1), match='O'>
#<re.Match object; span=(1, 2), match='n'>
#<re.Match object; span=(2, 3), match='e'>
#<re.Match object; span=(3, 4), match=' '>
#<re.Match object; span=(5, 6), match=' '>
#<re.Match object; span=(5, 6), match='_'>

Matching word characters and non-word characters with \w and \W

import re
text="One _ []2"
#### \w to match word characters ####

pattern=re.compile(r'\w')
match=pattern.finditer(text)
for m in match:
    print(m)

####OUTPUT####
#<re.Match object; span=(0, 1), match='O'>
#<re.Match object; span=(1, 2), match='n'>
#<re.Match object; span=(2, 3), match='e'>
#<re.Match object; span=(4, 5), match='_'>
#<re.Match object; span=(8, 9), match='2'>

#### \W to match non-word characters

pattern=re.compile(r'\W')
match=pattern.finditer(text) 
for m in match:
    print(m)

##OUTPUT##

#<re.Match object; span=(3, 4), match=' '>
#<re.Match object; span=(5, 6), match=' '>
#<re.Match object; span=(6, 7), match='['>
#<re.Match object; span=(7, 8), match=']'>

Matching whitespaces and non-whitespaces with \s and \S

import re
text="""First 1      ##tabbed line with
    second 2   """  ##extra spaces after 2

####Finding whitespaces with \s ####

pattern=re.compile(r'\s')
match=pattern.finditer(text)
for m in match:
    print(m)
... 
##OUTPUT##

# <re.Match object; span=(5, 6), match=' '>
# <re.Match object; span=(7, 8), match='\n'>
# <re.Match object; span=(8, 9), match='\t'>
# <re.Match object; span=(15, 16), match=' '>
# <re.Match object; span=(17, 18), match=' '>
# <re.Match object; span=(18, 19), match=' '>
# <re.Match object; span=(19, 20), match=' '>

####Finding non-whitespaces with \S ####

pattern=re.compile(r'\S')
match=pattern.finditer(text)
for m in match:
    print(m)

####OUTPUT####

# <re.Match object; span=(0, 1), match='F'>
# <re.Match object; span=(1, 2), match='i'>
# <re.Match object; span=(2, 3), match='r'>
# <re.Match object; span=(3, 4), match='s'>
# <re.Match object; span=(4, 5), match='t'>
# <re.Match object; span=(6, 7), match='1'>
# <re.Match object; span=(9, 10), match='s'>
# <re.Match object; span=(10, 11), match='e'>
# <re.Match object; span=(11, 12), match='c'>
# <re.Match object; span=(12, 13), match='o'>
# <re.Match object; span=(13, 14), match='n'>
# <re.Match object; span=(14, 15), match='d'>
# <re.Match object; span=(16, 17), match='2'>

Matching word boundary and non-word boundary with \b and \B

import re
text="Blah BlahBlah"

####Matching word boundary with \b

##Here we will be matching "Blah"
## which have word boundary in left"

pattern=re.compile(r'\bBlah')
match=pattern.finditer(text)
for m in match:
    print(m)

####OUTPUT####
# <re.Match object; span=(0, 4), match='Blah'>
# <re.Match object; span=(5, 9), match='Blah'>

####Matching non-word boundary with \B

##We will be matching "Blah" 
##which dont have word boundary in left"

pattern=re.compile(r'\BBlah')
match=pattern.finditer(text)
for m in match:
    print(m)

##OUTPUT
# <re.Match object; span=(9, 13), match='Blah'>

Pattern Matching and Complex Regex

Example No.1

import re
number="123-456-789"
###Matching 3--digits in a row###

pattern=re.compile(r'\d\d\d')
match=pattern.finditer(number)
for m in match:
    print(m)

##OUTPUT

# <re.Match object; span=(0, 3), match='123'>
# <re.Match object; span=(4, 7), match='456'>
# <re.Match object; span=(8, 11), match='789'> 

####SAME THING USING QUANTIFIERS####
pattern=re.compile(r'\d{3}) ## match exact 3 digits
match=pattern.finditer(number)
for m in match:
    print(m)

##OUTPUT

# <re.Match object; span=(0, 3), match='123'>
# <re.Match object; span=(4, 7), match='456'>
# <re.Match object; span=(8, 11), match='789'> 

Example no.2

import re
number="123-456-7890"

##You can use \d\d\d-\d\d\d-\d\d\d\d also as pattern
##Output will be same

pattern=re.compile(r'\d{3}-\d{3}-\d{4}')  
match=pattern.finditer(number)
for m in match:
    print(m)

##OUTPUT
# <re.Match object; span=(0, 12), match='123-456-7890'>

Example no.3

import re
text="""A.B
A B"""

##We want to match both A and B with or without "." ##

pattern=re.compile(r'A\.')
match=pattern.finditer(text)
for m in match:
    print(m)

##OUTPUT##
# <<re.Match object; span=(0, 2), match='A.'>

##YOU CAN SEE IT DOESN"T MATCH A B because it doesnt
## have "." After A. We need to make "." optional 

pattern=re.compile(r'A\.?')
match=pattern.finditer(text)
for m in match:
    print(m)

##OUTPUT##
#Now it have matched Both the A's
# <re.Match object; span=(0, 2), match='A.'>
# <re.Match object; span=(4, 5), match='A'>

Example no. 4

import re
##Matching string starting with capital letter
String="""
Apple
Nokia
samsung"""

pattern=re.compile(r'[A-Z]\w+')
match=pattern.finditer(String)
for m in match:
   print(m)

##OUTPUT## Not matched samsung
# <re.Match object; span=(1, 6), match='Apple'>
# <re.Match object; span=(7, 12), match='Nokia'>

Example no.5

import re
##Matching String of variable length and range

String="""Abc-de-fghIj
ab-c-De"""

## "*" means 0 or more of previous character or pattern

pattern=re.compile(r'[A-za-z]*-[a-z]*-[a-zA-z]*')
match=pattern.finditer(String)
for m in match:
    print(m)

##OUTPUT##
# <re.Match object; span=(0, 12), match='Abc-de-fghIj'>
# <re.Match object; span=(13, 20), match='ab-c-De'>

Example no. 6

import re
##Matching String of exact length and range

String="""Abc-de-fghIj
ab-c-De
ab-c-defgh"""

pattern=re.compile(r'[A-za-z]{2}-[a-z]{1}-[a-zA-z]{3,6}')
match=pattern.finditer(String)
for m in match:
    print(m)

##OUTPUT##
# <re.Match object; span=(21, 31), match='ab-c-defgh'>

In the above example,we want to match the string starting with Capital or small letter which have exactly 2 characters followed by “-” followed by only small letters having exactly 1 characters followed by “-” followed by capital or small letters of length minimum 3 characters and maximum 6 characters.

You can check ab-c-defgh satisfies the condition.

Example no. 7

import re
websites="""https://www.google.com
http://www.microsoft.com
https://youtube.com
https://www.wikipedia.org"""

##Finding website names with inconsistent naming conventions

pattern=re.compile(r'https?://(www\.)?\w+\.\w+')
match=pattern.finditer(websites)
for m in match:
    print(m)

# <re.Match object; span=(0, 22), match='https://www.google.com'>
# <re.Match object; span=(23, 47), match='http://www.microsoft.com'>
# <re.Match object; span=(48, 67), match='https://youtube.com'>
# <re.Match object; span=(68, 93), match='https://www.wikipedia.org'>

In above example,you can see,website name are not in same pattern.Some have http and some have https. Some have www and some dont.Some website are .org and some are .com . We need to write pattern to match all the website names considering all the issues.

“?” in https? makes “s” optional because “?” matches 0 or 1 of previous character.So,it matches both https and http. “://” is common in all website names ,so we directly put it after “https?” . (www\.)? makes www. optional as the whole www. is within ( ),so it is a group and it is followed by ?, so it makes www. optional,so it will match website which don’t have www. in it. Next,we want to match word character of length 1 or more followed by a “.” followed by another word character of length 1 or more. This is achieved by \w+.\w+ .

Grouping with RE

Suppose,we want to extract only domain name and top level domain ( .com, .gov),we can use groups. Groups can be used to get a selected part of the matching pattern. Let’s understand with example.

import re
websites="""https://www.google.com
http://www.microsoft.com
https://youtube.com
https://www.wikipedia.org"""

##Finding website names with inconsistent naming conventions

pattern=re.compile(r'https?://(www\.)?(\w+)(\.\w+)')
match=pattern.finditer(websites)
for m in match:
    print(m)

If you run the above code,you will notice that,we are still able to get all website names,but we have changed the pattern.Now we have used groups.So,there are 3 groups in pattern and group 0 is a default group which captures whole website name.So,total there are 4 groups.

Accessing the information from groups

pattern=re.compile(r'https?://(www\.)?(\w+)(\.\w+)')
match=pattern.finditer(websites)
for m in match:
    print(m.group(0))

##OUTPUT##
# https://www.google.com
# http://www.microsoft.com
# https://youtube.com
# https://www.wikipedia.org

for m in match:
    print(m.group(1))
##OUTPUT##
# www.
# www.
# None
# www.

for m in match:
    print(m.group(2))
##OUTPUT##
# google
# microsoft
# youtube
# wikipedia

for m in match:
    print(m.group(3))
##OUTPUT##
# .com
# .com
# .com
# .org

That all for this tutorial.Hope,I have explained Regular Expressions in a detail and covered all the important topics with examples.If you have any doubt or want to give any suggestion,drop a comment.

Thank You.

Amarjeet

About Amarjeet

Amarjeet,BE in CS ,love to code in python and passionate about Machine Learning and Data Science. Expertsteaching.com is just a medium to share what I have learned till now with world.
Bookmark the permalink.

2 Comments

  1. There are no words to show my appreciation!

  2. This was a really wonderful article. Many thanks for providing this information.

Leave a Reply

Your email address will not be published. Required fields are marked *