Python regular expression | Python Regex

Pybeginner
By -
0



In this tutorial, you will learn Regular Expressions and regular expression operations defined in the re module in Python. re is the Python standard library that supports regular expression matching operations.

Regular expression (regex or RE for short) as the name suggests is an expression that contains a string of characters that define a search pattern. 

Regular expressions (sometimes abbreviated to regexp, regex or re) are a tool for matching patterns in text.

They can perform in a single search what would require multiple passes using simple string searches. As such, they can be a very fast string processing technique.

See an example of this simple Regular Expression:

\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2, }\b 

 

This expression can be used to find all possible emails in a large body of text. This is useful because otherwise you will have to manually go through the entire text document and find every email id in it.

Python regular expression methods

The module re provides a set of functions that allow us to search a string for a match:

findall()    - Returns a list containing all matches
compile()  - Returns a regex object
search()  - Returns a Match object if there is a match anywhere in the string
split() - Returns a list where the string has been split in each match
sub() - Replaces one or more matches with a string
subn() - Similar to sub, except it returns a 2-item tuple containing the new string and the number of replacements made.
group() -   Returns a tuple containing all subgroups of the match, from 1 to how many groups are in thepattern
match() - Similar to search, but only search in the first line of text

How to use regular expressions in Python?

While there are several steps to using regular expressions in Python, each step is quite simple.

1. Import the regex module with import re
2. Create a Regex object with thefunction re.compile(). (Remember to use a raw string.)
3. Pass the string you want to search for inmethod search() the Regex object's. This returns aobject Match.
4. Call themethod Match group() to return a string of the actual matched text.

To learn how to write RE, let's first clarify some of the basics. In RE, we use literals or metacharacters. Literals are the characters themselves and have no special meaning. 
Here is an example where I use literals to find a specific string in text using the findall() method of module re.

import re
character="Hello my name is John"
print(re.findall(r"Joao",string))

RESULT:
['Joao']

 

As you can see, we used the word 'Joao' itself to find it in the text . This might not seem like a good idea when we have to extract thousands of names from a corpus of text. To do this, we need to find a specific pattern and use metacharacters.

Metacharacters

Metacharacters are characters with a special meaning and are not interpreted as they are, as literals are. We can further classify metacharacters into identifiers and modifiers.

Identifiers are used to recognize a certain type of character. For example, to find all numeric characters in a string, we can use an identifier '/d'

import re
string="Hello I live on 9th street which is near 23rd street"
print(re.findall(r"\d" ,string))

OUTPUT:
['9', '2', '3']


But there seems to be a problem with that. It returns only single-digit numbers and, even worse, splits the number 23 into two digits. So how can we solve this problem, can using two \ d help?

import re
string="Hello I live on 9th street which is near 23rd street"
print(re.findall(r"\d\d",string))

OUTPUT:
['23']


Using two identifiers helped, but now it can only find two-digit numbers, which is not what we wanted.

One way to solve this problem is modifiers, but first, here are some identifiers we can use in Python. We'll use some of them in the examples we'll do in the next section.


\d  - any number
\D  - anything but a number
\s - space
\S - anything but a space
\w - any letter
\W - anything but a letter
.  - any character except for a newline
\b - space around whole words
\. time course. must use a backslash because '. 'Usually it means any character.


Modifiers are a set of metacharacters that add more functionality to identifiers. Returning to the example above, we'll see how we can use a “+” modifier to get numbers of any length in the string. This modifier returns a string when it matches 1 or more characters.

import re
string = "Hello, I live on 9th street which is near 23rd street"
print (re.findall(r "\d +", string))


OUTPUT:
['9', '23']

 

Excellent! finally, we got the desired results. Using the '+' modifier with the /d identifier, I can extract numbers of any length. Here are some of the modifiers we'll also use in the examples section below.


+ = corresponds to 1 or more
? = corresponds to 0 or 1 repetitions.
* = matches 0 or MORE repetitions
$ = matches end of string
^ = matches start of string
| = corresponds to / or. Example x | y = will match x or y
[] = A character set where we define range or "variation"
{x} = expect to see this amount of the previous code.
{x, y} = expect to see these xy values ​​from the previous code

You've noticed that we are using the r character at the beginning of every RE, this r is called a raw string literal. It changes the way the string literal is interpreted. These literals are stored as they appear.

For example, \ is usually interpreted as an escape sequence, but is just a backslash when prefixed with an r. You'll see what this means with special characters. Sometimes the syntax involves backslash escape characters, and to prevent these characters from being interpreted as escape sequences, we use these raw string literals.

In the next chapter we will look at some regular expression examples in Python.

 

Python Regular Expression Examples

Let's explore some of the examples related to metacharacters. Here, we'll see how we use different metacharacters and what effect they have on the output:

 

import re
character="visit the page using python !!!"
print(re.findall(r"\w+",character))

 

import re
string="get out Of my house !!!"
print(re.findall(r"\w{2}",string))

OUTPUT:
['et', 'ut', 'Of', 'my', 'if'] 

 

import re
string="abc abcccc abbbc ac def"
print(re.findall(r"\bab*c\b",string))

OUTPUT:
['abc', 'abbbc', 'ac'] 

 

import re
string="abc abcccc abbbc ac def"
print(re .findall(r"\bab+c\b",string))

OUTPUT:
['abc', 'abbbc']  

 

import re
string="get out Of my house !!!"
print(re.findall(r"\b\w{2}\b",string))

OUTPUT:
['Of', 'my'] 

 

import re
string="name and names are 23 blah blah"
print(re. findall(r"\b\w+es?\b",string))

OUTPUT:
['name', 'names', 'are']  

 

import re

string='''I am Hussain Mujtaba and M12 !a'' '
print(re.findall(r"M.....a",string))

OUTPUT:
['Mujtaba', 'M12 !a'] 

 

import re
string='''123345678
'''

print(re.findall (r"[123]",string))

OUTPUT: 
['1', '2', '3', '3'] 

 

import re
string='''123345678 '''
print(re.findall(r"[ ^123]",string))

OUTPUT:
['4', '5', '6', '7', '8', '\n'] 

 

import re
string='''hello I am a student from India'''
print(re.findall(r"[AZ][az]+",string))

OUTPUT:
['India'] 

 

import re
string='''hello I am a student from India'''
print(re .findall(r"\b[A-Ia-i][az]+\b",string))

OUTPUT:
['hello', 'am', 'from', 'India'] 

 

import re
string='' 'hello I am a student from India. Hello again'''
print(re.findall(r"\b[h|H]\w+\b",string))

OUTPUT:
['hello', 'Hello']  


Python regular expression methods

Now that we've seen the Enough metacharacters, we'll see how some of the RE methods work. First, let's start with re.compile

 

re.compile()

We can combine a regular expression pattern into pattern objects, which can be used for pattern matching. It also helps to search for a pattern again without rewriting it. Here is an example:

import re

character = re.compile('[A-Z][a-z]+')
result=character.findall('Using Python is all about Python')

print(result)

Result
   ['Using', 'Python', 'Python']

 

re.search()

The re.search function searches the string for a match and returns a Match object if any a match. If there is more than one match, only the first occurrence of the match is returned. Here is an example:

import re

text = 'Using Python is all about Python'
result=re.search("about", text)

print(result)

Result
    <re.Match object; span=(20, 25), match='about'>

 

re.split()

The re.split function returns a list where the string has been split on each match:

import re

text = 'um111dois22três333quatro'
result=re.split('\d+', text)

print(result)

Result
   ['um', 'dois', 'três', 'quatro']

 

re.sub()

The re.sub function replaces matches with the text of your choice.

import re

text = 'aaa@xxx.com bbb@yyy.com ccc@zzz.com'
result=re.sub('[a-z]*@', 'usandopython@', text)

print(result)

Result
    usandopython@xxx.com usandopython@yyy.com usandopython@zzz.com

 

re.subn()

As mentioned earlier, the re.subn function is similar to the re.sub function, but returns a 2-item tuple containing the new string and the number of substitutions made.

import re

text = 'aaa@xxx.com bbb@yyy.com ccc@zzz.com'
result=re.subn('[a-z]*@', 'usandopython@', text)

print(result)

Result
    ('usandopython@xxx.com usandopython@yyy.com usandopython@zzz.com', 3)

 

re.match

The re.match function will search for the regular expression pattern and return the first occurrence. This method checks for a match only at the beginning of the string. So if a match is found on the first line, it will return the match object. But if a match is found on some other line, it will return null. Here is an example:

import re

text = 'The Programming Road page offers Python tutorial and programming tips'
result_1=re.match('The Programming', text)
result_2=re.match('The Programming Road', text)
result_3=re.match('the programming Road', text)

print(result_1)
print(result_2)
print(result_3)

Result
    <re.Match object; span=(0, 15), match='The Programming'>
     <re.Match object; span=(0, 20), match='The Programming Road'>
     None

 

re.group()

The re.group function returns an entire match (or specific subgroup number). We can mention subgroups in Regular expression if we put them in parentheses. Here is an example to clarify

import re

text = "Python is friendlier than Java"
combine = re.match( r'(.*) is (.*?) .*', text)

print ("first group : ", combine.group(0))
print ("second group : ", combine.group(1))
print ("third group : ", combine.group(2))

Result
    first group :  Python is friendlier than Java
    second group :  Python
    third group :  friendlier

 

Python regular expression Practical example

Now that we've seen how to use different regular expressions, let's use them in practice and create some programs.

The first program will be to validate whether an email is valid or not, and the second will be to validate whether a phone number is valid or not.

Python email validation using regular expression.

This program will receive several emails from a csv document, will check whether the email address provided is valid or not, and will separate them into a list that will be presented in a table.

First, we will find patterns in different emails and, from that, we will draw a regular expression that will be able to identify emails.

Here is an example of valid emails:

joao23@gmail.com
joao.futi@gmail.co
python@gamil.co.ao
python@gmail.co.br
Now, here is an example of invalid emails:
joao123gmail. com [@ is not present]
joao futi@gmail.com [space cannot be present]
@ python @ gamil.co.ao [email cannot start with @]
.python @ gamil.co.ao [o e -mail cannot start with a period (.)]

 

From the examples above, we can look at email patterns and create a regular expression that will allow us to evaluate emails as valid and invalid.

Then we will have:

1 - The first part contains the following ASCII characters.

  • English uppercase (AZ) and lowercase (az) letters.
  • Digits (0-9).
  • Characters ! # $% & '* + - / =? ^_`{| } ~
  • Character. (period, period or period), as long as it is not the first or the last character and does not come one after the other.

2 - The domain name part [eg, com, org, net, ao, br] contains letters, digits, hyphens and periods.

Thus, we will have the following expression:

result = re.search(r'\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a- zA-Z]{2,}\b',email)

Okay, now that we've started our expression, let's develop our program.

We will start by creating a new python file, name it whatever you like, mine will name it regex_email.py and save the file.

Then import the library into your script, and create a function and it is called email_checker, and pass a parameter to it, just like in the example below.

#importing re 
import re

def email_checker(email):
    print(email)

email_checker('Using Python')



If you run the code snippet, the phrase 'Using Python', which was passed as a parameter, will be printed.

Now let's create a condition that will receive the email and check if that email satisfies the expression or not, if it satisfies the expression, we'll print the valid email, otherwise we'll print the invalid email

#importing the re
import re

def email_checker(email):
     #verifying the received email
     if(re.search(r'\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+ \.[a-zA-Z]{2,}\b',email)):  
          print("Valid")   
      else:  
          e-mailprint("Invalid e-mail")
      
email_checker('Using Python')

Now, if you run the program, the message 'Invalid email' will be printed because the parameter we passed to our function does not meet the condition within it.

Now, from the email examples, copy a valid and an invalid one, so we can use that as an example.

 

joao23@gmail.com
joao123gmail.com

 

Let's use these two examples and pass them as parameters in our function, then we will.

#importing the re
import re
def(email):
    #verifying the email I received
    if(re.search(r'\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA- Z]{2,}\b',email)):  
        print("Valid email"
    else:  
        print("Invalid email")
email_1 = 'joao23@gmail.com'
email_2 = 'joao123gmail.com'

email_checker( email_1)
email_checker(email_2)

 

In this way, we have just finished our program that evaluates whether an email is valid or not. Now watch and try using different emails as examples or create your own conditions from the expression to be evaluated.

 

Post a Comment

0Comments

Post a Comment (0)