Regular expression in Python

Regular expression when there is the irregular format. The regular expression is one of the important modules to effectively handle the strings with patterns.

Regular Expression:

Price: $32.45
Was: $49.99

The regular expression is to search a particular pattern from a large amount of string. In another term, regular expression leverages us to format the string from strings.

Initially, the regular expressions were very well handled with Unix commands “sed” (stream editor) and widely used for many operations like string replacement, file editing, extract the file from one particular line to another particular line, etc.

vinoth@vinoth:~$ echo "Men is intelligent" | sed s/Men/Women/
Women is intelligent   -- String 'Men' substitute with string 'Womoen' for manupulation.

Then awk as text processing command added to some flavors of Unix. The awk works very well for text processing and there are certain applications has been developed using completely with AWK.

vinoth@vinoth:~$ echo "Awk is very good language, Yes" | awk '$1 ~ /'wk'/ {print $NF}'
Yes  --Check first word has letter "wk", so print last letter

Interesting about Regular Expression:

The regular expression is something as like sweet for the operation or production support agent who does scriptings. I have started my career as production support agent and my main job responsibility was to monitor the transaction logs and raise an alert if there are any anonyms identified in the log.

Logs are having the running information about Application / server / services in either structured format or unstructured format.

For eg: Web server logs (Catalina.out log) stores the information of whatever the developer wrote in SOP (system.out.println or system.err) during the Java application development. After this particular application deployed in production server, the production/application support team to identify the patterns of the possible error in the server/application logs to configure that as monitoring parameter for continuous application monitoring.

Identifying the error/exception patterns in the logs using scripting language is a very interesting job. Below I have the team viewer application log, in which I wanted keep monitor with pattern “error 3 digit number”.

2017/11/08 22:38:38.130 32725 4116704064 S CTcpProcessConnector::CloseConnection(): Shutdown socket returned error 107: Transport endpoint is not connected
2017/11/08 22:38:38.601 32725 4125096768 S TVRouterClock Schedule next request in 0 seconds
2017/11/08 22:38:38.640 32725 4125096768 S! KeepAliveSessionOutgoing::ConnectEndedHandler(): KeepAliveConnection with server50705.teamviewer.com ended

As we can see in the above 2 line of log, “error 107″ represent the transport endpoint connect issue. So the word error would be commonly followed by rest of 3 digits to represent any other errors. In this case, I wanted to monitor my team view application log as “error” space “3 digits”, so I just tried to use grep command with pattern error and got the output as below.

less Teamview.log | grep 'error'

Oops, the above screen I can see both “No error” and “Error patterns” but my requirements is to extract only the pattern that has word “error” followed by space with 3 digit number.

less Teamview.log.log | grep 'error [0-9][0-9][0-9]' -- I force the grep command 
to search the word error with 3 digit number which can be between 0-9

There are still the better ways to achieve this kind of requirements but it is out of scope in this article to explain.

By using well known grep command, I have got the error pattern which I can configure in any of monitoring tools like Nagios, Splunk for further monitoring and alert.

 Regular Expression makes the Perl:

Perl is the general purpose programming language which developed for the main purpose of text manipulation driven by the interest of sed & awk. The making history of this programming language encouraged me to learn all scripting languages, as it is developed by Mr.Larry wall in 1987. There are many programming languages developed and created but the history of Perl development and the man behind the Perl programming language is very much interesting. I encourage you to read that blog and also about the Mr.Larry wall.

Power of Perl regular expression:

I would like to run through one more example which can be solved in Unix command as well as Perl command but you can feel the power of Perl execution during this example,

Requirement: I have a variable which has the sentence “This is regular expression example”. In this sentence, I wanted to check whether the word regular is found. If it is found then I wanted to print “Regular expression found”

Unix Style:

To achieve this requirement in Unix style, I have used AWK and have got below,

In the above output screen, I have forced to pass the column number of the sentence as $3 meaning Column $1 refer the word “This” and column $2 refer the word “is” and column $3 refer the expected word “regular” and so on. Maybe we have got our requirement done but what if regular word not in the column $3, in that case, this command will get fail. However, we can do more work with awk for loop to achieve this requirement in AWK command which can be bit lengthy.

Perl Style: 

The same requirement has been achieved in very simple dynamic regular expression with pure Perl way as below,

I have declared the sentence in variable $a and that has been directly passed in if condition with regular expression symbol of =~ (equal to any string) and then printed another sentence.

Regular Expression of Python:

Before python version 1.5, regex was the module used in python to work with the regular expression. During that version, many python developers started integrating Perl in Python for playing with regular expression scenario. Hence, python re module introduced for provides Perl-style regular expression patterns in Python.

Below are the submodules of re,

['DEBUG', 'DOTALL', 'I', 'IGNORECASE', 'L', 'LOCALE', 'M', 'MULTILINE', 'S', 'Scanner', 'T', 'TEMPLATE', 'U', 'UNICODE', 'VERBOSE', 'X', '_MAXCACHE', '__all__', '__builtins__', '__doc__', '__file__', '__name__', '__package__', '__version__', '_alphanum', '_cache', '_cache_repl', '_compile', '_compile_repl', '_expand', '_locale', '_pattern_type', '_pickle', '_subx', 'compile', 'copy_reg', 'error', 'escape', 'findall', 'finditer', 'match', 'purge', 'search', 'split', 'sre_compile', 'sre_parse', 'sub', 'subn', 'sys', 'template']

There are common symbol, character & syntax explained very well in this site as common standard format expressed in all scripting/programming languages.

Eg:

“.”  –> Dot refers the matches any character except newline ‘\n’

>>> sent = "This is an example in learninone.com"  -- variable assignment 
>>> import re  --re modules imported
>>> reg=re.search('exam...',sent)   -- Searching for exam after that three character by representing ...
>>> reg.group()  -- group to get only the word that i wanted to search.
'example'
>>> reg.string  -- Print whole string
'This is an example in learninone.com'

With the same above example, if I represent 4 dots, the regular expression will not fetch the word example as I have introduced newline character, the dot “.” cannot represent the newline ‘\n’.

>>> sent = "This is an example\n in learninone.com"   --Introduced \n (newline) character
>>> reg=re.search('exam....',sent) -- search for exma.... in sent variable
>>> reg.group()                     -- As you can see the None type returned
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
>>> reg.string
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'string'

You can refer the python re module has been well explained with more example. So I would like to give some real-time example without more ado,

Often used example,

Mostly very often the admin team has to fetch the list of used IPs from all the files and folders on the server. As you all know, the IP format is 255.255.255.255. This can be achieved simply using well-known modules os, socket,& re

os: To walk through all the files & folders in the server using os.walk() module.

Re: To extract the IP which explained below.

Socket: To validate whether the fetched values by regular expression is really an IP

>>> line = line = 'This is simple regular expression example to extract the IP 255.255.255.255'
>>> ips = re.findall(r'(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3})$', line)
>>> print ips
['255.255.255.255']

In the above example,

(?:[\d]{1,3})\. :- This portion of regular expression extracts any digit (\d) found in a sequence of up to 3 digits {1,3} that escape with a dot “.” as the ?: symbol force the end of the pattern to be matched. The same portion followed as next 4 portions of retrieving the complete IP address (255.255.255.255) and $ refers the end of the line.

The method re.findall performs this regular expression pattern scanned from left to right on the string.

>>> line = line = 'This is simple regular expression example to extract the IP 255.255.255.255 and 
... there is another dummy ip as 101.21.122.20 final'
>>> ips = re.findall(r'(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3})', line)
>>> print ips
['255.255.255.255', '101.21.122.20']

The same example used to include two IPs and If you would be noticed the argument to the findall method, I have removed the “$” end of line character as the sentence in the above line not end with IP. Moreover, since we have two IPs in the sentence the findall method produce the output as a list structure.

Other than findall method, there are so many methods or python regular expression objects and match objects such as search, match, finditer, split, sub, subn, flags, groups, groupindex, pattern, expand, etc., are really easy your work when you play with regular expression.

I would encourage you to go through links that captured in the article, as it will help you to practice the regular expression in python and even in other languages.

1 thought on “Regular expression in Python”

Leave a Reply