Mystic Regex

You are a Dev and have been so for more than a year now ( or maybe more or less). Your editor of choice has always been emacs/vim. Your mode of operation starts with a sip of ‘darker than coal’ coffee and ends with making your keyboard – your pillow. You are an expert at typing and programming but the one weapon you have been missing from your arsenal is writing/understanding something like this –

$out =~ s/(^[a-zA-Z0-9]+)\.([a-z]+)/<a href="\&quot;\1\.\2&quot;">\1<\/A>/g;</a>

What does it do? I hope you will be able to tell me after reading through this.
This post is to add to your arsenal – an Intercontinental ballistic missile of programming or what others call – Regex!

The Basics

Literals

This is just plain text. If I need to match cat in Bell the cat, I would just use cat as a regex!

Regex Special Characters

The following characters – []{}().+*\|^$ are native to regex. If you need to use them as literals you need to escape them by preceding it with \, for eg – \{. Now what do these do –

Regex Character
What it is
Examples
[] Character class [abcd] – Anything that is either one of a,b,c or d.
[^abcd] Match anything which is neither of a,b,c or d
. Dot character class Matches any single character except \n
* Star Matches any character class preceding it; of any length including 0 length. So, if you use .*cat, it will match pussycat and also cat
+ Plus Matches anything of length >=1. So, if you use .+cat, it will match pussycat and but not cat
| Alternation This works similar to a ‘or’ in a regex. If you want to match dog in the string My dogs name is Tiger, but also match cat in My cats name is puff. These are almost similar string and so your regex would be My cats|dogs name is .*
{} Limited Repitition Let say you want to ensure the number of times a pattern is to be matched. Or even better, you know the minimum and the maximum. In such a case you would use {}. For eg – [0-9]{2,5} means match it to any 2 digit, 3digit, 4 digit or 5 digit number ( with leading zeros). If you want only 2 digit numbers – [0-9]{2}, or if you want atleast 2 digit numbers [0-9]{2,} (note the comma ,)
$ End line Anchor This is a regex end line anchor. If your regex ends with this character, you are trying to say that ‘The pattern must occur at end of line’. For eg, If you want to ensure that the match ends with your pattern like I am What I am, if you search using am, it will match both but if you search am$, it matches the last one only.
^ Start line anchor This is a regex start line anchor. If your regex starts with this character, you are trying to say that ‘The pattern must occur at the start of line’. For eg, If you want to ensure that the match starts with your pattern like I am What I am, if you search using I, it will match both but if you search ^I, it matches the first one only.
^$ Caret Dollar Remember, you can also use ^$ in the same regex and in this case it would mean that the line must contain exaclty the pattern. For eg, your input is a large file with text on every line and you are trying to pull out a key of length 10 which can contain characters and numbers you would say – ^[0-9A-Za-z]{10}$

The Advanced

Now that you have a basic grasp of regex writing, it is time to learn some more advanced stuff.

Grouping & Back Referencing

If you are looking to group a pattern so that another operation can be applied to it, like (ash)+ will match ash, ashash but not ashas. But this would be a very primitive usage of this character. The more powerful usage is in backreferencing. When you put a pattern into a (), you tell the regex engine to store the match internally so that you can access it later. To use a matched pattern as a pattern again, you can use \ followed by a number. This number is the sequence of back reference. If you say \1, then it means the pattern matched with the first set of parenthesis. For eg, if you want to write a regex, which will match html starting and closing tags, you can use

<([A-Z][A-Z0-9]*)\b[^>]*>(.*)?</\1>

A language like Perl allows you to return backreferences. In the above example, to get the tag, you would use $1 and to see inside this tag you would use $2 (since the second time () used contains the html inside this tag).

Optional Items and Regex Greediness

Suppose you want to use a regex to match an HTML tag, assuming your input is a well formed HTML file.

You would think that  <.+> will solve easily. But be surprised when they test it on a string like This is my <TAG>first</TAG> test. You might expect the regex to match <TAG> and when continuing after that match, </TAG>.

But it does not. The regex will match <TAG>first</TAG>; not what we wanted. The reason is that the plus is greedy. That is, the plus causes the regex engine to repeat the preceding token as often as possible. Only if that causes the entire regex to fail, will the regex engine backtrack. That is, it will go back to the plus, make it give up the last iteration, and proceed with the remainder of the regex. To avoid such pitfalls, use ?. You can see this in the above regex.

It can also be used like optionals. If you want to match February but also Feb, you can use Feb(ruary)?

Performance of Regex

In one word – Better. The regex engine will perform better than anything you or I can write to match a pattern, unless you write your own regex engine. And even in that case, the standard Regex will beat you to it! Also, the more simpler your regex, the faster it run (Obviously). Using back-referencing will slow down your regex. A very simple example is grep. This utility only allows simple regex characters and tends to be faster than egrep which allows much more advanced stuff but at a price!

Now, after going through all this, I hope you can answer what the first regex I introduced you to, did! Its in perl and s/<PATTERN>/<REPLACE>/g replaces <PATTERN> with <REPLACE>, globally.
I hope you are able to now add regex to your programming arsenal and hope this has helped you understand it. For more info, you can always google 😉

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: