Category Archives: Programming

Mystic Regex

You are a Dev and have been so for more than a year now ( or maybe more or less). Your editor of choice has always been emacs/vim. Your mode of operation starts with a sip of ‘darker than coal’ coffee and ends with making your keyboard – your pillow. You are an expert at typing and programming but the one weapon you have been missing from your arsenal is writing/understanding something like this -

$out =~ s/(^[a-zA-Z0-9]+)\.([a-z]+)/<a href="\&quot;\1\.\2&quot;">\1<\/A>/g;</a>

What does it do? I hope you will be able to tell me after reading through this.
This post is to add to your arsenal – an Intercontinental ballistic missile of programming or what others call – Regex!

The Basics

Literals

This is just plain text. If I need to match cat in Bell the cat, I would just use cat as a regex!

Regex Special Characters

The following characters – []{}().+*\|^$ are native to regex. If you need to use them as literals you need to escape them by preceding it with \, for eg – \{. Now what do these do -

Regex Character
What it is
Examples
[] Character class [abcd] – Anything that is either one of a,b,c or d.
[^abcd] Match anything which is neither of a,b,c or d
. Dot character class Matches any single character except \n
* Star Matches any character class preceding it; of any length including 0 length. So, if you use .*cat, it will match pussycat and also cat
+ Plus Matches anything of length >=1. So, if you use .+cat, it will match pussycat and but not cat
| Alternation This works similar to a ‘or’ in a regex. If you want to match dog in the string My dogs name is Tiger, but also match cat in My cats name is puff. These are almost similar string and so your regex would be My cats|dogs name is .*
{} Limited Repitition Let say you want to ensure the number of times a pattern is to be matched. Or even better, you know the minimum and the maximum. In such a case you would use {}. For eg – [0-9]{2,5} means match it to any 2 digit, 3digit, 4 digit or 5 digit number ( with leading zeros). If you want only 2 digit numbers – [0-9]{2}, or if you want atleast 2 digit numbers [0-9]{2,} (note the comma ,)
$ End line Anchor This is a regex end line anchor. If your regex ends with this character, you are trying to say that ‘The pattern must occur at end of line’. For eg, If you want to ensure that the match ends with your pattern like I am What I am, if you search using am, it will match both but if you search am$, it matches the last one only.
^ Start line anchor This is a regex start line anchor. If your regex starts with this character, you are trying to say that ‘The pattern must occur at the start of line’. For eg, If you want to ensure that the match starts with your pattern like I am What I am, if you search using I, it will match both but if you search ^I, it matches the first one only.
^$ Caret Dollar Remember, you can also use ^$ in the same regex and in this case it would mean that the line must contain exaclty the pattern. For eg, your input is a large file with text on every line and you are trying to pull out a key of length 10 which can contain characters and numbers you would say – ^[0-9A-Za-z]{10}$

The Advanced

Now that you have a basic grasp of regex writing, it is time to learn some more advanced stuff.

Grouping & Back Referencing

If you are looking to group a pattern so that another operation can be applied to it, like (ash)+ will match ash, ashash but not ashas. But this would be a very primitive usage of this character. The more powerful usage is in backreferencing. When you put a pattern into a (), you tell the regex engine to store the match internally so that you can access it later. To use a matched pattern as a pattern again, you can use \ followed by a number. This number is the sequence of back reference. If you say \1, then it means the pattern matched with the first set of parenthesis. For eg, if you want to write a regex, which will match html starting and closing tags, you can use

<([A-Z][A-Z0-9]*)\b[^>]*>(.*)?</\1>

A language like Perl allows you to return backreferences. In the above example, to get the tag, you would use $1 and to see inside this tag you would use $2 (since the second time () used contains the html inside this tag).

Optional Items and Regex Greediness

Suppose you want to use a regex to match an HTML tag, assuming your input is a well formed HTML file.

You would think that  <.+> will solve easily. But be surprised when they test it on a string like This is my <TAG>first</TAG> test. You might expect the regex to match <TAG> and when continuing after that match, </TAG>.

But it does not. The regex will match <TAG>first</TAG>; not what we wanted. The reason is that the plus is greedy. That is, the plus causes the regex engine to repeat the preceding token as often as possible. Only if that causes the entire regex to fail, will the regex engine backtrack. That is, it will go back to the plus, make it give up the last iteration, and proceed with the remainder of the regex. To avoid such pitfalls, use ?. You can see this in the above regex.

It can also be used like optionals. If you want to match February but also Feb, you can use Feb(ruary)?

Performance of Regex

In one word – Better. The regex engine will perform better than anything you or I can write to match a pattern, unless you write your own regex engine. And even in that case, the standard Regex will beat you to it! Also, the more simpler your regex, the faster it run (Obviously). Using back-referencing will slow down your regex. A very simple example is grep. This utility only allows simple regex characters and tends to be faster than egrep which allows much more advanced stuff but at a price!

Now, after going through all this, I hope you can answer what the first regex I introduced you to, did! Its in perl and s/<PATTERN>/<REPLACE>/g replaces <PATTERN> with <REPLACE>, globally.
I hope you are able to now add regex to your programming arsenal and hope this has helped you understand it. For more info, you can always google ;)


Linear Recurrences

How often has it happened to you in a programming contest (or elsewhere) that you thought it was impossible to solve it faster than O(N) and yet the limits imposed suggest that it has to be done faster. Well, if not all, atleast a majority of them have a solution based on the idea of linear recurrences. In this blog post, I intend to help you out on this !!

In this post, we are going to do a – Solve and Learn strategy ; You will be given a question and I will show you how to apply  the concepts on them.

TYPE 1 :: The Simplest :

If a post mentions recurrences, then it has to mention Fibonacci (Gosh, if only I had a penny for every mention of Fibo in tutorials. )

The recurrence is of type : F(n) = F(n-1) + F(n-2).

I am pretty sure you know to code the linear version of it which runs in O(N) but can you do it in O(log N) ? If you throw google to good use, you will come up with a solution which says there is a Matrix M which when raised to power N, will give you the N-th fibonacci number. And since you can always exponentiate in logN time, you have your answer. But to those, who wondered if this Matrix is magical- read on!

Firstly the answer- No; Its not magical. How. Lets do a little Algebra (yumm… My favourite! )
F(n)=F(n-1)+F(n-2)\\ \\ F(n+1) =F(n)+F(n-1)\\ \\ F(n+2)=F(n+1)+F(n)

Obviously enough, the value of N-th term, depends on two previous terms (or states). This implies that all values depend on just the first two states in the sequence. As you can see here -

\begin{pmatrix}F(n+2)\\ F(n+1)\end{pmatrix}=\begin{pmatrix}1&1\\ 1&0\\ \end{pmatrix}\times\begin{pmatrix}F(n+1)\\ F(n)\end{pmatrix}\\ \\ and\\ \\ \begin{pmatrix}F(n+1)\\ F(n)\end{pmatrix}=\begin{pmatrix}1&1\\ 1&0\\ \end{pmatrix} \times \begin{pmatrix}F(n)\\ F(n-1)\end{pmatrix} \\ \\ Hence \\ \\ \begin{pmatrix}F(n+2)\\ F(n+1)\end{pmatrix}=\begin{pmatrix}1&1\\ 1&0\\ \end{pmatrix} ^2 \times \begin{pmatrix}F(n)\\ F(n-1)\end{pmatrix} \\ \\ \begin{pmatrix}F(n+2)\\ F(n+1)\end{pmatrix}=\begin{pmatrix}1&1\\ 1&0\\ \end{pmatrix}^3 \times \begin{pmatrix}F(n-1)\\ F(n-2)\end{pmatrix}

Hence in General, we may write ::
\begin{pmatrix}F(n)\\ F(n-1)\end{pmatrix}=\begin{pmatrix}1&1\\ 1&0\\ \end{pmatrix}^{n-1} \times \begin{pmatrix}1\\ 0\end{pmatrix}

I hope that has helped you in understanding how to frame such equations and solving it with a matrix.

TYPE 2 : Simplest ++

Now that we have a basic understanding. Try the following recurrence :

F(n) = F(n-1) + F(n-2) + F(n-3).

It is the same as the previous recurrence but with an additional state. I won’t go on explaining the hows (again!). I am going to share the solution.
\begin{pmatrix}F(n)\\ F(n-1)\\ F(n-2) \end{pmatrix}=\begin{pmatrix}1&1&1\\ 1&0&0\\ 0&1&0 \end{pmatrix}^{n-2} \times \begin{pmatrix}2\\ 1\\ 1\end{pmatrix}

TYPE 3: Simplest << 1

Consider the following scenario ::

G(n) = a . G(n-1) + b . G(n-2) + c . H(n)\\ \\ and \\ \\ H(n)= d . H(n-1) + e . H(n-2)

This one is a lot trickier. First thing to notice is that we will need 4 states in a matrix to fully define the next state. The reason for using 4 and not 3 is that H(n) depends on 2 states and thus we need 2 states (and not just 1) to represent it.

If you carefully write down the LHS matrix and the RHS matrix, then we can frame the solution as . . .

\begin{pmatrix}G(n)\\ G(n-1)\\ H(n+1)\\ H(n) \end{pmatrix}=\begin{pmatrix}a&b&c&0\\ 1&0&0&0\\ 0&0&d&e\\ 0&0&1&0 \end{pmatrix}^{n-1} \times \begin{pmatrix}G(1)\\ G(0)\\ H(2)\\ H(1)\end{pmatrix}

TYPE 4 : Ohhh !

The final hurdle can come in the name of a constant. If we add a constant C to the above recurrence we get -

G(n) = a . G(n-1) + b . G(n-2) + c . H(n) + C\\ \\ and \\ \\ H(n)= d . H(n-1) + e . H(n-2)

But to tell you the truth, its not that difficult if your concepts are clean. Now there is another additional state to hold the information about C. The solution will look like -

\begin{pmatrix}G(n)\\ G(n-1)\\ H(n+1)\\ H(n)\\ C \end{pmatrix}=\begin{pmatrix}a&b&c&0&1\\ 1&0&0&0&0\\ 0&0&d&e&0\\ 0&0&1&0&0\\ 0&0&0&0&1 \end{pmatrix}^{n-1} \times \begin{pmatrix}G(1)\\ G(0)\\ H(2)\\ H(1)\\ C\end{pmatrix}

I hope this post lived up to your expectations and I hope it was worth the wait :P . Please feel free to post comments/corrections/improvements to this post to make it really useful.


Return to Roots: Tree 101

What is a Tree :

Tree is a heirarchial arrangement of nodes. From the literal meaning of Tree we know that it has root, branches, fruits and leaves. Well, in Algorithms also, we have a root – which is the origin of the tree. We have branches which connect to smaller trees and we have leaves, which do not have outgoing branches. And as far as the fruits are concern – depending on the complexity of operations that can be perform, we may label the fruits as sweet and sour !

The simplest tree would be a node which branches to exactly one other node, or in other words – a singly Link List. If every node branches to its child and also to its parent, we have a doubly link list. But in this post, we are not going to discuss these.

The next level of trees would be – where a single node may branch out to a maximum of two other nodes. Such a tree is call a binary tree. Binary trees are some of the most widely us datastructures in computers and we are going to discuss them in a series of posts. So lets begin.

One of the most important things to do is : Create a tree.
So what is it that we ne to create one. We will ne to represent the nodes and the links between nodes. And since we ne to connect to a maximum of two nodes, we will have two branches. We shall call these branches – left and right. Also, it will store some data in it. Our tree will be us to just store integers.

We will use the following structure to create it. FYI, everything here is in C++ and not C.

struct NODE {
    int data;
    NODE *left;
    NODE *right;
};

Now whenever we ne to insert a node, we ne to make sure that there is a fix position at which the node will be insert given its value (Data in the node). Let us follow a simple strategy.
We will insert a node to the left of a ‘Parent node’, if its value is lesser than the value of the Parent, otherwise to the right. The binary trees which use such a strategy are call Binary Search Trees.

The obvious advantage of such a strategy is that we can search for elements in the tree in O(h) time, where h is the height of the tree. Do note that, in general, h does not equal logN. If we could actually have a tree where the height is inde logN, we would call such trees as Balanc Binary Search Trees.

Alright then, lets get our hands dirty with a code that will create the tree for us. The function insert takes as input the root of the tree and the value to be insert and returns the node which contains the data.

NODE * insert(NODE *root, int data) {
    if(root==NULL) {
        root=(NODE*)malloc(sizeof(NODE));
        root->left=root->right=NULL;
        root->data=data;
        return root;
    }
    else {
        while(root!=NULL) {
            if(root->data>data) {
                if(root->left!=NULL) root=root->left;
                else break;
            }
            else {
                if(root->right!=NULL) root=root->right;
                else break;   
            }
        }
        NODE *new_node=new NODE;
        new_node->data=data;
        new_node->left=new_node->right=NULL;
        if(root->data > data) {
            root->left=new_node;
        }
        else root->right=new_node;
        return new_node;
    }
}

Another very useful and important property when using the above strategy is, that the INORDER traversal is sort!

Lets backup a bit. What are Traversals. It is like visiting many homes using the roads which connect them. Only that, the homes here are the NODEs and the roads are the links between each node.

There are many traversals but the three us very often are – PreOrder, InOrder and PostOrder.

In PreOrder, you print the current node and then visit its left and then its right children, recursively.
In InOrder, you first visit the left child, once you have return, you print the current value and then visit the right child.
In PostOrder, you visit both your children and then print the current value.

Here is the code snippet for the InOrder traversal (recursive version).

void inorder(NODE *root) {
    if(root!=NULL) {
        inorder(root->left);
        printf("%d ",root->data);
        inorder(root->right);
    }
}

You could write an iterative version, where you would simulate the operations in a system stack, using your own stack. The obvious advantage is that you would be saving space (since you would now push as many values as the system would for a function call.)

However, there exists a really beautiful iterative version which does not use a stack. It assumes that two pointers can be check for equality. It is bas on thread trees and it was first written in 1979 by Morris and hence the name!

How does it work.

The only reason we ne a stack is so that we can do the “RETURN” from child nodes to parent nodes. This return is ne only from one node really. Consider a 5 node tree.

                                      20
                                    /     \
                                   /       \
                                 10        30
                                /   \     
                               /     \
                             5       15

Now our stack would work like this.

1. Push 20.
2. Push 10.
3. Push 5.
4. Pop 5 and print 5.
5. Pop 10 and print 10.
6. Push 15.
7. Pop 15 and print 15.
8. Pop 20 and print 20.
9. Push 30.
10. Pop 30 and print 30.

If I write a non-resursive and non-stack version, my greatest headache would be to go to 20 from 15 (statements 7-8). So we need to link 15 and 20 so that we can go to 20 without problems. But that would mean that we are modifying the tree. Well, we could do it in two steps. First we link the two and in the next step once we have printed 20, we can destroy that link.

                                        20
                                      / | \
                                     /  |  \
                                   9    |   30
                                  /   \ |   
                                 /     \|
                               5       15

And thus we have the following -

1. SET current as root.
2. if current is not null do -
2.a. if current has no left child, print current , set current as right child and REPEAT 2.
2.b. else goto the rightmost child of current’s left child.
2.b.a. If this is NULL, then link it to current and set current as left child of current and REPEAT 2.
2.b.b. else set the right child to NULL. Print Current. Set current as Current’s right child . REPEAT 2.

As a pseudocode we may write it as -

Morris-InOrder ( root )
current = root
while current != NULL do
	if LEFT(current) == NULL then
	   print current
	   current=RIGHT(current)
	else do
	   // set pre to left child of current
	   pre=LEFT(current)
	   // find rightmost child of the left child of current
	   while (RIGHT(pre) != NULL  and RIGHT(pre) != current) do
	       pre=RIGHT(pre)
	    //if thus is null, link it to current and set current's left as current
	    if RIGHT(pre) == NULL then
	       RIGHT(pre)=current
	       current=LEFT(current)
	    // else unlink it, print current and set right child of current as current
	    else do
	       RIGHT(pre)=NULL
	       print current
	       current=RIGHT(current)

Looks nice aah. Let’s just write the code.

void MorrisInorder(NODE *root) {
    NODE* current,*pre;
    current=root;
    while(current!=NULL) {
        if(current->left==NULL) {
            printf("%d ",current->data);
            current=current->right;
        }
        else {
            pre=current->left;
            while(pre->right != NULL && pre->right !=current) 
                pre=pre->right;
            if(pre->right==NULL) {
                pre->right=current;
                current=current->left;
            }
            else {
                pre->right=NULL;
                printf("%d ",current->data);
                current=current->right;
            }
        }
    }
}

Now, lets talk about the fruits!

Insert happens in O(h) time. Each of the traversals (recursive and iterative versions using stack) are in O(N) time and O(N) space (system stack or normal stack).

Morris Inorder runs in O(NlogN) time and O(1) space. One could say that it is slower which is true, but the fact that it does not use additional space can be a huge boost in situations where you are low on system memory!

The entire code is available on :PASTEBIN
I hope you gathered all that info well! I will post a Tree 102, in which I shall discuss the delete operation and talk more about balanced trees!


Follow

Get every new post delivered to your Inbox.

Join 82 other followers