Category Archives: Random

Mystic Regex

You are a Dev and have been so for more than a year now ( or maybe more or less). Your editor of choice has always been emacs/vim. Your mode of operation starts with a sip of ‘darker than coal’ coffee and ends with making your keyboard – your pillow. You are an expert at typing and programming but the one weapon you have been missing from your arsenal is writing/understanding something like this –

$out =~ s/(^[a-zA-Z0-9]+)\.([a-z]+)/<a href="\&quot;\1\.\2&quot;">\1<\/A>/g;</a>

What does it do? I hope you will be able to tell me after reading through this.
This post is to add to your arsenal – an Intercontinental ballistic missile of programming or what others call – Regex!

The Basics

Literals

This is just plain text. If I need to match cat in Bell the cat, I would just use cat as a regex!

Regex Special Characters

The following characters – []{}().+*\|^$ are native to regex. If you need to use them as literals you need to escape them by preceding it with \, for eg – \{. Now what do these do –

Regex Character
What it is
Examples
[] Character class [abcd] – Anything that is either one of a,b,c or d.
[^abcd] Match anything which is neither of a,b,c or d
. Dot character class Matches any single character except \n
* Star Matches any character class preceding it; of any length including 0 length. So, if you use .*cat, it will match pussycat and also cat
+ Plus Matches anything of length >=1. So, if you use .+cat, it will match pussycat and but not cat
| Alternation This works similar to a ‘or’ in a regex. If you want to match dog in the string My dogs name is Tiger, but also match cat in My cats name is puff. These are almost similar string and so your regex would be My cats|dogs name is .*
{} Limited Repitition Let say you want to ensure the number of times a pattern is to be matched. Or even better, you know the minimum and the maximum. In such a case you would use {}. For eg – [0-9]{2,5} means match it to any 2 digit, 3digit, 4 digit or 5 digit number ( with leading zeros). If you want only 2 digit numbers – [0-9]{2}, or if you want atleast 2 digit numbers [0-9]{2,} (note the comma ,)
$ End line Anchor This is a regex end line anchor. If your regex ends with this character, you are trying to say that ‘The pattern must occur at end of line’. For eg, If you want to ensure that the match ends with your pattern like I am What I am, if you search using am, it will match both but if you search am$, it matches the last one only.
^ Start line anchor This is a regex start line anchor. If your regex starts with this character, you are trying to say that ‘The pattern must occur at the start of line’. For eg, If you want to ensure that the match starts with your pattern like I am What I am, if you search using I, it will match both but if you search ^I, it matches the first one only.
^$ Caret Dollar Remember, you can also use ^$ in the same regex and in this case it would mean that the line must contain exaclty the pattern. For eg, your input is a large file with text on every line and you are trying to pull out a key of length 10 which can contain characters and numbers you would say – ^[0-9A-Za-z]{10}$

The Advanced

Now that you have a basic grasp of regex writing, it is time to learn some more advanced stuff.

Grouping & Back Referencing

If you are looking to group a pattern so that another operation can be applied to it, like (ash)+ will match ash, ashash but not ashas. But this would be a very primitive usage of this character. The more powerful usage is in backreferencing. When you put a pattern into a (), you tell the regex engine to store the match internally so that you can access it later. To use a matched pattern as a pattern again, you can use \ followed by a number. This number is the sequence of back reference. If you say \1, then it means the pattern matched with the first set of parenthesis. For eg, if you want to write a regex, which will match html starting and closing tags, you can use

<([A-Z][A-Z0-9]*)\b[^>]*>(.*)?</\1>

A language like Perl allows you to return backreferences. In the above example, to get the tag, you would use $1 and to see inside this tag you would use $2 (since the second time () used contains the html inside this tag).

Optional Items and Regex Greediness

Suppose you want to use a regex to match an HTML tag, assuming your input is a well formed HTML file.

You would think that  <.+> will solve easily. But be surprised when they test it on a string like This is my <TAG>first</TAG> test. You might expect the regex to match <TAG> and when continuing after that match, </TAG>.

But it does not. The regex will match <TAG>first</TAG>; not what we wanted. The reason is that the plus is greedy. That is, the plus causes the regex engine to repeat the preceding token as often as possible. Only if that causes the entire regex to fail, will the regex engine backtrack. That is, it will go back to the plus, make it give up the last iteration, and proceed with the remainder of the regex. To avoid such pitfalls, use ?. You can see this in the above regex.

It can also be used like optionals. If you want to match February but also Feb, you can use Feb(ruary)?

Performance of Regex

In one word – Better. The regex engine will perform better than anything you or I can write to match a pattern, unless you write your own regex engine. And even in that case, the standard Regex will beat you to it! Also, the more simpler your regex, the faster it run (Obviously). Using back-referencing will slow down your regex. A very simple example is grep. This utility only allows simple regex characters and tends to be faster than egrep which allows much more advanced stuff but at a price!

Now, after going through all this, I hope you can answer what the first regex I introduced you to, did! Its in perl and s/<PATTERN>/<REPLACE>/g replaces <PATTERN> with <REPLACE>, globally.
I hope you are able to now add regex to your programming arsenal and hope this has helped you understand it. For more info, you can always google 😉

Advertisements

Thou art Debugger

I have been a big fan of Visual Studio. It is an amazing IDE. It can be used to develop anything from a CLI to GUI and from Mobile Apps to Web Apps. But I have never really tried all that and that isn’t the reason why I liked it so much. As a starter, it can be very difficult to discover a bug in your program. You safely assume you have written what you wanted to write. But in reality that happens very rarely. Often, we miss a little small thing here and there and that creates havoc. The one thing that caught my eye as a young programmer was the Debugging features of VS. Gosh, its amazing.
In this article, I will explain how Debuggers debug!.

Firstly, we will need to learn a bit about CPU registers. We will concentrate on the x86 architecture. If you have ever written code in x86 Assembly, then you would have heard about them. But let me just walk you through the functions of these registers.

CPU registers are just memory that the CPU can use to store data. But, it is the fastest accessible memory for the CPU. Ideally you would like to keep everything in it. Unfortunately, it is damn expensive and so a trade-off is done on the pricing and performance. Though there are 32 registers, the most commonly used ones(9 to be precise) for the purpose of executing instuctions are –

EAX : Accumulator : used to store. Used during add/sub/multiplication. Mult/Div cannot be done elsewehere except in EAX. Also used to store return values.
EDX : storage register Used in conjuction to EAX. It is like a side-kick.
ECX: count register: Used in looping. However, there is an interesting thing about it. It always counts downwards and not upwards.
for ex:
int a=100;
int b=0;
while(b<a)b++;

Then ECX will begin from 100 and not 0 and move to 99,98 … !
ESI : source index for data operations and holds location of input stream.(READING)
EDI : points to location where the result is stored of a data operation destination index. (WRITING)
ESP : Stack pointer
EBP: Base pointer
EBX: not designed for anything specific . can be used for extra storage.
EIP : Current instruction being executed.

Now that we understand this, lets see how a debugger works!
Depending on the type of breakpoint, one of the following happens :

Soft Breakpoint :
Let us assume we need to execute this instruction :
mov %eax,%ebx

And this is at location 0x44332211, and the 2 byte opcode for this is 0x8BC3. Hence what we will see is :
0x44332211 0x8BC3 mov %eax,%ebx

Now when we create a soft breakpoint at this instruction what the debugger does is – it takes the first byte (8B in this case) and replaces it with 0xCC.
So now, the opcode actually looks like – 0xCCC3. This is the opcode for INT 3 interrupt. INT 3 is used to halt the execution. Now when the CPU is happily executing everything till this one and suddenly sees the 0xCC it knows it has to stop(it may not really like it but – Rules are rules 😉 ). It then raises the INT 3 interrupt which the debugger will trap. It then checks the EIP register and sees if this intruction is actually in its list of breakpoints (just in case the program itself has a INT 3 inside it). If it is present, it will replace the first byte with the correct value (8B in this case) and then program execution can continue.

There are two kinds of soft breakpoint: One shot – Where the breakpoint occurs only once and after that the debugger removes the instruction from its list; and Persistent – where it keeps recurring. The debugger would replace the first byte with the correct byte but when execution is resumed, it will once again replace it with 0xCC and not remove it from its list.

Soft breakpoints have one caveat though.When we make a soft breakpoint it changes the software/program’s CRC (cyclic redundancy check) check sum. A CRC is a type of function which tells whether there has been any change or not. It is like a hash function and can be applied to memory/files etc.,. It compares against a known value and if the checksum fails, the CRC fails. This can be used to prevent Soft breakpoints like in malwares where if the CRC fails, the malware kills itself. To get around this we use Hardware breakpoints!

Hardware Breakpoints –
Though hardware breakpoints cannot be applied at everything, it is still very useful. Recall, the I said there are 32 registers. I introduced 9 in the previous section, let me add another 8 to that list. These eight – DR0 to DR7 are debug registers. DR0 – DR3 store the address of the breakpoints and thus we can have only 4 hardware breakpoints (Ouch!). DR4 and DR5 are reserved. DR6 is the status register which determines the type of debugging event once it is hit and DR7 is ON/OFF switch for hardware breakpoints and also stores different conditions like –
1. Break when an instruction is executed at a particular address.
2. Break when data is written to an address.
3. Break on reads or writes to an address but not execution.

As you can guess, you can only break 4 bytes of memory with hardware breakpoints but nonetheless they are very useful tools for reverse engineers. For creating hardware breakpoints, you can use hbreak command inside gdb.( More info ).

Memory breakpoints.
These arent really breakpoints. It is more like setting permissions on a section of memory or on an entire page, something similar to file permissions. When we set a particular permission, if any instruction tries to do something outside of these permissions on that memory, then a break occurs. The permissions available are – Page Execution (enables exec but throws access violation on read/write) , Page Read ( allows only read), Page Write (only write), and Guard Page (one time exception after which the page returns to its original status). Like in files, we can use combination of these to set permissions. In gdb, to break on write use watch, rwatch will break on read and awatch will break on read/write.

I hope you now have a better understanding of how debuggers work. You may feel this as being unnecessary info, but trust me it helps to know how things work for better usage.

I would like to thank all Source of knowledge – the World Wide Web. Please free to send me corrections/suggestions/criticism.
Adios!


BrainFUCK-ed

If you are offended by the title, then please SKIP! So, you were caught in the BrainFuck bubble as well. Isn’t it one of the coolest esolangs out there! You might argue – Whats the use of this language. It can’t be used for any REAL software development. Maybe. But as a programmer, it can really strech your boundaries. How? Lets say in C, you had to write a program to print the first five fibonacci numbers (using the logic and not just printing it out), you would code it with your toes. I would too. But think of it in BrainFUCK and you will understand why Urban Muller named it so appropriately.

Here is the Wikilink for it so that you can see the basics of the languange.

In a C version of it, you will need 3 variables – to store the previous value and current value, and one temporary. One more to make sure you only print 5 and not any more , making our count 4 !
Lets look at the code now :

int main()
{
	int prev=1,curr=1,temp,count;
	for(count=0;count<5;count++)
	{
		printf("%d ",curr);
		temp=curr;
		curr=prev+curr;
		prev=temp;

	}
}

That was quick.
But even a simple program like this could make you think very hard in BrainFuck. Its fun to know that you have to think to write simpler programs. It makes my blood rush with just the thought of it.

So here goes !
We only have 8 instructions, out of which we will use 7 (since we are not reading anything from the stdin).
You have to think through absolutely every single thing before you so even put a ‘.’ (dot) on the code editor (quite literally).
We need to print SPACE, so we need 32(ASCII of SPACE) in the memory somewhere!.
We need to print number (for simplicities sake, ONE digit numbers) so we need to make sure we have the ASCII values of it, in memory (ASCII of 0 is 48, 1 is 49 and so on..).
If you look at the C code, we will need keep count of the number of items we have remaining to print. And then we need two locations to store previous and current values of the sequence. A very important thing to remember is that all memory is initialized to 0 at the start and we are relying on this for it to work. Also, the looping statements compare the memory location currently being referred (where your current pointer is) to against 0 and if it is true it breaks otherwise loops.

Writing a simple pseudocode (in terms of only Brainf*k) :

Initialize a location A to 32
Initialize a location B to 48
Initialize a location C to 5
Initialize a location D and E to 1.
Set current Location to C.
Start a Loop
	Add B to E and F.
	Print E.
	Subtract B from E
	Add D to E and F
	Copy E to D 
	Copy F to E
	Reduce C by 1 and Set it as the current location
End Loop

Basically, it was very easy to write this pseudocode. The issue with Braninf*k is that it does not have a COPY instruction for us to use, and so we need to use a looping statement to copy data. But the problem is that to ensure that the loop is executed the correct number of times, we will need to destroy the original value. To get around this, we always update two locations (one where we need it to be copied and the other a temporary!) and then we copy back from the temporary which also destroys it!

Also the A,B,C,D,E and F are all contiguous in memory and so I am using a[0],a[1]…a[6] to refer to them in the code !

Here is the final version.

++++++++[>++++>++++++<<-]>>>++++++>+>+<<-[<[>>>+>+<<<<-]>>>.<<<<.>>>>>[<-<<<+>>>>-]<[>+>+<<-]<[>>>+<<<-]>>[<<+>>-]>[<<+>>-]<<<<-]  

To make sense of it I have divided it into parts. Some parts have not been commented and are left as an exercise for you to figure out ( I know I am mean 😉 ).

++++++++[>++++>++++++<<-] //32 48
>>>++++++ //6
>+>+<<- //Set first two values to 1 and reduce the number of terms to be printed to 5.
[
	<[>>>+>+<<<<-]
	>>>.//print the number
	<<<<.//print space
	>>>>>[<-<<<+>>>>-]
	<[>+>+<<-]
	<[>>>+<<<-]
	>>[<<+>>-]//copy from a[6] to a[4]
	>[<<+>>-]//copy from a[7] to a[5]
	<<<<-//decrement the value
]  

All right. Pretty cool huh! Well as a real mind-boggler you can try making it print N numbers (dont worry about printing it right, just print out the actual value encoded by ascii ).

You could use it to even encrypt messages in a really wierd way. Oh almost forgot. To run your Brainf*k programs use the following website. They have an online interpreter and a debugger of sorts too : www.brainfk.tk

I hope you liked your Christmas Present ;-). Merry Christmas and a Happy New Year… Ho oh ho 😛


Naughty Null

Hi, I am sorry it took me this long for the post. But to my fans (:P) I will be posting (hopefully) 2 more as promised in the next 2 days (actually nights).
This Post is based entirely on Sir Anthony Hoare’s Lecture on – NULL REFERENCES – A BILLION DOLLAR MISTAKE

If you have a null reference – every bachelor in the world would seem married, polyandrously!
                                                                                                                                                  -Edsger Djikstra

What is a Null Reference : A Null reference is a pointer that points to a NULL (in C++). For example,

#include<stdio.h>
int main()
{
    char *cp=NULL;
    cp=(char*)(malloc(sizeof(char)*12);
    cp="HELLO THERE";
    printf("%s\n"cp);
}

Here, in Line 4 you have a Null Reference. Now you would wonder why that is a bad thing. In the context of this program, it isnt all that bad since it executes smoothly. But if you begin using complex source codes then you will have to be careful that you do no use/pass a NULL REFERENCE in/to a function. As this can lead to potential issues – sometimes almost impossible to detect !

A HISTORY LESSON :
Hoare’s first job was in the 60’s, as a programmer for a British Manufacturer – Elliots. After about 9 months, he was asked to design a new programming language ( Imagine this as your assignment at your first job !) While working on that, he found a book – Report on the International Algorithmic Language – ALGOL-60 (around 23 pages). He mostly wrote the new language around Algol 60 and left out complicated parts like if statements but later added them into it.

In those days, code was written in Machine Language and it was easier to debug in machine language ( I never thought I would ever hear that!) than in high level languages because of now knowing the exact values in the 4096 bytes of memory. The principle involved was that ” One should be able to tell the error by looking at the high level code alone”. And this was a real problem and people were a little nervous about the high level languages. But as the complexity went up, it was accepted eventually.

Now, Lets examine another piece of commonly used code –

#include<stdio.h>
#define N 100
int main()
{
	int arr[N];
	int s=0;
	int k
	/*** SOME CODE TO DO SOME OPERATIONS ON ARRAYS ***/
	/*** SOME CODE THAT MODIFIES k***/
	
	while(s<N)
        {
              printf("%d\n",arr[s]);
              s++;
        }
	if(k>=0 && k<N)printf("K-th value is %d\n",arr[k]);
}

If we look at what we are doing in the code at Line 10 and 11 – We are making sure that the subscript in within bounds. Thus we see that two checks(Upper Bound and Lower bound) are required to make sure this condition is satisfied. But this added to the time to run code (In those days the processors were really slow and things like if-s could take up a large amount of time !) and also to the size of the code.

A Question may come up – Is this a good thing to do? The Answer is YES. Languages like JAVA already implement it!

Hoare later on went on to become part of the team to design the successor of ALGOL – 60. He suggested that an Object/Record could be referenced by a pointer. But pointers in indirect memory can cause absolute havoc, because if you used a float as a pointer you could end up overwriting your own code! Thus, he took it for granted that the programmer must declare what data the pointer points to and it could be checked at compile time. He asked the customers – Whether they would like the Type checking removed once the code was tested ? They said No. The reason being that there are more chances of errors and issues popping up in a Live environment than in a test environment.

At that point, many people also used FORTRAN instead of ALGOL and Elliot wanted to sell it to them also. So they wrote a program which converted FORTRAN to ALGOL. It was a disaster, not because it compromised the speed but because – after converting it, it would come up with an entire essay of subscript errors.

Hoare invented the NULL REFERENCE but his friend Edsger Djikstra thought that it was a bad idea – If you have a null reference – every bachelor in the world would seem married, polyandrously – since every bachelor has the same wife – NULL but is still a bachelor !

Hoare came up with a solution based on Discrimination of Objects belonging to the disjoint Union Class – A union between two sets with nothing in common! For eg, if we have a class Vehicle and subclasses – Cars and Bus. They both have an attribute Capacity but for a bus it refers to the passenger Capacity whereas for a car it refers to boot capacity. Thus it would discriminate based on bus or car and give you the corresponding capacity (by making a discriminating class.) Thus a pointer could point to a vehicle or to NULL.

Also, having NULL made it easy to initialize. If you didn’t do it this way, you could end up doing it in a really complex manner which could require a sub-language of its own. For eg, it is easy to design a data-structure without the NULL but almost impossible to design a circular structure without it! Hoare did not want to deal with that and hence NULL became a possible value for every pointer.

Thus in a way, it also influenced the way C was written. One of the greatest issues was the gets() function in C, which allowed buffer – overflow. If it had not been for it, the World would have been free of malwares !

If you asked me whether it was a Good choice or a bad choice to have a NULL reference – I would say that it was a good choice. Consider the amount of time I have saved thanks to the NULL pointer and the structures also owe some stability to the NULL Pointer. However, one must be very careful around it because it is very dificult to see that the pointer points to a NULL and it can be a cause for many a sleepless night!

See it yourself


%d bloggers like this: