Technical Views

Friday, 13 March 2015

OAuth 2.0

Most of you may have used Facebook for logging into some websites for example, www.gaana.com. As user is authenticated by FB, other websites don’t think of authenticating the user again. Instead, gets user’s information from FB. A consent screen is shown to the user, where he can accept/reject the permissions asked by the third party(www.gaana.com) Application using say FB. These permissions are for accessing the information from FB in read only, write only or both modes. This set of permissions depend upon the third party Application requirements.

If you are a new to the terminologies like OAuth or OpenId, then many questions may be going on in your mind. What is OAuth? Why OAuth?

What is OAuth?
The OAuth 2.0 is an authorization protocol that enables the third party application to obtain access to the resources on resource owner’s behalf. The resources can be anything like services, files or some information.

Why OAuth?Consider a scenario where third-party applications are required to store the resource owner's credentials for future in its database. As users have to store their credentials, Third-party applications gain overly broad access to the resource owner's protected resources, leaving resource owners without any ability to restrict duration or access to a limited subset of resources. So, instead of storing these credentials, can we have something that can be used for giving limited access to the resources of the users, so that no one can touch the information that user never intends to share?
And the answer is Yes. OAuth is the one of the best protocol for this purpose.

OAuth 2.0 Grant Types

Authorization Code: Auth Code grant type is used to obtain the Access and Refresh Tokens. Refer below figure and description for more information.

Implicit Grant: It does not support issuance of Refresh Tokens.

Resource Owner Password Credentials: It is suitable in some scenarios where resource owner has a trust relationship with the third party.

Authorization Request and Response: The client authentication is used as the authorization grant, no additional authorization request is needed.

let’s see in depth about the messages that are shared between a client and the different authentication/authorization servers for Authorization code grant type.
In OAuth, the client requests access to resources controlled by the resource owner and hosted by the resource server, and is issued a different set of credentials than those of the resource owner. Instead of using resource owner’s credentials , third party makes use of Access Tokens for accessing the protected resources. This Access Token is nothing but a string denoting a specific scope, lifetime and other access attributes. Access Tokens are issued to third party clients by an authorization server with approval of resource owner. Now, lets see how this protocol works in details and consider the following block diagram.

Fig 1: OAuth 2.0 Flow

The flow illustrated in the diagram includes the following the steps:

The Authorization and Authentication process of OAuth consists of exchange of two messages between user and the server.

(A) User authenticates itself by sending the Client Id to the Authorization Server and after successful authentication gets the Auth Code or Authorization Grant. This Auth code is expires within some minutes say 5 mins. This expiry time depends upon the authorization server.

(B) The Authorization Grant obtained in the first step along with Client Secret is used in order to obtain Access and Refresh Tokens. Out of these two tokens, Access token has validity of one hour(again expiry time is server dependent) and Refresh token has unlimited validity. For example, in case of Google Analytics, client can generate infinite no. of Refresh Tokens but when 21st token is generated 1st expires and similarly, for 22nd, 2nd expires and so on. Refresh token is used in order to obtain new access token, if expired or client wants to generated the newer one.

(C) Client makes request for accessing the protected resource to the resource server by presenting the Access token.

(D) The resource server validates the request and if found to be valid, serves the request.

(E) The steps (C) & (D) repeat until the Access token expires.

(F) Resource server serves the requests for the protected resource till the access token remains valid. If token found out to be invalid, it sends the error message.

(G) If Access token expires or client knows that his/her token is going to expire, he/she can request for the newer Access token by presenting refresh token.

(H) In this step, client gets the requested access token after authorization server authenticates the client and validates the refresh token. In the same response, he can get the new Refresh token also, if requested. As mentioned, client can generate infinite no. of refresh tokens but only last ‘N’ tokens will be valid and the value of this ‘N’ depends upon the authorization server.

I am sure most of you have some doubts:

What is the need of two tokens i.e. Access and Refresh?

Why two step process?

Need of Two Tokens

One of the reason of increasing popularity of OAuth 2.0 over OpenId is two tokens. In case of OpenId, each user can access the protected resources by having login with OpenId credentials. If some interceptor gets these credentials, then it can access the protected resources. To overcome this problem, OAuth 2.0 has two tokens. Most of the resource requests contain only access token, so if some interceptor gets this token, it can use for maximum of one hour(say) and after that access token gets expired. So, for providing limited time access to the resources, access token is introduced and for generating the new access token, refresh token can be utilized. Hence, OAuth 2.0 is more secure than OpenId.

Need of Two Steps

Consider a scenario where clients are entirely built on JavaScript and no Java or server side code and running in resource owner’s browser. Now, any application developer must not want his client Secret along with the tokens to be visible to the user. But in this scenario, its not possible to hide these secrets from the resource owner. So, only the first step known as Implicit Grant is used in this scenario and second one is ditched. So, only to achieve simplicity two steps are needed.

I think this post may be helpful for the developers working on third party authorization/authentication.

Please, do write the comments of appreciations or doubts.
Stay tuned for the next post till then eNjoy coding. Smile with tongue out

For more information:

Salesforce OAuth 2.0

Google Analytics OAuth 2.0

Monday, 26 January 2015

External Sort

I guess most of the people have implemented sorting algorithms like merge sort, quick sort, etc but they may not be aware of External Sort. So, What does this external sort mean?

In order to understand this concept, we will introduce a scenario. Suppose, we have to sort a file of size far larger than that of physical memory (RAM). To tackle this problem, firstly, increase the physical memory or secondly, try to introduce some other algorithm. Yes, second option is the better one but how?

This problem can be solved with this external merge sort algorithm. Let’s have a look of this algorithm.

This algorithm consists of two phases:

1) Creation of sorted sublists

2) Merging sublists to form a single file

Let’s start developing the foundation of the solution. Let the size of the main memory be ‘m’ buffers and that of file be ‘n’ such that n >> m. We can sort the file that has size less than or equal to the size of physical memory by making use of any known sorting algorithm. Here, sorting algorithms like merge sort cannot be used because such kind of sorting algorithm requires more memory for storing intermediate results.

Phase I:

Get the first m buffers (size of RAM) of the file into the main memory and sort using any internal sorting algorithm. Now, after sorting dump this onto the disk in a file called as sublist. Repeat this process until the whole file gets sorted into sublists. Obviously, we will get n/m no. of sublists. All the sublists will be nothing but sorted files of size equal to that of main memory.

Phase II:

So, we will get n/m = t (say) no. of sorted sublists.

Now, we have to merge these sorted sublists in order to create a single sorted file. One thing should be noted here, these sublists are not mutually sorted, don’t worry; you will understand the whole concept from the example.

Divide the main memory (logically) into (t+1) buffers as we have t sorted sublists and one extra buffer for storing the results of the merge operation. Now, the size of the buffer (s) will be equal to m/ (t+1). Read the s bytes from each sublists into the main memory and compare the top values from each buffer and put the lesser value into the output buffer and remove that value from its respective buffer. Fill the buffer when it becomes empty from its respective sublists. As soon as the output buffer gets full, dump it into an output file and this file will be the final sorted file that we wanted. All the dumps will be performed on the same output file in phase II.

For example,

Suppose we have a file containing the following nos.

1, 2, 3, 5, 4, 3, 5, 1, 3, 4, 5, 8, 3, 5, 1, 8, 3, 7, 4, 5 and we can store max. of 5 nos. at any instant of time in the main memory i.e. size of the main memory = 5 nos.

Now, we have to sort these nos. using External Sort mechanism.

Phase I:

Take first 5 nos. and put them in main memory, sort them and put them in a sublist.

Sublist 1: 1 2 3 4 5

Similarly, other sublists will be:

Sublist 2: 1 3 3 4 5

Sublist 3: 1 3 3 5 8

Sublist 4: 3 4 5 7 8

Phase II:

Now, we have 4 sublists, so 4 input plus 1 output buffers.

The working of this phase can be understood very easily from the above diagrams.

Step 1:

Initially, one no. from each sublist is loaded into physical memory (as each buffer can store only one no.) and the smallest no. is stored into the output buffer. This buffer is emptied/dumped into a file (output file).

Similarly,

Step 2:

Step 3:

Step 4:

Step 5:

In the similar fashion, we can go ahead until all the sublists become empty and at last the output file will contain sorted nos.

I hope, you can sort the files of huge sizes using external sorting algorithm that we have seen.

Please don’t forget to write the comments of appreciation or doubts.

Tuesday, 1 July 2014

Comment Oriented Blog Summarizer

This was the part of my course subject Information Retrieval And Extraction during my Masters. It was selected as one of the best projects for the same course. So, I want to share some information about the Implementation and Research Perspective(may be some people can expand the idea). I have done some kind of studies and embodied some new features into Summarization of the blogs on the basis of Comments.

Word of Caution: The above Proposed Method can be useful only for the blogs that have some proper Informative comments, for example, TechCrunch.com blogs.

Now lets start with understanding the Title itself,
What is Summarization of a Text?

Lets say, we are having a Document(Text) consisting some Sentences. Now, the task of Summarization is to extract a subset of sentences from D, denoted by Sc (Sc ⊂ D), that best represents the topic(s) presented in D.

But in Comment Oriented Text/Blog summarization, we make additional use of Comments that ultimately, increases the efficiency of the output.

I hope, you may got the abstract of what we have to implement.

Scenario Formation

Lets consider a scenario where we have a text of 100 sentences on Java and now we have summarize it. But one may think about how many Sentences or Words are expected in the output summarized text ?

Usually, 20% ( 20 sentences ) of the input text sentences are expected in the summary. This percentage is not standard but it is a subjective matter. It would be better to have lesser sentences in the output so as to evaluate your algorithm in a more precise manner. One thing should be noted here that we are not going to generate new sentences or information but just extracting the sentences from the input text that gives the central Idea about it. I genuinely suggest you not to go by No. of Words in the output summary concept as it may be the situation where you have to end up with incomplete meanings. Yes, this concept may be useful where you are going to generate new information keeping in mind the comments. But this approach involves checking of grammar, fluency of the generated language, concept representation, etc. Okay, does not matter we are going to generate summary on the basis of No. of sentences.

What’s the use of Comments ?

Considerable amount of research has been done till now on Text Summarization without considering Comments. Results were pretty good but not up to the mark as they fail to extract those sentences that may represent the central theme of the text. The main reason of including the comments into summarization is that most of the times they highly discuss the central Idea about the subject of the Text. But blog or text must have considerable no. of comments for better results. Consider the following figure ( Ref: Meishan Hu, Aixin Sun, and Ee-Peng Lim ) that gives the overview about the same.

I guess, you may have got the overview of what is Expected and how comments can be used to achieve that.

The Algorithm is as follows:

Phase I: Finding Dominant Comments that may represent Central Theme using

Topic Graph
Mention Graph
Named Entity Recognition

Phase II: Extracting Sentences on the basis of Dominant Comments from Phase I

Phase III: Finding the Quality of the Summary

Now, we will see each phase in details:

Phase I:

We know that all the comments are not useful enough to be used in the process of extraction of central theme about a blog. Let each comment has unique IDs from 1 to N.

Now, lets see some Graphs that can be further utilised in our algorithm.

i) Topic Graph

It shows how the comments are topically/contextually related with each other. That there is an edge between the two nodes iff they are topically related to each other. We can represent this graph using various Data Structures for example, 2D Array(Matrix), Array of Linked List, etc.

But how can we find whether two comments are topically related with each other or not ?
For that purpose, we can make use of Cosine Similarity. We are not going into deep about Cosine Similarity but you can refer Wikipedia page which explained the concept very well. It gives the angle between the lines drawn from two comments taken into consideration on the basis of common words. Lesser the angle more the similarity between the comments.

ii) Mention Graph

We may have reply from some other user to a comment. So, we can consider that such kind of comments are more related with each other. In Mention Graph, we consider only those comments that got one or more reply/ies from other users. There is an edge between X & a in the Mention Graph iff comment X got reply from comment a.

We can assign weightage to each graph. It would be better if we assign more weightage to Mention Graph than Topic, so that more dominant comments may get more scores. We have to see the generated output summary by varying these weightages. Make the weightages as constant after getting peak results.

For example, consider the second row from both the figures (a) & (b) for comment ID 2.

Topic: 0 0 1 1 = 0 + 0 + 1 + 1 = 2 * weightage(0.4) = 0.8

Mention: 1 0 0 0 = 1 + 0 + 0 + 0 = 1 * weightage(0.6) = 0.6

Therefore, Intermediate Score of comment 2 = 0.8 + 0.6 = 1.4

In this way, we calculate the Intermediate Scores of all the comments.

iii) Named Entity Recognition (NER)

I have read three research papers about Summarization using comments. But this feature is new that I added so as to produce better results. Consider our previous scenario about Java. As blog is about Java, it is obvious that Java may get compared with Python within the comments but there is nothing about Python in the blog text. Now, in this scenario, we may get dominant comments that may not be more related with the contents of the blog/text. For avoiding such issues, we consider Named Entities. Firstly, we will find out all the Nouns from the content of the blog known as BlogNounSet. After that, we have to find NERScore each comment on the basis of no. of nouns it contains from BlogNounSet. Then we merge the scores i.e. Intermediate Score and NERScore to get FinalScore for each comment by assigning proper weightages. For finding Named Entities, we can make use of Stanford POS Tagger.

After obtaining FinalScore for each comment, we have to select certain no. of dominant comments. This is the end of Phase I.

Phase II:

Now, we are having a set of all Dominant Comments (DC). In this phase, we have to extract sentences that are similar to the DC and hence we make use of Cosine Similarity. This phase is as follows:

For each sentence s from set of sentences S in the blog text
          total[s] = 0
For each sentence s from S
          For each comment c from DC
                  total[s] += Cosine_similarity( s, c )
         end of inner for loop
end for outer for loop
sort( total )                                // Sort in non-decreasing order
Print first n Sentences where n = 20% of S

Phase III:

Now, we got the summary in Phase II, but we must have something to evaluate whether our system generated summary is better or not. For that purpose, we use Kavita Ganeshan Rouge. Its a Perl script to evaluate our system generated summary by taking into consideration various manual generated summaries as benchmark. It makes use of n-grams for evaluation. In this way, we can check the performance of our Algorithm.

I hope you all EnJoY this post and please don’t forget to write comments or mails of appreciation. Your remarks are welcome.

Thanks a lot !!!!!!

References:

Comments-Oriented Document Summarization: Understanding Documents with Readers’ Feedback By Meishan Hu, Aixin Sun, and Ee-Peng Lim

Comments-Oriented Blog Summarization by Sentence Extraction By Meishan Hu, Aixin Sun and Ee-Peng Lim

Summarizing User-Contributed Comments By Elham Khabiri and James Caverlee and Chiao-Fang Hsu

Friday, 6 June 2014

TagMe

The current boom of the Web is associated with the revenues originated from online advertising. So, more emphasis nowadays, is being given to proper presentation of the Ads which then embodied into relevant pages. Almost all searching Giants like Google, Yahoo, Bing, etc. use Ads as the main source of revenues. Nowadays, advertisers put more efforts so as to make attractive Ads along with proper descriptions that would be quite helpful for making more profits both for advertiser as well as hosting websites.

So, it would be better if we can provide some more useful information about the words in the description. This is useful as far as users’ perspective is taken into consideration. Now, lets move to one of the most trending topic in the world of advertisement – “TagMe”.

What is TagMe ?

I guess, we all are familiar with tagging in FaceBook. In case of FB, we tag one of our friend’s name to a person’s face in the image. That is, we are adding more information about something. The same thing is related with advertisements. Given an Ad along with some description, we add some more relevant information about dominant words present in its description.

Consider, an Ad with some description as given below,

Auction for Bicycles
Description Auction of second hand Bicycles & Two Wheelers in the campus of International Institute of Information Technology, Hyderabad on the basis of English Auction.

Suppose, I am interested in such kind of Ads. From the above description, I surely point out three points/things.
i) Bicycles & Two Wheelers
ii) English Auction
iii) International Institute of Information Technology, Hyderabad.

Out of these three points, I can understand the first point but what about the remaining points?
I don’t have any kind of information about what is English Auction?
and Where is International Institute of Information Technology, Hyderabad?
Till now, this was about the perspective of Ad interested user/s.

Now, we will see from the Advertiser’s perspective.

Every advertiser would want that his/her Ad should be visible to more and more users and the depth/meaning of the Ad must be conveyed to them clearly and easily.
Now, what steps should the advertisers take in order to give more clarity for points (ii) & (iii) ?

We can add more informative information about these points. That is when user hovers his/her cursor on the text, it should provide more precise information about the text into consideration. Yes, this approach is effective and more feasible to implement.

To understand this concept more precisely, just google tagme, open the first link and put some words into the textbox and then click Tagme. Now, observe the results carefully to understand the concept.

As this blog is for the developers and the people with strong technical skills in the field of computer science, we are going to develop some more efficient algorithm for TagMe concept with heavy discussions.

Implementation Perspective

Certain important things must be pointed out in order to get clearer picture of what we have to implement.

i) From where we will get information about some words.
ii) Selection of particular words (keywords) for which more information has to be shown ( when hovered ).
Lets discuss these points one by one.

Point (i)

To resolve this problem, we must have certain DB from where we can get required information. But, the nature of this DB should be dynamic rather than static as information keeps changing every time. But the question is, which DB?

It would be better if we can get information from Wikipedia, which is stable, dynamic and more importantly it contains more Authentic Information. Yes, this matter is subjective and developer has full right to select its own repository.

Another important thing is that Wikipedia has provided very flexible API calls to fetch required data. Our aim is to show only overview that is the introductory part of Wikipedia page for the keyword.

Point (ii)

This is the most important phase of TagMe. In this phase, we have to make decision about selecting the keywords from the text. But how do we get to know that a particular word is a keyword and whether Wikipedia contains page for it or not ?

One way, I initially thought is that assign Parts of Speech (POS) tag to each of the words in the description of the Ad and extract all the noun words present in it. For these nouns, form the URL accordingly and get the information. This is very easy solution and it has some drawbacks. To understand this problem, consider our previous example of Auction Ad.

We tag all the words present in the description. Lets take into consideration the POS tags of International Institute of Information Technology, Hyderabad. Out of these words, Hyderabad is a noun. But the thing is that we must consider all these words as a single entity rather than individual. Now, looking into reverse way to this problem would give us the way to solve this ambiguity. Anyways, we have to get information from Wikipedia, so take into consideration those phrases (i.e. entities) that have a page ( as a title ) present in Wikipedia database. For doing so, create a file containing the title of all Wikipedia pages and note down the timestamp at which API call is made to fetch the titles. This timestamp will be useful when we want to update the titles of newly introduced pages into Wikipedia database through another API call to the File of Titles(FT).
Now, the whole problem is as follows:
i) Given an Ad description, search by taking into consideration ‘t’ words each time in FT for a match (entity).
ii) Form a URL for each of entity and get the required data from Wikipedia.

Data Structures

Since point(ii) form the base of TagMe, we will start with developing the same. The data structure for this phase should be Space and Search Efficient. So, I recommend Trie Data Structure. Each node consists of a char space along with 26 pointers equals to no. of alphabets in English. If you don’t have any knowledge about Trie, search for it on the web, as we are not going into depth of this Data Structure. Trie contains all the titles present in FT. Two things we are concerned about using Trie are: i) The height and ii) The size of Main Memory.

The height of Trie will be equal to the no. of chars in the longest title in FT. Obviously, we require more Main memory for accommodating the whole Trie. But this Main Memory investment in one time as, it would be rare to have page title more than 30 chars. When you go into the depth of Trie construction for titles, you will get to know that it is also memory efficient Our concentration is about reducing search time for getting proper entities in the description of Ad. Trie is more efficient for searching. Also, we can make use of some compression technique in order to reduce the size of the titles. Now, take one or more words ( depending upon your algorithm design) from Ad description and search into Trie to check whether the group of selected words forms a title of Wikipedia page. As soon as we got a match, extract required introductory information using Wiki API.

Consider an example to have a broader sense of understanding about what we have explained till now.

PG on sharing basis for Rs. 8000
Description Paying Guests required on sharing basis near IIIT area Hyderabad for Rs. 8000/-.

Also Wikipedia contains pages for IIIT, Hyderabad, India, IIT, Delhi, etc. We have all these titles in our FT. From FT, we are going to create Trie for all these titles. The Trie structure will like this:

Now, we have consider the Ad description. Starting with the word ‘Paying’, check it is there in the above structure or not. We will not find any words matched into this structure till IIIT. Note that this traversing not as easy but the programmer has to take care of this. As soon as we find a match for IIIT, form a URL accordingly and extract the interested information. Continue this procedure till all the words in the description gets over. This was about what I thought of solving this problem, may be someone can have better solution than this.

I request all of you who are having better solution than this to share your valuable ideas here.

I hope you all eNjOy this post.

Please do write the comments/mails of appreciation.
Thanks a lot !!!!

Thursday, 27 March 2014

Duplicate Elimination

Last week, I was brushing up some of the concepts of Database Management Systems and I came across the idea of implementing the keyword Distinct. Distinct keyword is used in order to find the tuples with distinct values of the desired attribute present in the table.

For example.

Select distinct(emp_fname) from employee;

This query will give the unique name of the employees present in the table employee.

Now, the question arises, what data structures to be used in order to implement the keyword distinct?

Data Structures that can be used are:

1) Hashing

2) B Trees

Let’s discuss, Duplicate Elimination with Hashing.

1) Hashing:

Hashing is one of the best data structure to handle such kind of queries. Following points play an important role during implementation:

i. Hash Function

ii. Size of the Hash Table

iii. Collision Avoidance

The kind of Hash Function to be used depends upon the type of the concerned attribute.

The Size of the Hash Table is usually kept high and it’s a Prime no. The reason of size being a prime no. is that it evenly distributes the attribute values in the table.

When the Hash Function gives same slot for two or more input value, we have to have some alternative in order to avoid the collisions. The collision avoidance scheme to be used depends upon the developer who decides on the basis of the nature of the given input data. One may use some other Hash Function when the collision is observed.

Please have a look of the following piece of code that I have written:

#include
#include
#include

struct hashNode
{
	long long int data;
	hashNode *link;		
};

struct list
{
	int noOfKeys;
 	hashNode *next;
};


struct listInfo
{
	list *listPointer;
	int tableSize;
	int collisionFactor;
};


list *createHash( int tableSize )
{
	list *hash;
	hash = new list[ tableSize ];
	
	for( int i = 0; i < tableSize; i++ )
	{
		hash[i].noOfKeys = 0;
		hash[i].next = NULL;
	}
	return hash;
}

hashNode *newNode()
{
	hashNode *temp = ( hashNode * )malloc( sizeof( hashNode ) );
	temp -> link = NULL;
	return temp;
}

int rehash( listInfo *listInfoObj, list *hash )
{
	long long int index;
	hashNode *trav, *temp;
	
	list *p = listInfoObj -> listPointer;
	int len = listInfoObj -> tableSize/2;
	
	for( int i = 0; i < len; i++ )
	{
		trav = p[i].next;
		while(trav)
		{
			index = trav -> data % listInfoObj -> tableSize;
			temp = newNode();
			temp -> data = trav -> data;
			temp -> link = hash[index].next;
			hash[index].next = temp;
			hash[index].noOfKeys++;	
			temp = trav;			
			trav = trav -> link;
			free(temp);
		}
	}
	return 0;
}

int getNext( long long int value, listInfo *listInfoObj )
{
	hashNode *trav;
	list *p, *hash;
	long long int index = value % listInfoObj -> tableSize;
	
	p = listInfoObj -> listPointer;
	
	if( p[index].noOfKeys >= listInfoObj -> collisionFactor )
	{
		listInfoObj -> tableSize *= 2;
		hash = createHash( listInfoObj -> tableSize );
		rehash( listInfoObj, hash );
		free(p);
		listInfoObj -> listPointer = p = hash;
		index = value % listInfoObj -> tableSize;
	}
	if( p[index].next == NULL ) 
	{
		p[index].next = newNode();
		p[index].next -> data = value;
		p[index].noOfKeys++;
	}
	else
	{
		trav = p[index].next;
		while( trav->link != NULL )
		{
			if( trav -> data == value )
				return 0;
			trav = trav -> link;
		}
		if( trav -> data == value )
			return 0;
		trav -> link = newNode();
		trav -> link -> data = value;
		p[index].noOfKeys++;
	}	
	return 1;
}

int displayHashTable( listInfo *listInfoObj )
{
	list *p;
	hashNode *trav;
	int count = 0;
	p = listInfoObj -> listPointer;
	
	for( int i = 0; i < listInfoObj -> tableSize; i++ )
	{
		trav = p[i].next;
		while( trav )
		{
			count++;
			printf("%lld ", trav -> data );
			trav = trav -> link;
		}
		printf("\n");
	}
	printf("No.of Unique Tuples: %d\n", count );
	return 0;
}

int main()
{
	char line[20];
	listInfo *listInfoObj;
	
	listInfoObj = ( listInfo * )malloc( sizeof( listInfo ) );
	listInfoObj -> tableSize = 93;
	listInfoObj -> collisionFactor = 11;
	
/*	printf("Enter The Size of The Hash Table and Collision Factor:\n");
	scanf("%d%d", &listInfoObj.tableSize, &listInfoObj.collisionFactor );
*/
	listInfoObj -> listPointer = createHash( listInfoObj -> tableSize );
	
	printf("Enter The Tuples:\n");	
	while( true )
	{
		scanf("%s", line );
		if( !strcmp( line, "-1" ) )
		{
			break;
		}
		long long int value = atoll(line);
		getNext( value, listInfoObj );
	}
	displayHashTable( listInfoObj );
	return 0;
}

Implementation Details:

The basic aim is to allocate memory dynamically and Linked List is the best example of that. Define an array of pointers of size ‘N’ equal to that of table that you desired of. In order to avoid collision, we can have linked list for each collision value. The size of this linked list can be adjusted to the variable value i.e. collisionFactor.

Let’s have a look of the Hash Table structure.

Fig. 1 Hash Table Structure

At a particular instant of time, we can have maximum of collisionFactor no. of keys having same slot given by the Hash Function.

In a database system, we may have to work with huge no. of records. So, whenever the no. of nodes in a particular linked list exceeds the limit (i.e. CollisionFactor), we have to go for Rehashing. In Rehashing, we create a hash table (2N Size) of double than the size of the previous one and all the keys that we stored in previous hash table are again hashed into the newer one.

So, this part is about Duplicate Elimination Using Hashing. In the next post, we will see implementation aspects of B Tree and how it can be used for eliminating duplicate tuples.

Saturday, 8 February 2014

FAT Chaining 2

In the previous post, we have seen the various Data Structures that the operating System uses in order to manage the files stored in the drives. Since, we are dealing with the system programming, it is important to know about Interrupts, Assembly Language, etc.

Anyways, if you don’t know that much about Interrupts, Assembly Language, the steps for developing the code are simpler to understand.

Let’s start with the practical implementation details.

Now, we want to read the file stored in Floppy Disk (FD). User will provide the name of the file through Command Line Interface (CLI).

Steps Involved:

1) Read the no. of sectors for:

i) FAT

ii) Root Directory (RD)

2) Find the RD entry for the given input file.

3) If RD entry found: Get the first cluster no. (Stored at 1AH offset in RD Entry) and read it. Also read FAT.

Else File Does not exist.

4) Now, read the remaining file by making use of FAT.

Now, we will elaborate these 4 steps in order to get into the rhythm.

Step 1:

Standard FD has predefined size of the Data Structures.

For instance,

Boot sector – 1 Sector

FAT – 9 Sectors

Copy of FAT – 9 Sectors

Root Directory – 14 Sectors

And then file area starts.

As these sizes are known previously, we are not going to retrieve them Boot Sector. Though, we can get the size of FAT from Number of sectors per FAT i.e. 16H offset & size of RD from Number of root-directory entries i.e. 11H offset of Boot Sector.

Note that: Size of RD = No. of RD Entries * 32

As size of each entry is 32 Bytes (See Fig. 3 of Previous Post).

Step 2:

User is going to provide the file name, which we have to search in the RD. For that purpose, we have to read the whole RD and then check first 11 Bytes (8 Bytes file name + 3 Bytes Extension) of each 32 Byte entries in it.

For reading RD, we are going to use INT 25H.

INT 25H (Absolute disk read)

AL = Drive Number (0 = A, 1 = B, etc.)

CX = Number of sectors to read

DX = Starting sector number

DS: BX = Segment: offset of buffer

Returns:

If function successful

Carry flag = clear

If function unsuccessful

Carry flag = set

AX = error code

MOV AL, 00

MOV BX, OFFSET BUFF

MOV CX, 14

MOV DX, 19

INT 25H

How does starting sector no. equal to 19?

RD comes after Boot Sector (1 Sector), FAT (9 Sectors), Copy of FAT (9 Sectors). So, 1+9+9 = 19 and from 19^th sector onwards RD starts (as first sector is sector 0).

After the execution of INT 25H, we will get whole RD into buffer BUFF and we start checking first 11 Bytes of each RD entry.

A legal 32 Byte entry does not start with E5H or 00H. So, during file name check, we have to take care of this thing.

Step 3:

As soon as we find the RD entry for which we were looking, read the first cluster no. at an offset 1AH in that entry. Read and display the first cluster. Now FAT comes into the picture for displaying the remaining clusters of the file.

For reading 9 sectors of FAT, we make use of INT 25H as described in step 2. Only change will be in the values of CX, DX and use some other buffer.

Note:

MOV CX, 09

MOV DX, 01

MOV BX, BUFF1

One this should be noted here that we are going to make use of INT 21H and INT 25H only.

Let’s have a look about the 09H and 4CH functions of INT 21H

Function: 09H

It writes a $-terminated string to standard output.

DS must point to the string's segment, and DX must contain the string's offset:

For example:

.DATA

STRING BYTE “THIS IS A STRING$”

.CODE

MOV AH, 09H

MOV DX, OFFSET STRING

INT 21H

Function: 4CH

Ends the current process and returns an optional 8-bit code to the calling process.

A return code of 0 usually indicates successful completion. It is same as EXIT 0.

MOV AX, 4C00H

INT 21H

;Here AL = 00, After the execution register AL (AX = AH || AL) will contain Error Code if present.

Step 4:

This step is more interesting as compared to other steps. We got first cluster of the required file and now we have to find the next cluster no. in order to read the remaining file. If you have observed the nomenclature of the FAT at the time formatting the Pen-Drive, you may have seen FAT12, FAT16 or something like nowadays NTFS (New Technology File System).

Now, What does this FATX represents? What is the significance of X, here?

X represents X-bit wide cluster entries. Now let’s convert X into Bytes.

If FAT12 then X = 1.5 Bytes or If FAT16 then X = 2 Bytes, etc. One thing should be noted that Floppy Disk is FAT12 (i.e. 1.5 Byte).

Let F be the first cluster no. obtained from RD entry of the given input file. Now, we multiply F by 1.5 in order to get the next cluster no. Here, comes the tricky point:

Let after multiplication i.e. 1.5 * F, the four nibbles be wxyz (i.e. 1.5F = wxyz). If 1.5F is even, consider last three nibbles otherwise first three nibbles.

Suppose, 1.5F is even, consider xyz and add first nibble as 0. So, the next cluster no. becomes 0xyz. Now, read the location value from FAT present at 0xyz. From this location, we have to read the next sector of the file. In the similar fashion, we can read remaining file. At any instant of point, the next cluster no. equals to or greater than 0FFFH, stop i.e. file is completely read. Here, we are getting the location of the next sectors to be read from FAT in continuation. This continuation is referred as “Chaining”.

Note: 1 Cluster = 2^i Sector, i = 0,1,2……

Now, try to understand the below ASM code, which is written on the basis of the above discussion. I strongly suggest you people to read “The Microsoft(R) Guide for Assembly Language and C Programmers By Ray Duncan” for better understanding of System Programming.

print macro msg
    mov ah,09h
    lea dx,msg
    int 21h
    endm

.model tiny
.code
    org 100h
begin:
    jmp start
    msg1 db 10d,13d,"Error in reading the root directory:$"
    msg2 db 10d,13d,"Error in reading the sector:$"
    msg3 db 10d,13d,"Found:$"
    msg4 db 10d,13d,"Not Found:$"
    msg5 db 10d,13d,"Error in reading FAT:$"
    buff db 7168 dup('$')
    buff1 db 4608 dup('$')
    buff2 db 513 dup('$')
    filename db 12 dup('$')
    str_cls dw ?
    temp dw ?
    cnt db 224
    count db ?
    num dw ?

start:
    mov count,08
    mov si,80h
    mov cl,[si]
    dec cl
    mov si,82h
    lea di,filename
l1:
    mov al,[si]
    cmp al,'.'
    je l2
    mov [di],al
    inc si
    inc di
    dec count
    dec cl
    jmp l1

l2:
    cmp count,00
    jz l4
    inc si
    dec cl

l3:
    mov [di],20h
    inc di
    dec count
    jnz l3

l4:
    mov count,03h

l5:
    mov al,[si]
    mov [di],al
    inc si
    inc di
    dec count
    dec cl
    jnz l5

    cmp count,00
    jnz l6
    jmp l7            ;filename completed

l6:
    mov [di],20h
    inc di
    dec count
    jnz l6

l7:
    mov al,00
    mov bx,offset buff
    mov cx,14
    mov dx,19
    int 25h
    jc error1
    add sp,02h

    lea si,buff
l8:
    mov al,[si]
    cmp al,0E5h
    je loop1
    cmp al,00
    je loop1
    jmp loop2

loop1:
    add si,0032
    dec cnt
    jnz l8

loop2:
lea di,filename
mov cx,11

l9:
    mov al,[si]
    cmp al,[di]
    je loop3
    add si,cx
    add si,21
    dec cnt
    jnz loop2
    jmp n_found
loop3:
    inc si
    inc di
    dec cl
    jnz l9
    jmp found

n_found:
    print msg4
    jmp end1

found:
    print msg3

    sub si,11
    add si,1Ah
    mov bl,[si]
    inc si
    mov bh,[si]
    mov str_cls,bx        ;first cluster

    mov al,00
    mov bx,offset buff1        ;buff1 contains FAT
    mov cx,09
    mov dx,01
    int 25h
    add sp,02
    jc error2
    mov ax,str_cls
    jmp en2

;FAT cluster reading

error1: jmp error11

en2:
    add ax,31
    mov dx,ax
    mov al,00
    mov cx,01
    mov bx,offset buff2
    int 25h
    jc error3
    add sp,02

    mov ah,09h
    lea dx,buff2        ;displaying cluster contents
    int 21h

    mov ax,str_cls
    mov bx,03
    mul bx
    mov bx,02    ;1.5 multiplication
    div bx

    lea si,buff1
    add si,ax
    mov bl,[si]
    inc si
    mov bh,[si]
    mov num,bx

    mov ax,str_cls
    and ax,01
    cmp ax,00
    je even1
    jmp odd1

even1:
    mov ax,num
    and ax,0fffh
    cmp ax,0fffh
    je end1
    mov str_cls,ax
    jmp en2

odd1:
    mov ax,num
    and ax,0fff0h
    mov cl,04
    shr ax,cl
    and ax,0fffh
    cmp ax,0fffh
    je end1
    mov str_cls,ax
    jmp en2

error11:
    print msg1
    jmp end1
error2:
    print msg5
    jmp end1
error3:
    print msg2
end1:
    mov ah,4ch
    int 21h
    end begin

I hope you people like this post.

Please do write comments !!!!

Wednesday, 5 February 2014

FAT Chaining

Most of the people will not understand the term ”FAT Chaining”. There are two things:
1) FAT
2) Chain
Yes, FAT is File Allocation Table and you will understand chaining at the end of this post. Basic aim of this post is to understand how a file gets retrieved when we click the icon or enter the name of the file on command prompt.
Suppose, we want to access a file present in a Floppy Disk. Now the question arises, Why are we retrieving the file from Floppy Disk and not from Hard Disk?
The reason is that we know the details of the Standard Data Structures of the Floppy Disk (FD). Here, data structures constitute to the following:
1) Boot Sector
2) File Allocation Table (FAT)
3) Root Directory
4) File Area
Now, we will not let ourselves only getting theoretical knowledge, instead we will develop Assembly Language Code for this. So, we divide this topic into two posts. This post will contain information about the above mentioned Data Structures and in the next post we will play with the Interrupts in order to develop the Assembly Level Code.
Now, let’s see how these things going to be used during file retrieval:
The whole area of the FD is divided among the above 4 Data Structures as shown below in Fig.1. Each MS-DOS logical volume is divided into several fixed-size control areas and a files area. The size of each control area depends on several factors like the size of the volume and the version of FORMAT used to initialize the volume, for example, but all of the information needed to interpret the structure of a particular logical volume can be found on the volume itself in the boot sector.

1) Boot Sector:

Boot sector is known as the logical sector 0, contains all of the critical information regarding the disk medium's characteristics (Fig.2).

We are not going into much detail about these fields as they are self explanatory. From Boot Sector, we get all the physical information about the Drive. Let say, we have to find the size of the logical space of the drive. This can be calculated by making use of:
1) Total sectors in logical volume and
2) Bytes per sector
2) File Allocation Table:

Each file's entry in a directory contains the number of the first cluster assigned to that file, which is used as an entry point into the FAT. From the entry point on, each FAT slot contains the cluster number of the next cluster in the file, until a last-cluster mark is encountered.
At the computer manufacturer's option, MS-DOS can maintain two or more identical copies of the FAT on each volume. MS-DOS updates all copies simultaneously whenever files are extended or the directory is modified.
If access to a sector in a FAT fails due to a read error, MS-DOS tries the other copies until a successful disk read is obtained or all copies are exhausted. Thus, if one copy of the FAT becomes unreadable due to wear or a software accident, the other copies may still make it possible to salvage the files on the disk. As part of its procedure for checking the integrity of a disk, the CHKDSK program compares the multiple copies (usually two) of the FAT to make sure they are all readable and consistent.
3) Root Directory:
Following the file allocation tables is an area known as the root directory. The root directory contains 32-byte entries that describe files, other directories, and the optional volume label (Fig. 3).
An entry beginning with the byte value E5H is available for reuse; it represents a file or directory that has been erased. An entry beginning with a null (zero) byte is the logical end-of-directory; that entry and all subsequent entries have never been used.

4) Files Area:

The remainder of the volume after the root directory is known as the files area. MS-DOS views the sectors in this area as a pool of clusters, each containing one or more logical sectors, depending on the disk format. Each cluster has a corresponding entry in the FAT that describes its current use: available, reserved, assigned to a file, or unusable (because of defects in the medium). Because the first two fields of the FAT are reserved, the first cluster in the files area is assigned the number 2.

Try to follow the next post which will be more interesting.

Thanks a lot !!!

Tuesday, 18 June 2013

Aria Subscription Billing System

Aria is a system founded by five Billing Experts, who have worked for LaserLink.net an internet service provider. Aria is basically a billing system which generates bill based upon the usage and the charges. For instance, mobile service providers may use Aria for generating bills of their subscribers based upon the usage of the services. Service Provider registers each subscriber of a service into Aria system with some information like Name, Address, Plans, Bill payment type, etc. Also the information like cost per unit of usage, required for generating the bill has to be provided at the time of user registration.

Aria information can be accessed through the website() or we can access it making http connections or REST calls. Some companies use Aria as a third party bill generating system use it by making REST calls.

For making these calls we have to provide some authentication information like Auth Key and Client No. in the url itself. Also with this information, we have to embed parameters, corresponding to the function that is called. Aria system provided some useful rest calls to access its information. There is proper documentation of these rest calls here .

For example,

URL Formation:

https://secure.future.stage.ariasystems.net/api/AriaQuery/objects.php?rest_call=get_acct_details_all&client_no=Client_No&auth_key=Auth_key&acct_no=Account_No&output_format=json

Here, https://secure.future.stage.ariasystems.net/api/AriaQuery/objects.php? Is the Base Url and we are making use of the function get_acct_details_all which requires three parameters:

client_no
auth_key
acct_no

Out of these parameters first two are necessary for making any Aria call. Each key-value pair in the url is separated by the delimiter '&'. Aria returns output in the form of Xml or as json.

ARIA Functions:

There are no. of various API functions provided by Aria. These functions are divided into some categories depending upon the functionality they perform. The important thing here is to notice that all functions except the functions with query as a parameter require, same base url.

Functions without query as a parameter:

Base Url:

https://secure.future.stage.ariasystems.net/api/ws/api_ws_class_dispatcher.php?

Functions with Query as a parameter:

The above functions take query string as a parameter-

get_account_details( username, password, limit, offset, query_string)
get_account_status_history(username, password, limit,offset, query_string)
get_account_plan _history( username, password, limit, offset, query_string)
get_payment_details( username, password, limit, offset, query_string)
get_order_details( username, password, limit, offset, query_string)
get_invoice_information( username, password, limit, offset, query_string)
get_transaction_information( username, password, limit, offset, query_string)
get_transaction_information( username, password, limit, offset, query_string)
get_coupon_history (client_no, auth_key, limit, offset, query_string)

where,

limit - The maximum number of objects that should be returned by this call.

offset - The number of records to skip. Note that both "0" and NULL will cause the interface not to skip any records.

query_string - The criteria which all returned objects must match. Different objects have a different .

Base Url: https://secure.future.stage.ariasystems.net/api/AriaQuery/objects.php?

Some information cannot be directly accessed by these calls, instead we have to write a query along with the function calls. Suppose we have to get information about all the accounts present under a client. For doing this no function is provided by Aria. So, we have to write a query “acct_no != 0” in encoded format like acct_no%20!%3D%200%3E.

For example,

https://secure.future.stage.ariasystems.net/api/AriaQuery/objects.php?rest_call=get_account_details_all&client_no=Client_No&auth_key=Auth_key&acct_no=Account_No&output_format=json&query_string=acct_no%20!%3D%200%3E

Here we have used acct_no as a condition parameter. Aria API Documentation has apparently given which parameters we can specify in the query for a particular function.

Instead of writing the code for accessing information, we can use Postman – Rest Client from the google chrome store which can provide the data in Xml or in Json format.

Based on the previous discussion, we have written the code to get data from Aria. Where we are getting data in the form Json string and then converting it into Json Object as shown.

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.PrintStream;
import java.net.URL;
import java.net.URLConnection;
import java.util.ArrayList;
import com.google.gson.JsonObject;
import com.google.gson.JsonParser;


public JsonObject getAriaResponse(String ariaUri,String ariaFuncString,String queryString)
{
 String responseStr;
 String clientNo = "Enter Client No";
 String authKey = "Enter Auth Key";
 JsonParser jParser = new JsonParser();
 JsonObject obj = new JsonObject();
 String UrlStr;
 
 try
 {         
  if(queryString != null)
   UrlStr = ariaUri + "rest_call=" + ariaFuncString+"&client_no="+clientNo+"&auth_key="+authKey+"&query_string="+queryString+"&output_format=json";
  else
   UrlStr = ariaUri + "rest_call=" + ariaFuncString+"&client_no="+clientNo+"&auth_key="+authKey+"&output_format=json";   
  
  System.out.println("URI: "+UrlStr);
  URL url = new URL(UrlStr);
  URLConnection urlc = url.openConnection();
  urlc.setDoOutput(true);
  urlc.setAllowUserInteraction(false);
  PrintStream ps = new PrintStream(urlc.getOutputStream());
  ps.close();
  BufferedReader br = new BufferedReader(new InputStreamReader(urlc.getInputStream()));  
  responseStr = br.readLine();
  obj = jParser.parse(responseStr).getAsJsonObject(); 
 }
 catch(Exception e)
 {
  e.printStackTrace();
 }
 return obj;
}