Friday 6 June 2014

TagMe

The current boom of the Web is associated with the revenues originated from online advertising. So, more emphasis nowadays, is being given to proper presentation of the Ads which then embodied into relevant pages. Almost all searching Giants like Google, Yahoo, Bing, etc. use Ads as the main source of revenues. Nowadays, advertisers put more efforts so as to make attractive Ads along with proper descriptions that would be quite helpful for making more profits both for advertiser as well as hosting websites.

So, it would be better if we can provide some more useful information about the words in the description. This is useful as far as users’ perspective is taken into consideration. Now, lets move to one of the most trending topic in the world of advertisement – “TagMe”.

What is TagMe ?

I guess, we all are familiar with tagging in FaceBook. In case of FB, we tag one of our friend’s name to a person’s face in the image. That is, we are adding more information about something. The same thing is related with advertisements. Given an Ad along with some description, we add some more relevant information about dominant words present in its description.

Consider, an Ad with some description as given below,

Auction for Bicycles
Description Auction of second hand Bicycles & Two Wheelers in the campus of International Institute of Information Technology, Hyderabad on the basis of English Auction.

Suppose, I am interested in such kind of Ads. From the above description, I surely point out three points/things.
i) Bicycles & Two Wheelers
ii) English Auction
iii) International Institute of Information Technology, Hyderabad.

Out of these three points, I can understand the first point but what about the remaining points?
I don’t have any kind of information about what is English Auction?
and Where is International Institute of Information Technology, Hyderabad?
Till now, this was about the perspective of Ad interested user/s.

Now, we will see from the Advertiser’s perspective.

Every advertiser would want that his/her Ad should be visible to more and more users and the depth/meaning of the Ad must be conveyed to them clearly and easily.
Now, what steps should the advertisers take in order to give more clarity for points (ii) & (iii) ?

We can add more informative information about these points. That is when user hovers his/her cursor on the text, it should provide more precise information about the text into consideration. Yes, this approach is effective and more feasible to implement.

To understand this concept more precisely, just google tagme, open the first link and put some words into the textbox and then click Tagme. Now, observe the results carefully to understand the concept.

As this blog is for the developers and the people with strong technical skills in the field of computer science, we are going to develop some more efficient algorithm for TagMe concept with heavy discussions.

Implementation Perspective 

Certain important things must be pointed out in order to get clearer picture of what we have to implement.

i) From where we will get information about some words.
ii) Selection of particular words (keywords) for which more information has to be shown ( when hovered ).
Lets discuss these points one by one.

Point (i)

To resolve this problem, we must have certain DB from where we can get required information. But, the nature of this DB should be dynamic rather than static as information keeps changing every time. But the question is, which DB?

It would be better if we can get information from Wikipedia, which is stable, dynamic and more importantly it contains more Authentic Information. Yes, this matter is subjective and developer has full right to select its own repository.

Another important thing is that Wikipedia has provided very flexible API calls to fetch required data. Our aim is to show only overview that is the introductory part of Wikipedia page for the keyword.

Point (ii)

This is the most important phase of TagMe. In this phase, we have to make decision about selecting the keywords from the text. But how do we get to know that a particular word is a keyword and whether Wikipedia contains page for it or not ?

One way, I initially thought is that assign Parts of Speech (POS) tag to each of the words in the description of the Ad and extract all the noun words present in it. For these nouns, form the URL accordingly and get the information. This is very easy solution and it has some drawbacks. To understand this problem, consider our previous example of Auction Ad.

We tag all the words present in the description. Lets take into consideration the POS tags of International Institute of Information Technology, Hyderabad. Out of these words, Hyderabad is a noun. But the thing is that we must consider all these words as a single entity rather than individual. Now, looking into reverse way to this problem would give us the way to solve this ambiguity. Anyways, we have to get information from Wikipedia, so take into consideration those phrases (i.e. entities) that have a page ( as a title ) present in Wikipedia database. For doing so, create a file containing the title of all Wikipedia pages and note down the timestamp at which API call is made to fetch the titles. This timestamp will be useful when we want to update the titles of newly introduced pages into Wikipedia database through another API call to the File of Titles(FT).
Now, the whole problem is as follows:
i) Given an Ad description, search by taking into consideration ‘t’ words each time in FT for a match (entity).
ii) Form a URL for each of entity and get the required data from Wikipedia.

Data Structures

Since point(ii) form the base of TagMe, we will start with developing the same. The data structure for this phase should be Space and Search Efficient. So, I recommend Trie Data  Structure. Each node consists of a char space along with 26 pointers equals to no. of alphabets in English. If you don’t have any knowledge about Trie, search for it on the web, as we are not going into depth of this Data Structure. Trie contains all the titles present in FT. Two things we are concerned about using Trie are: i) The height and ii) The size of Main Memory.

The height of Trie will be equal to the no. of chars in the longest title in FT. Obviously, we require more Main memory for accommodating the whole Trie. But this Main Memory investment in one time as, it would be rare to have page title more than 30 chars. When you go into the depth of Trie construction for titles, you will get to know that it is also memory efficient Our concentration is about reducing search time for getting proper entities in the description of Ad. Trie is more efficient for searching. Also, we can make use of some compression technique in order to reduce the size of the titles. Now, take one or more words ( depending upon your algorithm design) from Ad description and search into Trie to check whether the group of selected words forms a title of Wikipedia page. As soon as we got a match, extract required introductory information using Wiki API.

Consider an example to have a broader sense of understanding about what we have explained till now.

PG on sharing basis for Rs. 8000
Description Paying Guests required on sharing basis near IIIT area Hyderabad for Rs. 8000/-.

Also Wikipedia contains pages for IIIT, Hyderabad, India, IIT, Delhi, etc. We have all these titles in our FT. From FT, we are going to create Trie for all these titles. The Trie structure will like this:

 

image

Now, we have consider the Ad description. Starting  with the word ‘Paying’, check it is there in the above structure or not. We will not find any words matched into this structure till IIIT. Note that this traversing not as easy  but the programmer has to take care of this. As soon as we find a match for IIIT, form a URL accordingly and extract the interested information. Continue this procedure till all the words in the description gets over. This was about what I thought of solving this problem, may be someone can have better solution than this.

I request all of you who are having better solution than this to share your valuable ideas here.

I hope you all eNjOy this post.

Please do write the comments/mails of appreciation.
Thanks a lot !!!!