Tuesday, 1 July 2014

Comment Oriented Blog Summarizer

 

This was the part of my course subject Information Retrieval And Extraction during my Masters. It was selected as one of the best projects for the same course. So, I want to share some information about the Implementation and Research Perspective(may be some people can expand the idea). I have done some kind of studies and embodied some new features into Summarization of the blogs on the basis of Comments.

Word of Caution: The above Proposed Method can be useful only for the blogs that have some proper Informative comments, for example, TechCrunch.com blogs.

Now lets start with understanding the Title itself,
What is Summarization of a Text?

Lets say, we are having a Document(Text)  consisting some Sentences. Now, the task of Summarization is to extract a subset of sentences from D, denoted by Sc (Sc ⊂ D), that best represents the topic(s) presented in D.

But in Comment Oriented Text/Blog summarization, we make additional use of Comments that ultimately, increases the efficiency of the output.

I hope, you may got the abstract of what we have to implement.

Scenario Formation

Lets consider a scenario where we have a text of 100 sentences on Java and now we have summarize it. But one may think about how many Sentences or Words are expected in the output summarized text ?

Usually, 20% (  20 sentences ) of the input text sentences are expected in the summary. This percentage is not standard but it is a subjective matter. It would be better to have lesser sentences in the output so as to evaluate your algorithm in a more precise manner. One thing should be noted here that we are not going to generate new sentences or information but just extracting the sentences from the input text that gives the central Idea about it. I genuinely suggest you not to go by No. of Words in the output summary concept as it may be the situation where you have to end up with incomplete meanings. Yes, this concept may be useful where you are going to generate new information keeping in mind the comments. But this approach involves checking of grammar, fluency of the generated language, concept representation, etc. Okay, does not matter we are going to generate summary on the basis of No. of sentences.

What’s the use of Comments ?

Considerable amount of research has been done till now on Text Summarization without considering Comments. Results were pretty good but not up to the mark as they fail to extract those sentences that may represent the central theme of the text. The main reason of including the comments into summarization is that most of the times they highly discuss the central Idea about the subject of the Text. But blog or text must have considerable no. of comments for better results. Consider the following figure ( Ref: Meishan Hu, Aixin Sun, and Ee-Peng Lim ) that gives the overview about the same.

image

I guess, you may have got the overview of what is Expected and how comments can be used to achieve that.

The Algorithm is as follows:

Phase I: Finding Dominant Comments that may represent Central Theme using

  • Topic Graph
  • Mention Graph
  • Named Entity Recognition

Phase II: Extracting Sentences on the basis of Dominant Comments from Phase I

Phase III: Finding the Quality of the Summary

Now, we will see each phase in details:

Phase I:

We know that all the comments are not useful enough to be used in the process of extraction of central theme about a blog. Let each comment has unique IDs from 1 to N.

Now, lets see some Graphs that can be further utilised in our algorithm.

i) Topic Graph

It shows how the comments are topically/contextually related with each other. That there is an edge between the two nodes iff they are topically related to each other. We can represent this graph using various Data Structures for example, 2D Array(Matrix), Array of Linked List, etc.

But how can we find whether two comments are topically related with each other or not ?
For that purpose, we can make use of Cosine Similarity. We are not going into deep about Cosine Similarity but you can refer Wikipedia page which explained the concept very well. It gives the angle between the lines drawn from two comments taken into consideration on the basis of common words. Lesser the angle more the similarity between the comments.

 

image

ii) Mention Graph

We may have reply from some other user to a comment. So, we can consider that such kind of comments are more related with each other. In Mention Graph, we consider only those comments that got one or more reply/ies from other users. There is an edge between X & a in the Mention Graph iff comment X got reply from comment a.

We can assign weightage to each graph. It would be better if we assign more weightage to Mention Graph than Topic, so that more dominant comments may get more scores. We have to see the generated output summary by varying these weightages. Make the weightages as constant after getting peak results.

For example, consider the second row from both the figures (a) & (b) for comment ID 2.

Topic:     0 0 1 1 = 0 + 0 + 1 + 1 = 2 * weightage(0.4) = 0.8

Mention: 1 0 0 0 = 1 + 0 + 0 + 0 = 1 * weightage(0.6) = 0.6

Therefore, Intermediate Score of comment 2 = 0.8 + 0.6 = 1.4

In this way, we calculate the Intermediate Scores of all the comments.

iii) Named Entity Recognition (NER)

I have read three research papers about Summarization using comments. But this feature is new that I added so as to produce better results. Consider our previous scenario about Java. As blog is about Java, it is obvious that Java may get compared with Python within the comments but there is nothing about Python in the blog text. Now, in this scenario, we may get dominant comments that may not be more related with the contents of the blog/text. For avoiding such issues, we consider Named Entities. Firstly, we will find out all the Nouns from the content of the blog known as BlogNounSet. After that, we have to find NERScore each comment on the basis of no. of nouns it contains from BlogNounSet. Then we merge the scores i.e. Intermediate Score and NERScore to get FinalScore for each comment by assigning proper weightages. For finding Named Entities, we can make use of Stanford POS Tagger.

After obtaining FinalScore for each comment, we have to select certain no. of dominant comments. This is the end of Phase I.

Phase II:

Now, we are having a set of all Dominant Comments (DC). In this phase, we have to extract sentences that are similar to the DC and hence we make use of Cosine Similarity. This phase is as follows:

For each sentence s from set of sentences S in the blog text
          total[s] = 0
For each sentence s from S

          For each comment c  from DC
                  total[s] += Cosine_similarity( s, c )
         end of inner for loop
end for outer for loop
sort( total )                                // Sort in non-decreasing order
Print first n Sentences where n = 20% of S

Phase III:

Now, we got the summary in Phase II, but we must have something to evaluate whether our system generated summary is better or not. For that purpose, we use Kavita Ganeshan Rouge. Its a Perl script to evaluate our system generated summary by taking into consideration various manual generated summaries as benchmark. It makes use of n-grams for evaluation. In this way, we can check the performance of our Algorithm.

I hope you all EnJoY this post and please don’t forget to write comments or mails of appreciation. Your remarks are welcome.

Thanks a lot !!!!!!

References:

Comments-Oriented Document Summarization: Understanding Documents with Readers’ Feedback By Meishan Hu, Aixin Sun, and Ee-Peng Lim

Comments-Oriented Blog Summarization by Sentence Extraction By Meishan Hu, Aixin Sun and Ee-Peng Lim

Summarizing User-Contributed Comments By Elham Khabiri and James Caverlee and Chiao-Fang Hsu

Friday, 6 June 2014

TagMe

The current boom of the Web is associated with the revenues originated from online advertising. So, more emphasis nowadays, is being given to proper presentation of the Ads which then embodied into relevant pages. Almost all searching Giants like Google, Yahoo, Bing, etc. use Ads as the main source of revenues. Nowadays, advertisers put more efforts so as to make attractive Ads along with proper descriptions that would be quite helpful for making more profits both for advertiser as well as hosting websites.

So, it would be better if we can provide some more useful information about the words in the description. This is useful as far as users’ perspective is taken into consideration. Now, lets move to one of the most trending topic in the world of advertisement – “TagMe”.

What is TagMe ?

I guess, we all are familiar with tagging in FaceBook. In case of FB, we tag one of our friend’s name to a person’s face in the image. That is, we are adding more information about something. The same thing is related with advertisements. Given an Ad along with some description, we add some more relevant information about dominant words present in its description.

Consider, an Ad with some description as given below,

Auction for Bicycles
Description Auction of second hand Bicycles & Two Wheelers in the campus of International Institute of Information Technology, Hyderabad on the basis of English Auction.

Suppose, I am interested in such kind of Ads. From the above description, I surely point out three points/things.
i) Bicycles & Two Wheelers
ii) English Auction
iii) International Institute of Information Technology, Hyderabad.

Out of these three points, I can understand the first point but what about the remaining points?
I don’t have any kind of information about what is English Auction?
and Where is International Institute of Information Technology, Hyderabad?
Till now, this was about the perspective of Ad interested user/s.

Now, we will see from the Advertiser’s perspective.

Every advertiser would want that his/her Ad should be visible to more and more users and the depth/meaning of the Ad must be conveyed to them clearly and easily.
Now, what steps should the advertisers take in order to give more clarity for points (ii) & (iii) ?

We can add more informative information about these points. That is when user hovers his/her cursor on the text, it should provide more precise information about the text into consideration. Yes, this approach is effective and more feasible to implement.

To understand this concept more precisely, just google tagme, open the first link and put some words into the textbox and then click Tagme. Now, observe the results carefully to understand the concept.

As this blog is for the developers and the people with strong technical skills in the field of computer science, we are going to develop some more efficient algorithm for TagMe concept with heavy discussions.

Implementation Perspective 

Certain important things must be pointed out in order to get clearer picture of what we have to implement.

i) From where we will get information about some words.
ii) Selection of particular words (keywords) for which more information has to be shown ( when hovered ).
Lets discuss these points one by one.

Point (i)

To resolve this problem, we must have certain DB from where we can get required information. But, the nature of this DB should be dynamic rather than static as information keeps changing every time. But the question is, which DB?

It would be better if we can get information from Wikipedia, which is stable, dynamic and more importantly it contains more Authentic Information. Yes, this matter is subjective and developer has full right to select its own repository.

Another important thing is that Wikipedia has provided very flexible API calls to fetch required data. Our aim is to show only overview that is the introductory part of Wikipedia page for the keyword.

Point (ii)

This is the most important phase of TagMe. In this phase, we have to make decision about selecting the keywords from the text. But how do we get to know that a particular word is a keyword and whether Wikipedia contains page for it or not ?

One way, I initially thought is that assign Parts of Speech (POS) tag to each of the words in the description of the Ad and extract all the noun words present in it. For these nouns, form the URL accordingly and get the information. This is very easy solution and it has some drawbacks. To understand this problem, consider our previous example of Auction Ad.

We tag all the words present in the description. Lets take into consideration the POS tags of International Institute of Information Technology, Hyderabad. Out of these words, Hyderabad is a noun. But the thing is that we must consider all these words as a single entity rather than individual. Now, looking into reverse way to this problem would give us the way to solve this ambiguity. Anyways, we have to get information from Wikipedia, so take into consideration those phrases (i.e. entities) that have a page ( as a title ) present in Wikipedia database. For doing so, create a file containing the title of all Wikipedia pages and note down the timestamp at which API call is made to fetch the titles. This timestamp will be useful when we want to update the titles of newly introduced pages into Wikipedia database through another API call to the File of Titles(FT).
Now, the whole problem is as follows:
i) Given an Ad description, search by taking into consideration ‘t’ words each time in FT for a match (entity).
ii) Form a URL for each of entity and get the required data from Wikipedia.

Data Structures

Since point(ii) form the base of TagMe, we will start with developing the same. The data structure for this phase should be Space and Search Efficient. So, I recommend Trie Data  Structure. Each node consists of a char space along with 26 pointers equals to no. of alphabets in English. If you don’t have any knowledge about Trie, search for it on the web, as we are not going into depth of this Data Structure. Trie contains all the titles present in FT. Two things we are concerned about using Trie are: i) The height and ii) The size of Main Memory.

The height of Trie will be equal to the no. of chars in the longest title in FT. Obviously, we require more Main memory for accommodating the whole Trie. But this Main Memory investment in one time as, it would be rare to have page title more than 30 chars. When you go into the depth of Trie construction for titles, you will get to know that it is also memory efficient Our concentration is about reducing search time for getting proper entities in the description of Ad. Trie is more efficient for searching. Also, we can make use of some compression technique in order to reduce the size of the titles. Now, take one or more words ( depending upon your algorithm design) from Ad description and search into Trie to check whether the group of selected words forms a title of Wikipedia page. As soon as we got a match, extract required introductory information using Wiki API.

Consider an example to have a broader sense of understanding about what we have explained till now.

PG on sharing basis for Rs. 8000
Description Paying Guests required on sharing basis near IIIT area Hyderabad for Rs. 8000/-.

Also Wikipedia contains pages for IIIT, Hyderabad, India, IIT, Delhi, etc. We have all these titles in our FT. From FT, we are going to create Trie for all these titles. The Trie structure will like this:

 

image

Now, we have consider the Ad description. Starting  with the word ‘Paying’, check it is there in the above structure or not. We will not find any words matched into this structure till IIIT. Note that this traversing not as easy  but the programmer has to take care of this. As soon as we find a match for IIIT, form a URL accordingly and extract the interested information. Continue this procedure till all the words in the description gets over. This was about what I thought of solving this problem, may be someone can have better solution than this.

I request all of you who are having better solution than this to share your valuable ideas here.

I hope you all eNjOy this post.

Please do write the comments/mails of appreciation.
Thanks a lot !!!!

Thursday, 27 March 2014

Duplicate Elimination

Last week, I was brushing up some of the concepts of Database Management Systems and I came across the idea of implementing the keyword Distinct. Distinct keyword is used in order to find the tuples with distinct values of the desired attribute present in the table.

For example.

Select distinct(emp_fname) from employee;

This query will give the unique name of the employees present in the table employee.

Now, the question arises, what data structures to be used in order to implement the keyword distinct?

Data Structures that can be used are:

1) Hashing

2) B Trees

Let’s discuss, Duplicate Elimination with Hashing.

1) Hashing:

Hashing is one of the best data structure to handle such kind of queries. Following points play an important role during implementation:

i. Hash Function

ii. Size of the Hash Table

iii. Collision Avoidance

The kind of Hash Function to be used depends upon the type of the concerned attribute.

The Size of the Hash Table is usually kept high and it’s a Prime no. The reason of size being a prime no. is that it evenly distributes the attribute values in the table.

When the Hash Function gives same slot for two or more input value, we have to have some alternative in order to avoid the collisions. The collision avoidance scheme to be used depends upon the developer who decides on the basis of the nature of the given input data. One may use some other Hash Function when the collision is observed.

Please have a look of the following piece of code that I have written:

#include
#include
#include

struct hashNode
{
long long int data;
hashNode *link;
};

struct list
{
int noOfKeys;
hashNode *next;
};


struct listInfo
{
list *listPointer;
int tableSize;
int collisionFactor;
};


list *createHash( int tableSize )
{
list *hash;
hash = new list[ tableSize ];

for( int i = 0; i < tableSize; i++ )
{
hash[i].noOfKeys = 0;
hash[i].next = NULL;
}
return hash;
}

hashNode *newNode()
{
hashNode *temp = ( hashNode * )malloc( sizeof( hashNode ) );
temp -> link = NULL;
return temp;
}

int rehash( listInfo *listInfoObj, list *hash )
{
long long int index;
hashNode *trav, *temp;

list *p = listInfoObj -> listPointer;
int len = listInfoObj -> tableSize/2;

for( int i = 0; i < len; i++ )
{
trav = p[i].next;
while(trav)
{
index = trav -> data % listInfoObj -> tableSize;
temp = newNode();
temp -> data = trav -> data;
temp -> link = hash[index].next;
hash[index].next = temp;
hash[index].noOfKeys++;
temp = trav;
trav = trav -> link;
free(temp);
}
}
return 0;
}

int getNext( long long int value, listInfo *listInfoObj )
{
hashNode *trav;
list *p, *hash;
long long int index = value % listInfoObj -> tableSize;

p = listInfoObj -> listPointer;

if( p[index].noOfKeys >= listInfoObj -> collisionFactor )
{
listInfoObj -> tableSize *= 2;
hash = createHash( listInfoObj -> tableSize );
rehash( listInfoObj, hash );
free(p);
listInfoObj -> listPointer = p = hash;
index = value % listInfoObj -> tableSize;
}
if( p[index].next == NULL )
{
p[index].next = newNode();
p[index].next -> data = value;
p[index].noOfKeys++;
}
else
{
trav = p[index].next;
while( trav->link != NULL )
{
if( trav -> data == value )
return 0;
trav = trav -> link;
}
if( trav -> data == value )
return 0;
trav -> link = newNode();
trav -> link -> data = value;
p[index].noOfKeys++;
}
return 1;
}

int displayHashTable( listInfo *listInfoObj )
{
list *p;
hashNode *trav;
int count = 0;
p = listInfoObj -> listPointer;

for( int i = 0; i < listInfoObj -> tableSize; i++ )
{
trav = p[i].next;
while( trav )
{
count++;
printf("%lld ", trav -> data );
trav = trav -> link;
}
printf("\n");
}
printf("No.of Unique Tuples: %d\n", count );
return 0;
}

int main()
{
char line[20];
listInfo *listInfoObj;

listInfoObj = ( listInfo * )malloc( sizeof( listInfo ) );
listInfoObj -> tableSize = 93;
listInfoObj -> collisionFactor = 11;

/* printf("Enter The Size of The Hash Table and Collision Factor:\n");
scanf("%d%d", &listInfoObj.tableSize, &listInfoObj.collisionFactor );
*/
listInfoObj -> listPointer = createHash( listInfoObj -> tableSize );

printf("Enter The Tuples:\n");
while( true )
{
scanf("%s", line );
if( !strcmp( line, "-1" ) )
{
break;
}
long long int value = atoll(line);
getNext( value, listInfoObj );
}
displayHashTable( listInfoObj );
return 0;
}

Implementation Details:

The basic aim is to allocate memory dynamically and Linked List is the best example of that. Define an array of pointers of size ‘N’ equal to that of table that you desired of. In order to avoid collision, we can have linked list for each collision value. The size of this linked list can be adjusted to the variable value i.e. collisionFactor.

Let’s have a look of the Hash Table structure.

clip_image002


Fig. 1 Hash Table Structure

At a particular instant of time, we can have maximum of collisionFactor no. of keys having same slot given by the Hash Function.

In a database system, we may have to work with huge no. of records. So, whenever the no. of nodes in a particular linked list exceeds the limit (i.e. CollisionFactor), we have to go for Rehashing. In Rehashing, we create a hash table (2N Size) of double than the size of the previous one and all the keys that we stored in previous hash table are again hashed into the newer one.

So, this part is about Duplicate Elimination Using Hashing. In the next post, we will see implementation aspects of B Tree and how it can be used for eliminating duplicate tuples.

Saturday, 8 February 2014

FAT Chaining 2

In the previous post, we have seen the various Data Structures that the operating System uses in order to manage the files stored in the drives. Since, we are dealing with the system programming, it is important to know about Interrupts, Assembly Language, etc.

Anyways, if you don’t know that much about Interrupts, Assembly Language, the steps for developing the code are simpler to understand.

Let’s start with the practical implementation details.

Now, we want to read the file stored in Floppy Disk (FD). User will provide the name of the file through Command Line Interface (CLI).

Steps Involved:

1) Read the no. of sectors for:

i) FAT

ii) Root Directory (RD)

2) Find the RD entry for the given input file.

3) If RD entry found: Get the first cluster no. (Stored at 1AH offset in RD Entry) and read it. Also read FAT.

Else File Does not exist.

4) Now, read the remaining file by making use of FAT.

Now, we will elaborate these 4 steps in order to get into the rhythm.

Step 1:

Standard FD has predefined size of the Data Structures.

For instance,

Boot sector – 1 Sector

FAT – 9 Sectors

Copy of FAT – 9 Sectors

Root Directory – 14 Sectors

And then file area starts.

As these sizes are known previously, we are not going to retrieve them Boot Sector. Though, we can get the size of FAT from Number of sectors per FAT i.e. 16H offset & size of RD from Number of root-directory entries i.e. 11H offset of Boot Sector.

Note that: Size of RD = No. of RD Entries * 32

As size of each entry is 32 Bytes (See Fig. 3 of Previous Post).

Step 2:

User is going to provide the file name, which we have to search in the RD. For that purpose, we have to read the whole RD and then check first 11 Bytes (8 Bytes file name + 3 Bytes Extension) of each 32 Byte entries in it.

For reading RD, we are going to use INT 25H.

INT 25H (Absolute disk read)

AL = Drive Number (0 = A, 1 = B, etc.)

CX = Number of sectors to read

DX = Starting sector number

DS: BX = Segment: offset of buffer

Returns:

If function successful

Carry flag = clear

If function unsuccessful

Carry flag = set

AX = error code

MOV AL, 00

MOV BX, OFFSET BUFF

MOV CX, 14

MOV DX, 19

INT 25H

How does starting sector no. equal to 19?

RD comes after Boot Sector (1 Sector), FAT (9 Sectors), Copy of FAT (9 Sectors). So, 1+9+9 = 19 and from 19th sector onwards RD starts (as first sector is sector 0).

After the execution of INT 25H, we will get whole RD into buffer BUFF and we start checking first 11 Bytes of each RD entry.

A legal 32 Byte entry does not start with E5H or 00H. So, during file name check, we have to take care of this thing.

Step 3:

As soon as we find the RD entry for which we were looking, read the first cluster no. at an offset 1AH in that entry. Read and display the first cluster. Now FAT comes into the picture for displaying the remaining clusters of the file.

For reading 9 sectors of FAT, we make use of INT 25H as described in step 2. Only change will be in the values of CX, DX and use some other buffer.

Note:

MOV CX, 09

MOV DX, 01

MOV BX, BUFF1

One this should be noted here that we are going to make use of INT 21H and INT 25H only.

Let’s have a look about the 09H and 4CH functions of INT 21H

Function: 09H

It writes a $-terminated string to standard output.

DS must point to the string's segment, and DX must contain the string's offset:

For example:

.DATA

STRING BYTE “THIS IS A STRING$”

.CODE

MOV AH, 09H

MOV DX, OFFSET STRING

INT 21H

Function: 4CH

Ends the current process and returns an optional 8-bit code to the calling process.

A return code of 0 usually indicates successful completion. It is same as EXIT 0.

MOV AX, 4C00H

INT 21H

;Here AL = 00, After the execution register AL (AX = AH || AL) will contain Error Code if present.

Step 4:

This step is more interesting as compared to other steps. We got first cluster of the required file and now we have to find the next cluster no. in order to read the remaining file. If you have observed the nomenclature of the FAT at the time formatting the Pen-Drive, you may have seen FAT12, FAT16 or something like nowadays NTFS (New Technology File System).

Now, What does this FATX represents? What is the significance of X, here?

X represents X-bit wide cluster entries. Now let’s convert X into Bytes.

If FAT12 then X = 1.5 Bytes or If FAT16 then X = 2 Bytes, etc. One thing should be noted that Floppy Disk is FAT12 (i.e. 1.5 Byte).

Let F be the first cluster no. obtained from RD entry of the given input file. Now, we multiply F by 1.5 in order to get the next cluster no. Here, comes the tricky point:

Let after multiplication i.e. 1.5 * F, the four nibbles be wxyz (i.e. 1.5F = wxyz). If 1.5F is even, consider last three nibbles otherwise first three nibbles.

Suppose, 1.5F is even, consider xyz and add first nibble as 0. So, the next cluster no. becomes 0xyz. Now, read the location value from FAT present at 0xyz. From this location, we have to read the next sector of the file. In the similar fashion, we can read remaining file. At any instant of point, the next cluster no. equals to or greater than 0FFFH, stop i.e. file is completely read. Here, we are getting the location of the next sectors to be read from FAT in continuation. This continuation is referred as “Chaining”.

Note: 1 Cluster = 2^i Sector, i = 0,1,2……

Now, try to understand the below ASM code, which is written on the basis of the above discussion. I strongly suggest you people to read “The Microsoft(R) Guide for Assembly Language and C Programmers By Ray Duncan” for better understanding of System Programming.

print macro msg
    mov ah,09h
    lea dx,msg
    int 21h
    endm

.model tiny
.code
    org 100h
begin:
    jmp start
    msg1 db 10d,13d,"Error in reading the root directory:$"
    msg2 db 10d,13d,"Error in reading the sector:$"
    msg3 db 10d,13d,"Found:$"
    msg4 db 10d,13d,"Not Found:$"
    msg5 db 10d,13d,"Error in reading FAT:$"
    buff db 7168 dup('$')
    buff1 db 4608 dup('$')
    buff2 db 513 dup('$')
    filename db 12 dup('$')
    str_cls dw ?
    temp dw ?
    cnt db 224
    count db ?
    num dw ?

start:
    mov count,08
    mov si,80h
    mov cl,[si]
    dec cl
    mov si,82h
    lea di,filename
l1:
    mov al,[si]
    cmp al,'.'
    je l2
    mov [di],al
    inc si
    inc di
    dec count
    dec cl
    jmp l1

l2:
    cmp count,00
    jz l4
    inc si
    dec cl

l3:
    mov [di],20h
    inc di
    dec count
    jnz l3

l4:
    mov count,03h
   
l5:
    mov al,[si]
    mov [di],al
    inc si
    inc di
    dec count
    dec cl
    jnz l5

    cmp count,00
    jnz l6
    jmp l7            ;filename completed

l6:
    mov [di],20h
    inc di
    dec count
    jnz l6   
   
l7:
    mov al,00
    mov bx,offset buff
    mov cx,14
    mov dx,19
    int 25h
    jc error1
    add sp,02h

    lea si,buff
l8:
    mov al,[si]
    cmp al,0E5h
    je loop1
    cmp al,00
    je loop1
    jmp loop2

loop1:
    add si,0032
    dec cnt
    jnz l8

loop2:
    lea di,filename
    mov cx,11

l9:
    mov al,[si]
    cmp al,[di]
    je loop3
    add si,cx
    add si,21
    dec cnt
    jnz loop2
    jmp n_found
loop3:
    inc si
    inc di
    dec cl
    jnz l9
    jmp found

n_found:   
    print msg4
    jmp end1

found:
    print msg3
   
    sub si,11
    add si,1Ah
    mov bl,[si]
    inc si
    mov bh,[si]
    mov str_cls,bx        ;first cluster   

    mov al,00
    mov bx,offset buff1        ;buff1 contains FAT
    mov cx,09
    mov dx,01
    int 25h
    add sp,02
    jc error2
    mov ax,str_cls
    jmp en2

;FAT cluster reading

error1: jmp error11

en2:
    add ax,31
    mov dx,ax
    mov al,00
    mov cx,01
    mov bx,offset buff2
    int 25h
    jc error3
    add sp,02
   

    mov ah,09h
    lea dx,buff2        ;displaying cluster contents
    int 21h


    mov ax,str_cls
    mov bx,03
    mul bx
    mov bx,02    ;1.5 multiplication
    div bx
   
    lea si,buff1
    add si,ax
    mov bl,[si]
    inc si
    mov bh,[si]
    mov num,bx

    mov ax,str_cls
    and ax,01
    cmp ax,00
    je even1
    jmp odd1

even1:
    mov ax,num
    and ax,0fffh
    cmp ax,0fffh
    je end1
    mov str_cls,ax
    jmp en2


odd1:
    mov ax,num
    and ax,0fff0h
    mov cl,04
    shr ax,cl
    and ax,0fffh
    cmp ax,0fffh
    je end1
    mov str_cls,ax
    jmp en2


error11:
    print msg1
    jmp end1
error2:
    print msg5
    jmp end1
error3:
    print msg2
end1:
    mov ah,4ch
    int 21h
    end begin   

I hope you people like this post.

Please do write comments !!!!

Wednesday, 5 February 2014

FAT Chaining

 

Most of the people will not understand the term ”FAT Chaining”. There are two things:

1) FAT

2) Chain

Yes, FAT is File Allocation Table and you will understand chaining at the end of this post. Basic aim of this post is to understand how a file gets retrieved when we click the icon or enter the name of the file on command prompt.

Suppose, we want to access a file present in a Floppy Disk. Now the question arises, Why are we retrieving the file from Floppy Disk and not from Hard Disk?

The reason is that we know the details of the Standard Data Structures of the Floppy Disk (FD). Here, data structures constitute to the following:

1) Boot Sector

2) File Allocation Table (FAT)

3) Root Directory

4) File Area

Now, we will not let ourselves only getting theoretical knowledge, instead we will develop Assembly Language Code for this. So, we divide this topic into two posts. This post will contain information about the above mentioned Data Structures and in the next post we will play with the Interrupts in order to develop the Assembly Level Code.

Now, let’s see how these things going to be used during file retrieval:

The whole area of the FD is divided among the above 4 Data Structures as shown below in Fig.1. Each MS-DOS logical volume is divided into several fixed-size control areas and a files area. The size of each control area depends on several factors like the size of the volume and the version of FORMAT used to initialize the volume, for example, but all of the information needed to interpret the structure of a particular logical volume can be found on the volume itself in the boot sector.  

image

 

1) Boot Sector:

Boot sector is known as the logical sector 0, contains all of the critical information regarding the disk medium's characteristics (Fig.2).

image

We are not going into much detail about these fields as they are self explanatory. From Boot Sector, we get all the physical information about the Drive. Let say, we have to find the size of the logical space of the drive. This can be calculated by making use of:

1) Total sectors in logical volume and

2) Bytes per sector

2) File Allocation Table:

Each file's entry in a directory contains the number of the first cluster assigned to that file, which is used as an entry point into the FAT. From the entry point on, each FAT slot contains the cluster number of the next cluster in the file, until a last-cluster mark is encountered.

At the computer manufacturer's option, MS-DOS can maintain two or more identical copies of the FAT on each volume. MS-DOS updates all copies simultaneously whenever files are extended or the directory is modified.

If access to a sector in a FAT fails due to a read error, MS-DOS tries the other copies until a successful disk read is obtained or all copies are exhausted. Thus, if one copy of the FAT becomes unreadable due to wear or a software accident, the other copies may still make it possible to salvage the files on the disk. As part of its procedure for checking the integrity of a disk, the CHKDSK program compares the multiple copies (usually two) of the FAT to make sure they are all readable and consistent.

3) Root Directory:

Following the file allocation tables is an area known as the root directory. The root directory contains 32-byte entries that describe files, other directories, and the optional volume label (Fig. 3).

An entry beginning with the byte value E5H is available for reuse; it represents a file or directory that has been erased. An entry beginning with a null (zero) byte is the logical end-of-directory; that entry and all subsequent entries have never been used.

image

 

4) Files Area:

The remainder of the volume after the root directory is known as the files area. MS-DOS views the sectors in this area as a pool of clusters, each containing one or more logical sectors, depending on the disk format. Each cluster has a corresponding entry in the FAT that describes its current use: available, reserved, assigned to a file, or unusable (because of defects in the medium). Because the first two fields of the FAT are reserved, the first cluster in the files area is assigned the number 2.

Try to follow the next post which will be more interesting.

Thanks a lot !!!