Thursday, 27 March 2014

Duplicate Elimination

Last week, I was brushing up some of the concepts of Database Management Systems and I came across the idea of implementing the keyword Distinct. Distinct keyword is used in order to find the tuples with distinct values of the desired attribute present in the table.

For example.

Select distinct(emp_fname) from employee;

This query will give the unique name of the employees present in the table employee.

Now, the question arises, what data structures to be used in order to implement the keyword distinct?

Data Structures that can be used are:

1) Hashing

2) B Trees

Let’s discuss, Duplicate Elimination with Hashing.

1) Hashing:

Hashing is one of the best data structure to handle such kind of queries. Following points play an important role during implementation:

i. Hash Function

ii. Size of the Hash Table

iii. Collision Avoidance

The kind of Hash Function to be used depends upon the type of the concerned attribute.

The Size of the Hash Table is usually kept high and it’s a Prime no. The reason of size being a prime no. is that it evenly distributes the attribute values in the table.

When the Hash Function gives same slot for two or more input value, we have to have some alternative in order to avoid the collisions. The collision avoidance scheme to be used depends upon the developer who decides on the basis of the nature of the given input data. One may use some other Hash Function when the collision is observed.

Please have a look of the following piece of code that I have written:

#include
#include
#include

struct hashNode
{
long long int data;
hashNode *link;
};

struct list
{
int noOfKeys;
hashNode *next;
};


struct listInfo
{
list *listPointer;
int tableSize;
int collisionFactor;
};


list *createHash( int tableSize )
{
list *hash;
hash = new list[ tableSize ];

for( int i = 0; i < tableSize; i++ )
{
hash[i].noOfKeys = 0;
hash[i].next = NULL;
}
return hash;
}

hashNode *newNode()
{
hashNode *temp = ( hashNode * )malloc( sizeof( hashNode ) );
temp -> link = NULL;
return temp;
}

int rehash( listInfo *listInfoObj, list *hash )
{
long long int index;
hashNode *trav, *temp;

list *p = listInfoObj -> listPointer;
int len = listInfoObj -> tableSize/2;

for( int i = 0; i < len; i++ )
{
trav = p[i].next;
while(trav)
{
index = trav -> data % listInfoObj -> tableSize;
temp = newNode();
temp -> data = trav -> data;
temp -> link = hash[index].next;
hash[index].next = temp;
hash[index].noOfKeys++;
temp = trav;
trav = trav -> link;
free(temp);
}
}
return 0;
}

int getNext( long long int value, listInfo *listInfoObj )
{
hashNode *trav;
list *p, *hash;
long long int index = value % listInfoObj -> tableSize;

p = listInfoObj -> listPointer;

if( p[index].noOfKeys >= listInfoObj -> collisionFactor )
{
listInfoObj -> tableSize *= 2;
hash = createHash( listInfoObj -> tableSize );
rehash( listInfoObj, hash );
free(p);
listInfoObj -> listPointer = p = hash;
index = value % listInfoObj -> tableSize;
}
if( p[index].next == NULL )
{
p[index].next = newNode();
p[index].next -> data = value;
p[index].noOfKeys++;
}
else
{
trav = p[index].next;
while( trav->link != NULL )
{
if( trav -> data == value )
return 0;
trav = trav -> link;
}
if( trav -> data == value )
return 0;
trav -> link = newNode();
trav -> link -> data = value;
p[index].noOfKeys++;
}
return 1;
}

int displayHashTable( listInfo *listInfoObj )
{
list *p;
hashNode *trav;
int count = 0;
p = listInfoObj -> listPointer;

for( int i = 0; i < listInfoObj -> tableSize; i++ )
{
trav = p[i].next;
while( trav )
{
count++;
printf("%lld ", trav -> data );
trav = trav -> link;
}
printf("\n");
}
printf("No.of Unique Tuples: %d\n", count );
return 0;
}

int main()
{
char line[20];
listInfo *listInfoObj;

listInfoObj = ( listInfo * )malloc( sizeof( listInfo ) );
listInfoObj -> tableSize = 93;
listInfoObj -> collisionFactor = 11;

/* printf("Enter The Size of The Hash Table and Collision Factor:\n");
scanf("%d%d", &listInfoObj.tableSize, &listInfoObj.collisionFactor );
*/
listInfoObj -> listPointer = createHash( listInfoObj -> tableSize );

printf("Enter The Tuples:\n");
while( true )
{
scanf("%s", line );
if( !strcmp( line, "-1" ) )
{
break;
}
long long int value = atoll(line);
getNext( value, listInfoObj );
}
displayHashTable( listInfoObj );
return 0;
}

Implementation Details:

The basic aim is to allocate memory dynamically and Linked List is the best example of that. Define an array of pointers of size ‘N’ equal to that of table that you desired of. In order to avoid collision, we can have linked list for each collision value. The size of this linked list can be adjusted to the variable value i.e. collisionFactor.

Let’s have a look of the Hash Table structure.

clip_image002


Fig. 1 Hash Table Structure

At a particular instant of time, we can have maximum of collisionFactor no. of keys having same slot given by the Hash Function.

In a database system, we may have to work with huge no. of records. So, whenever the no. of nodes in a particular linked list exceeds the limit (i.e. CollisionFactor), we have to go for Rehashing. In Rehashing, we create a hash table (2N Size) of double than the size of the previous one and all the keys that we stored in previous hash table are again hashed into the newer one.

So, this part is about Duplicate Elimination Using Hashing. In the next post, we will see implementation aspects of B Tree and how it can be used for eliminating duplicate tuples.

No comments:

Post a Comment