Monday 26 January 2015

External Sort

 

I guess most of the people have implemented sorting algorithms like merge sort, quick sort, etc but they may not be aware of External Sort. So, What does this external sort mean?

In order to understand this concept, we will introduce a scenario. Suppose, we have to sort a file of size far larger than that of physical memory (RAM). To tackle this problem, firstly, increase the physical memory or secondly, try to introduce some other algorithm. Yes, second option is the better one but how?

This problem can be solved with this external merge sort algorithm. Let’s have a look of this algorithm.

This algorithm consists of two phases:

1) Creation of sorted sublists

2) Merging sublists to form a single file

Let’s start developing the foundation of the solution. Let the size of the main memory be ‘m’ buffers and that of file be ‘n’ such that n >> m. We can sort the file that has size less than or equal to the size of physical memory by making use of any known sorting algorithm. Here, sorting algorithms like merge sort cannot be used because such kind of sorting algorithm requires more memory for storing intermediate results.

Phase I:

Get the first m buffers (size of RAM) of the file into the main memory and sort using any internal sorting algorithm. Now, after sorting dump this onto the disk in a file called as sublist. Repeat this process until the whole file gets sorted into sublists. Obviously, we will get n/m no. of sublists. All the sublists will be nothing but sorted files of size equal to that of main memory.

Phase II:

So, we will get n/m = t (say) no. of sorted sublists.

Now, we have to merge these sorted sublists in order to create a single sorted file. One thing should be noted here, these sublists are not mutually sorted, don’t worry; you will understand the whole concept from the example.

Divide the main memory (logically) into (t+1) buffers as we have t sorted sublists and one extra buffer for storing the results of the merge operation. Now, the size of the buffer (s) will be equal to m/ (t+1). Read the s bytes from each sublists into the main memory and compare the top values from each buffer and put the lesser value into the output buffer and remove that value from its respective buffer. Fill the buffer when it becomes empty from its respective sublists. As soon as the output buffer gets full, dump it into an output file and this file will be the final sorted file that we wanted. All the dumps will be performed on the same output file in phase II.

For example,

Suppose we have a file containing the following nos.

1, 2, 3, 5, 4, 3, 5, 1, 3, 4, 5, 8, 3, 5, 1, 8, 3, 7, 4, 5 and we can store max. of 5 nos. at any instant of time in the main memory i.e. size of the main memory = 5 nos.

Now, we have to sort these nos. using External Sort mechanism.

Phase I:

Take first 5 nos. and put them in main memory, sort them and put them in a sublist.

Sublist 1: 1 2 3 4 5

Similarly, other sublists will be:

Sublist 2: 1 3 3 4 5

Sublist 3: 1 3 3 5 8

Sublist 4: 3 4 5 7 8

Phase II:

Now, we have 4 sublists, so 4 input plus 1 output buffers.

The working of this phase can be understood very easily from the above diagrams.

Step 1:

clip_image002_thumb5

Initially, one no. from each sublist is loaded into physical memory (as each buffer can store only one no.) and the smallest no. is stored into the output buffer. This buffer is emptied/dumped into a file (output file).

Similarly,

Step 2:

clip_image004_thumb6

Step 3:

 clip_image006_thumb5

Step 4:

clip_image008_thumb5

Step 5:

clip_image010_thumb5

In the similar fashion, we can go ahead until all the sublists become empty and at last the output file will contain sorted nos.

I hope, you can sort the files of huge sizes using external sorting algorithm that we have seen.

Please don’t forget to write the comments of appreciation or doubts.