Sunday 30 December 2007

Indexed vs. Hashed Files

Compare the two non-sequential file structure models: the random (hashed) file and the index file. What advantages does the first one have over the second and what advantages does the second have over the first? What would be your criteria for choosing one over the other for different applications?

"An index for a file contains a list of the keys stored in the file along with entries indicating where the record containing each key is stored" (Brookshear, 2007). Without the index, searching for a particular value in the file would require a full scan of the file. Using an index greatly speeds up the search process. In a non-indexed file with 1000 records an average of 500 comparisons would be needed to find a record. But with an evenly spaced index 100 records in size the number of operations is reduced by a factor of 5 (Callari, 1996). There are several disadvantages to using an index - the need to provide storage space for the index, the fact that the index must be updated as soon as the content of the file is updated, and the fact that an index only improves search speed once the file gets past a certain size.

In a hashed file, data storage space is divided into several buckets and the records in the file are distributed among the buckets according to an algorithm (Brookshear, 2007). The speed of searching for a record is improved by applying the reverse of the algorithm to determine which bucket the record is held in, then only having to search through that bucket for the record rather than the whole file. Problems occur in a hashed file because the way that records are distributed to buckets can result in more search keys ending up in one bucket than others (clustering). Records can be hashed with the same value - collision - and it is possible for buckets to overflow if some means of dynamically increasing their size is not implemented.

The fact that indexes are one of the fundamental concepts behind database design leads me to believe that indexes are best used when there are no limits on storage space - a large database will always have space available for indexes because of the extra efficiency in searching. Also, I think that indexes are less prone to errors than hashing files - no chance of collision or clustering. However hashing isn't as wasteful in terms of needing extra resources to implement, and is quicker when done properly because there aren't two files to maintain.

Refs:

Brookshear, J. G. (2007) Computer Science, An Overview (9th Ed.) Pearson Education Inc., Boston, MA

Callari, F (2006) Indexed Sequential [Online] Quebec: McGill University, Montreal, Quebec, Canada
Available from http://www.cim.mcgill.ca/~franco/OpSys-304-427/lecture-notes/node56.html (Accessed 30th Dec. 2007)

No comments: