Sunday, 30 December 2007

Indexed vs. Hashed Files

Compare the two non-sequential file structure models: the random (hashed) file and the index file. What advantages does the first one have over the second and what advantages does the second have over the first? What would be your criteria for choosing one over the other for different applications?

"An index for a file contains a list of the keys stored in the file along with entries indicating where the record containing each key is stored" (Brookshear, 2007). Without the index, searching for a particular value in the file would require a full scan of the file. Using an index greatly speeds up the search process. In a non-indexed file with 1000 records an average of 500 comparisons would be needed to find a record. But with an evenly spaced index 100 records in size the number of operations is reduced by a factor of 5 (Callari, 1996). There are several disadvantages to using an index - the need to provide storage space for the index, the fact that the index must be updated as soon as the content of the file is updated, and the fact that an index only improves search speed once the file gets past a certain size.

In a hashed file, data storage space is divided into several buckets and the records in the file are distributed among the buckets according to an algorithm (Brookshear, 2007). The speed of searching for a record is improved by applying the reverse of the algorithm to determine which bucket the record is held in, then only having to search through that bucket for the record rather than the whole file. Problems occur in a hashed file because the way that records are distributed to buckets can result in more search keys ending up in one bucket than others (clustering). Records can be hashed with the same value - collision - and it is possible for buckets to overflow if some means of dynamically increasing their size is not implemented.

The fact that indexes are one of the fundamental concepts behind database design leads me to believe that indexes are best used when there are no limits on storage space - a large database will always have space available for indexes because of the extra efficiency in searching. Also, I think that indexes are less prone to errors than hashing files - no chance of collision or clustering. However hashing isn't as wasteful in terms of needing extra resources to implement, and is quicker when done properly because there aren't two files to maintain.


Brookshear, J. G. (2007) Computer Science, An Overview (9th Ed.) Pearson Education Inc., Boston, MA

Callari, F (2006) Indexed Sequential [Online] Quebec: McGill University, Montreal, Quebec, Canada
Available from (Accessed 30th Dec. 2007)

Friday, 28 December 2007

Distributed Databases

When would you choose to implement distributed databases, rather than local databases? Would you go for this design if all your computers were localized in one building? What if they are spread out in your country? What would you recommend when they are spread worldwide? When would you mirror databases and when would you use databases tailored for the local environment?

I would move from local to distributed databases (DD) as soon as some or all of the data held required synchronisation between users; as soon as the data required more security than a local database management system (DBMS) such as Microsoft Access can offer; or as soon as the data becomes 'mission critical'. Of course a DD requires the application(s) submitting data to it to be compliant; either this or there must be capacity to implement applications that can interface with a DD. The decision would also be underpinned by financial considerations - implementing a DD requires an investment in both hardware and expertise.

In the case of all computers localised in one building then the case for a DD is harder to make. Imagine the building is home to a sales and marketing company with a small Customer Relationship Management (CRM) system. They could easily all hold a copy of the CRM on their local machines. Provided that they were happy to only ever have all the data in the same state once a day after a job to synch the data was run then there would be no problem. Any faults in the synching of the data could be easily tracked to the machine responsible. Of course, doing business this way is inefficient and not very secure. A simple distributed database - perhaps comprising a single SQL 2005 server, a seperate server to store data and log files, a tape drive to put the backups on and a simple web app to perform CRM functionality - would make the DD in this case relatively inexpensive.

What if the computers are spread out in the country? A distributed database is a must in this case because the logistics involved in maintaning local copies of data across multiple sites is overwhelming. In this case its perfectly feasible to have the distributed database located in one site in the country and all the computers communicating with this site. When the computers are spread worldwide a different approach is required. One such approach is the OceanStore system. "The core of the system is composed of a multitude of highly connected “pools”, among which data is allowed to “flow” freely. Clients connect to one or more pools, perhaps intermittently" (Kubiatowicz, Bindel, Chen et al, 2000). These pools - located in various locations globally - allow for much greater resilience in the face of disaster because all database objects are replicated across all pools, and the failure of one pool does not affect the availability of data.

"Fast failover with minimal data loss has traditionally involved higher hardware cost and greater software complexity. Database mirroring, however, can fail over quickly with no loss of committed data, does not require proprietary hardware, and is easy to set up and manage" (Talmage, 2005). Despite what Ron Talmage says implementing advanced database techniques such as mirroring requires expertise to be constantly available to keep it maintained; if an organisation were unable to obtain this availability, or would not benefit from mirroring because unavailablity of data was not high up the risk assessment, then I would suggest that tailoring databases for local use is the cheaper and quicker option - especially in the case of smaller businesses that only employ a small number of people. In a large organisation I think that the existence of data on databases is as important as filing tax returns and can't see why they wouldn't want to do everything possible to secure their data and make it available - especially in the post Sarbanes - Oxley world.


Kubiatowicz, J., Bindel, D., Chen, Y., Czerwinski, S., Eaton, P., Geels, D., Gummadi, R., Rhea, S., Weatherspoon, H., Weimer, W., Wells, C. & Zhao, B. (2000)
OceanStore: An Architecture for Global-Scale Persistence Storage [Online]
Proceedings of the Ninth international Conference on Architectural Support for Programming Languages and Operating Systems, 2000
Available from (Accessed 28th Dec. 2007)

Talmage, R (2005) Database Mirroring in SQL Server 2005 [Online] Microsoft Corporation
Available from (Accessed 28th Dec. 2007)

Sunday, 23 December 2007

Storing Data in a Sparse Matrix

One of the problems of storing data in a matrix (a two-dimensional Cartesian structure) is that if not all of the elements are used, there might be quite a waste of space. In order to handle this, we can use a construct called a "sparse matrix", where only the active elements appear. Each such element is accompanied by its two indexes (the row and the column). Discuss in what ways such a structure is similar to and/or different than a list.

A Linked List and a Sparse Matrix are two different data storage formats. "Many numerical problems in real life
applications such as engineering, scientific computing and economics use huge matrices with very few non-zero
elements, referred to as sparse matrices" (Smailbegovic, Gaydadjiev & Vassiliadis, n.d.). The most obvious way to
store a sparse matrix would be in a multi - dimensional array. However this is extremely wasteful because the
majority amount of zero elements that will never get accessed must still have memory allocated to them. "With a
little extra effort, the size of the array can be deferred until the array is created at runtime, but after that
it remains fixed" (Parlante, 2001). When an array is created the memory used by it is all stored in one chunk.

In contrast, a Linked List consists of a series of nodes that have their own independent memory allocation. These
nodes are 'linked' through pointers - each node in the Linked List usually contains two fields, one of which
stores the data for the node, the second of which contains a pointer to the next node in the chain.

The Linked List holds a number of advantages over the Sparse Matrix implemented as a multi - dimensional array.
The first is the fact that the Linked List is much more dynamic. Dynamically changing the size of an array is hard
work for the programmer, whereas a Linked List can be easily grown as required by simply chaining more nodes
together. Also, because an array allocates all its memory in one block, making changes to the array requires
making changes to all its elements. A Linked List is not constrained in this way.

An array does outperform a linked list, however, when thinking about the time taken to access a node - this is
always constant in an array of any size, whereas accessing a node lower down the chain in a linked list sees
increasingly degraded performance. But implementing a sparse matrix using a Linked List rather than a
multi-dimensional array does give benefits to the programmer in terms of performance and ease of implementation.


Smailbegovic, F., Gaydadjiev, N. G., Vassiliadi, S. (n.d.) Sparse Matrix Storage Format [Online] Delft, Netherlands: Delft University of Technology
Available from (Accessed 23rd Dec 2007)

Parlante, N (2001) Linked List Basics [Online] Stanford, USA: University of Stanford
Available from (Accessed 23rd Dec 2007)

Friday, 21 December 2007

“What we have is a total failure to communicate”

It is said that communications in the analysis step often breaks down? Could you comment on this? Have you experienced this phenomenon? If not, would you accept this statement, and why?

"The requirements analysis and design phase is believed to be the most critical part of the software development lifecycle " (Catherine, 2007). In a presentation on the topic, John Petlick of De Paul University in Chicago describes one of the deficiencies of the waterfall model as "all requirements must be known upfront" (n.d.), which is never the case when communication does break down as I have experienced.

I work as part of a team of four developers. We are expected to deliver business requirements across a number of platforms. But we are constantly involved in development projects where the analysis phase never happens or is over far too quickly. In the former case, the project becomes technology led. The design and execution is done by the development team who fully understand how to make the application in question output a message or file but fail to understand what the application actually means to the business, its users and its customers. In the latter case the analysis phase often starts and ends in a single meeting, with the output being a very rough specification document that lacks the low - level detail necessary when architecting a system. An exercise of 'fill in the blanks' then begins for the developers while the business customers believe their part in the process is done or are too busy to devote any more time to the analysis. The attitude quickly becomes 'it's now the responsibility of the IT people to deliver' which is ludicrous. Getting developers and business analysts to to communicate with more quality and quantity is one of the fundamentals of the agile / xtreme programming model where a business analyst is part of the development team full time and on hand to answer questions straight away. This would be light years away from the current process where progress is viewed on a weekly or fortnightly basis, and it's only then that the developers get their mistakes pointed out.

I also have experience of where the analysis phase is done properly. This has tended to happen on projects where I am the sole developer and where I have a good enough relationship with the customer to go to them face - to - face every day and ask questions. Communications in the analysis phase of local government IT projects often break down because groups of key people have a 'history' that stops them from working together. But until a level of understanding that sees the business afforded as much responsibility in the software engineering realm as the developers is reached, then software projects will continue to fail.

In the UK central government is trying to roll out Contact Point, an index of child information. It was recently announced that the deadline of April 2008 would be put back until October 2008 - the second time a deadline has been extended. At meetings it is always obvious that the developers of the system - one of the largest IT consultancy firms in the country - and its user base (people like me and my colleagues) are consistently on different wavelengths.

Catherine, A (2007) Software Development Life Cycle: Basic Steps to Strong Product [Online]
Available from (Accessed 21 Dec 2007)

Petlick, J (n.d.) Software Development Life Cycle (SDLC) [Online] De Paul University, Chicago, IL
Available from (Accessed 21 Dec 2007)

Sunday, 16 December 2007

Object Oriented Programming vs. Structured Programming Paradigm

Discuss why Object Oriented Programming is considered a better choice than the structured programming paradigm. Does it reflect our "natural" way of thinking or do you find it artificial?

The most obvious reason that Object Oriented Programming (OOP) is considered a better choice is that is supposed to better represent the way that humans think about the world than the structured programming paradigm. We don't really view life as a series of instructions waiting to be processed - this is how a computer operates. Rather, we see ourselves as one of millions of objects that interact with each other, and OOP should make writing software easier because it embraces this view. At a more operational level the supremacy of OOP over procedural programming is nicely summarised by Matt Weisfeld. "Access to data is uncontrolled and unpredicable [in procedural programming] and because you have no control over who has access to the data testing and debugging are much more difficult. Objects address this problem by combining data and behaviour into a nice complete package" (2007).

I do find the OOP paradigm a rather artificial and forced way of thinking. Don't get me wrong, as someone who spends a lot of time developing code I can fully grasp the advantages of it. For example in a current project I have a class called ToolBox that contains a method called AuditLog. I can insert a line into any other method that logs whatever information I desire into a database log without having to constantly re-write (or cut and paste) the code that controls the database insert. And by making this method available in a web service anyone else with access to the same network can use it too. But I don't really think in terms of objects - I just think about what actions I need to perform inside my system and write methods accordingly. Of course if I take into account that fact that I am using objects defined by other programmers - for example, there is a class in .NET and Java called XmlDocument - and can pass these in and out of my own methods then I am probably using OOP more than I give myself credit for, but that is not intentional.

I just think that if OOP really reflected how we think about the world then it wouldn't be such a struggle to come to terms with. "Novice designs are littered with regressions to global thinking: gratuitous global variables, unnecessary pointers, and inappropriate reliance on the implementation of other objects" (Beck & Cunningham, 1989). However I wonder if this comes from the fact that todays industry - standard OOP languages (C# / Java) are still have their roots in procedural programming. I also think that because I learned to program procedurally first and then came to OOP later I find it harder to make the transition. "Indeed many people who have no training in computer science and no idea how a computer works find the object - oriented model of problem solving quite natural" (Budd, 1987).

Weisfeld, M. (2007) Moving from Procedural to Object-Oriented Development [Online] JupiterMedia Corporation, USA
Available from (Accessed 16th Dec. 2007)

Beck, K. & Cunningham, W. (1989) A Laboratory for Teaching Object Oriented Thinking [Online] Sigplan Notices Vol 24 No 10
Available from (Accessed 16th Dec. 2007)

Budd, T (1987) A Little Smalltalk [Online] Pearson Education Inc.
Available from (Accessed 16th Dec. 2007)

Interpreted vs Compiled Code

Interpreted code executes much more slowly than compiled code, yet several systems use them extensively. The most well known are Visual Basic (it has two modes: interpreter mode and compiler mode) and JavaScript. Discuss the merits and weaknesses of interpreting, as opposed to compiling, and explain when and why they are used.

The major advantage of interpreted languages is the platform independence that they offer. The best example of this is JavaScript. Rather than forcing the user of a computer to install every single application that they wanted to use, applications written using JavaScript can be delivered through a web browser which installs all the necessary files (in reality downloading them to a temporary folder) without the user knowing anything about it. This means that applications can be easily made available to any operating system(such as Windows, Linux and Mac OS) that supports a web browser. The platform independence is one of the main reasons for the rapidly increasing popularity of using thin client web - based applications rather than thick - client desktop applications. Without interpreted JavaScript I would have to install one application to use Google Mail, another to use Expedia, another to use (of course these websites probably wouldn't exist at all but you get the idea).

A second advantage is speed in which interpreted languages can be developed in. When working with JavaScript, for example, a programmer can embed some code inside a tag within some HTML and, assuming that they have the Java Virtual Machine installed, open the web page in a browser and see the results straight away. The developer can quickly change some aspect of the code, refresh the browser and the change will be instantly reflected. Contrast this with compiler language development. The coder must change their code and then run it through a compiler with the necessary parameters to produce a compiled assembly such as a Dynamic Link Library (DLL). The DLL must then be deployed to the correct location (perhaps another development server, and in fact this deployment can sometimes require another step because in a Windows environment all assemblies must be manually deployed to the General Assembly Cache) and finally the assembly must be executed by a program.

One of the most commonly cited reasons for using a compiled language rather than an interpreted language is speed - because an interpreted language has to be compiled each and every time it is run it creates an extra overhead. But in reality most modern web applications use a combination of interpreted and compiled code. For example, an e-commerce website is likely to use some embedded JavaScript to perform simple functionality at the client side such as validating the data inside a TextBox. This will create a small overhead in terms of needing to compile the code at every run time, but the tiny amount of code required to do this makes the overhead pretty neglibible. The more error - prone parts of the website - such as interacting with a database to store data or perform complicated searches - are likely to be performed by pre-compiled web services or stored procedures / functions on the database server itself. The IBM Information Centre states "we can see that it would make sense to use a compiled language for the intensive parts of an application (heavy resource usage), whereas interfaces (invoking the application) and less-intensive parts could be written in an interpreted language" (2007).


IBM Corporation (2007) Compiled versus interpreted languages [Online] IBM Corporation, NY, USA
Available from (Accessed 16th Dec 2007)

Thursday, 6 December 2007

Program verification in the future

How do you envision the ways program verification and performance tuning will be accomplished in the future? Will it still be a work of art? Will it always be the result of one's experience? Or do you forecast more and more automatic ways (algorithms?) to accomplish these tasks?

One vision of the future of program verification was presented by J. Strother Moore in 2005. Moore stated that "the verification community seeks to build a system that can follow the reason of an average human engaged in a creative, computation, problem". My reading around this subject indicates that the holy grail defined by Moore is a long way off yet. On his weblog Wesner Moise (who worked on Excel for six years at Microsoft) states "it seems that just a few people are driving the field of software verification forward".

The verification of programs is much more complicated than actually writing the programs themselves. Much of the literature on the World Wide Web comes from international conferences on the subject that are attended by professors from university computer science departments. The field is incredibly theoretical and requires an excellent understanding of mathematics to fully comprehend. While I can type 'C# hello world tutorial' into Google and see a list of about a million websites that will teach me how to implement this simple algorithm I am sure that a similar exercise in terms of program verification would be fruitless.

It is because of the inherent complexity that I think program verification will continue to be more about personal skill and experience rather than automatic means.
Algorithms do contribute - for example, Microsoft's Visual Studio development package pre - compiles code when you are writing it and identifies where potential problems are before letting you build the code into a library. I maintain though that if the methods of program verification were easy they would be taught at undergraduate level along with database design and software engineering and so verification will continue to come from testing and good design.

Testing only goes so far, though. A partner company recently released a new version of their indexing database and they fired about six million different test cases at it as part of the verification. All this means is that case six million and one - which could potentially occur to a customer - will cause the software to fail. The best form of verification in my opinion is to adopt correct design methodologies - for example a bottom up methodology, where software is built from smaller components that are themselves easily tested and easy to identify when they cause an error. This robust design comes from the experience of system architects and from the discipline of system developers in making sure that the code they write works properly when they write it rather than from automatic means.


Moore, J. S. (2005) A Mechanized Program Verifier [Online] Austin, University of Texas
Available from (Accessed 6th Dec 2007)

Moise, W (2006) Program Verification [Online] Seattle, USA
Available from (Accessed 6th Dec 2007)

Sunday, 2 December 2007

The Operating System is the Network...

Discuss the statement: "Global communication has developed to such a degree that the true operating system is the net itself, where the individual operating systems are just its nodes". Do you agree? What if the operating systems are heterogeneous (not the same brand)? How is such a "super operating system" managed? If you agree with this statement, can you list concepts and examples? If you do not agree, how do you regard operating systems and networks?

John Gage, one of the fifteen founders of Sun Microsystems, first coined the phrase "the network is the computer" and I agree; the true operating system is the net itself and the individual operating system (OS) are just its nodes.

At first glance it would seem that the heterogeneity of operating systems within the network does not matter - if I sign up for a free email account on the web I don't care which OS the provider uses. I am more interested in storage space and availability. However, Kropf, Plaice and Unger identify "a need for a web operating system which would make available to all sites on a network the resources available on that network" (1997). This moves us away from connecting to the network through the same node (i.e. the PC in my home) every time; instead we would just connect and let the network decide (through its 'super' OS) which node we connect to, and then have all the resources of the network - storage, communication - at our disposal. Of course Linux and Windows Server nodes will always exist so "should a given version not be capable of dealing with a particular request for a service then it can pass it on to another version" (Kropf, Plaice & Unger, 1997). The super OS must be able to recognise that a particular application is incompatible with the node I am working on and make another node available. Heterogenous OS matter very much in this vision of the future.

How is such a super OS managed? Web Operating Systems do already exist. One excellent example is YouOS ( The YouOS website defines the web operating system as the "liberation of software from hardware" (WebShaka Inc., 2006). Of course the need for hardware still exists but this will come from massive server farms rather than individual personal computers. The Web OS will automatically take care of storage and archiving of data, leaving the user free to concentrate on their computer - related work or leisure pursuits. One of the keys to management will be API's. Currently I have to go to one website to check my personal email, another to check my work email, a third to manage correspondence for my MSc. Through the Web OS I will be able to configure connections to all these resources and only ever have to log into the network to access them all. The super OS is managed through collaboration, open standards and the personal desire to design and create a utility or tool that will solve a problem, rather than large software companies dictating what we need. Of course, Google, Microsoft and the rest would like to be the company that controls our access to the network - but that starts to get a little Orwellian.


Kropf, P. G., Plaice, J. & Unger, H. (1997) Towards a Web Operating System [Online] Association for the Advancement of Computing in Education, Toronto
Available from
(Accessed 2nd December 2007)

WebShaka Inc. (2006) What is YouOS? [Online]
Available from (Accessed 2nd December 2007)

Saturday, 1 December 2007

Dual Definition of an Operating System

It is said that there is a dual definition of what an operating system must do. The first one is to present a "virtual machine" to the user which isolates him/her from the bare-bones hardware and is user friendly and effective. The other one is that it has to manage in the most efficient way the (always) limited resources of the computer system. It is also stated that these two goals often conflict. Do you agree? If yes, explain why and what you would do to resolve this. If you do not agree, give an example of some mutual cooperation. Does the concept of the process play a meaningful part in either or both of these tasks?

I agree that the dual operating system goals of presenting a user interface and managing hardware resources are in conflict with one another. The concept of a process which Brookshear defines as "a dynamic activity whose properties change as time progresses" is meaningful in both these tasks and the conflict between them both. The architecture of todays widely used operating systems is inherently flawed because of cross cutting. This is where wider, non - functional concerns such as system security or fault tolerance dictate how the different modules are accessed. In their paper A Layered Approach to Building Open Aspect - Oriented Systems Netrinant, Elrad and Fayad explain "a typical operating system will need sections of code responding to each of these requirements. That code is not usually segregate into neat modules, but instead interacts and tangles with the underlying functionality". The conflict between interface provision and management of resources arises because "in real operating systems architectural properties, like locking and component interaction schemes, tend to be hard-coded into almost every component of the system" (Lohmann & Spinczyk, n.d.). From a personal point of view, the integration systems that I build as part of my job are designed in using the object - oriented software methodology; I don't belive that today's operating systems completely adopt that approach as a minimum because they still contain a design legacy from previous incarnations of the operating system.
The solution to this is to change the architecture of the operating system. In 1978 Lauer and Needham identifed two rough categories of operating system - message based and process based. They argued that both architectures were interchangeable and neither had a performance advantage over the other. However in 1982 Stroustrup performed an experiment based on this argument and found that "a degradation in response time occurred in a process-based system when the job load and the low level scheduling policy were poorly matched". Because a process - based operating system cannot localise requirements to specific concerns they are difficult to understand, maintain and expand. Contrast this to a message based system; I use Microsoft Biztalk Server to develop integration workflow; a number of orchestrations pass messages between each other and it is extremely easy to document and follow this interchange. I am sure that a similar design technique would reduce the conflict that process - based systems face. Another well - publicised approach to changing the architecture is Aspect Oriented Programming. This can be summarised as the seperation of services - such as writing data to a file - and aspects - the less tangible ideas within an operating system such as security - being kept seperate, and different policies being applied to aspects at different stages of execution. This architecture eliminates cross cutting and the conflict between the layers of the operating system.


Brookshear, J. G. (2007) Computer Science: An Overview (9th Ed.) Pearson Education Inc. Boston, MA, USA.

Netiant, P., Elrad, T. & Fayad, M. E. (2001) A layered approach to building open aspect-oriented systems [Online] ACM Inc.
Available from (Accessed 1st Dec 2007)

Lohmann, D. & Spinczyk, O (n.d.) Architecture-Neutral Operating System Components [Online] Friedrich - Alexander University, Nuremberg, Germany
Available from (Accessed 1st Dec 2007)

Lauer, H. C. & Needham, R. M. (1978) On the duality of Operating System structures [Online]
Proc. Second International Symposium on Operating Systems, IR1A, Oct. 1978, reprinted in Operating Systems Review, 13,2 April 1979, pp. 3-19. Available from (Accessed 1st Dec 2007)

Stroustrup, B. (1982) An experiment with the interchangeability of processes and monitors [Online] Bell Laboratories, NJ, USA
Available from (Accessed 1st Dec 2007)