Henri E. Bal, Jason Maassen, Rob V. van Nieuwpoort, Niels Drost,
Roelof Kemp, Timo van Kessel, Nick Palmer, Gosia Wrzesi ska, Thilo Kielmann,
Kees van Reeuwijk, Frank J. Seinstra, Ceriel J.H. Jacobs, and Kees Verstoep
The use of parallel and distributed computing systems is essential to
meet the ever-increasing computational demands of many scientific and
industrial applications. Ibis allows easy programming and deployment
of compute-intensive distributed applications, even for dynamic, faulty,
and heterogeneous environments.
The past two decades have seen tremendous progress in the application of high-performance and distributed computer systems in science and industry. Among the most widely used systems are commodity compute clusters, large-scale grid systems, and, more recently, economically driven computational clouds and mobile systems. In the last few years, researchers have intensively studied such systems with the goal of providing transparent and efficient computing, even on a worldwide scale.
Unfortunately, current practice shows that this goal remains out of reach. For example, today’s grid systems are mostly exploited to run coarse-grained parameter-sweep or master-worker programs. For more complex applications, grid usage is generally limited to straightforward scheduling systems that select a single site for execution. This is unfortunate, as many scientific and industrial applications— including astronomy, multimedia, medical imaging, and biobanking—would benefit from distributed compute resources. Optical networking advances also enable a much larger class of applications to run efficiently on such distributed systems. In addition, research hasn’t adequately addressed the problems that can arise from combining multiple unrelated systems to perform a single distributed computation. This is a likely scenario, as many scientific users have access to a wide variety of systems.
In itself, each cluster, grid, and cloud provides welldefined access policies, full connectivity, and middleware that allows easy access to all its resources. Such systems are often largely homogeneous, offering the same software configuration or even the same hardware on every node. Combining several systems, however, is apt to result in a distributed system that is heterogeneous in software, hardware, and performance. This may lead to interoperability problems. Communication problems are also probable due to firewalls or network address translation (NAT), or simply because the geographic distance between the resources makes efficient communication difficult. Moreover, a combination of systems is often dynamic and faulty, as compute resources can be added or removed or even crash at runtime. The use of inherently heterogeneous and unreliable resources such as desktop grids, stand-alone machines, and mobile devices exacerbates these issues.