How to Recognize a Cluster Application When You See One: High-Throughput Computing
For many types of computing problems related to high-throughput applications, it is possible to take advantage of the scalable performance of a cluster without changing the application. In this document we examine these types of applications and show how to recognize if your application problems fit into this category.
Problem Statement
A cluster is only as useful as the applications you can run on it. While the multiple nodes on a cluster mean that you can support many simultaneous users, to truly unlock the power of your cluster you need to be able to deliver better performance for a single user. The challenge in doing this is that the user’s applications must be able to take advantage of more then one cluster node at a time: An application that can use ten cluster nodes at once will run ten times faster!
The challenge is how to identify applications that can utilize more then one cluster node at a time. Certain applications have already been modified to recognize and run on clusters, and if you are fortunate enough to be able to use one of them, your problem is solved. However, what do you do if you don’t have an existing cluster-ready application? Does this mean you must have access to the source code and modify your application to use a cluster?
Fortunately, there is a broad and important class of applications which can be easily be adapted to use a cluster with little or no modification.
Solution Statement
A high-throughput computing (HTC) application is one in which the same basic calculation must be performed over many independent input data elements and the results collected. Because each calculation is independent, it is extremely easy to spread calculations out over multiple cluster nodes. For this reason, high-throughput applications are sometimes called “embarrassingly parallel.” HTC applications occur much more frequently than one might think, showing up in areas such as parameters studies, search applications, data analytics, and what-if calculations.
Identifying an HTC Application
There are a number of identifiers you can use to determine if your specific computing problem fits into the category of a high-throughput application:
- Do you need to run many instances of the same application with different arguments or parameters?
- Do you need to run the same application many times with different input files?
- Do you have an application that can select subsets of the input data and whose results can be combined by a simple merge process such as concatenating, placing them into a single data base, or adding them together?
If the answer to any of these questions is “yes,” then it is quite likely that you have a HTC application.
Making Your Application Cluster-Ready
Once you have made the determination that the HTC paradigm will work for your use case, it is often simply a matter of organizing your input data and submission scripts properly to be able to exploit the parallel processing capabilities of a cluster.
The most simplistic approach would be to submit an individual cluster job for each combination of input data and parameter setting. The cluster scheduler would then ensure that as many application instances as possible would run at the same time.
However, for applications which require large numbers of input alternatives, this simple approach can quickly become unwieldy, leaving the task of managing each individual submission up to the user. Fortunately, cluster scheduling software such as Grid Engine provides features such as job arrays that let you create parameterized task descriptions that can turn into hundreds or thousands of independent jobs. The resulting collection of jobs can be managed as a single unit, where the software monitors progress, notifies the user when the entire set is complete, and even kills the jobs if necessary.
In addition, tools such as the monitoring interface found in Cluster Express can simplify the task of keeping track of the progress of the large number of jobs that can make up an HTC application. By providing additional real-time monitoring capabilities and remote access, the user’s job of managing the application processing on the cluster is simplified.




