r1 - 01 Feb 2008 - 07:23:41 - RolandoRaquenoYou are here: TWiki >  People Web  > ArticleCondor1000
Tags:
create new tag
, view all tags

The History of CONDOR at RIT

Before CONDOR...

The use of CONDOR actually started several years before it was installed on the CIS systems. While working with Lee Sanders, an Imaging Science Ph.D. candidate, in 1998 we encountered what appeared to be seemingly a simple process - run a numerical model several times in order to find a physics-based solution to a remote sensing problem. The idea of his research was to use physics-based models to predict what an airborne sensor would detect when imaging a target on the ground. By quantifying the effects of the atmosphere, it can, in theory, be compensated to reveal the spectral/optical properties of the target. Since we often do not know (or cannot accurately measure) the atmospheric conditions when the images were acquired, we can use atmospheric propagation models (MODTRAN) to generate candidate "guesses" of these atmospheric conditions. The number of guesses depended on how well you knew the imaging conditions. The less you knew, the more guesses you had to generate. Our first attempt at running this model to generate our database of guesses was simultaneously enlightening and disheartening. We quickly realized that we cannot simply run all these cases serially on a single machine because it would have taken months to complete the process. We also realized these months of processing did not include the false starts and repeated attempts to identify the problems with our inputs or our algorithm workflow.

We eventually came up with an ad hoc way of breaking down the problem into smaller manageable chunks and using what are now considered insecure methods of logging into several machines at CIS and submitting these jobs semi-automatically using scripts. Monitoring the progress of the jobs and determining their successful completion was not always straightforward. Even though the jobs were low priority jobs, their presence was still noticeable to users of the workstations.

Initial Forays into CONDOR

The ad hoc workflow, albeit inelegant, allowed us to move forward with the research and make significant gains in using numerical models to develop successful algorithms. Then in 1999, Bob Krzaczek had the foresight to install CONDOR system on the CIS UNIX workstations and asked us to evaluate it as a possible replacement for our scripting process. Immediately, we appreciated the elegance of the utility of the system because it addressed many of the shortcomings that we found with the ad hoc scripting systems that we implemented. Although we did not know it at the time, we were addressing these remote sensing problems with a technique known as High Throughput Computing (HTC) which is a similar but different paradigm for the more commonly known expression of High Performance Computing (HPC). High Performance Computing is often associated with a system consisting of supercomputer hardware running specialized software aimed at maximizing the number of calculations per second to solve a problem. These types of computation are often tightly coupled and benefit from parallel computation techniques. High Throughput Computing on the other hand takes the approach of maximizing the number of jobs processed for a given period of time (jobs per day). These jobs are loosely coupled which allows ordinary workstations to be utilized for this problem. CONDOR subscribes to the latter approach and has been under development since 198? and had a steady development history that kept abreast with the rapidly changing nature of computer architecture and software. Initially considered an experimental system within CIS, it has now become an indispensable tool that continues to supply compute cycles with our existing hardware resources. The key features that make CONDOR effective and readily accepted is its cycle-scavenging strategy and policy of minimal impact to the interactive user/owner of a workstation. By directing processes only to systems that are idle and immediately suspending CONDOR jobs (and in some cases migrating the job to other idle systems) when a user interacts directly with the workstation, these jobs are quietly and inconspicuously processed. Compared to other compute distribution systems, the effort to port an application lies in disabling any interactive interface since the configurations of jobs is a retro paradigm shift back to batch processing days were an organization of input and output files need to be established.

For the next several years, CONDOR has proven itself not only in our studies of atmospheric compensation, but also in physics-based algorithms in target detection, hyperspectral synthetic image generation (DIRSIG) animation, water remote sensing, sparse aperture imaging system trade studies, and improved fusion techniques of lidar and hyperspectral data sets. With the help of Bob Krzaczek (CIS), Bill Hoagland (CIS) and James Craig (Computer Science), we were able to utilize their idle workstations to increase the throughput of our jobs. Many of these research problems could not have been solved in a timely manner (for our students to graduate) without the use of CONDOR. In nearly all the cases studied, computation under CONDOR yielded run time gains of an order of magnitude. All this was achieved a small set of workstations in the order of 100-200 machines depending on the time of day and week in the academic quarter. Another attractive feature of CONDOR that we have demonstrated its ability to distribute Commercial Off the Shelf (COTS) software across many systems at RIT for processing. As part of a demonstration study we (Carl Salvaggio, Don McKeown?) requested a grant from ITTVis for several hundred licenses of the ENVI/IDL image analysis package. This software package is one the standard tools for analyzing an array of remote sensing imagery. We had a good working understanding on how to distribute the computations for numerical models with available source code. We did not, however, have any practical experience in distributing commercial software packages under CONDOR. This partnering arrangements paid off greatly for one of our recent Ph.D. graduates, Captain Michael Foster (USAF). His research testing a proof of concept for improving target detection using hyperspectral imagery fused with simulated airborne LIDAR data was a prime example of a problem that could not have been solved without CONDOR. Implemented in IDL, his initial runs for his simpler techniques on a subsampled test case on his workstation was taking a day to process. Estimating run times for his more complex techniques was more difficult because the process runs were extending into the week time range. While some effort to migrate his processing to CONDOR was started in March of 2007, the turning point event that cemented the decision to fully utilize CONDOR came when his workstation, in the middle of one of these complex runs, was rebooted because of an automatic operating systems update.

With a hard deadline of August 2007 for graduation, we searched campus for additional resources to run his processes under a different operating system. We have been aware of the existence of IBM cluster (cluster.rit.edu) a 96-node SMP (Symmetric Multiprocessor Machine) which had been targetted at high performance parallel processing jobs. Because of the specialized modifications that is often required for software to run on these types of machines, utilization on these machines are often characterized by intermittent bursts of heavy usage followed by extended periods of idle time. With the proper combination of software configuration and tools and the help of Rick Bohn, these multiple CPU machines behaved like any ordinary workstation in the cluster, greatly increasing the utilization. This facilitated the distribution of COTS software throughout this system. The IBM cluster favored MPI (Message Passing Interface) parallel processing jobs over CONDOR jobs which meant that the these specialized jobs would preempt any CONDOR processes and service them as CPUs became idle. In one case, we had a two month IDL job running that was preempted by an MPI job for a couple of weeks. What impressed us was the seamless nature in which the IDL jobs were reintroduced in the processing queue after the MPI jobs were completed. The availability of the IBM cluster along with the resources placed online by Jeremy Szeminksi (COS), Brent Strong, Dave Snyder (CIS), and Paul Mezzannini (RC) under the administration of of Dan Rosica and Gurcharan Khanna, Mike Foster successfully defended his dissertation on time.

CONDOR 1000 Project

This success translated into the CONDOR 1000 project which was an initiative to place online 1000 workstation into the CONDOR cluster by the end of 2007 which was achieved before the December break of the winter quarter.

Efforts are currently underway to establish usage policy and equitably configure CONDOR to favor jobs submitted from the departments that directly administrate the machines and at the same time integrate the servicing of jobs from the World Community Grid (need a reference to a UNS article) which has been in place on campus since ????. This insures that RIT related studies and research have priority to use these campus-wide computational resources while donating unused RIT CPU cycles to humanitarian and public service research for AIDS, Cancer, Global Climate studies, etc.

Keeping these workstations as busy as possible also has the added benefit of monitoring the health of each systems since idle workstations can be easily identified as a symptom of hardware or system configuration problem. In terms of RITs focus on sustainability, the CONDOR system adds life to what would traditionally be classified as obsolete workstations. CIS has demonstrated that even systems nearly a decade old can still make significant contributions to research by the virtue of the number of CPUs simultaneously working on any given problem. Systems have been saved from electronic recycling to become part of the campus cluster.

This is not to say that CONDOR is not without its difficulties. Each problem is unique and there are invariably nuances that need to be identified and addressed as the process is implemented. With each iteration that we have encountered, documenting our process moves us closer to making the process turnkey for the student and faculty in order to make these resources readily available. This system in itself is a research project but its availability it also opens doors to research endeavors that a decade ago could not have been addressed at RIT. We would like to see the RIT community take advantage of this infrastructure so that we can test the limits of the system and fine tune the process. As we grow and refine this infrastructure, we hope to offer our collective experience beyond RIT and contribute in both knowledge and infrastructure to the different scientific grids (SURAGrid, OpenScience? Grid) and initiate RIT as a significant contributor to scientific computing.

CONDOR has demonstrated what can be achieved when the right combination of technology and cooperative arrangements are in place.

-- RolandoRaqueno - 23 Jan 2008

-- RolandoRaqueno - 01 Feb 2008

Edit | WYSIWYG | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r1 | More topic actions
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback