References to online documents about process migration, checkpointing and load balancing

The Wayback Machine - https://web.archive.org/web/20030804174537/http://www.iti.mu-luebeck.de:80/~petri/pbeamrefs.html

References to online documents about process migration, checkpointing and load balancing

(permanently under construction)

The references are (yet) listed (mostly) without any really significant order.

My own publications .
Bettina Schnor's publications about load balancing strategies, mechanisms, and some more.
The P-Beam home page has just begun to be constructed.
A BiBTeX-File with references
We have collected/mirrored and put up for anon FTP some GB of tech reports, papers, etc about distributed systems in general, with emphasis and load balancing, scheduling, mapping, checkpointing.
Slightly outdated information about our research in load balancing and process migration (P-Beam) as well as the list of published papers can be found in our load balancing project home page.
The ``classical'' Condor project. The Condor software is available on ftp.cs.wisc.edu:condor/
And here is the Condor Home Page .
- Matt W. Mutka and Miron Livny: Profiling workstations' available capacity for remote execution . In: Proceedings of the 12-th IFIP WG 7.3 Symposium on Computer Performance , 1987.
- Michael J. Litzkow and Miron Livny and Matt W. Mutka: Condor - A Hunter of Idle Workstations} . In Proceedings of the 8th International Conference on Distributed Computer Systems , pages 104--111, IEEE, June 1988.
- Michael Litzkow and Marvin Solomon: Supporting Checkpointing and Process Migration Outside the UNIX Kernel . In Usenix Conference Proceedings , San Francisco, CA, January 1992, pages 283--290. Available on ftp.cs.wisc.edu (228158 bytes) or here on ftp.ibr.cs.tu-bs.de (1195075 bytes).
  Please note that these versions are hopelessly outdate, currently the Condor team distributes version 6.x .
The Rio (RAM I/O) project at Michigan is pleased to release an implementation of the Rio file cache for FreeBSD. The basic idea of Rio is to make memory as safe as disk from operating system crashes. Such "reliable main memory" is useful in a variety of contexts:
* checkpointing: Discount Checking is a checkpointing library ...
Papers describing Rio, Vista, and Discount Checking are available on the Rio web page http://www.eecs.umich.edu/Rio .
K. I. Mandelberg and V. S. Sunderam: Process Migration in UNIX Networks . In Usenix Conference Proceedings , Dallas, TX, February 1988, pages 357--363.
Rafael Alonso and Kriton Kyrimis: A Process Migration Implementation for a Unix System . In Usenix Conference Proceedings , Dallas, TX, February 1988, pages 365--372.
Chad Hunter: Process Cloning: A System for Duplicating UNIX Processes . In Usenix Conference Proceedings , Dallas, TX, February 1988, pages 373-379.
Dan Freedman: Experience Building a Process Migration Subsystem for UNIX . In Usenix Conference Proceedings , Dallas, TX, January 1991, pages 349--356.
D. Eager and E. Lazowska and J. Zahorjan: The Limited Performance Benefits of Migrating Active Processes for Load Sharing . In Conf. on Measurement & Modelling of Comp. Syst., (ACM SIGMETRICS) , May 1988, pages 63--72.
"It is not worth the effort of migrating *active* processes if the sole intention is to increase performance, but it may be worthwhile if there are other objectives eg: free a particular workstation."
Mor Harchol-Balter's Papers , notably A note on 'The Limited Performance Benefits of Migrating Active Processes for Load Sharing' and Exploiting Process Lifetime Distributions for Dynamic Load Balancing .
In these papers, the above article by Eager et al. is contradicted, using trace driven simulation based on actually measured data.
The Sprite operating system project at Berkeley . Only some of the papers about that project are listed here:
- F. Douglis and J. Ousterhout: Process Migration in the Sprite Operating System . In Proceedings of the 7th International Conference on Distributed Computer Systems , 1987, pages 18--25. Available on sprite.berkeley.edu or here on ftp.ibr.cs.tu-bs.de
- Fred Douglis: Experience with Process Migration in Sprite . In Distributed and Multiprocessor Systems Workshop Proceedings , pages 59--72, Fort Lauderdale, FL, October 1989. Available on sprite.berkeley.edu or here on ftp.ibr.cs.tu-bs.de
- F. Douglis and J. Ousterhout: Transparent Process Migration: Design Alternatives and the Sprite Implementation . In Software -- Practice and Experience , volume 21, number 8, pages 757--785, August 1991. Available on sprite.berkeley.edu or here on ftp.ibr.cs.tu-bs.de
Y. Artsy and R. Finkel: Designing a Process Migration Facility: The Charlotte Experience . In Computer , IEEE, September 1989, pages 47--56.
The MOSIX operating system project:
- Amnon Barak and Richard Wheeler: MOSIX: An Integrated Multiprocessor UNIX . In USENIX Conference Proceedings , San Diego, CA, January 1989, pages 101--112.
- Amnon Barak and Shai Guday and Richard G. Wheeler: The MOSIX Distributed Operating System . LNCS 672, Springer, Berlin, 1993.
- The MOSIX project WWW site
The MACH micro-kernel project. Only a few of the many publications about MACH are listed here.
- Dejan S. Miloji\v{c}i\'{c}: Load Distribution -- Implementation for the Mach Microkernel . Vieweg, Braunschweig, 1994.
- Michael Blair Jones: Transparently Interposing User Code at the System Interface . Technical Report, CMU, September 1992. on CMU MACH FTP site
- R. Sansom and D. Julin and R. Rashid: Extending a Capability Based System into a Network Environment . Technical Report, Carnegie-Mellon-University, number CMU-CS-86-115, April 1986. on CMU MACH FTP site
GatoStar by Betil Folliot et al.
Combination of Gato and Star to avoid redundancy between load sharing and fault tolerance. Implemented in library, transparent for applications. On SunOS. Application modelled as tasks with precedence graph. Independent checkpointing with pessimistic message logging. Replication of files via reliable broadcast protocol (ISIS?). Migration through checkpoint/remote restart.
more project info ... and even more
Stardust: An Environment for Parallel Programming on Networks of Heterogeneous Workstations by Gilbert Cabillic and Isabelle Puaut. Provides communication via Distributed Shared Memory and via Message Passing. Can Checkpoint and Migrate between heterogeneous machines. Requires the applications to be written for the programming library, and the data types of shared memory ares must be specified by the progremmer in the source code. Non-shared data and stack contents are not contained in checkpoints. Checkpoint/Migration is only possible at synchronization points where the application is in a globally consistent state out of its own, e.g. when performing global barriere. Since stack contents cannot be checkpointed, such synch points must be in the application's main function.
Nixdorf Targon/32
Fault Tolerance under Unix Paper

From: shap@cobra.cis.upenn.edu (Jonathan Shapiro)
Newsgroups: comp.os.research
Subject: Re: Checkpointing
Date: 16 Dec 1995 20:15:02 GMT
Organization: University of Pennsylvania
Message-ID: <4av9c6$mi9@darkstar.UCSC.EDU>
References: <4anocr$3dl@darkstar.ucsc.edu> <4aq11q$dmo@darkstar.UCSC.EDU>

You might also want to check out the KeyKOS home page:

http://www.cis.upenn.edu/~KeyKOS

They have an extremely light-weight global checkpoint mechanism.
A paper describing it and various others on the system can be
found from that home page.

To sum up in technical terms, Tunes is a project to replace existing Operating Systems, Languages, and User Interfaces by a completely rethought Computing System, based on a correctness-proof-secure higher-order reflective self-extensible fine-grained distributed persistent fault-tolerant version-aware decentralized (no-kernel) object system. [..] Tunes is a recursive acronym for: "Tunes is a Useful, Not Expedient, System".

From: pstephan+@RUBIX.MC.CS.CMU.EDU (Peter Stephan)
Newsgroups: comp.parallel.pvm
Subject: ANNOUNCE: Release of Dome (version 1.0)
Date: 23 May 1996 17:08:58 GMT
Organization: Carnegie Mellon University
Message-ID: <4o263a$foj@cantaloupe.srv.cs.cmu.edu>

-------------------------------------------------------------------
                     Announcing the release of 

                              Dome 
                           version 1.0

               (Distributed object migration environment)
-------------------------------------------------------------------

Overview
--------
Dome, the Distributed object migration environment, provides a C++ 
library of distributed objects for parallel programming.  These 
objects perform dynamic load balancing and support fault tolerance.  
Programmers using Dome can, with modest effort, write parallel
programs 
that are automatically distributed over a heterogeneous network, 
dynamically load balanced as the program runs, and able to survive 
compute node and network failures.  Thus, Dome provides a means for 
writing simple simple and efficient distributed programs.

The focus of the Dome system is to support parallel programming over
networks of workstations.  Dome's load balancing and fault tolerance
play an integral role in producing efficient and survivable parallel
programs in such an environment.  Dome uses a single program multiple
data (SPMD) model to perform the parallelization of programs which
use the Dome library, and Dome uses PVM to provide its underlying 
process control and message passing.

The Dome system is available in a package via anonymous ftp.  The
package includes the Dome source code, makefiles, related build
scripts, documentation, and example programs.  To obtain the Dome
package login via anonymous ftp to ftp.cs.cmu.edu.  The directory 
project/dome will contain the file dome1.0.tar.Z and a README file.  
The dome1.0.tar.Z file contains the Dome system in compressed, tar 
format.

More information on the Dome project is available at 

http://www.cs.cmu.edu/~Dome

The authors of Dome can be contacted at 
dome-help@cs.cmu.edu
.

-------------------------------------------------------------------
*   Dome version 1.0:  Distributed object migration environment   *
*                   Carnegie Mellon University                    *
*   Authors:  J. Arabe, A. Beguelin, B. Lowekamp, E. Seligman,    *
*         S. Simon, M. Starkey, P. Stephan, and K. Walker         *
*                 (C) 1996 All Rights Reserved                    *
-------------------------------------------------------------------

S.J. Leffler and M.K. McKusick and M.J. Karels and J.S. Quarterman: The Design and Implementation of the 4.3 BSD UNIX Operating System . Addison-Wesley, 1988.
M.K. McKusick and K. Bostic and M.J. Karels and J.S. Quarterman: The Design and Implementation of the 4.4 BSD UNIX Operating System . Addison-Wesley, 1996, ISBN 0-201-54979-4.
Pankaj Mehra: Automated Learning of Load-Balancing Strategies for a Distributed Computer System . PhD Thesis, University of Illinois at Urbana-Champaign, 1993. Available here on ftp.ibr.cs.tu-bs.de ( Directory with online thesis and papers)
Thomas Ludwig: Automatische Lastverteilung für Parallelrechner . BI-Wissenschaftsverlag, Reihe Informatik, 1993 (in german).
"Ausführliche Darstellung von Lastmessung und -bewertung. Experimentelle Umgebung auf iPSC/2 implementiert. Prozeßverschiebung implementiert wie in Accent (Copy-on-Reference). Vorstellung von Messungen am Beispiel Mandelbrotmenge."
SunOS Network Programming Guide . Revision A, Sun Microsystems, March 1990.
SunOS Reference Manual . Revision A, Sun Microsystems, 1990.
Roman Zajcew and Paul Roy and David Black and Chris Peak and Paulo Guedes and Bradford Kemp and John LoVerso and Michael Leibensperger and Michael Barnett and Faramarz Rabii and Durriya Netterwala: An OSF/1 UNIX for Massively Parallel Multicomputers . In Usenix Conference Proceedings , pages 449--468, San Diego, CA, January 1993.
David K. Gifford: Weighted voting for replicated data . In Proceedings of the 7th ACM Symposium on Operating System Principles , pages 150--162, December 1979.
Nitin H. Vaidya: Another Two-Level Failure Recovery Scheme: Performance Impact of Checkpoint Placement and Checkpoint Latency . Technical Report, Texas A&M University, College Station, TX 77843-3112, December 1994, number 94-068. vaidya@cs.tamu.edu , available on ftp.cs.tamu.edu or here on ftp.ibr.cs.tu-bs.de
Juan Le\'{o}n and Allan L. Fisher and Peter Steenkiste: Fail-save PVM: A portable package for distributed programming with Transparent Recovery . Technical Report, Carnegie Mellon University, Pittsburgh, PA, February 1993, number CMU-CS-93-124. Juan.Leon@cs.cmu.edu , available on reports.adm.cs.cmu.edu or here on ftp.ibr.cs.tu-bs.de
"Geert Deconinck and Johan Vounckx and Rudi Cuyvers and Rudy Lauwereins: Survey of Checkpointing and Rollback Techniques . Technical Report, Katholieke Universiteit Leuven, Belgium, June 1993. geert.deconinck@esat.kuleuven.ac.be , 93-04.ps
H. Langendörfer and B. Schnor: Verteilte Systeme . Hanser, München, 1994.
Stefan Stille: Lastbalancierung in verteilten Systemen . Master's Thesis, Institut für Betriebssysteme und Rechnerverbund, TU Braunschweig, 1993 (in german). (yalb-stille93.ps.gz)
Henrik Carlsson: Konfiguration von Lastbalancierungssystemen . Master's Thesis, Institut für Betriebssysteme und Rechnerverbund, TU Braunschweig, 1994 (in german). (yalb-carlsson94.ps.gz)
Jens Steinborn: Globale konsistente Checkpoints für verteilte Anwendungen in Workstation Clustern . Master's Thesis, Institut für Betriebssysteme und Rechnerverbund, TU Braunschweig, 1996 (in german). (pbeam-steinborn96.ps.gz) (180935 Bytes)
Sabine Denecke: Entwurf und Implementierung einer adaptiven Komponente in Objektmigrationssystemen mit Hilfe Neuronaler Netze . Master's Thesis, Institut für Betriebssysteme und Rechnerverbund, Uni Hildesheim, February 1994 (in german).
Y. Miyata: A User's Guide to PlaNet . University of Colorado, Boulder, 1991.
Songnian Zhou and Jingwen Wang and Xiaohu Zheng and Pierre Delisle: UTOPIA: A Load Sharing Facility for Large, Heterogeneous Distributed Computer Systems . Technical Report, CSRI, University of Toronto}, number CSRI-257, April 1992. ??
Adam de Boor: Customs -- A Load Balancing System , Term paper for CS262, Fall 1987. Query archie for customs
D. Nauck and F. Klawonn and R. Kruse: Neuronale Netze und Fuzzy-Systeme . Vieweg, 1994. a collection of corresponding papers
B. Schnor and H. Langendörfer and S. Petri: Einsatz neuronaler Netze zur Lastbalancierung in Workstationclustern . In Praxisorientierte Parallelverarbeitung , Ed. H. Langendörfer, Hanser, München, pages 154--165, October 1994. bs3neuro
James S. Plank and the Checkpointing Research Group at the University of Tennessee, Knoxville. Here are their Papers .
One of the Papers is:
James S. Plank and Micah Beck and Gerry Kingsley and Kai Li: Libckpt: Transparent Checkpointing under Unix . In Usenix Conference Proceedings , New Orleans, January 1995. plank.html
"Checkpointing is a simple technique for rollback recovery: the state of an executing program is periodically saved to a disk file from which it can be recovered after a failure. While recent research has developed a collection of powerful techniques for minimizing the overhead of writing checkpoint files, checkpointing remains unavailable to most application developers. In this paper we describe libckpt, a portable checkpointing tool for Unix that implements all applicable performance optimizations which are reported in the literature. While libckpt can be used in a mode which is almost totally transparent to the programmer, it also supports the incorporation of user directives into the creation of checkpoints. This ``user-directed'' checkpointing is an innovation which is unique to our work."
Codine from Genias Software GmbH, Regensburg, Germany.
Georg Stellner: Consistent Checkpoints of PVM Applications . In Proceedings of the First European PVM User Group Meeting , 1994. EPVMUG94.ps
You might also have a look at the CoCheck WWW homepage
Georg Stellner's Bookmarks
Resource Management

Batch Queueing Systems
Condor Homepage
DQS Distributed Queueing System
Dr. Samuel H. Russ (Home Page) --- Process Migration
FWI Parallel Scientific Computing and Simulation (DynamicPVM)
Hibernator II Data Sheet
HOPE: Hopefully Optimistic Programming Environment
IBM LoadLeveler (IBM)
LoadLeveler 1.2 Batch Queuing System (LRZ)
MIST Scheduler
Papers about fault tolerance and checkpoint/restart (ZDV-Parallel)
References on Job Scheduling
References to online documents about load balancing and process migration
Queueing and Scheduling Page
SFB 342: Querschnittsthema Q4 (ALV)
The MOSIX Multicomputer System
The Scalable I/O Project
Warp web on checkpointing
FADI : A Fault-Tolerant Environment for Distributed Processing Systems
DynamicPVM

DynamicPVM &ers; DynamicMPI
MIST
Jonathan M. Smith: A Survey of Process Migration Mechanisms . In Operating Systems Review , volume 22, number 3, July 1988, pages 28--40, ACM.
G. Popek and and B. Walker and J. Chow and D. Edwards and C. Kline and G. Rudisin and G. Thiel: LOCUS: A Network Transparent, High Reliability Distributed System . In Operating Systems Review , volume 15, number 5, December 1981, pages 169--177.
J. Ju and G. Xu and K. Yang: An Intelligent Dynamic Load Balancer for Workstation Clusters . In Operating Systems Review , volume 29, number 1, pages 7--16, January 1995.
M. Cena and M. L. Crespo and R. Gallard: Transparent Remote Execution in LAHNOS by Means of a Neural Networking Device . In Operating Systems Review , volume 29, number 1, pages 17--28, January 1995.
The TACOMA project, Department of Computer Science, University of Tromsø, Norway:
- Dag Johansen, Robbert van Renesse and Fred B. Schneider: Operating system support for mobile agents . In: Proceedings of the 5th. IEEE Workshop on Hot Topics in Operating Systems , Orcas Island, Wa, USA (4th-5th May, 1995). Published by: IEEE Computer Society, NY, USA, May 1995.
  Also available as Technical Report TR94-1468 , Department of Computer Science, Cornell University, USA, November 1994.
- Dag Johansen and Robbert van Renesse and Fred B. Schneider: An Introduction to the TACOMA Distributed System Version 1.0 . Also available as Technical Report 95-23 . Department of Computer Science, University of Tromsø, Norway, June 1995.
Internet Parallel Computing Archive, Funded by JISC NTSC, Hosted at HENSA Unix (JISC funded) .

Some references about migration in heterogeneous environments:

FLASH: Flexible Agent System for Heterogeneous Cluster
FLASH (Flexible Agent System for Heterogeneous Cluster) is an agent-based framework for the creation of load-balanced distributed applications running on a heterogeneous cluster systems. It offers the possibility to transfer subtasks of a parallel application to mobile agents, which travel autonomously through a network searching for free resources.

Implemented in Java (with support utilities in several other languages).
The Tui Heterogeneous Process Migration System
Tui is a general purpose migration package that allows processes to be migrated between machines of different architectures. That is, we can start a program executing on an 386 based machine, and during its execution, move it to a SPARC machine. Very few systems that I [the TUI author, Peter Smith] know of allow a process to be moved between machines of differing architecture, but since these generally work for only type-safe languages, this is not too hard to achieve. Tui is able to migrate programs that are written in more common languages (such as C), that have horrible type-unsafe features that make it hard to locate data within a process.
Porch - The Portable Checkpoint Compiler
Porch is a source-to-source compiler that translates C programs into semantically equivalent C programs which are capable of saving and recovering from portable checkpoints. These portable checkpoints can be transferred in heterogeneous computer networks such as the Internet, and may be restarted on binary incompatible machines. Porch provides a simple way to enhance programs by adding functionalities for portability and fault-tolerance.

Version 1.0 can be downloaded from the Porch home page.

And some references that have just collected, but not (yeeeet) checked out:

Sabina Rips has done load balancing in a parallel Prolog implementation on top of PVM.
The online DSM bibliography of M. Rasit Eskicioglu.
SAM - Distributed Shared Memory System
```
     Newsgroups: comp.parallel
     From: itf@mcs.anl.gov (Ian Foster)
     Subject: Mirror Sites for DESIGNING & BUILDING PARALLEL PROGRAMS
     Message-ID: <81687650327264@dalek.mcs.anl.gov>
     Organization: Math and Computer Science, Argonne National Laboratory
     Date: Mon, 20 Nov 1995 14:08:23 GMT
     
```
Many of you have seen the text, "Designing and Building Parallel Programs" , available both from Addison-Wesley and (thanks to A-W's enlightened publishing policies) on the Web I'm glad to announce that the online version is now also available at two mirror sites: http://www.cs.rdg.ac.uk/dbpp/ http://www.qpsf.edu.au/mirrors/dbpp/ Thanks a lot to Jonathan Chin and Paul Pritchard for making these available. Additional mirror sites will probably be added in the future; these will be listed at: http://www.mcs.anl.gov/dbpp/mirror_sites.html Happy reading! Ian Foster. Designing and Building Parallel Programs
Ian Foster
Addison-Wesley, 1995
ISBN 0-201-57594-9

(last edited: Thu Jun 24 11:14:28 MET DST 1999)

petri@iti.mu-luebeck.de

Stefan's Homepage

Jun	AUG	Oct
	04
2002	2003	2004