한국   대만   중국   일본 
References to online documents about process migration, checkpointing and load balancing
The Wayback Machine - https://web.archive.org/web/20030804174537/http://www.iti.mu-luebeck.de:80/~petri/pbeamrefs.html

References to online documents about process migration, checkpointing and load balancing

(permanently under construction)
The references are (yet) listed (mostly) without any really significant order.
  • My own publications .
  • Bettina Schnor's publications about load balancing strategies, mechanisms, and some more.
  • The P-Beam home page has just begun to be constructed.

  • A BiBTeX-File with references

    We have collected/mirrored and put up for anon FTP some GB of tech reports, papers, etc about distributed systems in general, with emphasis and load balancing, scheduling, mapping, checkpointing.

    Slightly outdated information about our research in load balancing and process migration (P-Beam) as well as the list of published papers can be found in our load balancing project home page.

  • The ``classical'' Condor project. The Condor software is available on ftp.cs.wisc.edu:condor/
    And here is the Condor Home Page .

    • Matt W. Mutka and Miron Livny: Profiling workstations' available capacity for remote execution . In: Proceedings of the 12-th IFIP WG 7.3 Symposium on Computer Performance , 1987.

    • Michael J. Litzkow and Miron Livny and Matt W. Mutka: Condor - A Hunter of Idle Workstations} . In Proceedings of the 8th International Conference on Distributed Computer Systems , pages 104--111, IEEE, June 1988.
    • Michael Litzkow and Marvin Solomon: Supporting Checkpointing and Process Migration Outside the UNIX Kernel . In Usenix Conference Proceedings , San Francisco, CA, January 1992, pages 283--290. Available on ftp.cs.wisc.edu (228158 bytes) or here on ftp.ibr.cs.tu-bs.de (1195075 bytes).
      Please note that these versions are hopelessly outdate, currently the Condor team distributes version 6.x .

  • The Rio (RAM I/O) project at Michigan is pleased to release an implementation of the Rio file cache for FreeBSD. The basic idea of Rio is to make memory as safe as disk from operating system crashes. Such "reliable main memory" is useful in a variety of contexts:
    * checkpointing: Discount Checking is a checkpointing library ...
    Papers describing Rio, Vista, and Discount Checking are available on the Rio web page http://www.eecs.umich.edu/Rio .
  • K. I. Mandelberg and V. S. Sunderam: Process Migration in UNIX Networks . In Usenix Conference Proceedings , Dallas, TX, February 1988, pages 357--363.

  • Rafael Alonso and Kriton Kyrimis: A Process Migration Implementation for a Unix System . In Usenix Conference Proceedings , Dallas, TX, February 1988, pages 365--372.

  • Chad Hunter: Process Cloning: A System for Duplicating UNIX Processes . In Usenix Conference Proceedings , Dallas, TX, February 1988, pages 373-379.

  • Dan Freedman: Experience Building a Process Migration Subsystem for UNIX . In Usenix Conference Proceedings , Dallas, TX, January 1991, pages 349--356.

  • D. Eager and E. Lazowska and J. Zahorjan: The Limited Performance Benefits of Migrating Active Processes for Load Sharing . In Conf. on Measurement & Modelling of Comp. Syst., (ACM SIGMETRICS) , May 1988, pages 63--72.
    "It is not worth the effort of migrating *active* processes if the sole intention is to increase performance, but it may be worthwhile if there are other objectives eg: free a particular workstation."

  • Mor Harchol-Balter's Papers , notably A note on 'The Limited Performance Benefits of Migrating Active Processes for Load Sharing' and Exploiting Process Lifetime Distributions for Dynamic Load Balancing .
    In these papers, the above article by Eager et al. is contradicted, using trace driven simulation based on actually measured data.

  • The Sprite operating system project at Berkeley . Only some of the papers about that project are listed here:

    • F. Douglis and J. Ousterhout: Process Migration in the Sprite Operating System . In Proceedings of the 7th International Conference on Distributed Computer Systems , 1987, pages 18--25. Available on sprite.berkeley.edu or here on ftp.ibr.cs.tu-bs.de

    • Fred Douglis: Experience with Process Migration in Sprite . In Distributed and Multiprocessor Systems Workshop Proceedings , pages 59--72, Fort Lauderdale, FL, October 1989. Available on sprite.berkeley.edu or here on ftp.ibr.cs.tu-bs.de

    • F. Douglis and J. Ousterhout: Transparent Process Migration: Design Alternatives and the Sprite Implementation . In Software -- Practice and Experience , volume 21, number 8, pages 757--785, August 1991. Available on sprite.berkeley.edu or here on ftp.ibr.cs.tu-bs.de

  • Y. Artsy and R. Finkel: Designing a Process Migration Facility: The Charlotte Experience . In Computer , IEEE, September 1989, pages 47--56.

  • The MOSIX operating system project:

    • Amnon Barak and Richard Wheeler: MOSIX: An Integrated Multiprocessor UNIX . In USENIX Conference Proceedings , San Diego, CA, January 1989, pages 101--112.

    • Amnon Barak and Shai Guday and Richard G. Wheeler: The MOSIX Distributed Operating System . LNCS 672, Springer, Berlin, 1993.

    • The MOSIX project WWW site

  • The MACH micro-kernel project. Only a few of the many publications about MACH are listed here.

    • Dejan S. Miloji\v{c}i\'{c}: Load Distribution -- Implementation for the Mach Microkernel . Vieweg, Braunschweig, 1994.

    • Michael Blair Jones: Transparently Interposing User Code at the System Interface . Technical Report, CMU, September 1992. on CMU MACH FTP site

    • R. Sansom and D. Julin and R. Rashid: Extending a Capability Based System into a Network Environment . Technical Report, Carnegie-Mellon-University, number CMU-CS-86-115, April 1986. on CMU MACH FTP site

  • GatoStar by Betil Folliot et al.
    Combination of Gato and Star to avoid redundancy between load sharing and fault tolerance. Implemented in library, transparent for applications. On SunOS. Application modelled as tasks with precedence graph. Independent checkpointing with pessimistic message logging. Replication of files via reliable broadcast protocol (ISIS?). Migration through checkpoint/remote restart.
    more project info ... and even more

  • Stardust: An Environment for Parallel Programming on Networks of Heterogeneous Workstations by Gilbert Cabillic and Isabelle Puaut. Provides communication via Distributed Shared Memory and via Message Passing. Can Checkpoint and Migrate between heterogeneous machines. Requires the applications to be written for the programming library, and the data types of shared memory ares must be specified by the progremmer in the source code. Non-shared data and stack contents are not contained in checkpoints. Checkpoint/Migration is only possible at synchronization points where the application is in a globally consistent state out of its own, e.g. when performing global barriere. Since stack contents cannot be checkpointed, such synch points must be in the application's main function.

  • Nixdorf Targon/32
    Fault Tolerance under Unix Paper

  • From: shap@cobra.cis.upenn.edu (Jonathan Shapiro)
    Newsgroups: comp.os.research
    Subject: Re: Checkpointing
    Date: 16 Dec 1995 20:15:02 GMT
    Organization: University of Pennsylvania
    Message-ID: <4av9c6$mi9@darkstar.UCSC.EDU>
    References: <4anocr$3dl@darkstar.ucsc.edu> <4aq11q$dmo@darkstar.UCSC.EDU>
    
    You might also want to check out the KeyKOS home page:
    
            
    http://www.cis.upenn.edu/~KeyKOS
    
    
    They have an extremely light-weight global checkpoint mechanism.
    A paper describing it and various others on the system can be
    found from that home page.
    
  • To sum up in technical terms, Tunes is a project to replace existing Operating Systems, Languages, and User Interfaces by a completely rethought Computing System, based on a correctness-proof-secure higher-order reflective self-extensible fine-grained distributed persistent fault-tolerant version-aware decentralized (no-kernel) object system. [..] Tunes is a recursive acronym for: "Tunes is a Useful, Not Expedient, System".

  • From: pstephan+@RUBIX.MC.CS.CMU.EDU (Peter Stephan)
    Newsgroups: comp.parallel.pvm
    Subject: ANNOUNCE: Release of Dome (version 1.0)
    Date: 23 May 1996 17:08:58 GMT
    Organization: Carnegie Mellon University
    Message-ID: <4o263a$foj@cantaloupe.srv.cs.cmu.edu>
    
    -------------------------------------------------------------------
                         Announcing the release of 
    
                                  Dome 
                               version 1.0
    
                   (Distributed object migration environment)
    -------------------------------------------------------------------
    
    Overview
    --------
    Dome, the Distributed object migration environment, provides a C++ 
    library of distributed objects for parallel programming.  These 
    objects perform dynamic load balancing and support fault tolerance.  
    Programmers using Dome can, with modest effort, write parallel
    programs 
    that are automatically distributed over a heterogeneous network, 
    dynamically load balanced as the program runs, and able to survive 
    compute node and network failures.  Thus, Dome provides a means for 
    writing simple simple and efficient distributed programs.
    
    The focus of the Dome system is to support parallel programming over
    networks of workstations.  Dome's load balancing and fault tolerance
    play an integral role in producing efficient and survivable parallel
    programs in such an environment.  Dome uses a single program multiple
    data (SPMD) model to perform the parallelization of programs which
    use the Dome library, and Dome uses PVM to provide its underlying 
    process control and message passing.
    
    The Dome system is available in a package via anonymous ftp.  The
    package includes the Dome source code, makefiles, related build
    scripts, documentation, and example programs.  To obtain the Dome
    package login via anonymous ftp to ftp.cs.cmu.edu.  The directory 
    project/dome will contain the file dome1.0.tar.Z and a README file.  
    The dome1.0.tar.Z file contains the Dome system in compressed, tar 
    format.
    
    More information on the Dome project is available at 
    
    http://www.cs.cmu.edu/~Dome The authors of Dome can be contacted at dome-help@cs.cmu.edu . ------------------------------------------------------------------- * Dome version 1.0: Distributed object migration environment * * Carnegie Mellon University * * Authors: J. Arabe, A. Beguelin, B. Lowekamp, E. Seligman, * * S. Simon, M. Starkey, P. Stephan, and K. Walker * * (C) 1996 All Rights Reserved * -------------------------------------------------------------------
  • S.J. Leffler and M.K. McKusick and M.J. Karels and J.S. Quarterman: The Design and Implementation of the 4.3 BSD UNIX Operating System . Addison-Wesley, 1988.

  • M.K. McKusick and K. Bostic and M.J. Karels and J.S. Quarterman: The Design and Implementation of the 4.4 BSD UNIX Operating System . Addison-Wesley, 1996, ISBN 0-201-54979-4.

  • Pankaj Mehra: Automated Learning of Load-Balancing Strategies for a Distributed Computer System . PhD Thesis, University of Illinois at Urbana-Champaign, 1993. Available here on ftp.ibr.cs.tu-bs.de ( Directory with online thesis and papers)

  • Thomas Ludwig: Automatische Lastverteilung für Parallelrechner . BI-Wissenschaftsverlag, Reihe Informatik, 1993 (in german).
    "Ausführliche Darstellung von Lastmessung und -bewertung. Experimentelle Umgebung auf iPSC/2 implementiert. Prozeßverschiebung implementiert wie in Accent (Copy-on-Reference). Vorstellung von Messungen am Beispiel Mandelbrotmenge."

  • SunOS Network Programming Guide . Revision A, Sun Microsystems, March 1990.

  • SunOS Reference Manual . Revision A, Sun Microsystems, 1990.

  • Roman Zajcew and Paul Roy and David Black and Chris Peak and Paulo Guedes and Bradford Kemp and John LoVerso and Michael Leibensperger and Michael Barnett and Faramarz Rabii and Durriya Netterwala: An OSF/1 UNIX for Massively Parallel Multicomputers . In Usenix Conference Proceedings , pages 449--468, San Diego, CA, January 1993.

  • David K. Gifford: Weighted voting for replicated data . In Proceedings of the 7th ACM Symposium on Operating System Principles , pages 150--162, December 1979.

  • Nitin H. Vaidya: Another Two-Level Failure Recovery Scheme: Performance Impact of Checkpoint Placement and Checkpoint Latency . Technical Report, Texas A&M University, College Station, TX 77843-3112, December 1994, number 94-068. vaidya@cs.tamu.edu , available on ftp.cs.tamu.edu or here on ftp.ibr.cs.tu-bs.de

  • Juan Le\'{o}n and Allan L. Fisher and Peter Steenkiste: Fail-save PVM: A portable package for distributed programming with Transparent Recovery . Technical Report, Carnegie Mellon University, Pittsburgh, PA, February 1993, number CMU-CS-93-124. Juan.Leon@cs.cmu.edu , available on reports.adm.cs.cmu.edu or here on ftp.ibr.cs.tu-bs.de

  • "Geert Deconinck and Johan Vounckx and Rudi Cuyvers and Rudy Lauwereins: Survey of Checkpointing and Rollback Techniques . Technical Report, Katholieke Universiteit Leuven, Belgium, June 1993. geert.deconinck@esat.kuleuven.ac.be , 93-04.ps

  • H. Langendörfer and B. Schnor: Verteilte Systeme . Hanser, München, 1994.

  • Stefan Stille: Lastbalancierung in verteilten Systemen . Master's Thesis, Institut für Betriebssysteme und Rechnerverbund, TU Braunschweig, 1993 (in german). (yalb-stille93.ps.gz)

  • Henrik Carlsson: Konfiguration von Lastbalancierungssystemen . Master's Thesis, Institut für Betriebssysteme und Rechnerverbund, TU Braunschweig, 1994 (in german). (yalb-carlsson94.ps.gz)

  • Jens Steinborn: Globale konsistente Checkpoints für verteilte Anwendungen in Workstation Clustern . Master's Thesis, Institut für Betriebssysteme und Rechnerverbund, TU Braunschweig, 1996 (in german). (pbeam-steinborn96.ps.gz) (180935 Bytes)

  • Sabine Denecke: Entwurf und Implementierung einer adaptiven Komponente in Objektmigrationssystemen mit Hilfe Neuronaler Netze . Master's Thesis, Institut für Betriebssysteme und Rechnerverbund, Uni Hildesheim, February 1994 (in german).

  • Y. Miyata: A User's Guide to PlaNet . University of Colorado, Boulder, 1991.

  • Songnian Zhou and Jingwen Wang and Xiaohu Zheng and Pierre Delisle: UTOPIA: A Load Sharing Facility for Large, Heterogeneous Distributed Computer Systems . Technical Report, CSRI, University of Toronto}, number CSRI-257, April 1992. ??

  • Adam de Boor: Customs -- A Load Balancing System , Term paper for CS262, Fall 1987. Query archie for customs

  • D. Nauck and F. Klawonn and R. Kruse: Neuronale Netze und Fuzzy-Systeme . Vieweg, 1994. a collection of corresponding papers

  • B. Schnor and H. Langendörfer and S. Petri: Einsatz neuronaler Netze zur Lastbalancierung in Workstationclustern . In Praxisorientierte Parallelverarbeitung , Ed. H. Langendörfer, Hanser, München, pages 154--165, October 1994. bs3neuro

  • James S. Plank and the Checkpointing Research Group at the University of Tennessee, Knoxville. Here are their Papers .

    One of the Papers is:
    James S. Plank and Micah Beck and Gerry Kingsley and Kai Li: Libckpt: Transparent Checkpointing under Unix . In Usenix Conference Proceedings , New Orleans, January 1995. plank.html
    "Checkpointing is a simple technique for rollback recovery: the state of an executing program is periodically saved to a disk file from which it can be recovered after a failure. While recent research has developed a collection of powerful techniques for minimizing the overhead of writing checkpoint files, checkpointing remains unavailable to most application developers. In this paper we describe libckpt, a portable checkpointing tool for Unix that implements all applicable performance optimizations which are reported in the literature. While libckpt can be used in a mode which is almost totally transparent to the programmer, it also supports the incorporation of user directives into the creation of checkpoints. This ``user-directed'' checkpointing is an innovation which is unique to our work."

  • Codine from Genias Software GmbH, Regensburg, Germany.
  • Georg Stellner: Consistent Checkpoints of PVM Applications . In Proceedings of the First European PVM User Group Meeting , 1994. EPVMUG94.ps
    You might also have a look at the CoCheck WWW homepage

  • Georg Stellner's Bookmarks

    Resource Management

    Batch Queueing Systems
    Condor Homepage
    DQS Distributed Queueing System
    Dr. Samuel H. Russ (Home Page) --- Process Migration
    FWI Parallel Scientific Computing and Simulation (DynamicPVM)
    Hibernator II Data Sheet
    HOPE: Hopefully Optimistic Programming Environment
    IBM LoadLeveler (IBM)
    LoadLeveler 1.2 Batch Queuing System (LRZ)
    MIST Scheduler
    Papers about fault tolerance and checkpoint/restart (ZDV-Parallel)
    References on Job Scheduling
    References to online documents about load balancing and process migration
    Queueing and Scheduling Page
    SFB 342: Querschnittsthema Q4 (ALV)
    The MOSIX Multicomputer System
    The Scalable I/O Project
    Warp web on checkpointing

  • FADI : A Fault-Tolerant Environment for Distributed Processing Systems

  • DynamicPVM

    DynamicPVM &ers; DynamicMPI

  • MIST

  • Jonathan M. Smith: A Survey of Process Migration Mechanisms . In Operating Systems Review , volume 22, number 3, July 1988, pages 28--40, ACM.

  • G. Popek and and B. Walker and J. Chow and D. Edwards and C. Kline and G. Rudisin and G. Thiel: LOCUS: A Network Transparent, High Reliability Distributed System . In Operating Systems Review , volume 15, number 5, December 1981, pages 169--177.

  • J. Ju and G. Xu and K. Yang: An Intelligent Dynamic Load Balancer for Workstation Clusters . In Operating Systems Review , volume 29, number 1, pages 7--16, January 1995.

  • M. Cena and M. L. Crespo and R. Gallard: Transparent Remote Execution in LAHNOS by Means of a Neural Networking Device . In Operating Systems Review , volume 29, number 1, pages 17--28, January 1995.

  • The TACOMA project, Department of Computer Science, University of Tromsø, Norway:

    • Dag Johansen, Robbert van Renesse and Fred B. Schneider: Operating system support for mobile agents . In: Proceedings of the 5th. IEEE Workshop on Hot Topics in Operating Systems , Orcas Island, Wa, USA (4th-5th May, 1995). Published by: IEEE Computer Society, NY, USA, May 1995.

      Also available as Technical Report TR94-1468 , Department of Computer Science, Cornell University, USA, November 1994.

    • Dag Johansen and Robbert van Renesse and Fred B. Schneider: An Introduction to the TACOMA Distributed System Version 1.0 . Also available as Technical Report 95-23 . Department of Computer Science, University of Tromsø, Norway, June 1995.

  • Internet Parallel Computing Archive, Funded by JISC NTSC, Hosted at HENSA Unix (JISC funded) .
Some references about migration in heterogeneous environments:
  • FLASH: Flexible Agent System for Heterogeneous Cluster
    FLASH (Flexible Agent System for Heterogeneous Cluster) is an agent-based framework for the creation of load-balanced distributed applications running on a heterogeneous cluster systems. It offers the possibility to transfer subtasks of a parallel application to mobile agents, which travel autonomously through a network searching for free resources.

    Implemented in Java (with support utilities in several other languages).
  • The Tui Heterogeneous Process Migration System
    Tui is a general purpose migration package that allows processes to be migrated between machines of different architectures. That is, we can start a program executing on an 386 based machine, and during its execution, move it to a SPARC machine. Very few systems that I [the TUI author, Peter Smith] know of allow a process to be moved between machines of differing architecture, but since these generally work for only type-safe languages, this is not too hard to achieve. Tui is able to migrate programs that are written in more common languages (such as C), that have horrible type-unsafe features that make it hard to locate data within a process.
  • Porch - The Portable Checkpoint Compiler
    Porch is a source-to-source compiler that translates C programs into semantically equivalent C programs which are capable of saving and recovering from portable checkpoints. These portable checkpoints can be transferred in heterogeneous computer networks such as the Internet, and may be restarted on binary incompatible machines. Porch provides a simple way to enhance programs by adding functionalities for portability and fault-tolerance.

    Version 1.0 can be downloaded from the Porch home page.
And some references that have just collected, but not (yeeeet) checked out:
(last edited: Thu Jun 24 11:14:28 MET DST 1999)
petri@iti.mu-luebeck.de

Stefan's Homepage