Beowulf Cluster Computing with Linux, Second Edition [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Beowulf Cluster Computing with Linux, Second Edition [Electronic resources] - نسخه متنی

William Gropp; Ewing Lusk; Thomas Sterling

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
توضیحات
افزودن یادداشت جدید







15.3 Condor Architecture

A Condor pool comprises a single machine that serves as the central manager and an arbitrary number of other machines that have joined the pool. Conceptually, the pool is a collection of resources (machines) and resource requests (jobs). The role of Condor is to match waiting requests with available resources. Every part of Condor sends periodic updates to the central manager, the centralized repository of information about the state of the pool. The central manager periodically assesses the current state of the pool and tries to match pending requests with the appropriate resources.


15.3.1 The Condor Daemons


In this subsection we describe all the daemons (background server processes) in Condor and the role each plays in the system.



  • condor_master : This daemon's role is to simplify system administration. It is responsible for keeping the rest of the Condor daemons running on each machine in a pool. The master spawns the other daemons and periodically checks the timestamps on the binaries of the daemons it is managing. If it finds new binaries, the master will restart the affected daemons. This allows Condor to be upgraded easily. In addition, if any other Condor daemon on the machine exits abnormally, the

    condor_master will send e-mail to the system administrator with information about the problem and then automatically restart the affected daemon. The

    condor_master also supports various administrative commands to start, stop, or reconfigure daemons remotely. The

    condor_master runs on every machine in your Condor pool.



  • condor_startd : This daemon represents a machine to the Condor pool. It advertises a machine ClassAd that contains attributes about the machine's capabilities and policies. Running the

    startd enables a machine to execute jobs. The

    condor_startd is responsible for enforcing the policy under which remote jobs will be started, suspended, resumed, vacated, or killed. When the

    startd is ready to execute a Condor job, it spawns the

    condor_starter , described below.



  • condor_starter : This program is the entity that spawns the remote Condor job on a given machine. It sets up the execution environment and monitors the job once it is running. The starter detects job completion, sends back status information to the submitting machine, and exits.



  • condor_schedd : This daemon represents jobs to the Condor pool. Any machine that allows users to submit jobs needs to have a

    condor_schedd running. Users submit jobs to the

    condor_schedd , where they are stored in the job queue. The various tools to view and manipulate the job queue (such as

    condor_submit ,

    condor_q , or

    condor_rm ) connect to the

    condor_schedd to do their work.



  • condor_shadow : This program runs on the machine where a job was submitted whenever that job is executing. The shadow serves requests for files to transfer, logs the job's progress, and reports statistics when the job completes. Jobs that are linked for Condor's Standard Universe, which perform remote system calls, do so via the

    condor_shadow . Any system call performed on the remote execute machine is sent over the network to the

    condor_shadow . The shadow performs the system call (such as file I/O) on the submit machine and the result is sent back over the network to the remote job.



  • condor_collector : This daemon is responsible for collecting all the information about the status of a Condor pool. All other daemons periodically send ClassAd updates to the collector. These ClassAds contain all the information about the state of the daemons, the resources they represent, or resource requests in the pool (such as jobs that have been submitted to a given

    condor_schedd ). The

    condor_collector can be thought of as a dynamic database of ClassAds. The

    condor_status command can be used to query the collector for specific information about various parts of Condor. The Condor daemons also query the collector for important information, such as what address to use for sending commands to a remote machine. The

    condor_collector runs on the machine designated as the central manager.



  • condor_negotiator : This daemon is responsible for all the matchmaking within the Condor system. The negotiator is also responsible for enforcing user priorities in the system.




15.3.2 The Condor Daemons in Action


Within a given Condor installation, one machine will serve as the pool's central manager. In addition to the

condor_master daemon that runs on every machine in a Condor pool, the central manager runs the

condor_collector and the

condor_negotiator daemons. Any machine in the installation that should be capable of running jobs should run the

condor_startd , and any machine that should maintain a job queue and therefore allow users on that machine to submit jobs should run a

condor_schedd .

Condor allows any machine simultaneously to execute jobs and serve as a submission point by running both a

condor_startd and a

condor_schedd . Figure 15.6 displays a Condor pool in which every machine in the pool can both submit and run jobs, including the central manager.


Figure 15.6: Daemon layout of an idle Condor pool.


The interface for adding a job to the Condor system is

condor_submit , which reads a job description file, creates a job ClassAd, and gives that ClassAd to the

condor_schedd managing the local job queue. This triggers a negotiation cycle. During a negotiation cycle, the

condor_negotiator queries the

condor_collector to discover all machines that are willing to perform work and all users with idle jobs. The

condor_negotiator communicates in user priority order with each

condor_schedd that has idle jobs in its queue, and performs matchmaking to match jobs with machines such that both job and machine ClassAd requirements are satisfied and preferences (rank) are honored.

Once the

condor_negotiator makes a match, the

condor_schedd claims the corresponding machine and is allowed to make subsequent scheduling decisions about the order in which jobs run. This hierarchical, distributed scheduling architecture enhances Condor's scalability and flexibility.

When the

condor_schedd starts a job, it spawns a

condor_shadow process on the submit machine, and the

condor_startd spawns a

condor_starter process on the corresponding execute machine (see Figure 15.7). The shadow transfers the job ClassAd and any data files required to the starter, which spawns the user's application.


Figure 15.7: Daemon layout when a job submitted from Machine 2 is running.

If the job is a Standard Universe job, the shadow will begin to service remote system calls originating from the user job, allowing the job to transparently access data files on the submitting host.

When the job completes or is aborted, the

condor_starter removes every process spawned by the user job, and frees any temporary scratch disk space used by the job. This ensures that the execute machine is left in a clean state and that resources (such as processes or disk space) are not being leaked.

/ 198