High throughput computational framework HTCondor (VI) - Supplements

1. text

1.1 Some problems

If you really want to HTCondor high-throughput computing products also requires a lot of work to do, HTCondor and no GUI interface, more comprehensive and more convenient functions in the command window under the Linux system.

Split task the user is also worth considering that many of compute-intensive in fact not very convenient to split large probability to merge operations after the split, merge this operation may be time-consuming, and can not be distributed only stand-alone operation calculated. Split task also requires experience, namely how to ensure load balancing, so that all tasks simultaneously.

File access is also a problem worthy of study. Next time use the default Windows file transfer mechanism, that is, the data is sent to the mission area on the machine running the program with the task, this approach often leads to huge IO obstruction; after re-run is complete, the data transfer will be emptied deleted, too IO performance resulting waste. So, if conditions allow, the best or the use of distributed file management system, of course, is another question.

Use of some features under Windows vanilla mode is limited:

  1. Is sent to the task program the task machines can not access the network address resources task machine, this is due to the security policy;
  2. Sending task program is further encapsulated, change the default parameters;
  3. When there is a problem in the task computing resources, and not automatically migrate breakpoint sequel. This feature is important for HTC, the problem is the stability of the scale of operations HTC process. This issue also needs to further get to the bottom.

HTCondor own computing resources are divided according to the number of CPU cores; It is also debatable. If you give a task to submit 8-core machine, the machine will run 8 mission, if you happen to this task is associated with the IO-intensive, it will cause a waste of IO performance. After all, the hard disk is always only one head, the head is moved repeatedly in a single disk will cause the loss of the disk. And the CPU can be divided according to the number of cores, the GPU resources? GPU computing for task-based program of how to divide it? In many practical cases it may be to a machine as a node in a more reasonable number.

In order to achieve better performance, I have a simple way of using file-sharing mechanism. That is the task of the program HTCondor Although it is impossible to access network resources, but can do file sharing before the calculation, the data needs to be transferred to the task ahead of the machine up, the job program to ensure access to local resources. Such transmitted data can be used repeatedly, it contributes to the efficiency of the subsequent task. This approach how should I say, unless you share files on the network that a very familiar, it is recommended not to do so.

1.2. Use recommended

  1. condor_q display task H is pending, indicating that the task of sending the program may not work correctly, the machine is generally lack the necessary task of operating environment, as some dll.
  2. The need to maintain a stable network environment. Some security software, firewalls, network tools may result in changes in the network environment, resulting in the task can not be performed. Examples are based on a local LAN.
  3. HTC more emphasis on stability and not just high-performance, all changes must be based on this principle.
  4. HTCondor have set the task queue priority operation function condor_prio, you can view the instructions within the document.
  5. In HTCondor help documentation section 7.2.4 "Executing Jobs as the Submitting User" raised the issue of access to network resources task program:

    By default, HTCondor executes jobs on Windows using dedicated run accounts that have minimal access rights and privileges, and which are recreated for each new job. As an alternative, HTCondor can be configured to allow users to run jobs using their Windows login accounts. This may be useful if jobs need access to files on a network share, or to other resources that are not available to the low-privilege run account.
    This feature requires use of a condor_credd daemon for secure password storage and retrieval. With the condor_credd daemon running, the user’s password must be stored, using the condor_store_cred tool. Then, a user that wants a job to run using their own account places into the job’s submit description file
    run_as_owner = True

Meaning of this passage is more about the background condor_credd process, you need to configure the environment. But I did not succeed in accordance with Section 7.2.5 "The condor_credd Daemon" configure interested boots child can try it for yourself.

2. Related

Previous
Contents

Guess you like

Origin www.cnblogs.com/charlee44/p/12233502.html