Computing on Slurm with Julia
本文介绍如何在 Slurm 里提交 Julia 并行任务。
避免同一个账号使用一个.julia
文件
每次登陆后使用
export JULIA_DEPOT_PATH="/home/usrname/WORK/your_work_dir/.julia"
之后启动julia REPL,安装的包都会出现在指定的 JULIA_DEPOT_PATH
目录下。
Slurm的启动文件也应当先启用环境变量
#!/bin/bash
#SBATCH -J test_cluster_manager
#SBATCH -p cnall
#SBATCH -o cluster2.out
#SBATCH -e cluster2.err
#SBATCH --nodes=2
#SBATCH --ntasks=10
#SBATCH --cpus-per-task=2
module load soft/julia/
export JULIA_DEPOT_PATH="/home/usrname/WORK/your_work_dir/.julia"
julia ./scripts/slurm/cluster_manager.j
参考:不这么做可能会报错
https://stackoverflow.com/questions/67794194
提交多进程任务
Julia 利用 Distributed.jl
和 ClusterManagers
管理多进程任务。
单一节点任务提交
#!/bin/bash
#SBATCH -J dos_qc_8
#SBATCH -p cnall
#SBATCH -o dos.out
#SBATCH -e dos.err
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=10
module load soft/julia/
export JULIA_DEPOT_PATH="/home/usrname/WORK/your_work_dir/.julia"
julia ./scripts/slurm/hello_world.jl
hello_world.jl
文件内容
using Distributed
# launch worker processes
num_cores = parse(Int, ENV["SLURM_CPUS_PER_TASK"])
addprocs(num_cores)
println("Number of cores: ", nprocs())
println("Number of workers: ", nworkers())
# each worker gets its id, process id and hostname
for i in workers()
id, pid, host = fetch(@spawnat i (myid(), getpid(), gethostname()))
println(id, " " , pid, " ", host)
end
# remove the workers
for i in workers()
rmprocs(i)
end
结果
Number of cores: 57
Number of workers: 56
2 15705 c11b03n20
3 15707 c11b03n20
4 15708 c11b03n20
5 15709 c11b03n20
6 15711 c11b03n20
7 15713 c11b03n20
8 15715 c11b03n20
9 15717 c11b03n20
10 15719 c11b03n20
可以理解为 addprocs(num_cores)
将一个 Slurm task 进程分解为 num_cores
个小进程。每个小进程占用一个物理核心。
多节点任务提交
#!/bin/bash
#SBATCH -J test_cluster_manager
#SBATCH -p cnall
#SBATCH -o cluster2.out
#SBATCH -e cluster2.err
#SBATCH --nodes=2
#SBATCH --ntasks=10
#SBATCH --cpus-per-task=2
module load soft/julia/
export JULIA_DEPOT_PATH="/home/usrname/WORK/your_work_dir/.julia"
julia ./scripts/slurm/cluster_manager.jl
cluster_manager.jl
using Distributed
using ClusterManagers
num_tasks = parse(Int, ENV["SLURM_NTASKS"])
# launch worker processes
cpus_per_task = parse(Int, ENV["SLURM_CPUS_PER_TASK"])
addprocs(SlurmManager(num_tasks))
println("Number of cores: ", nprocs())
println("Number of workers: ", nworkers())
println("cpus_per_task: $cpus_per_task")
println(Threads.nthreads())
# each worker gets its id, process id and hostname
for i in workers()
id, pid, host = fetch(@spawnat i (myid(), getpid(), gethostname()))
println(id, " " , pid, " ", host)
end
println(Threads.nthreads())
# remove the workers
for i in workers()
rmprocs(i)
end
运行结果
connecting to worker 1 out of 10
connecting to worker 2 out of 10
connecting to worker 3 out of 10
connecting to worker 4 out of 10
connecting to worker 5 out of 10
connecting to worker 6 out of 10
connecting to worker 7 out of 10
connecting to worker 8 out of 10
connecting to worker 9 out of 10
connecting to worker 10 out of 10
Number of cores: 11
Number of workers: 10
cpus_per_task: 2
1
2 110588 c11b02n01
3 110589 c11b02n01
4 60730 c11b02n08
5 60731 c11b02n08
6 60732 c11b02n08
7 60733 c11b02n08
8 60734 c11b02n08
9 60735 c11b02n08
10 60736 c11b02n08
11 60737 c11b02n08
1
ClusterManagers.jl
封装了 MPI,可以实现多节点的计算。
注意:
addprocs(SlurmManager(num_tasks))
应指定进程总数。例如在提交任务时如果采用#SBATCH --nodes=4 #SBATCH --ntasks-per-node=2
应使用
addprocs(SlurmManager(8))
- 若在此时需要使用共享的
Array
,需用DistributedArrays.jl
而不是SharedArrays.jl
。 - 即使分配了两个物理核心,Julia 中的默认线程数
Threads.nthreads()
还是 1。
错误示范
若只使用 Distributed.jl
,在多节点计算中只能分配到一个task的资源。
#!/bin/bash
#SBATCH -J dos_qc_8
#SBATCH -p cnall
#SBATCH -o dos.out
#SBATCH -e dos.err
#SBATCH --nodes=2
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=56
module load soft/julia/
export JULIA_DEPOT_PATH="/home/xuyong/WORK/MYF/.julia"
julia ./scripts/slurm/hello_world.jl
hello_world.jl
内容
using Distributed
# launch worker processes
num_cores = parse(Int, ENV["SLURM_CPUS_PER_TASK"])
addprocs(num_cores)
println("Number of cores: ", nprocs())
println("Number of workers: ", nworkers())
# each worker gets its id, process id and hostname
for i in workers()
id, pid, host = fetch(@spawnat i (myid(), getpid(), gethostname()))
println(id, " " , pid, " ", host)
end
# remove the workers
for i in workers()
rmprocs(i)
end
结果
Number of cores: 57
Number of workers: 56
2 15705 c11b03n20
3 15707 c11b03n20
4 15708 c11b03n20
5 15709 c11b03n20
6 15711 c11b03n20
7 15713 c11b03n20
8 15715 c11b03n20
9 15717 c11b03n20
10 15719 c11b03n20
11 15721 c11b03n20
12 15723 c11b03n20
13 15725 c11b03n20
14 15727 c11b03n20
15 15729 c11b03n20
16 15731 c11b03n20
17 15733 c11b03n20
18 15735 c11b03n20
19 15737 c11b03n20
20 15739 c11b03n20
21 15742 c11b03n20
22 15744 c11b03n20
23 15746 c11b03n20
24 15748 c11b03n20
25 15750 c11b03n20
26 15752 c11b03n20
27 15754 c11b03n20
28 15756 c11b03n20
29 15758 c11b03n20
30 15760 c11b03n20
31 15762 c11b03n20
32 15764 c11b03n20
33 15766 c11b03n20
34 15771 c11b03n20
35 15774 c11b03n20
36 15777 c11b03n20
37 15780 c11b03n20
38 15783 c11b03n20
39 15785 c11b03n20
40 15789 c11b03n20
41 15792 c11b03n20
42 15796 c11b03n20
43 15799 c11b03n20
44 15801 c11b03n20
45 15803 c11b03n20
46 15807 c11b03n20
47 15809 c11b03n20
48 15814 c11b03n20
49 15817 c11b03n20
50 15820 c11b03n20
51 15823 c11b03n20
52 15827 c11b03n20
53 15829 c11b03n20
54 15831 c11b03n20
55 15837 c11b03n20
56 15851 c11b03n20
57 15852 c11b03n20
推荐阅读
- Princeton's Introduction
- ClusterManagers.jl
- 一个关于 ClusterManagers.jl 的例子
- Sengupta, Avik. Julia High Performance: Optimizations, Distributed Computing, Multithreading, and GPU Programming with Julia 1.0 and Beyond. Second edition. Birmingham Mumbai: Packt Publishing, 2019.