📈 Paralelizar dentro de un job

📈 Paralelizar dentro de un job#

Cuando el software que está usando no escala adecuadamente y no puede ocupar un nodo completo, puede lanzar varios procesos al mismo tiempo dentro de un mismo job de slurm.

Peligro

Solo debe usarse cuando realmente no hay forma de que escale de otro modo.

Importante

Use el ejemplo final de GNU Parallel si cada step del trabajo puede utilizar MPI con CPU pinnig. En caso contrario, use el ejemplo de xargs.

GNU Parallel#

Permite ejecutar varios procesos en paralelo, pruebe ejecutar los siguientes ejemplos:

Ejemplo 1#

Parallel ejemplo 1#
$ parallel -u -P4 \
  'echo running task {} && sleep 3 && echo task {} done' \
  ::: 01 02 03 04 05 06 07 08

El orden puede variar, ya que se ejecuta todo en paralelo.
-u: Evita que parallel imprima todo el output junto al final, solo es útil para este ejemplo.
-P4: Define cuantos procesos corren a la vez.
::: 01 02 03 04 05 06 07 08: Conjunto de parametros a ejecutar.

Ejemplo de salida
running task 01
running task 02
running task 03
running task 04
task 01 done
task 02 done
task 03 done
task 04 done
running task 05
running task 06
running task 07
running task 08
task 05 done
task 06 done
task 07 done
task 08 done

Ejemplo 2#

El siguiente ejemplo es un paso intermedio para entender la versión final:

Parallel ejemplo 2#
$ export NUM_JOBS=16
$ export TASKS_PER_JOB=4
$ parallel -P$NUM_JOBS \
  'MAP=$(seq -s, $(( ({%} - 1) * $TASKS_PER_JOB )) $(( {%} * $TASKS_PER_JOB - 1 ))); \
  echo task {} on slot {%} with cpus $MAP && sleep 1' \
  ::: 1 2 3 4 ::: A B C D E

El orden puede variar, ya que se ejecuta todo en paralelo.
{%}: Se reemplaza por «slots».
::: 1 2 3 4 ::: A B C D E: Conjunto de parametros haciendo producto interno (4x5=20 en total).

Ejemplo de salida
task 1 A on slot 1 with cpus 0,1,2,3
task 1 B on slot 2 with cpus 4,5,6,7
task 1 C on slot 3 with cpus 8,9,10,11
task 1 D on slot 4 with cpus 12,13,14,15
task 1 E on slot 5 with cpus 16,17,18,19
task 2 A on slot 6 with cpus 20,21,22,23
task 2 B on slot 7 with cpus 24,25,26,27
task 2 C on slot 8 with cpus 28,29,30,31
task 2 D on slot 9 with cpus 32,33,34,35
task 2 E on slot 10 with cpus 36,37,38,39
task 3 A on slot 11 with cpus 40,41,42,43
task 3 B on slot 12 with cpus 44,45,46,47
task 3 C on slot 13 with cpus 48,49,50,51
task 3 D on slot 14 with cpus 52,53,54,55
task 3 E on slot 15 with cpus 56,57,58,59
task 4 A on slot 16 with cpus 60,61,62,63
task 4 B on slot 1 with cpus 0,1,2,3
task 4 C on slot 2 with cpus 4,5,6,7
task 4 D on slot 3 with cpus 8,9,10,11
task 4 E on slot 4 with cpus 12,13,14,15

Ejemplo final#

Finalmente, podemos ver un ejemplo del script de submit:

Importante

Este ejemplo asume que cada step tiene soporte para MPI y CPU pinnig.

Ejecutable de ejemplo: work.sh#
#!/bin/bash
# Imprime el la hora actual, el task id de slurm, el input, y en que cpu se está ejecutando.
echo $(date +"%H:%M:%S"): Running task $SLURM_PROCID with input $@. $(taskset -cp $$)

sleep 4

echo $(date +"%H:%M:%S"): End task $SLURM_PROCID with input $@
Parallel ejemplo para Serafín#
#!/bin/bash
#SBATCH --nodes=1

. /etc/profile

module load parallel

export NUM_JOBS=16
export TASKS_PER_JOB=4

parallel -P$NUM_JOBS ' \
  MAP=$(seq -s, $(( ({%} - 1) * $TASKS_PER_JOB )) $(( {%} * $TASKS_PER_JOB - 1 ))); \
  srun -n $TASKS_PER_JOB --cpu_bind=map_cpu:$MAP --overlap \
    work.sh {}
  ' \
  ::: 1 2 3 4 ::: A B C D E

-n $TASKS_PER_JOB: Ejecutar 4 tareas para cada trabajo.
--cpu_bind=map_cpu:$MAP: Asigna los procesadores segun el mapa.
--overlap: Dependiendo de la versión de SLURM, puede ser necesario ese flag, o -s o --exclusive o -c 1.\

Ejemplo de salida
$ cat slurm.out
15:06:03: Running task 0 with input 1 A. pid 917838's current affinity list: 0
15:06:03: Running task 1 with input 1 A. pid 917839's current affinity list: 1
15:06:03: Running task 2 with input 1 A. pid 917840's current affinity list: 2
15:06:03: Running task 3 with input 1 A. pid 917841's current affinity list: 3
15:06:03: Running task 1 with input 1 E. pid 917848's current affinity list: 17
15:06:03: Running task 3 with input 4 A. pid 917881's current affinity list: 63
15:06:03: Running task 2 with input 1 C. pid 917858's current affinity list: 10
15:06:03: Running task 3 with input 3 E. pid 917889's current affinity list: 59
15:06:03: Running task 2 with input 1 D. pid 917873's current affinity list: 14
15:06:03: Running task 0 with input 3 C. pid 917892's current affinity list: 48
15:06:03: Running task 0 with input 1 E. pid 917847's current affinity list: 16
15:06:03: Running task 2 with input 1 E. pid 917849's current affinity list: 18
15:06:03: Running task 3 with input 1 E. pid 917850's current affinity list: 19
15:06:03: Running task 0 with input 1 C. pid 917853's current affinity list: 8
15:06:03: Running task 1 with input 1 C. pid 917856's current affinity list: 9
15:06:03: Running task 0 with input 2 D. pid 917902's current affinity list: 32
15:06:03: Running task 3 with input 1 C. pid 917861's current affinity list: 11
15:06:03: Running task 1 with input 3 E. pid 917887's current affinity list: 57
15:06:03: Running task 0 with input 1 D. pid 917871's current affinity list: 12
15:06:03: Running task 3 with input 3 C. pid 917895's current affinity list: 51
15:06:03: Running task 2 with input 3 E. pid 917888's current affinity list: 58
15:06:03: Running task 3 with input 1 D. pid 917874's current affinity list: 15
15:06:03: Running task 1 with input 1 D. pid 917872's current affinity list: 13
15:06:03: Running task 0 with input 4 A. pid 917878's current affinity list: 60
15:06:03: Running task 1 with input 3 D. pid 917898's current affinity list: 53
15:06:03: Running task 1 with input 4 A. pid 917879's current affinity list: 61
15:06:03: Running task 2 with input 3 C. pid 917894's current affinity list: 50
15:06:03: Running task 0 with input 3 E. pid 917886's current affinity list: 56
15:06:03: Running task 2 with input 4 A. pid 917880's current affinity list: 62
15:06:03: Running task 1 with input 3 C. pid 917893's current affinity list: 49
15:06:03: Running task 3 with input 1 B. pid 917921's current affinity list: 7
15:06:03: Running task 3 with input 2 E. pid 917930's current affinity list: 39
15:06:03: Running task 2 with input 2 A. pid 917909's current affinity list: 22
15:06:03: Running task 0 with input 2 C. pid 917912's current affinity list: 28
15:06:03: Running task 0 with input 3 B. pid 917935's current affinity list: 44
15:06:03: Running task 3 with input 3 A. pid 917943's current affinity list: 43
15:06:03: Running task 1 with input 2 B. pid 917924's current affinity list: 25
15:06:03: Running task 1 with input 2 D. pid 917903's current affinity list: 33
15:06:03: Running task 2 with input 2 D. pid 917904's current affinity list: 34
15:06:03: Running task 3 with input 2 D. pid 917905's current affinity list: 35
15:06:03: Running task 2 with input 3 D. pid 917899's current affinity list: 54
15:06:03: Running task 3 with input 3 D. pid 917900's current affinity list: 55
15:06:03: Running task 0 with input 3 D. pid 917897's current affinity list: 52
15:06:03: Running task 2 with input 1 B. pid 917920's current affinity list: 6
15:06:03: Running task 0 with input 1 B. pid 917917's current affinity list: 4
15:06:03: Running task 0 with input 2 E. pid 917926's current affinity list: 36
15:06:03: Running task 1 with input 1 B. pid 917918's current affinity list: 5
15:06:03: Running task 3 with input 2 A. pid 917910's current affinity list: 23
15:06:03: Running task 1 with input 2 A. pid 917908's current affinity list: 21
15:06:03: Running task 1 with input 2 E. pid 917928's current affinity list: 37
15:06:03: Running task 2 with input 2 C. pid 917914's current affinity list: 30
15:06:03: Running task 1 with input 2 C. pid 917913's current affinity list: 29
15:06:03: Running task 0 with input 2 A. pid 917907's current affinity list: 20
15:06:03: Running task 2 with input 2 E. pid 917929's current affinity list: 38
15:06:03: Running task 3 with input 2 C. pid 917915's current affinity list: 31
15:06:03: Running task 2 with input 3 A. pid 917940's current affinity list: 42
15:06:03: Running task 1 with input 3 A. pid 917938's current affinity list: 41
15:06:03: Running task 2 with input 3 B. pid 917939's current affinity list: 46
15:06:03: Running task 2 with input 2 B. pid 917925's current affinity list: 26
15:06:03: Running task 0 with input 3 A. pid 917936's current affinity list: 40
15:06:03: Running task 1 with input 3 B. pid 917937's current affinity list: 45
15:06:03: Running task 3 with input 3 B. pid 917941's current affinity list: 47
15:06:03: Running task 3 with input 2 B. pid 917927's current affinity list: 27
15:06:03: Running task 0 with input 2 B. pid 917922's current affinity list: 24
15:06:07: End task 2 with input 1 A
15:06:07: End task 3 with input 1 A
15:06:07: End task 0 with input 1 A
15:06:07: End task 3 with input 4 A
15:06:07: End task 1 with input 1 A
15:06:07: End task 1 with input 1 E
15:06:07: End task 2 with input 1 C
15:06:07: End task 0 with input 3 C
15:06:07: End task 3 with input 3 C
15:06:07: End task 0 with input 1 C
15:06:07: End task 2 with input 1 E
15:06:07: End task 3 with input 3 E
15:06:07: End task 0 with input 1 E
15:06:07: End task 2 with input 1 D
15:06:07: End task 3 with input 1 D
15:06:07: End task 1 with input 3 E
15:06:07: End task 2 with input 3 E
15:06:07: End task 3 with input 1 C
15:06:07: End task 1 with input 1 C
15:06:07: End task 3 with input 1 E
15:06:07: End task 0 with input 2 D
15:06:07: End task 0 with input 1 D
15:06:07: End task 0 with input 4 A
15:06:07: End task 1 with input 3 D
15:06:07: End task 0 with input 3 E
15:06:07: End task 2 with input 3 C
15:06:07: End task 1 with input 1 D
15:06:07: End task 1 with input 4 A
15:06:07: End task 1 with input 2 D
15:06:07: End task 2 with input 3 D
15:06:07: End task 3 with input 3 D
15:06:07: End task 1 with input 3 C
15:06:07: End task 3 with input 1 B
15:06:07: End task 2 with input 4 A
15:06:07: End task 2 with input 2 D
15:06:07: End task 0 with input 3 D
15:06:07: End task 3 with input 2 E
15:06:07: End task 2 with input 1 B
15:06:07: End task 2 with input 2 A
15:06:07: End task 0 with input 2 C
15:06:07: End task 2 with input 2 C
15:06:07: End task 0 with input 2 A
15:06:07: End task 2 with input 3 A
15:06:07: End task 3 with input 3 A
15:06:07: End task 3 with input 2 D
15:06:07: End task 2 with input 3 B
15:06:07: End task 0 with input 3 B
15:06:07: End task 0 with input 2 E
15:06:07: End task 1 with input 2 E
15:06:07: End task 1 with input 2 B
15:06:07: End task 3 with input 2 A
15:06:07: End task 3 with input 2 C
15:06:07: End task 2 with input 2 E
15:06:07: End task 1 with input 1 B
15:06:07: End task 1 with input 2 A
15:06:07: End task 0 with input 2 B
15:06:07: End task 1 with input 2 C
15:06:07: End task 0 with input 1 B
15:06:07: End task 3 with input 3 B
15:06:07: End task 0 with input 3 A
15:06:07: End task 3 with input 2 B
15:06:07: End task 2 with input 2 B
15:06:07: End task 1 with input 3 B
15:06:07: End task 1 with input 3 A
15:06:07: Running task 2 with input 4 C. pid 918328's current affinity list: 18
15:06:08: Running task 0 with input 4 B. pid 918338's current affinity list: 0
15:06:08: Running task 0 with input 4 D. pid 918323's current affinity list: 60
15:06:08: Running task 0 with input 4 E. pid 918334's current affinity list: 8
15:06:07: Running task 3 with input 4 C. pid 918330's current affinity list: 19
15:06:07: Running task 0 with input 4 C. pid 918324's current affinity list: 16
15:06:07: Running task 1 with input 4 C. pid 918326's current affinity list: 17
15:06:08: Running task 1 with input 4 B. pid 918339's current affinity list: 1
15:06:08: Running task 2 with input 4 B. pid 918340's current affinity list: 2
15:06:08: Running task 3 with input 4 B. pid 918341's current affinity list: 3
15:06:08: Running task 1 with input 4 D. pid 918325's current affinity list: 61
15:06:08: Running task 3 with input 4 D. pid 918329's current affinity list: 63
15:06:08: Running task 2 with input 4 D. pid 918327's current affinity list: 62
15:06:08: Running task 1 with input 4 E. pid 918335's current affinity list: 9
15:06:08: Running task 2 with input 4 E. pid 918336's current affinity list: 10
15:06:08: Running task 3 with input 4 E. pid 918337's current affinity list: 11
15:06:11: End task 2 with input 4 C
15:06:11: End task 3 with input 4 C
15:06:11: End task 0 with input 4 C
15:06:11: End task 1 with input 4 C
15:06:12: End task 2 with input 4 B
15:06:12: End task 3 with input 4 B
15:06:12: End task 0 with input 4 B
15:06:12: End task 1 with input 4 B
15:06:12: End task 1 with input 4 D
15:06:12: End task 3 with input 4 D
15:06:12: End task 0 with input 4 D
15:06:12: End task 2 with input 4 D
15:06:12: End task 2 with input 4 E
15:06:12: End task 0 with input 4 E
15:06:12: End task 1 with input 4 E
15:06:12: End task 3 with input 4 E

xargs#

Importante

Use este ejemplo cuando cada paso es independiente y sin soporte para MPI

 1#SBATCH --nodes=1
 2#SBATCH --ntasks-per-node=1
 3#SBATCH --cpus-per-task=32
 4####SBATCH  time/partition/etc...
 5
 6find ligandos -name 'active*_*' -printf "%f\n" | \
 7  xargs -P ${SLURM_CPUS_PER_TASK} --max-args 1 -I{} \
 8  vina --cpu 1 \
 9    --config conf.txt \
10    --ligand "ligandos/{}" \
11    --out out/out_{} \
12    --log log/log_{}.txt

Explicación:
Línea 6: Listar todos los archivos del directorio ligandos filtrando por patron de nombre active*_* y mostrar su nombre seguido de nueva linea.
Línea 7: Tomar de a 32 (SLURM_CPUS_PER_TASK) líneas pasando un solo argumento, a colocar en cada “{}”.
Líneas 8-12: Ejecutar vina reemplazando “{}” por la línea correspondiente.