What I want to be able to do is to run a script on a massive number of inputs. But, I only want a specified maximum number of them to be running at any given time. GNU parallel can accomplish this very easily.

First, make sure you have GNU parallel installed. The package in most major repositories is simply called “parallel”.

Writing A Basic Script

I’m going to write a bash script that echos a timestamp and input, and then waits two seconds before exiting. The script looks like this.

#!/bin/bash                                                                                                                                              
 
VAL="$1"
TIME="$(date +%s)"
 
echo "${TIME} -- ${VAL}"
 
sleep 2

Just to make sure it is working, we chmod it to 0755 and then we call it with input “hi”.

$ ./echo_sleep hi
1425690292 -- hi

It worked just as expected: After it echoed the time and input, it slept for two seconds and then exited and my prompt returned.

Running the Script in Parallel

I want to run this script on 20 inputs, but I only ever want to have 5 instances running at any given time. Here’s how we do that with GNU parallel (where the input arguments for the script are denoted by ‘:::’). I’m just using numbers 1..20 and the inputs.

$ parallel -j5 ./echo_sleep ::: {1..20}
1425690566 -- 1
1425690566 -- 2
1425690566 -- 3
1425690566 -- 4
1425690566 -- 5
1425690569 -- 6
1425690569 -- 7
1425690569 -- 8
1425690569 -- 9
1425690569 -- 10
1425690571 -- 11
1425690571 -- 12
1425690571 -- 13
1425690571 -- 14
1425690571 -- 15
1425690573 -- 16
1425690573 -- 17
1425690573 -- 18
1425690573 -- 19
1425690573 -- 20

Note: You can use any bash IFS-separated sequence as the input. For instance, something like ‘::: 1 2 3 4 5 6 7 8 9 10′ works just as well as the sequence ‘{1..10}’.

As you can see, the times are two seconds apart. What if we change the script to sleep for a random amount of time? We’ll have each script instance wait between 0 and 9 seconds by changing the script as follows, asking for output both when the script starts and when it finishes.

#!/bin/bash                                                                                                                                            
 
VAL="$1"
 
echo "$(date +%s) -- starting -- ${VAL}"
 
sleep "$(($RANDOM % 10))"
 
echo "$(date +%s) -- finishing -- ${VAL}"

Now, when we run the script, we’ll expect each instance to take a different amount of time to finish. We’ll notice that all output for the script is sent to stdout at the same time. (There are output control mechanisms in GNU parallel, but we’re not using them here.)

$ parallel -j5 ./echo_sleep ::: {1..20}
1425690952 -- starting -- 2
1425690954 -- finishing -- 2
1425690952 -- starting -- 1
1425690955 -- finishing -- 1
1425690952 -- starting -- 5
1425690955 -- finishing -- 5
1425690952 -- starting -- 3
1425690957 -- finishing -- 3
1425690952 -- starting -- 4
1425690959 -- finishing -- 4
1425690955 -- starting -- 7
1425690959 -- finishing -- 7
1425690955 -- starting -- 8
1425690961 -- finishing -- 8
1425690954 -- starting -- 6
1425690963 -- finishing -- 6
1425690959 -- starting -- 10
1425690963 -- finishing -- 10
1425690959 -- starting -- 11
1425690965 -- finishing -- 11
1425690961 -- starting -- 12
1425690965 -- finishing -- 12
1425690957 -- starting -- 9
1425690966 -- finishing -- 9
1425690966 -- starting -- 17
1425690967 -- finishing -- 17
1425690963 -- starting -- 13
1425690967 -- finishing -- 13
1425690963 -- starting -- 14
1425690967 -- finishing -- 14
1425690967 -- starting -- 19
1425690967 -- finishing -- 19
1425690967 -- starting -- 18
1425690969 -- finishing -- 18
1425690965 -- starting -- 15
1425690973 -- finishing -- 15
1425690965 -- starting -- 16
1425690973 -- finishing -- 16
1425690967 -- starting -- 20
1425690973 -- finishing -- 20

To get the output immediately, we can use the –linebuffer option. (There is also an –ungroup option, but it suffers from the problem of potentially mashing the simultaneous output of two script instances.)

$ parallel -j5 --linebuffer ./echo_sleep ::: {1..20}
1425691272 -- starting -- 1
1425691272 -- starting -- 2
1425691272 -- starting -- 3
1425691272 -- starting -- 4
1425691272 -- starting -- 5
1425691273 -- finishing -- 1
1425691273 -- finishing -- 2
1425691273 -- starting -- 7
1425691273 -- starting -- 6
1425691275 -- finishing -- 4
1425691275 -- finishing -- 6
1425691275 -- starting -- 8
1425691275 -- starting -- 9
1425691276 -- finishing -- 3
1425691276 -- starting -- 10
1425691277 -- finishing -- 9
1425691277 -- starting -- 11
1425691278 -- finishing -- 8
1425691278 -- finishing -- 10
1425691278 -- starting -- 12
1425691278 -- starting -- 13
1425691280 -- finishing -- 5
1425691280 -- finishing -- 7
1425691280 -- starting -- 15
1425691280 -- starting -- 14
1425691282 -- finishing -- 12
1425691282 -- starting -- 16
1425691283 -- finishing -- 13
1425691283 -- starting -- 17
1425691284 -- finishing -- 17
1425691284 -- finishing -- 11
1425691284 -- starting -- 19
1425691284 -- starting -- 18
1425691285 -- finishing -- 14
1425691285 -- starting -- 20
1425691286 -- finishing -- 15
1425691286 -- finishing -- 18
1425691289 -- finishing -- 19
1425691289 -- finishing -- 16
1425691293 -- finishing -- 20

Now, all of the timestamps are in order.


One Trackback

  1. [...] had written an article about running scripts in parallel using GNU Parallel, and then I realized that GNU parallel isn’t in the CentOS repositories. Since the code that [...]

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="" highlight="">