I had written an article about running scripts in parallel using GNU Parallel, and then I realized that GNU parallel isn’t in the CentOS repositories. Since the code that I’m writing requires standard repo support, I need to find a different solution.

If we want to perform the same action as in the referenced article, using xargs instead of GNU parallel, we’d run the following command.

$ echo {1..20} | xargs -n1 -P5 ./echo_sleep
1426008382 -- starting -- 2
1426008382 -- starting -- 5
1426008382 -- starting -- 1
1426008382 -- starting -- 3
1426008382 -- starting -- 4
1426008382 -- finishing -- 4
1426008382 -- starting -- 6
1426008383 -- finishing -- 1
1426008383 -- starting -- 7
1426008385 -- finishing -- 3
1426008385 -- starting -- 8
1426008386 -- finishing -- 7
1426008386 -- starting -- 9
1426008389 -- finishing -- 9
1426008389 -- starting -- 10
1426008390 -- finishing -- 2
1426008390 -- finishing -- 5
1426008390 -- starting -- 11
1426008390 -- starting -- 12
1426008391 -- finishing -- 6
1426008391 -- starting -- 13
1426008392 -- finishing -- 10
1426008392 -- starting -- 14
1426008394 -- finishing -- 8
1426008394 -- starting -- 15
1426008396 -- finishing -- 15
1426008396 -- starting -- 16
1426008397 -- finishing -- 16
1426008397 -- starting -- 17
1426008398 -- finishing -- 11
1426008398 -- starting -- 18
1426008399 -- finishing -- 12
1426008399 -- starting -- 19
1426008399 -- finishing -- 13
1426008399 -- starting -- 20
1426008399 -- finishing -- 20
1426008399 -- finishing -- 17
1426008400 -- finishing -- 14
1426008402 -- finishing -- 18
1426008408 -- finishing -- 19

Some things to note here: First, the “-n1″ or “-n 1″ option is critical, as it informs xargs how many arguments from the echo string that each instance of the script echo_sleep needs to take as input. Also, the output format controls for xargs aren’t as well developed. In fact, it’s entirely possible for stdout from invoked scripts to collide. For this reason, you may want to make sure that the invoked scripts are more advanced (in Python with better file handling, for instance) instead of simply redirecting bash output.


What I want to be able to do is to run a script on a massive number of inputs. But, I only want a specified maximum number of them to be running at any given time. GNU parallel can accomplish this very easily.

First, make sure you have GNU parallel installed. The package in most major repositories is simply called “parallel”.

Writing A Basic Script

I’m going to write a bash script that echos a timestamp and input, and then waits two seconds before exiting. The script looks like this.

#!/bin/bash                                                                                                                                              
 
VAL="$1"
TIME="$(date +%s)"
 
echo "${TIME} -- ${VAL}"
 
sleep 2

Just to make sure it is working, we chmod it to 0755 and then we call it with input “hi”.

$ ./echo_sleep hi
1425690292 -- hi

It worked just as expected: After it echoed the time and input, it slept for two seconds and then exited and my prompt returned.

Running the Script in Parallel

I want to run this script on 20 inputs, but I only ever want to have 5 instances running at any given time. Here’s how we do that with GNU parallel (where the input arguments for the script are denoted by ‘:::’). I’m just using numbers 1..20 and the inputs.

$ parallel -j5 ./echo_sleep ::: {1..20}
1425690566 -- 1
1425690566 -- 2
1425690566 -- 3
1425690566 -- 4
1425690566 -- 5
1425690569 -- 6
1425690569 -- 7
1425690569 -- 8
1425690569 -- 9
1425690569 -- 10
1425690571 -- 11
1425690571 -- 12
1425690571 -- 13
1425690571 -- 14
1425690571 -- 15
1425690573 -- 16
1425690573 -- 17
1425690573 -- 18
1425690573 -- 19
1425690573 -- 20

Note: You can use any bash IFS-separated sequence as the input. For instance, something like ‘::: 1 2 3 4 5 6 7 8 9 10′ works just as well as the sequence ‘{1..10}’.

As you can see, the times are two seconds apart. What if we change the script to sleep for a random amount of time? We’ll have each script instance wait between 0 and 9 seconds by changing the script as follows, asking for output both when the script starts and when it finishes.

#!/bin/bash                                                                                                                                            
 
VAL="$1"
 
echo "$(date +%s) -- starting -- ${VAL}"
 
sleep "$(($RANDOM % 10))"
 
echo "$(date +%s) -- finishing -- ${VAL}"

Now, when we run the script, we’ll expect each instance to take a different amount of time to finish. We’ll notice that all output for the script is sent to stdout at the same time. (There are output control mechanisms in GNU parallel, but we’re not using them here.)

$ parallel -j5 ./echo_sleep ::: {1..20}
1425690952 -- starting -- 2
1425690954 -- finishing -- 2
1425690952 -- starting -- 1
1425690955 -- finishing -- 1
1425690952 -- starting -- 5
1425690955 -- finishing -- 5
1425690952 -- starting -- 3
1425690957 -- finishing -- 3
1425690952 -- starting -- 4
1425690959 -- finishing -- 4
1425690955 -- starting -- 7
1425690959 -- finishing -- 7
1425690955 -- starting -- 8
1425690961 -- finishing -- 8
1425690954 -- starting -- 6
1425690963 -- finishing -- 6
1425690959 -- starting -- 10
1425690963 -- finishing -- 10
1425690959 -- starting -- 11
1425690965 -- finishing -- 11
1425690961 -- starting -- 12
1425690965 -- finishing -- 12
1425690957 -- starting -- 9
1425690966 -- finishing -- 9
1425690966 -- starting -- 17
1425690967 -- finishing -- 17
1425690963 -- starting -- 13
1425690967 -- finishing -- 13
1425690963 -- starting -- 14
1425690967 -- finishing -- 14
1425690967 -- starting -- 19
1425690967 -- finishing -- 19
1425690967 -- starting -- 18
1425690969 -- finishing -- 18
1425690965 -- starting -- 15
1425690973 -- finishing -- 15
1425690965 -- starting -- 16
1425690973 -- finishing -- 16
1425690967 -- starting -- 20
1425690973 -- finishing -- 20

To get the output immediately, we can use the –linebuffer option. (There is also an –ungroup option, but it suffers from the problem of potentially mashing the simultaneous output of two script instances.)

$ parallel -j5 --linebuffer ./echo_sleep ::: {1..20}
1425691272 -- starting -- 1
1425691272 -- starting -- 2
1425691272 -- starting -- 3
1425691272 -- starting -- 4
1425691272 -- starting -- 5
1425691273 -- finishing -- 1
1425691273 -- finishing -- 2
1425691273 -- starting -- 7
1425691273 -- starting -- 6
1425691275 -- finishing -- 4
1425691275 -- finishing -- 6
1425691275 -- starting -- 8
1425691275 -- starting -- 9
1425691276 -- finishing -- 3
1425691276 -- starting -- 10
1425691277 -- finishing -- 9
1425691277 -- starting -- 11
1425691278 -- finishing -- 8
1425691278 -- finishing -- 10
1425691278 -- starting -- 12
1425691278 -- starting -- 13
1425691280 -- finishing -- 5
1425691280 -- finishing -- 7
1425691280 -- starting -- 15
1425691280 -- starting -- 14
1425691282 -- finishing -- 12
1425691282 -- starting -- 16
1425691283 -- finishing -- 13
1425691283 -- starting -- 17
1425691284 -- finishing -- 17
1425691284 -- finishing -- 11
1425691284 -- starting -- 19
1425691284 -- starting -- 18
1425691285 -- finishing -- 14
1425691285 -- starting -- 20
1425691286 -- finishing -- 15
1425691286 -- finishing -- 18
1425691289 -- finishing -- 19
1425691289 -- finishing -- 16
1425691293 -- finishing -- 20

Now, all of the timestamps are in order.


I’ve come across a situation where I have improperly encoded UTF-8 in a text file. I need to remove lines with improper encoding from the file.

Caveat: Keep in mind that Unicode and UTF-8 are different things. Unicode is best described as a character set while UTF-8 is a digital encoding, analogous respectively to the English alphabet and cursive writing. Just as you can write the alphabet in different ways (cursive, print, shorthand, etc.), you can write Unicode characters in various ways … UTF-8 has simply become the most popular.

Finding Multi-Byte Characters

One of the oddities with UTF-8 is that it uses a variable-length byte encoding. Many characters only require a single byte, but some require up to four bytes. You can grep a file and find anything encoded with more than a single byte with the following shell command.

grep -P '[^\x00-\x7f]' filename

Removing Improperly Encoded UTF-8

If a file contains improperly encoded UTF-8, it can be found and removed with the following command.

inconv -c -f UTF-8 -t UTF-8 -o outputfile inputfile

If you diff the input and output files, you’ll see any difference. Hence, when the diff is empty, you know that the input only contains valid UTF-8.


The ‘paste’ command will merge multiple files line by line, and you can declare a delimiter between the files’ contents.

This would be really useful for, say, creating a CSV using the contents of multiple files. You could run the following command and instantly create a CSV.

$ paste -d',' column1.txt column2.txt

Another use I have found recently is to help generate large SQL create table statements. For instance, say that I have two files. File ‘file_a.txt’ has the following contents (SQL types).

INT
INT
STRING
DOUBLE
STRING
STRING

File ‘file_b.txt’ has the corresponding SQL column names.

width
length
name
cost
comment1
comment2

I can form a create table SQL statement in Bash very quickly using paste:

$ echo -e "CREATE TABLE my_table (\n$(paste -d' ' file_a.txt file_b.txt))"
CREATE TABLE my_table (
INT width
INT length
STRING name
DOUBLE cost
STRING comment1
STRING comment2)

A lot of times, I’ll create an externally managed Hive table as a step toward constructing something better (e.g., a Parquet columnar snappy-compressed table created by Hive for use in Impala or Spark). The data for that table is often broken down by day. Instead of writing an interactive BASH command to iterate dates and create the nested directory structure, I wrote the following script.

For instance, I want a root directory in HDFS (say, “/user/jason/my_root_dir”) to have date directories for all days in 2014, such as:
- /user/jason/my_root_dir/2014
- /user/jason/my_root_dir/2014/01
- /user/jason/my_root_dir/2014/01/01
- /user/jason/my_root_dir/2014/01/02
- /user/jason/my_root_dir/2014/01/03

- /user/jason/my_root_dir/2014/12/31

Running “./make_partitions /user/jason/my_root_dir 2014-01-01 2014-12-31″ accomplishes this. Keep in mind that this takes a while, as the directories are checked and created across the cluster.

#!/bin/bash
 
# Usage: ./make_partitions HDFS_root_dir start_date end_date
# Example: ./make_partitions /user/root/mydir 2014-01-01 2014-12-31
# Creates nested year, month, day partitions for a sequence of dates (inclusive).
# Jason B. Hill - jason@jasonbhill.com
 
# Parse input options
HDFSWD=$1
START_DATE="$(date -d "$2" +%Y-%m-%d)"
END_DATE="$(date -d "$3 +1 days" +%Y-%m-%d)"
 
# Function to form directories based on a date
function mkdir_partition {
    # Input: $1 = date to form partition
 
    # Get date parameters
    YEAR=$(date -d "$1" +%Y)
    MONTH=$(date -d "$1" +%m)
    DAY=$(date -d "$1" +%d)
 
    # If the year doesn't exist, create it
    $(hdfs dfs -test -e ${HDFSWD}/${YEAR})
    if [[ "$?" -eq "1" ]]; then
        echo "-- creating HDFS directory: ${HDFSWD}/${YEAR}"
        $(hdfs dfs -mkdir ${HDFSWD}/${YEAR})
    fi
    # If the month doesn't exist, create it
    $(hdfs dfs -test -e ${HDFSWD}/${YEAR}/${MONTH})
    if [[ "$?" -eq "1" ]]; then
        echo "-- creating HDFS directory: ${HDFSWD}/${YEAR}/${MONTH}"
        $(hdfs dfs -mkdir ${HDFSWD}/${YEAR}/${MONTH})
    fi
    # If the day doesn't exist (it shouldn't), create it
    $(hdfs dfs -test -e ${HDFSWD}/${YEAR}/${MONTH}/${DAY})
    if [[ "$?" -eq "1" ]]; then
        echo "-- creating HDFS directory: ${HDFSWD}/${YEAR}/${MONTH}/${DAY}"
        $(hdfs dfs -mkdir ${HDFSWD}/${YEAR}/${MONTH}/${DAY})
    fi
}
 
# Iterate over dates and make partitions
ITER_DATE="${START_DATE}"
until [[ "${ITER_DATE}" == "${END_DATE}" ]]; do
    mkdir_partition ${ITER_DATE}
    ITER_DATE=$(date -d "${ITER_DATE} +1 days" +%Y-%m-%d)
done
 
exit 0

The following bash script iterates over dates in a range.

#!/bin/bash
 
# $1 = start date (e.g.: yyyy-mm-dd)
# $2 = end date
 
# make sure the end date is formatted correctly
end_date=$(date -d "$2" +%Y-%m-%d)
 
# set iteration date to start date and format
iter_date=$(date -d "$1" +%Y-%m-%d)
 
until [[ ${iter_date} == ${end_date} ]]; do
    # print the date
    echo ${iter_date}
    # advance the date
    iter_date=$(date -d "${iter_date} +1 day" +%Y-%m-%d)
done

I’ve saved that in a file called “loopdates.sh” and chmod the file to 0755. An example usage follows.

$ ./loopdates.sh 2014-11-27 2014-12-02
2014-11-27
2014-11-28
2014-11-29
2014-11-30
2014-12-01

Say you have a bunch of subdirectories of your current working directory, all including variously named files. You want to iterate over those files and apply some Bash command.

For instance, I have folders named 01, 02, 03, …, 31 (representing days in a month), and inside each of those folders sits various files. I wish to gzip each of those files individually. Here’s how I do that with a single line in Bash:

$ for d in */; do for f in $d*; do echo "gzip ${f}";gzip ${f};done; done

The [date] program in Linux is incredibly powerful, and can be used to modify dates very quickly. Here are some examples.

The current time in my present locale:

$ date
Thu Nov 13 15:41:12 MST 2014

The current time in UTC: (Use UTC, not GMT. One is an international standard of keeping time based on atomic clocks while the other is a local old-fashioned timezone based on when the sun is highest in the sky … which isn’t exactly accurate enough for international business. Plus, GMT isn’t even used when daylight saving time is in effect. Anyway…)

$ date --utc
Thu Nov 13 22:44:06 UTC 2014

The time one day ago in UTC:

$ date --utc -d "now -1 day"
Wed Nov 12 22:44:51 UTC 2014

A specific date minus one day, formatted as we wish:

date -d "2014-10-01 -1 day" +%Y-%m-%d
2014-09-30

See the manual page for date for more options.