A lot of times, I’ll create an externally managed Hive table as a step toward constructing something better (e.g., a Parquet columnar snappy-compressed table created by Hive for use in Impala or Spark). The data for that table is often broken down by day. Instead of writing an interactive BASH command to iterate dates and create the nested directory structure, I wrote the following script.

For instance, I want a root directory in HDFS (say, “/user/jason/my_root_dir”) to have date directories for all days in 2014, such as:
- /user/jason/my_root_dir/2014
- /user/jason/my_root_dir/2014/01
- /user/jason/my_root_dir/2014/01/01
- /user/jason/my_root_dir/2014/01/02
- /user/jason/my_root_dir/2014/01/03

- /user/jason/my_root_dir/2014/12/31

Running “./make_partitions /user/jason/my_root_dir 2014-01-01 2014-12-31″ accomplishes this. Keep in mind that this takes a while, as the directories are checked and created across the cluster.

#!/bin/bash
 
# Usage: ./make_partitions HDFS_root_dir start_date end_date
# Example: ./make_partitions /user/root/mydir 2014-01-01 2014-12-31
# Creates nested year, month, day partitions for a sequence of dates (inclusive).
# Jason B. Hill - jason@jasonbhill.com
 
# Parse input options
HDFSWD=$1
START_DATE="$(date -d "$2" +%Y-%m-%d)"
END_DATE="$(date -d "$3 +1 days" +%Y-%m-%d)"
 
# Function to form directories based on a date
function mkdir_partition {
    # Input: $1 = date to form partition
 
    # Get date parameters
    YEAR=$(date -d "$1" +%Y)
    MONTH=$(date -d "$1" +%m)
    DAY=$(date -d "$1" +%d)
 
    # If the year doesn't exist, create it
    $(hdfs dfs -test -e ${HDFSWD}/${YEAR})
    if [[ "$?" -eq "1" ]]; then
        echo "-- creating HDFS directory: ${HDFSWD}/${YEAR}"
        $(hdfs dfs -mkdir ${HDFSWD}/${YEAR})
    fi
    # If the month doesn't exist, create it
    $(hdfs dfs -test -e ${HDFSWD}/${YEAR}/${MONTH})
    if [[ "$?" -eq "1" ]]; then
        echo "-- creating HDFS directory: ${HDFSWD}/${YEAR}/${MONTH}"
        $(hdfs dfs -mkdir ${HDFSWD}/${YEAR}/${MONTH})
    fi
    # If the day doesn't exist (it shouldn't), create it
    $(hdfs dfs -test -e ${HDFSWD}/${YEAR}/${MONTH}/${DAY})
    if [[ "$?" -eq "1" ]]; then
        echo "-- creating HDFS directory: ${HDFSWD}/${YEAR}/${MONTH}/${DAY}"
        $(hdfs dfs -mkdir ${HDFSWD}/${YEAR}/${MONTH}/${DAY})
    fi
}
 
# Iterate over dates and make partitions
ITER_DATE="${START_DATE}"
until [[ "${ITER_DATE}" == "${END_DATE}" ]]; do
    mkdir_partition ${ITER_DATE}
    ITER_DATE=$(date -d "${ITER_DATE} +1 days" +%Y-%m-%d)
done
 
exit 0

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="" highlight="">