This one isn’t code related, but I come across this enough in Linux Mint VMs that I’m saving it here for myself and anyone else that may find it useful.

Problem

I currently use Linux Mint as my development environment. When I’m using Firefox (39.0+build5-0build0.14.04.1) and closing a browser window containing multiple tabs, I always get prompted by Firefox: “Do you want Firefox to save your tabs for the next time it starts?” There is a checkbox for Firefox to not raise this warning in the future, but no matter how often you check it you will still get the warning.

Solution

Open the ‘about:config’ page.

Set browser.tabs.warnOnClose to false (if it isn’t so set already).
Set browser.tabs.warnOnCloseOtherTabs to false.
Set browser.warnOnQuit to false.


There’s a problem that a coworker recently brought up. The problem statement can be found here.

This is a re-hash of an existing problem known as the one hundred prisoners problem. It seems impossible at first, and many naive approaches yield failing results. But, given the correct approach, it is solvable with a surprisingly high probability.

A Solution

It’s really just a permutation problem. The basic idea is that the boxes are numbered 1..100 and they all contain a bill with a number 1..100. To construct a working solution, you use the bills as pointers to box numbers in the following way. You start at the box labelled with your number, and go to the box number of the bill it contains. Keep doing that, looking in the box and going to the box of the bill you find inside. Eventually, you get to the box containing the bill that points to where you started – and that’s the bill you’re looking for. So, the question really becomes: How often does a random permutation of degree $n$ contain a cycle having length greater than $n/2$? If you get a longer cycle, then anyone with a number in that cycle is screwed because it takes them longer than $n/2$ steps to find their bill. If there are only cycles having lengths less than $n/2$, then everyone can find their bill in the appropriate number of steps.

So, how many random permutations of degree $n=100$ contain only cycles having length less than 50? For this, we can look at approximations given by Stirling numbers of the first kind. Essentially, the number of random permutations having only cycles of length less than $n/2$ (in our case 50) as asymptotic to $log(2)$, which is around 0.3010. That’s a bit surprising, because it means that the strategy above is expected to win roughly 30% of the time for large numbers of boxes.

Testing with Sage

How well does the strategy actually perform when $n=100$? We can test this out in Sage.

def max_length_orbit_random_perm(n):
    """ Generate a random permutation on 1..n and compute the longest orbit."""
    return max([len(i) for i in Permutations(n).random_element().cycle_tuples()]) 
 
won = 0
tries = 10000
for i in range(tries):
    if max_length_orbit_random_perm(100) <= 50:
        won += 1
print "won %s out of %s" % (won, tries)

And we get the following.

won 3072 out of 10000

So, we win roughly 3 out of 10 times. If we were charged $1 to play the game, but won $100 if everyone found their dollar, then we’d be doing very well to play the game.


The Problem

Using HDFS commands (e.g., ‘hdfs dfs -put filename path’) can be frustratingly slow, especially when you’re trying to move many files to distinct locations. Each command can take three or so seconds simply to spin-up a JVM process.

Solution: Use HTTPFS

A solution is to use httpfs. Of course, instead of forming relatively simple HDFS commands, we now need to form HTML requests and submit them to an HDFS node running httpfs. We can do this with Curl, and it works perfectly fine, but it’s tedious. So, we’ll instead bury the requests inside a script of our own and reference the methods when needed. I’m using Python with the requests package. One could also use pycurl if you desire greater control at the expense of pain and suffering.

Plus, we can do awesome things previously not possible with HDFS, such as appending files (with other files or with strings) and reading directly into memory. So, for instance, we can load a JSON file from HDFS directly into Python as a dict. *head explodes*

A caveat: I haven’t included the ‘rm’ command yet. In a lot of Hadoop environments where safety is an issue (as where I’m testing this), I don’t want easy access to blowing data away. (You can figure it out from the other commands and an httpfs reference.)

Here’s the script, which is a bit long, followed by some usage examples.

github: httpfs_utils

#!/usr/bin/python2
 
# httpfs_utils.py
#
# Provides HDFS access via httpfs using Python's requests package.
 
 
import datetime
import requests
try:
    import simplejson as json
except ImportError:
    import json
 
 
###################################################################################################
# Helper functions                                                                                #
###################################################################################################
 
def _get_max_str_len_(filestatuses, key):
    """ Returns the max string value length for a list of dictionaries with 'field' as key.
 
    This is used to pretty print directory listings.
 
    INPUT
    -----
    filestatus : list of dicts
        The FileStatuses dictionary returned by the liststatus method.
    key : str
        The key for which we wish to find the maximum length value.
 
    OUTPUT
    ------
    int : The length of the longest value.
    """
    return max([len(str(B[key])) for B in filestatuses['FileStatuses']['FileStatus']])
 
 
def _perm_long_str_(type_str, perm_str):
    """ Forms the long string version of the permission string.
 
    INPUT
    -----
    type_str : str
        The type of object as given by list, e.g., 'FILE' or 'DIRECTORY'.
    perm_str : str
        The short form (numeric) version of the permission string.
 
    OUTPUT
    ------
    str : The long form version of the permission string.
    """
    # Determine if a directory is represented.
    if type_str == 'DIRECTORY':
        perm_str_long = 'd'
    else:
        perm_str_long = '-'
    # Convert the permission string to long letter form.
    for n in perm_str:
        L = [int(i) for i in list(bin(int(n)).split('0b')[1].zfill(3))]
        if L[0]:
            perm_str_long += 'r'
        else:
            perm_str_long += '-'
        if L[1]:
            perm_str_long += 'w'
        else:
            perm_str_long += '-'
        if L[2]:
            perm_str_long += 'x'
        else:
            perm_str_long += '-'
 
    return perm_str_long
 
 
def make_httpfs_url(host, user, hdfs_path, op, port=14000):
    """ Forms the URL for httpfs requests.
 
    INPUT
    -----
    host : str
        The host to connect to for httpfs access to HDFS. (Can be 'localhost'.)
    user : str
        The user to use for httpfs connections.
    hdfs_path : str
        The full path of the file or directory being checked.
    op : str
        The httpfs operation string. E.g., 'GETFILESTATUS'.
    port : int
        The port to use for httpfs connections.
 
    OUTPUT
    ------
    str : The string to use for an HTTP request to httpfs.
    """
    url = 'http://' + user + '@' + host + ':' + str(port) + '/webhdfs/v1'
    url += hdfs_path + '?user.name=' + user + '&op=' + op
 
    return url
 
 
###################################################################################################
# Functions                                                                                       #
###################################################################################################
 
def append(host, user, hdfs_path, filename, port=14000):
    """ Appends contents of 'filename' to 'hdfs_path' on 'user'@'host':'port'.
 
    INPUT
    -----
    host : str
        The host to connect to for httpfs access to HDFS. (Can be 'localhost'.)
    user : str
        The user to use for httpfs connections.
    hdfs_path : str
        The full path of the file to be appended to in HDFS.
    filename : str
        The file with contents being appended to hdfs_path. Can be a local file or a full path.
    port : int : default=14000
        The port to use for httpfs connections.
    """
    # Form the URL.
    url = make_httpfs_url(
        host=host,
        user=user,
        hdfs_path=hdfs_path,
        op='APPEND&data=true',
        port=port
    )
    headers = {
        'Content-Type':'application/octet-stream'
    }
 
    resp = requests.post(url, data=open(filename,'rb'), headers=headers)
    if resp.status_code != 200:
        resp.raise_for_status
 
 
def appends(host, user, hdfs_path, content, port=14000):
    """ Appends 'content' to 'hdfs_path' on 'user'@'host':'port'.
 
    This method is like 'append', but takes a string as input instead of a file name.
 
    INPUT
    -----
    host : str
        The host to connect to for httpfs access to HDFS. (Can be 'localhost'.)
    user : str
        The user to use for httpfs connections.
    hdfs_path : str
        The full path of the file to be appended to in HDFS.
    content : str
        The contents being appended to hdfs_path.
    port : int : default=14000
        The port to use for httpfs connections.
    """
    # Form the URL.
    url = make_httpfs_url(
        host=host,
        user=user,
        hdfs_path=hdfs_path,
        op='APPEND&data=true',
        port=port
    )
    headers = {
        'Content-Type':'application/octet-stream'
    }
 
    resp = requests.post(url, data=content, headers=headers)
    if resp.status_code != 200:
        resp.raise_for_status
 
 
def copy_to_local(host, user, hdfs_path, filename, port=14000):
    """ Copies the file at 'hdfs_path' on 'user'@'host':'port' to 'filename' locally.
 
    INPUT
    -----
    host : str
        The host to connect to for httpfs access to HDFS. (Can be 'localhost'.)
    user : str
        The user to use for httpfs connections.
    hdfs_path : str
        The full path of the file in HDFS.
    port : int : default=14000
        The port to use for httpfs connections.
    perms : str or int : default=775
        The permissions to use for the uploaded file in HDFS.
    """
    # Form the URL.
    url = make_httpfs_url(host=host, user=user, hdfs_path=hdfs_path, op='OPEN', port=port)
 
    # Form and issue the request.
    resp = requests.get(url, stream=True)
 
    if resp.status_code == 200:
        with open(filename, 'wb') as f_p:
            for chunk in resp:
                f_p.write(chunk)
    else:
        resp.raise_for_status
 
 
def exists(host, user, hdfs_path, port=14000):
    """ Returns True if 'hdfs_path' (full path) exists in HDFS at user@host:port via httpfs.
 
    INPUT
    -----
    host : str
        The host to connect to for httpfs access to HDFS. (Can be 'localhost'.)
    user : str
        The user to use for httpfs connections.
    hdfs_path : str
        The full path of the file or directory being checked.
    port : int
        The port to use for httpfs connections.
 
    OUTPUT
    ------
    Boolean : True if 'hdfs_path' exists and can be accessed by 'user'; False otherwise.
    """
    op = 'GETFILESTATUS'
    url = make_httpfs_url(host=host, user=user, hdfs_path=hdfs_path, op=op, port=port)
    # Get the JSON response using httpfs; stores as a Python dict
    resp = requests.get(url)
    # If a 404 was returned, the file/path does not exist
    if resp.status_code == 404:
        return False
    # If a 200 was returned, the file/path does exist
    elif resp.status_code == 200:
        return True
    # Something else - raise status, or if all else fails return None
    else:
        resp.raise_for_status()
        return None
 
 
def get_blocksize(host, user, hdfs_path, port=14000):
    """ Returns the HDFS block size (bytes) of 'hdfs_path' in HDFS at user@host:port via httpfs.
 
    The returned block size is in bytes. For MiB, divide this value by 2**20=1048576.
 
    INPUT
    -----
    host : str
        The host to connect to for httpfs access to HDFS. (Can be 'localhost'.)
    user : str
        The user to use for httpfs connections.
    hdfs_path : str
        The full path of the file or directory being checked.
    port : int
        The port to use for httpfs connections.
 
    OUTPUT
    ------
    int/long : The block size in bytes.
    """
    op = 'GETFILESTATUS'
    url = make_httpfs_url(host=host, user=user, hdfs_path=hdfs_path, op=op, port=port)
    # Get the JSON response using httpfs; stores as a Python dict
    resp = requests.get(url)
    # If a 200 was returned, the file/path exists
    if resp.status_code == 200:
        return resp.json()['FileStatus']['blockSize']
    # Something else - raise status, or if all else fails return None
    else:
        resp.raise_for_status()
 
 
def get_size(host, user, hdfs_path, port=14000):
    """ Returns the size (bytes) of 'hdfs_path' in HDFS at user@host:port via httpfs.
 
    The returned block size is in bytes. For MiB, divide this value by 2**20=1048576.
 
    INPUT
    -----
    host : str
        The host to connect to for httpfs access to HDFS. (Can be 'localhost'.)
    user : str
        The user to use for httpfs connections.
    hdfs_path : str
        The full path of the file or directory being checked.
    port : int
        The port to use for httpfs connections.
 
    OUTPUT
    ------
    int/long : The size in bytes.
    """
    op = 'GETFILESTATUS'
    url = make_httpfs_url(host=host, user=user, hdfs_path=hdfs_path, op=op, port=port)
    # Get the JSON response using httpfs; stores as a Python dict
    resp = requests.get(url)
    # If a 200 was returned, the file/path exists
    if resp.status_code == 200:
        return resp.json()['FileStatus']['length']
    # Something else - raise status, or if all else fails return None
    else:
        resp.raise_for_status()
 
 
def info(host, user, hdfs_path, port=14000):
    """ Returns a dictionary of info for 'hdfs_path' in HDFS at user@host:port via httpfs.
 
    This method is similar to 'liststatus', but only displays top-level information. If you need
    info about all of the files and subdirectories of a directory, use 'liststatus'.
 
    The returned dictionary contains keys: group, permission, blockSize, accessTime, pathSuffix,
    modificationTime, replication, length, ownder, type.
 
    INPUT
    -----
    host : str
        The host to connect to for httpfs access to HDFS. (Can be 'localhost'.)
    user : str
        The user to use for httpfs connections.
    hdfs_path : str
        The full path of the file or directory being checked.
    port : int
        The port to use for httpfs connections.
 
    OUTPUT
    ------
    Dictionary : Information about 'hdfs_path'
    """
    op = 'GETFILESTATUS'
    url = make_httpfs_url(host=host, user=user, hdfs_path=hdfs_path, op=op, port=port)
    # Get the JSON response using httpfs; stores as a Python dict
    resp = requests.get(url)
    # If a 200 was returned, the file/path exists
    if resp.status_code == 200:
        return resp.json()
    # Something else - raise status, or if all else fails return None
    else:
        resp.raise_for_status()
 
 
def liststatus(host, user, hdfs_path, port=14000):
    """ Returns a dictionary of info for 'hdfs_path' in HDFS at user@host:port via httpfs.
 
    Returns a dictionary of information. When used on a file, the returned dictionary contains a
    copy of the dictionary returned by 'info.' When used on a directory, the returned dictionary
    contains a list of such dictionaries.
 
    INPUT
    -----
    host : str
        The host to connect to for httpfs access to HDFS. (Can be 'localhost'.)
    user : str
        The user to use for httpfs connections.
    hdfs_path : str
        The full path of the file or directory being checked.
    port : int
        The port to use for httpfs connections.
 
    OUTPUT
    ------
    Dictionary : Information about 'hdfs_path'
    """
    op = 'LISTSTATUS'
    url = make_httpfs_url(host=host, user=user, hdfs_path=hdfs_path, op=op, port=port)
    # Get the JSON response using httpfs; stores as a Python dict
    resp = requests.get(url)
    # If a 200 was returned, the file/path exists
    if resp.status_code == 200:
        return resp.json()
    # Something else - raise status, or if all else fails return None
    else:
        resp.raise_for_status()
 
 
def ls(host, user, hdfs_path, port=14000):
    """ Print info for 'hdfs_path' in HDFS at user@host:port via httpfs.
 
    A print function intended for interactive usage. Similar to 'ls -l' or 'hdfs dfs -ls'.
 
    INPUT
    -----
    host : str
        The host to connect to for httpfs access to HDFS. (Can be 'localhost'.)
    user : str
        The user to use for httpfs connections.
    hdfs_path : str
        The full path of the file or directory being checked.
    port : int
        The port to use for httpfs connections.
    """
    op = 'LISTSTATUS'
    url = make_httpfs_url(host=host, user=user, hdfs_path=hdfs_path, op=op, port=port)
    # Get the JSON response using httpfs; stores as a Python dict
    resp = requests.get(url)
    # If a 200 was returned, the file/path exists. Otherwise, raise error or exit.
    if resp.status_code != 200:
        resp.raise_for_status()
    else:
        filestatuses = resp.json()
        for obj in filestatuses['FileStatuses']['FileStatus']:
            obj_str = _perm_long_str_(type_str=obj['type'],perm_str=obj['permission'])
            obj_str += '%*s' % (
                _get_max_str_len_(filestatuses, 'replication')+3,
                obj['replication']
            )
            obj_str += '%*s' % (
                _get_max_str_len_(filestatuses, 'owner')+3,
                obj['owner']
            )
            obj_str += '%*s' % (
                _get_max_str_len_(filestatuses, 'group')+2,
                obj['group']
            )
            obj_str += '%*s' % (
                _get_max_str_len_(filestatuses, 'length')+4,
                obj['length']
            )
            obj_str += '%21s' % (
                datetime.datetime.utcfromtimestamp(
                    obj['modificationTime']/1000
                ).isoformat().replace('T',' ')
            )
            obj_str += ' ' + hdfs_path + '/' + obj['pathSuffix']
 
            print "%s" % obj_str
 
 
def mkdir(host, user, hdfs_path, port=14000):
    """ Creates the directory 'hdfs_path' on 'user'@'host':'port'.
 
    Directories are created recursively.
 
    INPUT
    -----
    host : str
        The host to connect to for httpfs access to HDFS. (Can be 'localhost'.)
    user : str
        The user to use for httpfs connections.
    hdfs_path : str
        The path of the directory to create in HDFS.
    port : int : default=14000
        The port to use for httpfs connections.
    """
    op = 'MKDIRS'
    url = make_httpfs_url(host=host, user=user, hdfs_path=hdfs_path, op=op, port=port)
 
    # Make the request
    resp = requests.put(url)
    # If a 200 was returned, the file/path exists
    if resp.status_code == 200:
        return resp.json()
    # Something else - raise status, or if all else fails return None
    else:
        resp.raise_for_status()
 
 
def put(host, user, hdfs_path, filename, port=14000, perms=775):
    """ Puts 'filename' into 'hdfs_path' on 'user'@'host':'port'.
 
    INPUT
    -----
    host : str
        The host to connect to for httpfs access to HDFS. (Can be 'localhost'.)
    user : str
        The user to use for httpfs connections.
    hdfs_path : str
        The full path of the location to place the file in HDFS.
    filename : str
        The file to upload. Can be a local file or a full path.
    port : int : default=14000
        The port to use for httpfs connections.
    perms : str or int : default=775
        The permissions to use for the uploaded file in HDFS.
    """
    # Get the file name without base path.
    filename_short = filename.split('/')[-1]
    # Form the URL.
    url = make_httpfs_url(
        host=host,
        user=user,
        hdfs_path=hdfs_path + '/' + filename_short,
        op='CREATE&data=true&overwrite=true&permission=' + str(perms),
        port=port
    )
    headers = {
        'Content-Type':'application/octet-stream'
    }
    #files = {'file': open(filename,'rb')}
 
    resp = requests.put(url, data=open(filename,'rb'), headers=headers)
    if resp.status_code != 200:
        resp.raise_for_status()
 
 
def read(host, user, hdfs_path, port=14000):
    """ Reads file at 'hdfs_path' on 'user'@'host':'port'.
 
    This method allows the contents of a file in HDFS to be read into memory in Python.
 
    INPUT
    -----
    host : str
        The host to connect to for httpfs access to HDFS. (Can be 'localhost'.)
    user : str
        The user to use for httpfs connections.
    hdfs_path : str
        The full path of the file in HDFS.
    port : int : default=14000
        The port to use for httpfs connections.
    perms : str or int : default=775
        The permissions to use for the uploaded file in HDFS.
 
    OUTPUT
    ------
    Text of the file.
    """
    # Form the URL.
    url = make_httpfs_url(host=host, user=user, hdfs_path=hdfs_path, op='OPEN', port=port)
 
    # Form and issue the request.
    resp = requests.get(url)
 
    if resp.status_code != 200:
        resp.raise_for_status
 
    return resp.text
 
 
def read_json(host, user, hdfs_path, port=14000):
    """ Reads JSON file at 'hdfs_path' on 'user'@'host':'port' and returns a Python dict.
 
    This method reads the contents of a JSON file in HDFS into Python as a dictionary.
 
    INPUT
    -----
    host : str
        The host to connect to for httpfs access to HDFS. (Can be 'localhost'.)
    user : str
        The user to use for httpfs connections.
    hdfs_path : str
        The full path of the file in HDFS.
    port : int : default=14000
        The port to use for httpfs connections.
    perms : str or int : default=775
        The permissions to use for the uploaded file in HDFS.
 
    OUTPUT
    ------
    Text of the file interpreted in JSON as a Python dict.
    """
    # Form the URL.
    url = make_httpfs_url(host=host, user=user, hdfs_path=hdfs_path, op='OPEN', port=port)
 
    # Form and issue the request.
    resp = requests.get(url)
 
    if resp.status_code != 200:
        resp.raise_for_status
 
    return json.loads(requests.get(url).text)

Here’s an example of printing the contents of a directory in IPython. I should point out that this is across a network connection, but still only takes 0.08 seconds.

In [1]: import httpfs_utils as httpfs
 
In [2]: time httpfs.ls(host='hadoop01',user='root',hdfs_path='/user/root')
drwx------   0     root  supergroup        0  2015-01-29 20:21:17 /user/root/.Trash
drwx------   0     root  supergroup        0  2015-01-22 20:49:49 /user/root/.staging
-rwxrwxr-x   3     root  supergroup    74726  2015-07-06 16:50:08 /user/root/alice.txt
drwxr-xr-x   0     root  supergroup        0  2015-07-06 22:17:52 /user/root/config
drwxr-xr-x   0     root  supergroup        0  2015-03-03 19:38:21 /user/root/hdfs_bin
drwxr-xrwx   0     root  supergroup        0  2015-04-09 00:56:40 /user/root/referrer_host
drwxr-xr-x   0     root  supergroup        0  2015-02-26 22:58:04 /user/root/stage
-rwxrwxr-x   3     root  supergroup       49  2015-07-06 19:09:20 /user/root/test_append.txt
drwxr-xr-x   0     root  supergroup        0  2015-07-06 19:38:14 /user/root/testdir
drwxr-xr-x   0     root  supergroup        0  2015-03-04 22:04:37 /user/root/tmp
CPU times: user 0.02 s, sys: 0.02 s, total: 0.03 s
Wall time: 0.08 s

That was simply a print function. We can do much more useful things, such as uploading a file. Here, I’ll upload a smallish (1.6MiB) CSV into HDFS. Then, we’ll verify that the file exists in HDFS.

In [3]: time httpfs.put(host='hadoop01',user='root',hdfs_path='/user/root',filename='/export/data-share/test.csv')
CPU times: user 0.00 s, sys: 0.02 s, total: 0.02 s
Wall time: 0.09 s
 
In [4]: time httpfs.exists(host='hadoop01',user='root',hdfs_path='/user/root/test.csv',)
CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s
Wall time: 0.02 s
Out[4]: True

Let’s read a JSON file directly into Python as a dictionary. (I’ve tracked all tweets mentioning the company for which I work, SpotXchange, for a year now. Here’s a random one.)

In [5]: time tweet = httpfs.read_json(host='hadoop01',user='root',hdfs_path='/user/root/spotxtweet.json')
CPU times: user 0.01 s, sys: 0.01 s, total: 0.02 s
Wall time: 0.66 s
 
In [8]: tweet
Out[8]: 
{u'contributors': None,
 u'coordinates': None,
 u'created_at': u'Mon Jul 14 22:25:14 +0000 2014',
 u'entities': {u'hashtags': [],
  u'symbols': [],
  u'urls': [{u'display_url': u'bit.ly/1qoqcDB',
    u'expanded_url': u'http://bit.ly/1qoqcDB',
    u'indices': [102, 124],
    u'url': u'http://t.co/yyMiXenMTJ'}],
  u'user_mentions': []},
 u'favorite_count': 0,
 u'favorited': False,
 u'filter_level': u'medium',
 u'geo': None,
 u'id': 488811485703835648,
 u'id_str': u'488811485703835648',
 u'in_reply_to_screen_name': None,
 u'in_reply_to_status_id': None,
 u'in_reply_to_status_id_str': None,
 u'in_reply_to_user_id': None,
 u'in_reply_to_user_id_str': None,
 u'lang': u'en',
 u'place': None,
 u'possibly_sensitive': False,
 u'retweet_count': 0,
 u'retweeted': False,
 u'source': u'<a href="http://www.hootsuite.com" rel="nofollow">Hootsuite</a>',
 u'text': u'Great insights from the CEO. Publishers Rapidly Adopting Programmatic Ad Sales: SpotXchange\u2019s Shehan: http://t.co/yyMiXenMTJ',
 u'truncated': False,
 u'user': {u'contributors_enabled': False,
  u'created_at': u'Thu Oct 31 17:50:48 +0000 2013',
  u'default_profile': False,
  u'default_profile_image': False,
  u'description': u'The Center of the Native Mobile Advertising Community',
  u'favourites_count': 1,
  u'follow_request_sent': None,
  u'followers_count': 152,
  u'following': None,
  u'friends_count': 83,
  u'geo_enabled': False,
  u'id': 2166999860,
  u'id_str': u'2166999860',
  u'is_translation_enabled': False,
  u'is_translator': False,
  u'lang': u'en',
  u'listed_count': 10,
  u'location': u'Anywhere and Everywhere',
  u'name': u'NativeMobile.com',
  u'notifications': None,
  u'profile_background_color': u'000000',
  u'profile_background_image_url': u'http://pbs.twimg.com/profile_background_images/378800000139148775/8OxxwKh3.jpeg',
  u'profile_background_image_url_https': u'https://pbs.twimg.com/profile_background_images/378800000139148775/8OxxwKh3.jpeg',
  u'profile_background_tile': False,
  u'profile_banner_url': u'https://pbs.twimg.com/profile_banners/2166999860/1386291602',
  u'profile_image_url': u'http://pbs.twimg.com/profile_images/378800000834565058/0948f6f3c16916fbc99746370d067d71_normal.png',
  u'profile_image_url_https': u'https://pbs.twimg.com/profile_images/378800000834565058/0948f6f3c16916fbc99746370d067d71_normal.png',
  u'profile_link_color': u'0084B4',
  u'profile_sidebar_border_color': u'FFFFFF',
  u'profile_sidebar_fill_color': u'DDEEF6',
  u'profile_text_color': u'333333',
  u'profile_use_background_image': True,
  u'protected': False,
  u'screen_name': u'nativemobile',
  u'statuses_count': 1841,
  u'time_zone': u'Pacific Time (US & Canada)',
  u'url': u'http://nativemobile.com',
  u'utc_offset': -25200,
  u'verified': False}}

Carriage Return -vs- Line Feed

Every now and then, I end up having a text parsing issue that simply comes down to carriage returns versus line feeds. For instance, with Twitter’s streaming API, you can receive multiple responses that all form a single JSON entity. You can realize this situation because the end of an entity will have a carriage return, while individual chunks before the end will only have line feed (new line) terminators.

What’s really a bit messed up is that Windows is a big fan of CRLF (‘\r\n’), while most of Unix prefers LF (‘\n’) for line terminators. Apparently, Apple has a thing for sometimes using single CR (‘\r’) because it likes to be special.

This all started when computers were supposed to mimic typewriters. At the end of a line, you needed to do two things at once: (1) Return the character carriage to the left so you could type more and (2) advance the line feed so you wouldn’t simply write over what you just wrote.

Recognizing CRLF -vs- LF

Both the carriage return (CR) and line feed (LF) are represented by non-printable characters. Your terminal or text editor simply knows how to interpret them. When you tell a script to write ‘\n’ for a line feed, you’re actually referencing the non-printable ASCII character 0x0A (decimal 10). When you write ‘\r’, you’re referencing 0x0D (decimal 12).

But those characters don’t actually print. They only instruct a terminal or text editor to display the text around the characters in specific ways. So, how do you recognize them?

The Linux ‘file’ command will tell you what sort of line terminators a file has, so that’s pretty quick. Here’s an example:

$ file my_file_created_on_Windows.txt
my_file_created_on_Windows.txt: ASCII text, with very long lines, with CRLF line terminators
 
$ file my_file_created_on_Linux
my_file_created_on_Linux: ASCII text, with very long lines

If the file uses only LF terminators, this is considered the default and you won’t be informed.

Removing CR Terminators

You have several options for getting rid of those ‘\r’ CR characters in text. One option is to simply ‘tr’ the text in the terminal:

$ tr -d '\r' < my_file_created_on_Windows.txt > my_new_file.txt

Another option is to use a utility such as ‘dos2unix.’ Yet another option would be to use a more advanced text parsing language, such as Python, and replace the characters manually:

import codecs
 
f_p = codecs.open('my_file_created_on_Windows.txt','r','utf-8')
g_p = codecs.open('my_new_file.txt','w','utf-8')
 
for line in f_p:
    g_p.write(line.replace('\r','').replace('\n',''))
f_p.close()
g_p.close()

A few notes on that Python code.. First, we use the codecs module to read text because it may contain non-ASCII characters such as Unicode. In this case, we’re reading the characters in UTF-8 encoding. Also, we replace both the CR and LF because Python will automatically write an LF at the end of the line, and we don’t want two LFs.


I had written an article about running scripts in parallel using GNU Parallel, and then I realized that GNU parallel isn’t in the CentOS repositories. Since the code that I’m writing requires standard repo support, I need to find a different solution.

If we want to perform the same action as in the referenced article, using xargs instead of GNU parallel, we’d run the following command.

$ echo {1..20} | xargs -n1 -P5 ./echo_sleep
1426008382 -- starting -- 2
1426008382 -- starting -- 5
1426008382 -- starting -- 1
1426008382 -- starting -- 3
1426008382 -- starting -- 4
1426008382 -- finishing -- 4
1426008382 -- starting -- 6
1426008383 -- finishing -- 1
1426008383 -- starting -- 7
1426008385 -- finishing -- 3
1426008385 -- starting -- 8
1426008386 -- finishing -- 7
1426008386 -- starting -- 9
1426008389 -- finishing -- 9
1426008389 -- starting -- 10
1426008390 -- finishing -- 2
1426008390 -- finishing -- 5
1426008390 -- starting -- 11
1426008390 -- starting -- 12
1426008391 -- finishing -- 6
1426008391 -- starting -- 13
1426008392 -- finishing -- 10
1426008392 -- starting -- 14
1426008394 -- finishing -- 8
1426008394 -- starting -- 15
1426008396 -- finishing -- 15
1426008396 -- starting -- 16
1426008397 -- finishing -- 16
1426008397 -- starting -- 17
1426008398 -- finishing -- 11
1426008398 -- starting -- 18
1426008399 -- finishing -- 12
1426008399 -- starting -- 19
1426008399 -- finishing -- 13
1426008399 -- starting -- 20
1426008399 -- finishing -- 20
1426008399 -- finishing -- 17
1426008400 -- finishing -- 14
1426008402 -- finishing -- 18
1426008408 -- finishing -- 19

Some things to note here: First, the “-n1″ or “-n 1″ option is critical, as it informs xargs how many arguments from the echo string that each instance of the script echo_sleep needs to take as input. Also, the output format controls for xargs aren’t as well developed. In fact, it’s entirely possible for stdout from invoked scripts to collide. For this reason, you may want to make sure that the invoked scripts are more advanced (in Python with better file handling, for instance) instead of simply redirecting bash output.


What I want to be able to do is to run a script on a massive number of inputs. But, I only want a specified maximum number of them to be running at any given time. GNU parallel can accomplish this very easily.

First, make sure you have GNU parallel installed. The package in most major repositories is simply called “parallel”.

Writing A Basic Script

I’m going to write a bash script that echos a timestamp and input, and then waits two seconds before exiting. The script looks like this.

#!/bin/bash                                                                                                                                              
 
VAL="$1"
TIME="$(date +%s)"
 
echo "${TIME} -- ${VAL}"
 
sleep 2

Just to make sure it is working, we chmod it to 0755 and then we call it with input “hi”.

$ ./echo_sleep hi
1425690292 -- hi

It worked just as expected: After it echoed the time and input, it slept for two seconds and then exited and my prompt returned.

Running the Script in Parallel

I want to run this script on 20 inputs, but I only ever want to have 5 instances running at any given time. Here’s how we do that with GNU parallel (where the input arguments for the script are denoted by ‘:::’). I’m just using numbers 1..20 and the inputs.

$ parallel -j5 ./echo_sleep ::: {1..20}
1425690566 -- 1
1425690566 -- 2
1425690566 -- 3
1425690566 -- 4
1425690566 -- 5
1425690569 -- 6
1425690569 -- 7
1425690569 -- 8
1425690569 -- 9
1425690569 -- 10
1425690571 -- 11
1425690571 -- 12
1425690571 -- 13
1425690571 -- 14
1425690571 -- 15
1425690573 -- 16
1425690573 -- 17
1425690573 -- 18
1425690573 -- 19
1425690573 -- 20

Note: You can use any bash IFS-separated sequence as the input. For instance, something like ‘::: 1 2 3 4 5 6 7 8 9 10′ works just as well as the sequence ‘{1..10}’.

As you can see, the times are two seconds apart. What if we change the script to sleep for a random amount of time? We’ll have each script instance wait between 0 and 9 seconds by changing the script as follows, asking for output both when the script starts and when it finishes.

#!/bin/bash                                                                                                                                            
 
VAL="$1"
 
echo "$(date +%s) -- starting -- ${VAL}"
 
sleep "$(($RANDOM % 10))"
 
echo "$(date +%s) -- finishing -- ${VAL}"

Now, when we run the script, we’ll expect each instance to take a different amount of time to finish. We’ll notice that all output for the script is sent to stdout at the same time. (There are output control mechanisms in GNU parallel, but we’re not using them here.)

$ parallel -j5 ./echo_sleep ::: {1..20}
1425690952 -- starting -- 2
1425690954 -- finishing -- 2
1425690952 -- starting -- 1
1425690955 -- finishing -- 1
1425690952 -- starting -- 5
1425690955 -- finishing -- 5
1425690952 -- starting -- 3
1425690957 -- finishing -- 3
1425690952 -- starting -- 4
1425690959 -- finishing -- 4
1425690955 -- starting -- 7
1425690959 -- finishing -- 7
1425690955 -- starting -- 8
1425690961 -- finishing -- 8
1425690954 -- starting -- 6
1425690963 -- finishing -- 6
1425690959 -- starting -- 10
1425690963 -- finishing -- 10
1425690959 -- starting -- 11
1425690965 -- finishing -- 11
1425690961 -- starting -- 12
1425690965 -- finishing -- 12
1425690957 -- starting -- 9
1425690966 -- finishing -- 9
1425690966 -- starting -- 17
1425690967 -- finishing -- 17
1425690963 -- starting -- 13
1425690967 -- finishing -- 13
1425690963 -- starting -- 14
1425690967 -- finishing -- 14
1425690967 -- starting -- 19
1425690967 -- finishing -- 19
1425690967 -- starting -- 18
1425690969 -- finishing -- 18
1425690965 -- starting -- 15
1425690973 -- finishing -- 15
1425690965 -- starting -- 16
1425690973 -- finishing -- 16
1425690967 -- starting -- 20
1425690973 -- finishing -- 20

To get the output immediately, we can use the –linebuffer option. (There is also an –ungroup option, but it suffers from the problem of potentially mashing the simultaneous output of two script instances.)

$ parallel -j5 --linebuffer ./echo_sleep ::: {1..20}
1425691272 -- starting -- 1
1425691272 -- starting -- 2
1425691272 -- starting -- 3
1425691272 -- starting -- 4
1425691272 -- starting -- 5
1425691273 -- finishing -- 1
1425691273 -- finishing -- 2
1425691273 -- starting -- 7
1425691273 -- starting -- 6
1425691275 -- finishing -- 4
1425691275 -- finishing -- 6
1425691275 -- starting -- 8
1425691275 -- starting -- 9
1425691276 -- finishing -- 3
1425691276 -- starting -- 10
1425691277 -- finishing -- 9
1425691277 -- starting -- 11
1425691278 -- finishing -- 8
1425691278 -- finishing -- 10
1425691278 -- starting -- 12
1425691278 -- starting -- 13
1425691280 -- finishing -- 5
1425691280 -- finishing -- 7
1425691280 -- starting -- 15
1425691280 -- starting -- 14
1425691282 -- finishing -- 12
1425691282 -- starting -- 16
1425691283 -- finishing -- 13
1425691283 -- starting -- 17
1425691284 -- finishing -- 17
1425691284 -- finishing -- 11
1425691284 -- starting -- 19
1425691284 -- starting -- 18
1425691285 -- finishing -- 14
1425691285 -- starting -- 20
1425691286 -- finishing -- 15
1425691286 -- finishing -- 18
1425691289 -- finishing -- 19
1425691289 -- finishing -- 16
1425691293 -- finishing -- 20

Now, all of the timestamps are in order.


I’ve come across a situation where I have improperly encoded UTF-8 in a text file. I need to remove lines with improper encoding from the file.

Caveat: Keep in mind that Unicode and UTF-8 are different things. Unicode is best described as a character set while UTF-8 is a digital encoding, analogous respectively to the English alphabet and cursive writing. Just as you can write the alphabet in different ways (cursive, print, shorthand, etc.), you can write Unicode characters in various ways … UTF-8 has simply become the most popular.

Finding Multi-Byte Characters

One of the oddities with UTF-8 is that it uses a variable-length byte encoding. Many characters only require a single byte, but some require up to four bytes. You can grep a file and find anything encoded with more than a single byte with the following shell command.

grep -P '[^\x00-\x7f]' filename

Removing Improperly Encoded UTF-8

If a file contains improperly encoded UTF-8, it can be found and removed with the following command.

inconv -c -f UTF-8 -t UTF-8 -o outputfile inputfile

If you diff the input and output files, you’ll see any difference. Hence, when the diff is empty, you know that the input only contains valid UTF-8.


Taking the results of an Impala query in the impala-shell and saving them as a TSV is easy. In my experience, this is better through the shell than through a service such as Hue. When I’ve done this in Hue, there have been some issues with the name node running out of memory due to the resultset being so large. Dumping to TSV from the shell doesn’t seem to result in the same issue.

Here’s a brief explanation of the different options:
-i: As usual, this connects the shell to an impala daemon.
-o: Output to the following file.
-B: Turn off pretty printing. Use tab delimiters by default.
-f: Run the query in the following file.

The delimiter used can be changed using the –output_delimiter option. In the following example, I’m connecting to the data node at data_node_01.

$ impala-shell -i data_node_01 -o output_file.tsv -B -f impala_query.sql

The ‘paste’ command will merge multiple files line by line, and you can declare a delimiter between the files’ contents.

This would be really useful for, say, creating a CSV using the contents of multiple files. You could run the following command and instantly create a CSV.

$ paste -d',' column1.txt column2.txt

Another use I have found recently is to help generate large SQL create table statements. For instance, say that I have two files. File ‘file_a.txt’ has the following contents (SQL types).

INT
INT
STRING
DOUBLE
STRING
STRING

File ‘file_b.txt’ has the corresponding SQL column names.

width
length
name
cost
comment1
comment2

I can form a create table SQL statement in Bash very quickly using paste:

$ echo -e "CREATE TABLE my_table (\n$(paste -d' ' file_a.txt file_b.txt))"
CREATE TABLE my_table (
INT width
INT length
STRING name
DOUBLE cost
STRING comment1
STRING comment2)

A lot of times, I’ll create an externally managed Hive table as a step toward constructing something better (e.g., a Parquet columnar snappy-compressed table created by Hive for use in Impala or Spark). The data for that table is often broken down by day. Instead of writing an interactive BASH command to iterate dates and create the nested directory structure, I wrote the following script.

For instance, I want a root directory in HDFS (say, “/user/jason/my_root_dir”) to have date directories for all days in 2014, such as:
- /user/jason/my_root_dir/2014
- /user/jason/my_root_dir/2014/01
- /user/jason/my_root_dir/2014/01/01
- /user/jason/my_root_dir/2014/01/02
- /user/jason/my_root_dir/2014/01/03

- /user/jason/my_root_dir/2014/12/31

Running “./make_partitions /user/jason/my_root_dir 2014-01-01 2014-12-31″ accomplishes this. Keep in mind that this takes a while, as the directories are checked and created across the cluster.

#!/bin/bash
 
# Usage: ./make_partitions HDFS_root_dir start_date end_date
# Example: ./make_partitions /user/root/mydir 2014-01-01 2014-12-31
# Creates nested year, month, day partitions for a sequence of dates (inclusive).
# Jason B. Hill - jason@jasonbhill.com
 
# Parse input options
HDFSWD=$1
START_DATE="$(date -d "$2" +%Y-%m-%d)"
END_DATE="$(date -d "$3 +1 days" +%Y-%m-%d)"
 
# Function to form directories based on a date
function mkdir_partition {
    # Input: $1 = date to form partition
 
    # Get date parameters
    YEAR=$(date -d "$1" +%Y)
    MONTH=$(date -d "$1" +%m)
    DAY=$(date -d "$1" +%d)
 
    # If the year doesn't exist, create it
    $(hdfs dfs -test -e ${HDFSWD}/${YEAR})
    if [[ "$?" -eq "1" ]]; then
        echo "-- creating HDFS directory: ${HDFSWD}/${YEAR}"
        $(hdfs dfs -mkdir ${HDFSWD}/${YEAR})
    fi
    # If the month doesn't exist, create it
    $(hdfs dfs -test -e ${HDFSWD}/${YEAR}/${MONTH})
    if [[ "$?" -eq "1" ]]; then
        echo "-- creating HDFS directory: ${HDFSWD}/${YEAR}/${MONTH}"
        $(hdfs dfs -mkdir ${HDFSWD}/${YEAR}/${MONTH})
    fi
    # If the day doesn't exist (it shouldn't), create it
    $(hdfs dfs -test -e ${HDFSWD}/${YEAR}/${MONTH}/${DAY})
    if [[ "$?" -eq "1" ]]; then
        echo "-- creating HDFS directory: ${HDFSWD}/${YEAR}/${MONTH}/${DAY}"
        $(hdfs dfs -mkdir ${HDFSWD}/${YEAR}/${MONTH}/${DAY})
    fi
}
 
# Iterate over dates and make partitions
ITER_DATE="${START_DATE}"
until [[ "${ITER_DATE}" == "${END_DATE}" ]]; do
    mkdir_partition ${ITER_DATE}
    ITER_DATE=$(date -d "${ITER_DATE} +1 days" +%Y-%m-%d)
done
 
exit 0