The Problem

Using HDFS commands (e.g., ‘hdfs dfs -put filename path’) can be frustratingly slow, especially when you’re trying to move many files to distinct locations. Each command can take three or so seconds simply to spin-up a JVM process.

Solution: Use HTTPFS

A solution is to use httpfs. Of course, instead of forming relatively simple HDFS commands, we now need to form HTML requests and submit them to an HDFS node running httpfs. We can do this with Curl, and it works perfectly fine, but it’s tedious. So, we’ll instead bury the requests inside a script of our own and reference the methods when needed. I’m using Python with the requests package. One could also use pycurl if you desire greater control at the expense of pain and suffering.

Plus, we can do awesome things previously not possible with HDFS, such as appending files (with other files or with strings) and reading directly into memory. So, for instance, we can load a JSON file from HDFS directly into Python as a dict. *head explodes*

A caveat: I haven’t included the ‘rm’ command yet. In a lot of Hadoop environments where safety is an issue (as where I’m testing this), I don’t want easy access to blowing data away. (You can figure it out from the other commands and an httpfs reference.)

Here’s the script, which is a bit long, followed by some usage examples.

github: httpfs_utils

#!/usr/bin/python2
 
# httpfs_utils.py
#
# Provides HDFS access via httpfs using Python's requests package.
 
 
import datetime
import requests
try:
    import simplejson as json
except ImportError:
    import json
 
 
###################################################################################################
# Helper functions                                                                                #
###################################################################################################
 
def _get_max_str_len_(filestatuses, key):
    """ Returns the max string value length for a list of dictionaries with 'field' as key.
 
    This is used to pretty print directory listings.
 
    INPUT
    -----
    filestatus : list of dicts
        The FileStatuses dictionary returned by the liststatus method.
    key : str
        The key for which we wish to find the maximum length value.
 
    OUTPUT
    ------
    int : The length of the longest value.
    """
    return max([len(str(B[key])) for B in filestatuses['FileStatuses']['FileStatus']])
 
 
def _perm_long_str_(type_str, perm_str):
    """ Forms the long string version of the permission string.
 
    INPUT
    -----
    type_str : str
        The type of object as given by list, e.g., 'FILE' or 'DIRECTORY'.
    perm_str : str
        The short form (numeric) version of the permission string.
 
    OUTPUT
    ------
    str : The long form version of the permission string.
    """
    # Determine if a directory is represented.
    if type_str == 'DIRECTORY':
        perm_str_long = 'd'
    else:
        perm_str_long = '-'
    # Convert the permission string to long letter form.
    for n in perm_str:
        L = [int(i) for i in list(bin(int(n)).split('0b')[1].zfill(3))]
        if L[0]:
            perm_str_long += 'r'
        else:
            perm_str_long += '-'
        if L[1]:
            perm_str_long += 'w'
        else:
            perm_str_long += '-'
        if L[2]:
            perm_str_long += 'x'
        else:
            perm_str_long += '-'
 
    return perm_str_long
 
 
def make_httpfs_url(host, user, hdfs_path, op, port=14000):
    """ Forms the URL for httpfs requests.
 
    INPUT
    -----
    host : str
        The host to connect to for httpfs access to HDFS. (Can be 'localhost'.)
    user : str
        The user to use for httpfs connections.
    hdfs_path : str
        The full path of the file or directory being checked.
    op : str
        The httpfs operation string. E.g., 'GETFILESTATUS'.
    port : int
        The port to use for httpfs connections.
 
    OUTPUT
    ------
    str : The string to use for an HTTP request to httpfs.
    """
    url = 'http://' + user + '@' + host + ':' + str(port) + '/webhdfs/v1'
    url += hdfs_path + '?user.name=' + user + '&op=' + op
 
    return url
 
 
###################################################################################################
# Functions                                                                                       #
###################################################################################################
 
def append(host, user, hdfs_path, filename, port=14000):
    """ Appends contents of 'filename' to 'hdfs_path' on 'user'@'host':'port'.
 
    INPUT
    -----
    host : str
        The host to connect to for httpfs access to HDFS. (Can be 'localhost'.)
    user : str
        The user to use for httpfs connections.
    hdfs_path : str
        The full path of the file to be appended to in HDFS.
    filename : str
        The file with contents being appended to hdfs_path. Can be a local file or a full path.
    port : int : default=14000
        The port to use for httpfs connections.
    """
    # Form the URL.
    url = make_httpfs_url(
        host=host,
        user=user,
        hdfs_path=hdfs_path,
        op='APPEND&data=true',
        port=port
    )
    headers = {
        'Content-Type':'application/octet-stream'
    }
 
    resp = requests.post(url, data=open(filename,'rb'), headers=headers)
    if resp.status_code != 200:
        resp.raise_for_status
 
 
def appends(host, user, hdfs_path, content, port=14000):
    """ Appends 'content' to 'hdfs_path' on 'user'@'host':'port'.
 
    This method is like 'append', but takes a string as input instead of a file name.
 
    INPUT
    -----
    host : str
        The host to connect to for httpfs access to HDFS. (Can be 'localhost'.)
    user : str
        The user to use for httpfs connections.
    hdfs_path : str
        The full path of the file to be appended to in HDFS.
    content : str
        The contents being appended to hdfs_path.
    port : int : default=14000
        The port to use for httpfs connections.
    """
    # Form the URL.
    url = make_httpfs_url(
        host=host,
        user=user,
        hdfs_path=hdfs_path,
        op='APPEND&data=true',
        port=port
    )
    headers = {
        'Content-Type':'application/octet-stream'
    }
 
    resp = requests.post(url, data=content, headers=headers)
    if resp.status_code != 200:
        resp.raise_for_status
 
 
def copy_to_local(host, user, hdfs_path, filename, port=14000):
    """ Copies the file at 'hdfs_path' on 'user'@'host':'port' to 'filename' locally.
 
    INPUT
    -----
    host : str
        The host to connect to for httpfs access to HDFS. (Can be 'localhost'.)
    user : str
        The user to use for httpfs connections.
    hdfs_path : str
        The full path of the file in HDFS.
    port : int : default=14000
        The port to use for httpfs connections.
    perms : str or int : default=775
        The permissions to use for the uploaded file in HDFS.
    """
    # Form the URL.
    url = make_httpfs_url(host=host, user=user, hdfs_path=hdfs_path, op='OPEN', port=port)
 
    # Form and issue the request.
    resp = requests.get(url, stream=True)
 
    if resp.status_code == 200:
        with open(filename, 'wb') as f_p:
            for chunk in resp:
                f_p.write(chunk)
    else:
        resp.raise_for_status
 
 
def exists(host, user, hdfs_path, port=14000):
    """ Returns True if 'hdfs_path' (full path) exists in HDFS at user@host:port via httpfs.
 
    INPUT
    -----
    host : str
        The host to connect to for httpfs access to HDFS. (Can be 'localhost'.)
    user : str
        The user to use for httpfs connections.
    hdfs_path : str
        The full path of the file or directory being checked.
    port : int
        The port to use for httpfs connections.
 
    OUTPUT
    ------
    Boolean : True if 'hdfs_path' exists and can be accessed by 'user'; False otherwise.
    """
    op = 'GETFILESTATUS'
    url = make_httpfs_url(host=host, user=user, hdfs_path=hdfs_path, op=op, port=port)
    # Get the JSON response using httpfs; stores as a Python dict
    resp = requests.get(url)
    # If a 404 was returned, the file/path does not exist
    if resp.status_code == 404:
        return False
    # If a 200 was returned, the file/path does exist
    elif resp.status_code == 200:
        return True
    # Something else - raise status, or if all else fails return None
    else:
        resp.raise_for_status()
        return None
 
 
def get_blocksize(host, user, hdfs_path, port=14000):
    """ Returns the HDFS block size (bytes) of 'hdfs_path' in HDFS at user@host:port via httpfs.
 
    The returned block size is in bytes. For MiB, divide this value by 2**20=1048576.
 
    INPUT
    -----
    host : str
        The host to connect to for httpfs access to HDFS. (Can be 'localhost'.)
    user : str
        The user to use for httpfs connections.
    hdfs_path : str
        The full path of the file or directory being checked.
    port : int
        The port to use for httpfs connections.
 
    OUTPUT
    ------
    int/long : The block size in bytes.
    """
    op = 'GETFILESTATUS'
    url = make_httpfs_url(host=host, user=user, hdfs_path=hdfs_path, op=op, port=port)
    # Get the JSON response using httpfs; stores as a Python dict
    resp = requests.get(url)
    # If a 200 was returned, the file/path exists
    if resp.status_code == 200:
        return resp.json()['FileStatus']['blockSize']
    # Something else - raise status, or if all else fails return None
    else:
        resp.raise_for_status()
 
 
def get_size(host, user, hdfs_path, port=14000):
    """ Returns the size (bytes) of 'hdfs_path' in HDFS at user@host:port via httpfs.
 
    The returned block size is in bytes. For MiB, divide this value by 2**20=1048576.
 
    INPUT
    -----
    host : str
        The host to connect to for httpfs access to HDFS. (Can be 'localhost'.)
    user : str
        The user to use for httpfs connections.
    hdfs_path : str
        The full path of the file or directory being checked.
    port : int
        The port to use for httpfs connections.
 
    OUTPUT
    ------
    int/long : The size in bytes.
    """
    op = 'GETFILESTATUS'
    url = make_httpfs_url(host=host, user=user, hdfs_path=hdfs_path, op=op, port=port)
    # Get the JSON response using httpfs; stores as a Python dict
    resp = requests.get(url)
    # If a 200 was returned, the file/path exists
    if resp.status_code == 200:
        return resp.json()['FileStatus']['length']
    # Something else - raise status, or if all else fails return None
    else:
        resp.raise_for_status()
 
 
def info(host, user, hdfs_path, port=14000):
    """ Returns a dictionary of info for 'hdfs_path' in HDFS at user@host:port via httpfs.
 
    This method is similar to 'liststatus', but only displays top-level information. If you need
    info about all of the files and subdirectories of a directory, use 'liststatus'.
 
    The returned dictionary contains keys: group, permission, blockSize, accessTime, pathSuffix,
    modificationTime, replication, length, ownder, type.
 
    INPUT
    -----
    host : str
        The host to connect to for httpfs access to HDFS. (Can be 'localhost'.)
    user : str
        The user to use for httpfs connections.
    hdfs_path : str
        The full path of the file or directory being checked.
    port : int
        The port to use for httpfs connections.
 
    OUTPUT
    ------
    Dictionary : Information about 'hdfs_path'
    """
    op = 'GETFILESTATUS'
    url = make_httpfs_url(host=host, user=user, hdfs_path=hdfs_path, op=op, port=port)
    # Get the JSON response using httpfs; stores as a Python dict
    resp = requests.get(url)
    # If a 200 was returned, the file/path exists
    if resp.status_code == 200:
        return resp.json()
    # Something else - raise status, or if all else fails return None
    else:
        resp.raise_for_status()
 
 
def liststatus(host, user, hdfs_path, port=14000):
    """ Returns a dictionary of info for 'hdfs_path' in HDFS at user@host:port via httpfs.
 
    Returns a dictionary of information. When used on a file, the returned dictionary contains a
    copy of the dictionary returned by 'info.' When used on a directory, the returned dictionary
    contains a list of such dictionaries.
 
    INPUT
    -----
    host : str
        The host to connect to for httpfs access to HDFS. (Can be 'localhost'.)
    user : str
        The user to use for httpfs connections.
    hdfs_path : str
        The full path of the file or directory being checked.
    port : int
        The port to use for httpfs connections.
 
    OUTPUT
    ------
    Dictionary : Information about 'hdfs_path'
    """
    op = 'LISTSTATUS'
    url = make_httpfs_url(host=host, user=user, hdfs_path=hdfs_path, op=op, port=port)
    # Get the JSON response using httpfs; stores as a Python dict
    resp = requests.get(url)
    # If a 200 was returned, the file/path exists
    if resp.status_code == 200:
        return resp.json()
    # Something else - raise status, or if all else fails return None
    else:
        resp.raise_for_status()
 
 
def ls(host, user, hdfs_path, port=14000):
    """ Print info for 'hdfs_path' in HDFS at user@host:port via httpfs.
 
    A print function intended for interactive usage. Similar to 'ls -l' or 'hdfs dfs -ls'.
 
    INPUT
    -----
    host : str
        The host to connect to for httpfs access to HDFS. (Can be 'localhost'.)
    user : str
        The user to use for httpfs connections.
    hdfs_path : str
        The full path of the file or directory being checked.
    port : int
        The port to use for httpfs connections.
    """
    op = 'LISTSTATUS'
    url = make_httpfs_url(host=host, user=user, hdfs_path=hdfs_path, op=op, port=port)
    # Get the JSON response using httpfs; stores as a Python dict
    resp = requests.get(url)
    # If a 200 was returned, the file/path exists. Otherwise, raise error or exit.
    if resp.status_code != 200:
        resp.raise_for_status()
    else:
        filestatuses = resp.json()
        for obj in filestatuses['FileStatuses']['FileStatus']:
            obj_str = _perm_long_str_(type_str=obj['type'],perm_str=obj['permission'])
            obj_str += '%*s' % (
                _get_max_str_len_(filestatuses, 'replication')+3,
                obj['replication']
            )
            obj_str += '%*s' % (
                _get_max_str_len_(filestatuses, 'owner')+3,
                obj['owner']
            )
            obj_str += '%*s' % (
                _get_max_str_len_(filestatuses, 'group')+2,
                obj['group']
            )
            obj_str += '%*s' % (
                _get_max_str_len_(filestatuses, 'length')+4,
                obj['length']
            )
            obj_str += '%21s' % (
                datetime.datetime.utcfromtimestamp(
                    obj['modificationTime']/1000
                ).isoformat().replace('T',' ')
            )
            obj_str += ' ' + hdfs_path + '/' + obj['pathSuffix']
 
            print "%s" % obj_str
 
 
def mkdir(host, user, hdfs_path, port=14000):
    """ Creates the directory 'hdfs_path' on 'user'@'host':'port'.
 
    Directories are created recursively.
 
    INPUT
    -----
    host : str
        The host to connect to for httpfs access to HDFS. (Can be 'localhost'.)
    user : str
        The user to use for httpfs connections.
    hdfs_path : str
        The path of the directory to create in HDFS.
    port : int : default=14000
        The port to use for httpfs connections.
    """
    op = 'MKDIRS'
    url = make_httpfs_url(host=host, user=user, hdfs_path=hdfs_path, op=op, port=port)
 
    # Make the request
    resp = requests.put(url)
    # If a 200 was returned, the file/path exists
    if resp.status_code == 200:
        return resp.json()
    # Something else - raise status, or if all else fails return None
    else:
        resp.raise_for_status()
 
 
def put(host, user, hdfs_path, filename, port=14000, perms=775):
    """ Puts 'filename' into 'hdfs_path' on 'user'@'host':'port'.
 
    INPUT
    -----
    host : str
        The host to connect to for httpfs access to HDFS. (Can be 'localhost'.)
    user : str
        The user to use for httpfs connections.
    hdfs_path : str
        The full path of the location to place the file in HDFS.
    filename : str
        The file to upload. Can be a local file or a full path.
    port : int : default=14000
        The port to use for httpfs connections.
    perms : str or int : default=775
        The permissions to use for the uploaded file in HDFS.
    """
    # Get the file name without base path.
    filename_short = filename.split('/')[-1]
    # Form the URL.
    url = make_httpfs_url(
        host=host,
        user=user,
        hdfs_path=hdfs_path + '/' + filename_short,
        op='CREATE&data=true&overwrite=true&permission=' + str(perms),
        port=port
    )
    headers = {
        'Content-Type':'application/octet-stream'
    }
    #files = {'file': open(filename,'rb')}
 
    resp = requests.put(url, data=open(filename,'rb'), headers=headers)
    if resp.status_code != 200:
        resp.raise_for_status()
 
 
def read(host, user, hdfs_path, port=14000):
    """ Reads file at 'hdfs_path' on 'user'@'host':'port'.
 
    This method allows the contents of a file in HDFS to be read into memory in Python.
 
    INPUT
    -----
    host : str
        The host to connect to for httpfs access to HDFS. (Can be 'localhost'.)
    user : str
        The user to use for httpfs connections.
    hdfs_path : str
        The full path of the file in HDFS.
    port : int : default=14000
        The port to use for httpfs connections.
    perms : str or int : default=775
        The permissions to use for the uploaded file in HDFS.
 
    OUTPUT
    ------
    Text of the file.
    """
    # Form the URL.
    url = make_httpfs_url(host=host, user=user, hdfs_path=hdfs_path, op='OPEN', port=port)
 
    # Form and issue the request.
    resp = requests.get(url)
 
    if resp.status_code != 200:
        resp.raise_for_status
 
    return resp.text
 
 
def read_json(host, user, hdfs_path, port=14000):
    """ Reads JSON file at 'hdfs_path' on 'user'@'host':'port' and returns a Python dict.
 
    This method reads the contents of a JSON file in HDFS into Python as a dictionary.
 
    INPUT
    -----
    host : str
        The host to connect to for httpfs access to HDFS. (Can be 'localhost'.)
    user : str
        The user to use for httpfs connections.
    hdfs_path : str
        The full path of the file in HDFS.
    port : int : default=14000
        The port to use for httpfs connections.
    perms : str or int : default=775
        The permissions to use for the uploaded file in HDFS.
 
    OUTPUT
    ------
    Text of the file interpreted in JSON as a Python dict.
    """
    # Form the URL.
    url = make_httpfs_url(host=host, user=user, hdfs_path=hdfs_path, op='OPEN', port=port)
 
    # Form and issue the request.
    resp = requests.get(url)
 
    if resp.status_code != 200:
        resp.raise_for_status
 
    return json.loads(requests.get(url).text)

Here’s an example of printing the contents of a directory in IPython. I should point out that this is across a network connection, but still only takes 0.08 seconds.

In [1]: import httpfs_utils as httpfs
 
In [2]: time httpfs.ls(host='hadoop01',user='root',hdfs_path='/user/root')
drwx------   0     root  supergroup        0  2015-01-29 20:21:17 /user/root/.Trash
drwx------   0     root  supergroup        0  2015-01-22 20:49:49 /user/root/.staging
-rwxrwxr-x   3     root  supergroup    74726  2015-07-06 16:50:08 /user/root/alice.txt
drwxr-xr-x   0     root  supergroup        0  2015-07-06 22:17:52 /user/root/config
drwxr-xr-x   0     root  supergroup        0  2015-03-03 19:38:21 /user/root/hdfs_bin
drwxr-xrwx   0     root  supergroup        0  2015-04-09 00:56:40 /user/root/referrer_host
drwxr-xr-x   0     root  supergroup        0  2015-02-26 22:58:04 /user/root/stage
-rwxrwxr-x   3     root  supergroup       49  2015-07-06 19:09:20 /user/root/test_append.txt
drwxr-xr-x   0     root  supergroup        0  2015-07-06 19:38:14 /user/root/testdir
drwxr-xr-x   0     root  supergroup        0  2015-03-04 22:04:37 /user/root/tmp
CPU times: user 0.02 s, sys: 0.02 s, total: 0.03 s
Wall time: 0.08 s

That was simply a print function. We can do much more useful things, such as uploading a file. Here, I’ll upload a smallish (1.6MiB) CSV into HDFS. Then, we’ll verify that the file exists in HDFS.

In [3]: time httpfs.put(host='hadoop01',user='root',hdfs_path='/user/root',filename='/export/data-share/test.csv')
CPU times: user 0.00 s, sys: 0.02 s, total: 0.02 s
Wall time: 0.09 s
 
In [4]: time httpfs.exists(host='hadoop01',user='root',hdfs_path='/user/root/test.csv',)
CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s
Wall time: 0.02 s
Out[4]: True

Let’s read a JSON file directly into Python as a dictionary. (I’ve tracked all tweets mentioning the company for which I work, SpotXchange, for a year now. Here’s a random one.)

In [5]: time tweet = httpfs.read_json(host='hadoop01',user='root',hdfs_path='/user/root/spotxtweet.json')
CPU times: user 0.01 s, sys: 0.01 s, total: 0.02 s
Wall time: 0.66 s
 
In [8]: tweet
Out[8]: 
{u'contributors': None,
 u'coordinates': None,
 u'created_at': u'Mon Jul 14 22:25:14 +0000 2014',
 u'entities': {u'hashtags': [],
  u'symbols': [],
  u'urls': [{u'display_url': u'bit.ly/1qoqcDB',
    u'expanded_url': u'http://bit.ly/1qoqcDB',
    u'indices': [102, 124],
    u'url': u'http://t.co/yyMiXenMTJ'}],
  u'user_mentions': []},
 u'favorite_count': 0,
 u'favorited': False,
 u'filter_level': u'medium',
 u'geo': None,
 u'id': 488811485703835648,
 u'id_str': u'488811485703835648',
 u'in_reply_to_screen_name': None,
 u'in_reply_to_status_id': None,
 u'in_reply_to_status_id_str': None,
 u'in_reply_to_user_id': None,
 u'in_reply_to_user_id_str': None,
 u'lang': u'en',
 u'place': None,
 u'possibly_sensitive': False,
 u'retweet_count': 0,
 u'retweeted': False,
 u'source': u'<a href="http://www.hootsuite.com" rel="nofollow">Hootsuite</a>',
 u'text': u'Great insights from the CEO. Publishers Rapidly Adopting Programmatic Ad Sales: SpotXchange\u2019s Shehan: http://t.co/yyMiXenMTJ',
 u'truncated': False,
 u'user': {u'contributors_enabled': False,
  u'created_at': u'Thu Oct 31 17:50:48 +0000 2013',
  u'default_profile': False,
  u'default_profile_image': False,
  u'description': u'The Center of the Native Mobile Advertising Community',
  u'favourites_count': 1,
  u'follow_request_sent': None,
  u'followers_count': 152,
  u'following': None,
  u'friends_count': 83,
  u'geo_enabled': False,
  u'id': 2166999860,
  u'id_str': u'2166999860',
  u'is_translation_enabled': False,
  u'is_translator': False,
  u'lang': u'en',
  u'listed_count': 10,
  u'location': u'Anywhere and Everywhere',
  u'name': u'NativeMobile.com',
  u'notifications': None,
  u'profile_background_color': u'000000',
  u'profile_background_image_url': u'http://pbs.twimg.com/profile_background_images/378800000139148775/8OxxwKh3.jpeg',
  u'profile_background_image_url_https': u'https://pbs.twimg.com/profile_background_images/378800000139148775/8OxxwKh3.jpeg',
  u'profile_background_tile': False,
  u'profile_banner_url': u'https://pbs.twimg.com/profile_banners/2166999860/1386291602',
  u'profile_image_url': u'http://pbs.twimg.com/profile_images/378800000834565058/0948f6f3c16916fbc99746370d067d71_normal.png',
  u'profile_image_url_https': u'https://pbs.twimg.com/profile_images/378800000834565058/0948f6f3c16916fbc99746370d067d71_normal.png',
  u'profile_link_color': u'0084B4',
  u'profile_sidebar_border_color': u'FFFFFF',
  u'profile_sidebar_fill_color': u'DDEEF6',
  u'profile_text_color': u'333333',
  u'profile_use_background_image': True,
  u'protected': False,
  u'screen_name': u'nativemobile',
  u'statuses_count': 1841,
  u'time_zone': u'Pacific Time (US & Canada)',
  u'url': u'http://nativemobile.com',
  u'utc_offset': -25200,
  u'verified': False}}

In the context of a program, an Easter egg is an undocumented response that results from a specific input. The classic example is the Konami code (up up down down left right left right B A), used as a cheat code in various video games. (The Konami code has been used in many other places you wouldn’t expect, including a lens flare effect that used to be an Easter egg on Facebook.) You can Google that, so do that if you haven’t already. Here, I want to list a few fun Easter eggs.

I started this by simply looking at Python, but then expanded the list to include some others. So, it’s still a bit heavier on Python with randomness coming afterwards.

Python

import braces

Python has some good ones. My favorite is the one that is built in to the ‘__future__’ module. This module is intended for future Python additions, or things that are going to be in Python but aren’t there just yet. One thing that keeps getting asked for by some Python users is a C-style braces-wrapped scoping system, contrary to Python’s white-space scope delimiters. So, when ‘braces’ showed up in the ‘__future__’ module, some may have been excited … until they tried it out. Here’s what happens.

>>> from __future__ import braces
  File "<stdin>", line 1
SyntaxError: not a chance

import antigravity

This references an XKCD comic. It actually opens a browser and loads the XKCD comic when you run it.

>>> import antigravity

Hello World

There’s a hello world Easter egg in Python.

>>> import __hello__
Hello world!

The Zen of Python

This is probably, at this point, the most well-known Python Easter egg. “The Zen of Python,” now officially PEP-20, was originally sent by Tim Peters (now functioning as “occasional contributor of great knowledge” to Python) to the python-list back in 1999 in this post. In 2001 while organizing IPC10 (the precursor to PyCon), Barry Warsaw and Tim filtered through a massive list of crowd-sourced potential slogans for an IPC t-shirt and came up with “import this”. The combination of “The Zen of Python” and “import this” was merged secretly in Python 2.2.1, and if you ‘cat’ the contents of this.py (e.g., in /usr/lib/python2.x) you’ll see that there is some encryption so the message isn’t immediately obvious in a commit. Anyway, it does the following when called.

>>> import this
The Zen of Python, by Tim Peters
 
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

 

Firefox / Iceweasel

If you’re using Firefox (or its Debian brother with properly licensed logo… Iceweasel), then you can try looking at the following.

Gort! Klaatu barada nikto!

About:robots is a nice page: Gort! Klaatu barada nikto!.

The Book of Mozilla

About:mozilla is oddly Biblical and has been changing in new versions along with Mozilla development since early 90s Netscape. It’s just really weird. The current version, as I write this, reads: “The twins of Mammon quarrelled. Their warring plunged the world into a new darkness, and the beast abhorred the darkness. So it began to move swiftly, and grew more powerful, and went forth and multiplied. And the beasts brought fire and light to the darkness.” The Book of Mozilla.

YouTube

YouTube Snake

Go to any video on YouTube (it has to be on youtube; not embedded from youtube onto another site). Start playing a video (beyond the ads) and pause the video. With the video paused, hold down the left arrow button on your keyboard and (while holding it down) press the up button.

CLISP

Menorah Candles

The menorah has long been the logo for CLISP. This page explains why. When you run CLISP, you get the menorah, as in the following.

$ clisp
  i i i i i i i       ooooo    o        ooooooo   ooooo   ooooo
  I I I I I I I      8     8   8           8     8     o  8    8
  I  \ `+' /  I      8         8           8     8        8    8
   \  `-+-'  /       8         8           8      ooooo   8oooo
    `-__|__-'        8         8           8           8  8
        |            8     o   8           8     o     8  8
  ------+------       ooooo    8oooooo  ooo8ooo   ooooo   8
 
Welcome to GNU CLISP 2.49 (2010-07-07) <http://clisp.cons.org/>
 
Copyright (c) Bruno Haible, Michael Stoll 1992, 1993
Copyright (c) Bruno Haible, Marcus Daniels 1994-1997
Copyright (c) Bruno Haible, Pierpaolo Bernardi, Sam Steingold 1998
Copyright (c) Bruno Haible, Sam Steingold 1999-2000
Copyright (c) Sam Steingold, Bruno Haible 2001-2010
 
Type :h and hit Enter for context help.
 
[1]>

What many CLISP users haven’t noticed is that the candles on the menorah are lit properly during Hanukkah.

Vim

Help with the meaning of life…

In Vim, if you issue the command ‘:help 42′ you get the following output.

What is the meaning of life, the universe and everything?
*42*
Douglas Adams, the only person who knew what this question really was about is
now dead, unfortunately.  So now you might wonder what the meaning of death
is...
 
==============================================================================
 
Next chapter: |usr_43.txt|  Using filetypes
 
Copyright: see |manual-copyright|  vim:tw=78:ts=8:ft=help:norl:

 

Apt-get and Aptitude

The apt package management system (used by Debian-based Linux systems) has several Easter eggs, mostly related to cows.

Cow Powers of Apt and Aptitude

If you ask apt for help documentation, the last line that it returns is a bit weird.

$ apt-get help
apt 0.9.7.9 for amd64 compiled on Nov 16 2013 12:41:41
Usage: apt-get [options] command
       apt-get [options] install|remove pkg1 [pkg2 ...]
...
...
...
  -o=? Set an arbitrary configuration option, eg -o dir::cache=/tmp
See the apt-get(8), sources.list(5) and apt.conf(5) manual
pages for more information and options.
                       This APT has Super Cow Powers.

So, we know that apt does have super cow powers. What does that mean? Well, it should probably be able to moo.

$ apt-get moo
         (__) 
         (oo) 
   /------\/ 
  / |    ||   
 *  /\---/\ 
    ~~   ~~   
...."Have you mooed today?"...

So, that works. What about aptitude? Since aptitude is supposed to be an improved apt interface, we’d expect something more. But, when we ask aptitude for its help string, we get the following.

$ aptitude help
aptitude 0.6.8.2
Usage: aptitude [-S fname] [-u|-i]
       aptitude [options] <action> ...
  Actions (if none is specified, aptitude will enter interactive mode):
...
...
...
                  (terminal interface only)
 -i             Perform an install run on startup.
                  (terminal interface only)
 
                  This aptitude does not have Super Cow Powers.

Hmmm… so aptitude claims to not have super cow powers. What happens if we ask it to moo?

$ aptitude moo
There are no Easter Eggs in this program.

OK. Maybe we just need to ask correctly…

$ sudo aptitude moo
[sudo] password for jason: 
There are no Easter Eggs in this program.
$ aptitude -v moo
There really are no Easter Eggs in this program.

Now, most people would just stop here. But, a real Easter egg isn’t found by stopping. What if we keep asking for more verbosity?

$ aptitude -vv moo
Didn't I already tell you that there are no Easter Eggs in this program?
$ aptitude -vvv moo
Stop it!
$ aptitude -vvvv moo
Okay, okay, if I give you an Easter Egg, will you go away?
$ aptitude -vvvvv moo
All right, you win.
 
                               /----\
                       -------/      \
                      /               \
                     /                |
   -----------------/                  --------\
   ----------------------------------------------
$ aptitude -vvvvvv moo
What is it?  It's an elephant being eaten by a snake, of course.
$ aptitude -vvvvvvv moo
What is it?  It's an elephant being eaten by a snake, of course.

That’s a good Easter egg.

sl instead of ls

For those that misspell commands in a hurry

Some Linux distros have a command named ‘sl’ which is designed to mock those that misspell ‘ls’ at the command-line. (E.g., in Ubuntu/Mint/Debian you can use ‘sudo apt-get install sl’.) This command simply chugs a train across the screen, forcing you to wait and watch the train before you can attempt to enter ‘ls’ again.

Irix

Release note recipes

If you have an old SGI machine running Irix, type the following.

> relnotes dmedia_eoe 29

What should give you release notes … actually gives you a kung pao chicken recipe.


Introduction

A while ago in this post, I proved that the sum of divisors of an integer can be computed as a multiplicative function on primes. Thus, knowing the prime factorization of an integer results in a direct algorithmic method for computing the sum of divisors of the integer without any combinatorial hassle.

This little trick, as well as many other factorization techniques (e.g., greatest common divisor of two integers), pops up quite a bit in the Project Euler problems. They pop up so often that I eventually found myself wanting to collect the routines in a Python module instead of copying and pasting the code every time I needed to do factorizations in Python.

So, today I pushed this little module. (There’s no setup script yet. Just place it in your working directory and ‘import primes’ for now.)

There are other ways we could go about this (using Sage to coordinate factorization methods, for instance). And, of course, for the more complicated problems (e.g., higher numbered Project Euler problems), such an approach in Python won’t do. But, for a surprising number of Project Euler problems below 150 or so, this module should help a ton.

Factoring a Range of Integers

First, you should understand that there’s no elliptic curve method and no quadratic sieve. This module is designed to provide factorization tools (sum of divisors, gcd, lcm, etc.) for a large range or list of consecutive integers. While certain fancy methods are fast at factoring large random integers, they aren’t the best approach when factoring a big list of consecutive integers. A typical sieve and some careful planning works better in such a situation, as strange as that may feel.

Examples: Prime Sieve Class

Using IPython, we’ll explore this module a bit and see what we can do with it. Then, we’ll show how it can be used to solve a Project Euler problem. For now, let’s start by creating a prime sieve. We’ll ask Python to compute all primes below one million.

In [1]: import primes
 
In [2]: time S = primes.Sieve(1000000)
CPU times: user 0.12 s, sys: 0.02 s, total: 0.13 s
Wall time: 0.12 s

How many primes are now included in this sieve? That is, how many primes are there below one million?

In [3]: S.numPrimes
Out[3]: 78498

We can use the sieve to test any integer in the given range to see if it is a prime.

In [4]: S.isPrime(873)
Out[4]: False

We can make a list of primes from the given sieve, in case we wish to use an iterable over primes.

In [5]: time L = S.listPrimes()
CPU times: user 0.06 s, sys: 0.00 s, total: 0.06 s
Wall time: 0.05 s

Now, L is a list containing all prime values less than one million. We can also make a set. The set will be slightly more expensive to create, but will provide faster membership access once created.

In [6]: time M = S.setPrimes()
CPU times: user 0.09 s, sys: 0.01 s, total: 0.10 s
Wall time: 0.07 s

If you forget what the limit point of the sieve is, it is contained as a variable inside the Sieve class.

In [7]: S.limit
Out[7]: 1000000

Examples: RangeFactor Class

The RangeFactor class uses a prime sieve to factor a list of integers. It then makes various methods available that use this collection of factorization records. Let’s start by factoring all natural numbers less than one million.

In [8]: time F = primes.RangeFactor(1000000)
CPU times: user 2.13 s, sys: 0.06 s, total: 2.19 s
Wall time: 2.18 s

We can get a more detailed time analysis by asking for verbose output.

In [9]: time F = primes.RangeFactor(1000000, verbose=True)
primes : creating sieve
primes : sieve created in 0.140506982803 seconds
primes : creating factor records
primes : factor records created in 0.232191085815 seconds
primes : populating factor records
primes : factor records filled in 1.8158788681 seconds
CPU times: user 2.22 s, sys: 0.08 s, total: 2.29 s
Wall time: 2.27 s

One of the most simple and obvious things to do now is to ask for the factorization of an integer in the given range. Currently, the factors are in a list with each integer having a dictionary of prime keys and exponent value pairs. This example should make that relationship a bit more clear.

In [10]: F.factors[5]
Out[10]: {5: 1}
 
In [11]: F.factors[50]
Out[11]: {2: 1, 5: 2}

The next most obvious thing we could do is to examine the greatest common divisor and least common multiple of two integers. We could also simply ask if two integers are coprime (which is slightly more efficient than computing the gcd or lcm).

In [12]: F.gcd(100,29754)
Out[12]: 2
 
In [13]: F.gcd(100,6634)
Out[13]: 2
 
In [14]: F.gcd(100,500)
Out[14]: 100
 
In [15]: F.lcm(774,384)
Out[15]: 49536
 
In [16]: F.areCoprime(774,384)
Out[16]: False

Nothing here so far is too special. Really, we’re just making standard prime tools available more easily. But, now we can use the multiplicative function on primes to compute the sum of divisors of any integers in the given range. (Recall that a perfect number is a number that is the sum of all of its proper divisors.)

In [17]: F.sumDivisors(496)
Out[17]: 992
 
In [18]: F.sumProperDivisors(496)
Out[18]: 496
 
In [19]: F.isPerfect(496)
Out[19]: True

Now, we can do some slightly more complicated computations. Let’s sum all the perfect numbers under one million. First, let’s see what they are. Then, we’ll take the sum.

In [20]: for i in range(2,1000000):
   ....:     if F.isPerfect(i): print i
   ....:     
6
28
496
8128

What’s a bit crazy here is that we’re computing the sum of divisors for ALL integers less than one million to check if each integer is perfect or not. All things considered, that’s going on pretty quickly due to our use of that nice multiplicative function.

In [28]: time sum([i for i in range(2,F.sieve.limit) if F.isPerfect(i)])
CPU times: user 1.60 s, sys: 0.03 s, total: 1.62 s
Wall time: 1.58 s
Out[28]: 8658

A Sample Project Euler Problem

Problem 21 is stated as follows:

Let d(n) be defined as the sum of proper divisors of n (numbers less than n which divide evenly into n).
If d(a) = b and d(b) = a, where a ā‰  b, then a and b are an amicable pair and each of a and b are called amicable numbers.

For example, the proper divisors of 220 are 1, 2, 4, 5, 10, 11, 20, 22, 44, 55 and 110; therefore d(220) = 284. The proper divisors of 284 are 1, 2, 4, 71 and 142; so d(284) = 220.

Evaluate the sum of all the amicable numbers under 10000.

Python Solution

This is solved using the just discussed ‘primes’ module with the following code.

#!/usr/bin/python2
 
import time
import primes
 
start = time.time()
 
# factor all numbers from 1 to 10000
F = primes.RangeFactor(10000)
 
# compute the sum of all proper divisors for integers from 1 to 10000
S = [F.sumProperDivisors(i) for i in range(10000)]
 
# find amicable pairs and add to a set
pairs = set()
for i in range(2,10000):
    if i in pairs:
        pass
    else:
        if S[i] < 10000 and S[i]!=i and S[S[i]]==i:
            pairs.add(i)
            pairs.add(S[i])
 
elapsed = time.time() - start
 
print "Found result %s in %s seconds" % (sum(pairs), elapsed)

When executed, we get the following output.

Found result 31626 in 0.0326828956604 seconds

Introduction

The Monty Hall problem is a relatively well-known puzzle based loosely on the game show “Let’s Make a Deal.” It’s known for its ability to confuse, and the winning strategy to the puzzle may not seem intuitive at first. Here, I want to describe and prove the winning strategy (with Bayes’ Theorem) and model the puzzle with Python to show that the described strategy actually wins the puzzle with the described probability.

This blog post was featured in Webucator’s Python training. Specifically, here is the video.

Problem Description

You are on a game show and there are three doors in front of you, as in the illustration below. Behind one of the doors sits a prize, but the door has been chosen at random. You’re asked to select a door, and if you select correctly you will win the prize. So, it should seem obvious at this point that, all things being equal and random, there is a 1/3 probability of you selecting the correct door.

But, there’s a catch: When you announce which door you have chosen, the game show host opens one of the other doors. That is, he opens a door that is neither the door you selected nor the door containing the prize. He then asks if you’d like to change your choice of door. There are only two doors remaining – the one you selected and another closed door. Should you stick with your original choice of door, or should you switch your choice to the only other closed door? Does it matter?

A Winning Strategy

Yes, it matters. Always change your choice of door. Always. Simply put, if you stick with your original door selection, you will only win one out of every three times you play the game. If you change your selection, you will win two out of every three times you play.

Proving The Strategy

Intuitively, it is pretty straightforward that your original choice (when there are three doors to choose from) has a 1/3 probability of winning the game. Where things become complicated is when a door is removed. Many would argue that you now have a 1/2 probability of winning by selecting either door, but this isn’t the case.

One critical aspect that you must realize is the following: The game show host can select a door to remove because he knows (a) which door you selected and (b) which door has the prize. The act of removing a door from the game is not random. In fact, in many cases, the host must remove a specific door. For instance, if you select door 1 and the prize is behind door 3, the host has no other option than to remove door 2. You must realize that the choice of which door to remove is conditional on both the prize door and the door you initially select. The only “new” information that a removed door gives you is that it tells you where the prize isn’t. How can you use that information?

Think of the situation this way: When you made your original door selection, there was an equal and random 1/3 probability that the prize was behind any particular door. That situation looks like the following picture, where we’ve selected door 1 as an example.

Put another way, there is a 1/3 probability that door 1 contains the prize and a 2/3 probability that the prize is behind one of the two other doors. That looks like this picture.

So, what happens when we open door 2 or 3? Actually, nothing. The probability doesn’t magically change – it doesn’t change at all. There’s still a 1/3 probability that the prize is behind door 1 and a 2/3 probability that the prize is not behind door 1. The only new information we have is about where the prize is not – as removing a door is completely dependent on the host knowing where the prize isn’t. So, now we have a situation that looks like this.

Of course, the fact that door 2 was opened in this example shows us that there is a 0% probability that the prize is behind door 2. The implication, as strange as it feels, is that there is a 2/3 probability that the prize is behind door 3… because the combined probability of the prize being behind doors 2 and 3 is 2/3 and we know that the prize isn’t behind door 2. Clearly, you should change your selected door to door 3 in this situation.

Rigorous Proof Using Bayes’ Theorem

Bayes’ Theorem allows us to compute conditional probabilities. That is, Bayes’ Theorem will allow us to compute a probability of the form $P(A|B)$, read as “the probability of event A given that event B holds.” We will use Bayes’ Theorem here to compute the probability of winning the game based on switching or not switching our choice in door once a door is removed.

Without loss of generality, we will select door 1 for our initial door choice. Also without loss of generality, we will say that door 2 is opened (as in the above illustrations), leaving us with doors 1 and 3 as possible prize doors. We define the following events: $D_1$, $D_2$, and $D_3$ are the events that the prize is behind doors 1, 2, and 3, respectively. $O_2$ is the event of the host opening door 2, which is dependent on two things: (a) our initial choice of door 1 and (b) the host’s knowledge that the prize isn’t behind door 2.

We wish to compute two things. The first is formally written as $P(D_1|O_2)$. That is, we wish to compute the probability of the prize being behind door 1 given that the host has removed door 2. This represents the probability of winning the game when we do not change door selection. By Bayes’ Theorem, we have
\[
P(D_1|O_2)=\frac{P(O_2|D_1)P(D_1)}{P(O_2)}.
\]
Two of the three components on the right hand side are relatively straightforward. Namely, $P(D_1)$ is the probability that the prize is behind door 1, which is 1/3. The conditional probability $P(O_2|D_1)$ is literally read as “the probability that door 2 is removed when the prize is located behind door 1.” In such a situation, the host may remove either door 2 or 3, and so this is equal to 1/2. The denominator is more challenging here. $P(O_2)$ is the probability that door 2 is removed from selection without the condition that the prize is in any particular location, and we must calculate that directly.

If we consider all of the possible locations where the prize may be (behind each of the three doors) along with their corresponding probabilities, we see that
\[
P(O_2)=P(D_1)P(O_2|D_1)+P(D_2)P(O_2|D_2)+P(D_3)P(O_2|D_3).
\]
We know that $P(D_1)=P(D_2)=P(D_3)=1/3$, and so we consider the other values in question. The first of these values, $P(O_2|D_1)$ is the probability that door 2 is removed given that the prize is behind door 1. We already know that this is 1/2. The second value, $P(O_2|D_2)$ is zero since we’d never have door 2 opened if the prize is behind door 2. The third value, $P(O_2|D_3)$ is 1/1 (or 100%) because the prize being behind door 3 and our selection of door 1 only leaves door 2 to be opened. Thus, we have computed:
\[
P(D_1|O_2)=\frac{P(O_2|D_1)P(D_1)}{P(O_2)}=\frac{\frac{1}{2}\cdot\frac{1}{3}}{\frac{1}{3}\left(\frac{1}{2}+0+1\right)}=\frac{1}{3}.
\]
If I just did your probability or stats homework, you’re welcome. (Please understand the result if this is the case.) What we have just shown is that keeping our selection of door at door 1 results in a 1/3 chance of us winning the prize. What about the probability of winning when we switch choices to door 3? We have
\[
P(D_3|O_2)=\frac{P(O_2|D_3)P(D_3)}{P(O_2)}.
\]
The denominator here is exactly the same (thank goodness for that). The first value in the numerator, $P(O_2|D_3)$ is 1 (or 100%) since it is the probability of opening door 2 given that the prize is behind door 3, in which case door 2 is the only possible door to open (remember that we have initially selected door 1). The second value, $P(D_3)$ is 1/3 since is represents the unconditional probability of the prize being behind door 3. Thus, the probability of winning if we change our choice to door 3 is
\[
P(D_3|O_2)=\frac{P(O_2|D_3)P(D_3)}{P(O_2)}=\frac{1\cdot\frac{1}{3}}{\frac{1}{2}}=\frac{2}{3}
\]
and the result has been proved. QED.

A Python Model

We’ll create a Python class that will model the Monty Hall problem. Each time the class is initialized, it will select a door at random (from the 3 available) representing the contestant’s selection. It will then have the host remove a door and present the contestant with the option to change doors. We’ll run the game one million times with both situations: one million times where the contestant switches doors and one million times where the contestant sticks with the original door choice. We’ll present winning percentages in both cases. The code is a bit verbose. (I’ve used this recently as an introductory example to object oriented programming in Python.) This is saved in a file named MontyHall.py.

#!/usr/bin/python2
 
# MontyHall.py - Jason B. Hill 2014
 
# This script models the Monty Hall problem: You are on a gameshow and are
# presented with three doors. Behind one door (chosen at random) sits a prize.
# You must select (guess) a door. If you guess correctly, you get the prize.
# But, when you select your door, the show's host throws in a bit of a twist,
# opening one of the other two doors ... specifically a door that doesn't
# contain the prize. He asks you if you'd like to change your door selection,
# to the only other remaining door (not the one you picked originally and not
# the now open door). Do you change your selection? Why?
 
# Answer: Yes. ALWAYS change your selection. By sticking with your original
# door selection, you have a 1/3 chance of getting the prize. By switching, you
# end up with a 2/3 chance of getting the prize.
 
# I explain why on my code blog at: code.jasonbhill.com
 
# I'm going to write this as a tutorial example for object oriented programming
# in Python. The idea is that we'll have a class object representing the game.
# We'll have variables inside that class that hold information about the game
# state, and methods (functions) that advance the game state.
 
 
 
import random # We need the random module to select integers in [1,3] randomly
 
 
def pick_door():
    """
    Function to pick a door. Returns an integer 1, 2, or 3 at random.
    """
    return random.randint(1,3)
 
 
class MontyHall:
    """
    Class to model the Monty Hall problem.
    """
    def __init__(self):
        """
        Creates an instance of the Monty Hall problem. This will be called
        automatically when the class is formed.
        """
        # Pick a prize door randomly and store as a variable.
        self.prize_door = pick_door()
        # We'll create variables for the selected door and removed door as
        # well, but we won't set them just yet.
        self.selected_door = None
        self.removed_door = None
 
    def select_door(self):
        """
        Randomly selects a door for the contestant.
        """
        self.selected_door = pick_door()
 
    def remove_door(self):
        """
        This is how the host removes a (non-prize/non-selected) door..
        """
        # Pick a door at random.
        d = pick_door()
        # If that door is the prize door or the contestant's door, re-pick.
        while d == self.selected_door or d == self.prize_door:
            d = pick_door()
        # set the removed door to d
        self.removed_door = d
 
    def switch_choice(self):
        """
        Switches the selected door once a non-prize door is removed.
        """
        # 1+2+3=6. There's only one choice of door if we switch our selection.
        self.selected_door = 6 - self.selected_door - self.removed_door
 
    def user_wins(self):
        """
        Determine if the user wins. Return true on win, false on lose.
        """
        if self.selected_door == self.prize_door:
            return True
        else:
            return False
 
    def run_game(self, switch=True):
        """
        Once a problem is initialized, run the game.
 
        'switch' determines if the user changes selection during the game.
        """
        # The user selects a door.
        self.select_door()
        # The host removes a door.
        self.remove_door()
        # The user can switch selection of doors.
        if switch:
            self.switch_choice()
        # Determine if the user wins.
        return self.user_wins()
 
 
 
# Now, we'll run the game. When asked if we want to switch door selection when
# the door is removed by the host, we'll say 'yes' and switch our choice of
# door. We'll run that experiment one million times, always switching choice
# when given the chance. Here's what that looks like:
 
wins, losses = 0, 0
for i in range(1000000):
    # make an instance of the game, call it 'm'
    m = MontyHall()
    # run the game and switch choice of door.
    if m.run_game(switch=True):
        # a return value of True means we've won
        wins += 1
    else:
        # a return value of False means we've lost
        losses += 1
 
# Now that the game has been run one million times, compute/display stats.
perc_win = 100.0*wins / (wins+losses)
 
print "\nOne million Monty Hall games (with switching):"
print "  won:", wins, "games"
print "  lost:", losses, "games"
print "  odds: %.2f%% winning percentage" % perc_win
 
 
# Now, we'll run the game one million times and always stick to our original
# door selection every single time.
 
wins, losses = 0, 0
for i in range(1000000):
    # make an instance of the game, call it 'm'
    m = MontyHall()
    # run the game and switch choice of door.
    if m.run_game(switch=False):
        # a return value of True means we've won
        wins += 1
    else:
        # a return value of False means we've lost
        losses += 1
 
# Now that the game has been run one million times, compute/display stats.
perc_win = 100.0*wins / (wins+losses)
 
print "\nOne million Monty Hall games (staying with original choice):"
print "  won:", wins, "games"
print "  lost:", losses, "games"
print "  odds: %.2f%% winning percentage\n" % perc_win

Running the program, we get the following output.

$ python MontyHall.py 
 
One million Monty Hall games (with switching):
  won: 666675 games
  lost: 333325 games
  odds: 66.67% winning percentage
 
One million Monty Hall games (staying with original choice):
  won: 333779 games
  lost: 666221 games
  odds: 33.38% winning percentage

Problem

The fraction $\frac{49}{98}$ is a curious fraction, as an inexperienced mathematician in attempting to simplify it may incorrectly believe that $\frac{49}{98}=\frac{4}{8}$, which is correct, is obtained by cancelling the 9s.

We shall consider fractions like $\frac{30}{50}=\frac{3}{5}$ to be trivial examples.

There are exactly four non-trivial examples of this type of fraction, less than one in value, and containing two digits in the numerator and denominator.

If the product of these four fractions is given in its lowest common terms, find the value of the denominator.

Solution in Python

We really only need to iterate from 10 to 99 for the numberator and from the numerator to 99 for the denominator, since we require that the fraction be less than one. Also, we can ignore any denominators that have zeros (since the only other possible integer to match in the numerator is nonzero). We form a list from the numerator and denominator, check for redundancy, remove the redundancies when found, then cross multiply to find if the fractions are equivalent to their redundant-removed siblings.

import time
import fractions
 
start = time.time()
 
p = fractions.Fraction(1,1)
 
for a in range(10, 100, 1):
    for b in range(a+1, 100, 1):
        if b % 10 == 0 or a == b: continue
        La, Lb = [a/10, a%10], [b/10, b%10]
        if any(i in Lb for i in La) and not all(i in Lb for i in La):
            if La[0] in Lb: x = La[0]
            else: x = La[1]
            La.remove(x)
            Lb.remove(x)
            if a*Lb[0] == b*La[0]: p *= fractions.Fraction(La[0],Lb[0])
 
elapsed = time.time() - start
 
print "result %s found in %s seconds" % (p, elapsed)

We’ve used Python’s fractions module to make things a bit easier. When executed, we get the following.

result 1/100 found in 0.00531482696533 seconds

Problem

Consider all integer combinations of $ab$ for $2\le a\le 5$ and $2\le b\le 5$:

22=4, 23=8, 24=16, 25=32
32=9, 33=27, 34=81, 35=243
42=16, 43=64, 44=256, 45=1024
52=25, 53=125, 54=625, 55=3125

If they are then placed in numerical order, with any repeats removed, we get the following sequence of 15 distinct terms:

4, 8, 9, 16, 25, 27, 32, 64, 81, 125, 243, 256, 625, 1024, 3125

How many distinct terms are in the sequence generated by $ab$ for $2\le a\le 100$ and $2\le b\le 100$?

Solution in Python

Update: After a conversation with my friend and co-worker Larry, I’ve changed just two lines of this code – using a Python set instead of a list. That has taken the runtime down considerably. The set version runs in about 2% the time of the list version. This is because checking if an element is in a set is a constant time operation, while checking membership in a list is a linear time operation.

This is one of those problems that can be over-engineered. If you simply code the basic happy-path Python version, you find that it runs very quickly and returns the result. You could consider unique factorizations and so on … and that was my initial idea, but coding the “brute-force” Python version is all you really need here.

#!/usr/bin/python
 
import time
 
# L = [] # old version with a list
L = set() # new version with a set
 
limit = 101
 
start = time.time()
 
for a in range(2,limit):
    for b in range(2,limit):
        c = a**b
        if c in L: pass
        # else: L.append(c) # old list version
        L.add(c) # new version with set
 
elapsed = time.time() - start
 
print "found %s ints in %s seconds" % (len(L), elapsed)

When executed, we get the correct result in under a second.

$ python 29.py 
# found 9183 ints in 0.66696190834 seconds # old list version
found 9183 ints in 0.0150249004364 seconds

Problem

The $n^\text{th}$ term of the sequence of triangle numbers is given by, $t_n=\frac{1}{2}n(n+1)$; so the first ten triangle numbers are: 1, 3, 6, 10, 15, 21, 28, 36, 45, 55, …

By converting each letter in a word to a number corresponding to its alphabetical position and adding these values we form a word value. For example, the word value for SKY is 19 + 11 + 25 = 55 = $t_{10}$. If the word value is a triangle number then we shall call the word a triangle word.

Using words.txt (right click and ‘Save Link/Target As…’), a 16K text file containing nearly two-thousand common English words, how many are triangle words?

Solution in Python

#!/usr/bin/python
 
import time
 
# start the clock
start = time.time()
 
# turn the string of words into a list of python strings
with open('words.txt','r') as f:
    words = f.read().split(',')
    words = [list(word.strip('\"')) for word in words]
    f.close()
 
# we should have an idea of how long the longest word is,
# giving us an idea of the magnitude of the triangle words
m = max([len(word) for word in words])
# form triangle numbers up to the given range
triangles = [n*(n + 1)/2 for n in range(1,2*m)]
 
# make a dictionary map for character values
vals = {}
s = list("ABCDEFGHIJKLMNOPQRSTUVWXYZ")
for c in s:
    vals[c] = s.index(c) + 1
 
# count triangle words
triangle_words = 0
for word in words:
    if sum([vals[c] for c in word]) in triangles:
        triangle_words += 1
 
# terminate the clock
elapsed = time.time() - start
 
print "found %s triangle words in %s seconds" % (triangle_words, elapsed)

When executed, this returns as follows.

found 162 triangle words in 0.00315809249878 seconds

Problem

An irrational decimal fraction is created by concatenating the positive integers:

\[0.123456789101112131415161718192021...\]

It can be seen that the 12th digit of the fractional part is 1.

If $d_n$ represents the nth digit of the fractional part, find the value of the following expression.

\[d_1\times d_{10}\times d_{100}\times d_{1000}\times d_{10000}\times d_{100000}\times d_{1000000}\]

Python Solution

This was coded in Python using the Sage Notebook (for speed). The idea is to concatenate the string representations of increasing integers into a single string, turn that string into a list with the appropriate index range, and take the appropriate product from within that list. I use the range 400000 to form the string, guessing (correctly) that this would provide enough digits in the resulting string/list.

import time
 
start = time.time()
 
s = ""
for i in range(400000):
    s += str(i)
d = list(s)
n = prod([ZZ(d[10**i]) for i in range(7)])
 
elapsed = time.time() - start
 
print "result %s found in %s seconds" % (n, elapsed)

When executed, we get the following result.

result 210 found in 0.369875907898 seconds

This is just another example of a problem I decided to attempt as a break at work, trying to form the solution before the cookie on the Project Euler site ran out.

Problem

It can be seen that the number, 125874, and its double, 251748, contain exactly the same digits, but in a different order.

Find the smallest positive integer, x, such that 2x, 3x, 4x, 5x, and 6x, contain the same digits.

Python Solution

This could be written in C or Cython to perform much more quickly, and I’m sure there are other optimizations that could be made to this code, but this was written and executed in a total of about three minutes using the Sage Notebook.

import time
 
start = time.time()
 
def numDigits(n):
    r"""
    return a sorted list of digits in a number
    """
    L = []
    while True:
        L.append(n % 10)
        n /= 10
        if n == 0: break
    L.sort()
    return L
 
i = 1
while True:
    a = numDigits(i)
    for j in range(1,7):
        b = numDigits(i * j)
        if a != b: break
        if j == 6:
            print i
            i = 0
    if i == 0: break
    i += 1
 
elapsed = time.time() - start
 
print "result found in %s seconds" % elapsed

When executed, we find the following result.

142857
result found in 1.92401385307 seconds

While browsing the Project Euler problems, I found problem 97, which is an incredibly straightforward problem that is buried within a sea of much more challenging problems. It’s called “Large non-Mersenne Prime” and it states the following.

The first known prime found to exceed one million digits was discovered in 1999, and is a Mersenne prime of the form $2^{6972593}āˆ’1$; it contains exactly 2,098,960 digits. Subsequently other Mersenne primes, of the form $2^pāˆ’1$, have been found which contain more digits.

However, in 2004 there was found a massive non-Mersenne prime which contains 2,357,207 digits: $28433\times 2^{7830457}+1$.

Find the last ten digits of this prime number.

Solution in Python

Since we’re only looking for the final ten digits of the number, we can form a solution modulo a sufficiently large number, effectively forgetting about any larger power digits. We’ll do all of the powers of two first, then multiply by the constant, then add one. We’ll take the first ten digits of the result and that should be the answer we’re looking for. I’d say that this one is pretty quick and painless.

#!/usr/bin/env python
 
import time
 
start = time.time()
 
n = 2
for i in range(7830456):
    n = (2 * n) % 10000000000
 
n *= 28433
n += 1
 
n = n % 10000000000
 
elapsed = time.time() - start
 
print "Result %s found in %s seconds" % (n, elapsed)

Then, executing, we have:

$ python problem-97.py 
Result 8739992577 found in 1.08037400246 seconds

Solution in C

This is a situation where the C solution is nearly as fast to write as the Python solution, and the code really follows the same structure.

#include <stdio.h>
#include <stdlib.h>
 
int main(int argc, char **argv) {
    unsigned long       i;
    unsigned long long  n = 2;
 
    for (i=0;i<7830456;i++) {
        n = (n << 1) % 10000000000;
    }
 
    n *= 28433;
    n += 1;
 
    n = n % 10000000000;
 
    printf("Result %lld\n", n);
 
    return 0;
}

When we execute, the result is returned very quickly. (I’ve optimized the machine code using the -O3 flag as you can see below.)

$ gcc -O3 problem-97.c -o problem-97
$ time ./problem-97
Result 8739992577
 
real	0m0.030s
user	0m0.028s
sys	0m0.004s

That feels pretty fast, but I think a bottleneck is popping up when we bitshift (multiply by 2) AND reduce the result modolo 10000000000 at every single iteration of the for loop. The bitshifting is fast, but the modular arithmetic isn’t so much. Let’s rewrite the code so that we’re bitshifting at every iteration, but only using modular arithmetic every several iterations.

#include <stdio.h>
#include <stdlib.h>
 
 
int main(int argc, char **argv) {
    unsigned long       i,j;
    unsigned long long  n = 2;
 
    j = 0;
    for (i=0;i<7830456;i++) {
        //n = (n << 1) % 10000000000;
        n <<= 1;
        j++;
        if (j%10==0) {
            j = 0;
            n = n % 10000000000;
        }
    }
 
    n *= 28433;
    n += 1;
 
    n = n % 10000000000;
 
    printf("Result %lld\n", n);
 
    return 0;
}

That runs in roughly 1/3 the time of the original C code.

$ gcc -O3 problem-97.c -o problem-97
$ time ./problem-97
Result 8739992577
 
real	0m0.009s
user	0m0.008s
sys	0m0.000s