Carriage Return -vs- Line Feed

Every now and then, I end up having a text parsing issue that simply comes down to carriage returns versus line feeds. For instance, with Twitter’s streaming API, you can receive multiple responses that all form a single JSON entity. You can realize this situation because the end of an entity will have a carriage return, while individual chunks before the end will only have line feed (new line) terminators.

What’s really a bit messed up is that Windows is a big fan of CRLF (‘\r\n’), while most of Unix prefers LF (‘\n’) for line terminators. Apparently, Apple has a thing for sometimes using single CR (‘\r’) because it likes to be special.

This all started when computers were supposed to mimic typewriters. At the end of a line, you needed to do two things at once: (1) Return the character carriage to the left so you could type more and (2) advance the line feed so you wouldn’t simply write over what you just wrote.

Recognizing CRLF -vs- LF

Both the carriage return (CR) and line feed (LF) are represented by non-printable characters. Your terminal or text editor simply knows how to interpret them. When you tell a script to write ‘\n’ for a line feed, you’re actually referencing the non-printable ASCII character 0x0A (decimal 10). When you write ‘\r’, you’re referencing 0x0D (decimal 12).

But those characters don’t actually print. They only instruct a terminal or text editor to display the text around the characters in specific ways. So, how do you recognize them?

The Linux ‘file’ command will tell you what sort of line terminators a file has, so that’s pretty quick. Here’s an example:

$file my_file_created_on_Windows.txt my_file_created_on_Windows.txt: ASCII text, with very long lines, with CRLF line terminators$ file my_file_created_on_Linux my_file_created_on_Linux: ASCII text, with very long lines

If the file uses only LF terminators, this is considered the default and you won’t be informed.

Removing CR Terminators

You have several options for getting rid of those ‘\r’ CR characters in text. One option is to simply ‘tr’ the text in the terminal:

$tr -d '\r' < my_file_created_on_Windows.txt > my_new_file.txt Another option is to use a utility such as ‘dos2unix.’ Yet another option would be to use a more advanced text parsing language, such as Python, and replace the characters manually: import codecs f_p = codecs.open('my_file_created_on_Windows.txt','r','utf-8') g_p = codecs.open('my_new_file.txt','w','utf-8') for line in f_p: g_p.write(line.replace('\r','').replace('\n','')) f_p.close() g_p.close() A few notes on that Python code.. First, we use the codecs module to read text because it may contain non-ASCII characters such as Unicode. In this case, we’re reading the characters in UTF-8 encoding. Also, we replace both the CR and LF because Python will automatically write an LF at the end of the line, and we don’t want two LFs. I had written an article about running scripts in parallel using GNU Parallel, and then I realized that GNU parallel isn’t in the CentOS repositories. Since the code that I’m writing requires standard repo support, I need to find a different solution. If we want to perform the same action as in the referenced article, using xargs instead of GNU parallel, we’d run the following command. $ echo {1..20} | xargs -n1 -P5 ./echo_sleep 1426008382 -- starting -- 2 1426008382 -- starting -- 5 1426008382 -- starting -- 1 1426008382 -- starting -- 3 1426008382 -- starting -- 4 1426008382 -- finishing -- 4 1426008382 -- starting -- 6 1426008383 -- finishing -- 1 1426008383 -- starting -- 7 1426008385 -- finishing -- 3 1426008385 -- starting -- 8 1426008386 -- finishing -- 7 1426008386 -- starting -- 9 1426008389 -- finishing -- 9 1426008389 -- starting -- 10 1426008390 -- finishing -- 2 1426008390 -- finishing -- 5 1426008390 -- starting -- 11 1426008390 -- starting -- 12 1426008391 -- finishing -- 6 1426008391 -- starting -- 13 1426008392 -- finishing -- 10 1426008392 -- starting -- 14 1426008394 -- finishing -- 8 1426008394 -- starting -- 15 1426008396 -- finishing -- 15 1426008396 -- starting -- 16 1426008397 -- finishing -- 16 1426008397 -- starting -- 17 1426008398 -- finishing -- 11 1426008398 -- starting -- 18 1426008399 -- finishing -- 12 1426008399 -- starting -- 19 1426008399 -- finishing -- 13 1426008399 -- starting -- 20 1426008399 -- finishing -- 20 1426008399 -- finishing -- 17 1426008400 -- finishing -- 14 1426008402 -- finishing -- 18 1426008408 -- finishing -- 19

Some things to note here: First, the “-n1″ or “-n 1″ option is critical, as it informs xargs how many arguments from the echo string that each instance of the script echo_sleep needs to take as input. Also, the output format controls for xargs aren’t as well developed. In fact, it’s entirely possible for stdout from invoked scripts to collide. For this reason, you may want to make sure that the invoked scripts are more advanced (in Python with better file handling, for instance) instead of simply redirecting bash output.

What I want to be able to do is to run a script on a massive number of inputs. But, I only want a specified maximum number of them to be running at any given time. GNU parallel can accomplish this very easily.

First, make sure you have GNU parallel installed. The package in most major repositories is simply called “parallel”.

Writing A Basic Script

I’m going to write a bash script that echos a timestamp and input, and then waits two seconds before exiting. The script looks like this.

#!/bin/bash   VAL="$1" TIME="$(date +%s)"   echo "${TIME} --${VAL}"   sleep 2

Just to make sure it is working, we chmod it to 0755 and then we call it with input “hi”.

$./echo_sleep hi 1425690292 -- hi It worked just as expected: After it echoed the time and input, it slept for two seconds and then exited and my prompt returned. Running the Script in Parallel I want to run this script on 20 inputs, but I only ever want to have 5 instances running at any given time. Here’s how we do that with GNU parallel (where the input arguments for the script are denoted by ‘:::’). I’m just using numbers 1..20 and the inputs. $ parallel -j5 ./echo_sleep ::: {1..20} 1425690566 -- 1 1425690566 -- 2 1425690566 -- 3 1425690566 -- 4 1425690566 -- 5 1425690569 -- 6 1425690569 -- 7 1425690569 -- 8 1425690569 -- 9 1425690569 -- 10 1425690571 -- 11 1425690571 -- 12 1425690571 -- 13 1425690571 -- 14 1425690571 -- 15 1425690573 -- 16 1425690573 -- 17 1425690573 -- 18 1425690573 -- 19 1425690573 -- 20

Note: You can use any bash IFS-separated sequence as the input. For instance, something like ‘::: 1 2 3 4 5 6 7 8 9 10′ works just as well as the sequence ‘{1..10}’.

As you can see, the times are two seconds apart. What if we change the script to sleep for a random amount of time? We’ll have each script instance wait between 0 and 9 seconds by changing the script as follows, asking for output both when the script starts and when it finishes.

#!/bin/bash   VAL="$1" echo "$(date +%s) -- starting -- ${VAL}" sleep "$(($RANDOM % 10))" echo "$(date +%s) -- finishing -- ${VAL}" Now, when we run the script, we’ll expect each instance to take a different amount of time to finish. We’ll notice that all output for the script is sent to stdout at the same time. (There are output control mechanisms in GNU parallel, but we’re not using them here.) $ parallel -j5 ./echo_sleep ::: {1..20} 1425690952 -- starting -- 2 1425690954 -- finishing -- 2 1425690952 -- starting -- 1 1425690955 -- finishing -- 1 1425690952 -- starting -- 5 1425690955 -- finishing -- 5 1425690952 -- starting -- 3 1425690957 -- finishing -- 3 1425690952 -- starting -- 4 1425690959 -- finishing -- 4 1425690955 -- starting -- 7 1425690959 -- finishing -- 7 1425690955 -- starting -- 8 1425690961 -- finishing -- 8 1425690954 -- starting -- 6 1425690963 -- finishing -- 6 1425690959 -- starting -- 10 1425690963 -- finishing -- 10 1425690959 -- starting -- 11 1425690965 -- finishing -- 11 1425690961 -- starting -- 12 1425690965 -- finishing -- 12 1425690957 -- starting -- 9 1425690966 -- finishing -- 9 1425690966 -- starting -- 17 1425690967 -- finishing -- 17 1425690963 -- starting -- 13 1425690967 -- finishing -- 13 1425690963 -- starting -- 14 1425690967 -- finishing -- 14 1425690967 -- starting -- 19 1425690967 -- finishing -- 19 1425690967 -- starting -- 18 1425690969 -- finishing -- 18 1425690965 -- starting -- 15 1425690973 -- finishing -- 15 1425690965 -- starting -- 16 1425690973 -- finishing -- 16 1425690967 -- starting -- 20 1425690973 -- finishing -- 20

To get the output immediately, we can use the –linebuffer option. (There is also an –ungroup option, but it suffers from the problem of potentially mashing the simultaneous output of two script instances.)

\$ parallel -j5 --linebuffer ./echo_sleep ::: {1..20} 1425691272 -- starting -- 1 1425691272 -- starting -- 2 1425691272 -- starting -- 3 1425691272 -- starting -- 4 1425691272 -- starting -- 5 1425691273 -- finishing -- 1 1425691273 -- finishing -- 2 1425691273 -- starting -- 7 1425691273 -- starting -- 6 1425691275 -- finishing -- 4 1425691275 -- finishing -- 6 1425691275 -- starting -- 8 1425691275 -- starting -- 9 1425691276 -- finishing -- 3 1425691276 -- starting -- 10 1425691277 -- finishing -- 9 1425691277 -- starting -- 11 1425691278 -- finishing -- 8 1425691278 -- finishing -- 10 1425691278 -- starting -- 12 1425691278 -- starting -- 13 1425691280 -- finishing -- 5 1425691280 -- finishing -- 7 1425691280 -- starting -- 15 1425691280 -- starting -- 14 1425691282 -- finishing -- 12 1425691282 -- starting -- 16 1425691283 -- finishing -- 13 1425691283 -- starting -- 17 1425691284 -- finishing -- 17 1425691284 -- finishing -- 11 1425691284 -- starting -- 19 1425691284 -- starting -- 18 1425691285 -- finishing -- 14 1425691285 -- starting -- 20 1425691286 -- finishing -- 15 1425691286 -- finishing -- 18 1425691289 -- finishing -- 19 1425691289 -- finishing -- 16 1425691293 -- finishing -- 20

Now, all of the timestamps are in order.