Carriage Return -vs- Line Feed
Every now and then, I end up having a text parsing issue that simply comes down to carriage returns versus line feeds. For instance, with Twitter’s streaming API, you can receive multiple responses that all form a single JSON entity. You can realize this situation because the end of an entity will have a carriage return, while individual chunks before the end will only have line feed (new line) terminators.
What’s really a bit messed up is that Windows is a big fan of CRLF (‘\r\n’), while most of Unix prefers LF (‘\n’) for line terminators. Apparently, Apple has a thing for sometimes using single CR (‘\r’) because it likes to be special.
This all started when computers were supposed to mimic typewriters. At the end of a line, you needed to do two things at once: (1) Return the character carriage to the left so you could type more and (2) advance the line feed so you wouldn’t simply write over what you just wrote.
Recognizing CRLF -vs- LF
Both the carriage return (CR) and line feed (LF) are represented by non-printable characters. Your terminal or text editor simply knows how to interpret them. When you tell a script to write ‘\n’ for a line feed, you’re actually referencing the non-printable ASCII character 0x0A (decimal 10). When you write ‘\r’, you’re referencing 0x0D (decimal 12).
But those characters don’t actually print. They only instruct a terminal or text editor to display the text around the characters in specific ways. So, how do you recognize them?
The Linux ‘file’ command will tell you what sort of line terminators a file has, so that’s pretty quick. Here’s an example:
$ file my_file_created_on_Windows.txt my_file_created_on_Windows.txt: ASCII text, with very long lines, with CRLF line terminators $ file my_file_created_on_Linux my_file_created_on_Linux: ASCII text, with very long lines
If the file uses only LF terminators, this is considered the default and you won’t be informed.
Removing CR Terminators
You have several options for getting rid of those ‘\r’ CR characters in text. One option is to simply ‘tr’ the text in the terminal:
$ tr -d '\r' < my_file_created_on_Windows.txt > my_new_file.txt
Another option is to use a utility such as ‘dos2unix.’ Yet another option would be to use a more advanced text parsing language, such as Python, and replace the characters manually:
import codecs f_p = codecs.open('my_file_created_on_Windows.txt','r','utf-8') g_p = codecs.open('my_new_file.txt','w','utf-8') for line in f_p: g_p.write(line.replace('\r','').replace('\n','')) f_p.close() g_p.close()
A few notes on that Python code.. First, we use the codecs module to read text because it may contain non-ASCII characters such as Unicode. In this case, we’re reading the characters in UTF-8 encoding. Also, we replace both the CR and LF because Python will automatically write an LF at the end of the line, and we don’t want two LFs.