Languages

The adventure of the Old Mac line breaker in the Python world

There are many representations of a new line, End Of Line indicator, or a line breaker. You probably heard of the terms Line Feed (LF) and Carriage Return (CR). They are technically characters like the letter “A” and small letter “a”. But instead of printing the letter, they tell the system that it’s the end of a line. However, different computer system uses these 2 common characters in different ways but let’s narrow it down into the 2 most common ones, namely the Unix version “LF” and the Windows version “CR+LF”. But wait a minute, there is this Old Mac version as well that uses only CR character to represent the end of line.

Interestingly in the Python’s universe (and probably some other even more racist universes), the Old Mac convention is by default not a line breaker. If you read a file full of lines that only ends with “CR” using the standard file open() function in Python, they will come out as a single line text.

As a slightly less racist developer, we need to build applications that can support as many types of stuff as possible. Here are 2 tricks to help you ensure the file you are reading is read properly the next time you use it.

When reading a file

# Use the 'rU' mode so it understand the Old Mac properly
file = open('filename', 'rU')

If you happen to be working with File upload in Django, this might be useful

# http://stackoverflow.com/questions/1875956/how-can-i-access-an-uploaded-file-in-universal-newline-mode

# First, read the uploaded file and convert it to unicode using unicode() function
# Second, stream the file using io.StringIO function with the Universal-newline mode turn on by setting newline=None
import io
stream = io.StringIO(unicode(request.FILES['foo'].read()), newline=None)
Standard