A quick tale about FEFF, an invisible UTF-8 character that wrecked our CSV files

Today, we encountered an error while trying to create some database seeds from a CSV. This CSV was originally generated by me using a Ruby script which piped the output to a file and saved as a CSV.

The CSV was checked in to Git and had been used for awhile until we had to update some parts of it by adding a new column and fixing some values.

While we don’t know the exact reason yet, my theory is that somehow, Excel for Mac (we are all using Macs) added some additional metadata to it even after saving the file as a CSV.

This in turn made anyone using the seed receive the following error:

CSV::MalformedCSVError: Illegal quoting in line 1.

I opened the CSV file and nothing looked suspicious. My first thought was some left/right quotation marks were somehow mixed into the file instead of just the ‘normal’ double quotes: ". But upon further investigation, there was nothing out of the ordinary. This led me to just wipe out the whole file, and actually type out the first row again.

I saved that file again and ran the migration:

CSV::MalformedCSVError: Illegal quoting in line 1.

What?!

Okay, this was driving me nuts. I opened up a new file, typed the exact single line again, and ran the migration. It worked. So what was in that file?!

Only one way to find out:

cat companies.csv | pbcopy | pbpaste > temp.csv
rm companies.csv
mv temp.csv companies.csv
git diff

So OSX has these two functions that are very useful: pbcopy and pbpaste. Basically anything piped to pbcopy gets into your clipboard and pbpaste puts what you have on your clipboard to standard output (stdout). But it removes all formatting.

Very useful when you want to just copy some text from somewhere and you want to paste it into a WYSIWYG editor without all the formatting. Like when writing an email from Gmail, for example.

I then removed the original file and saved the new ‘unformatted’ file with the same file name so I could see the difference.

And we finally saw the invisible man:

xkI0eL8SP-nbAV3RKrpq8GdMG09j9HjsgHga — The invisible character showing in Atlassian’s Bitbucket.

iS2g-cMpndutvi-N0lw6dCfwuiFSHbtpeW0o — The invisible character’s actual name!

A quick Google search told us that our friend U+FEFF was called a ZERO WIDTH NO-BREAK SPACE. Also, a quick trip to Wikipedia told us about the actual uses for U+FEFF, more commonly known as Byte order mark or BOM.

Our friend FEFF means different things, but it’s basically a signal for a program on how to read the text. It can be UTF-8 (more common), UTF-16, or even UTF-32.

FEFF itself is for UTF-16 — in UTF-8 it is more commonly known as 0xEF,0xBB, or 0xBF.

From my understanding, when the CSV file was opened in Excel and saved, Excel created a space for our invisible stowaway, U+FEFF. And in front of the file to boot!

Excel did some magic, and it was probably saved in UTF-16 instead of UTF-8. UTF-8 does not understand BOM and just treats it as a non-character so visually, the file was okay. But Ruby’s CSV thought that there was something wrong because it assumed the file it was reading was UTF-8 and it couldn’t ignore Mr. U+FEFF.

So lesson learned: don’t open (and save!) a CSV file in Excel if you want to feed it to Ruby’s CSV parser.

If you do ever encounter an error like that, be sure to look for hidden characters not shown by your editor. If you still can’t see it and are using OSX, then pbcopy and pbpaste will help you out — they strip out any formatting or hidden characters from text in addition to copying and pasting it.