Bubble Foundry


On CSV

by Peter.

CSV, or comma-separated values, is a non-standard which due to its apparent simplicity people think is a good data interchange format. It’s not. In this blog post I will attempt to tell you more than you ever wanted to know about CSV.

tl;dr: CSV is a a HORRIBLE ‘format’ and should NEVER be used. You will regret it. If you need a simple data exchange format that is easy to for lay people to read and write and widely supported by programming languages, use JSON (my first choice) or something like YAML.

CSV sounds like the simplest format in the world: you have rows and columns, just like in Excel, with one row per line and each column separated by commas:

USA, 1776
France, 1789

And there we go!

If only it were so simple…

How do we know what column is what?

What a good question! The simple answer is that you don’t. Or rather, you make the decision once in a project and make sure everyone else knows. They do know, right? Then you can all have code like this:

for (var k = 0; k < revolutions.length; k++) {
  var country = revolutions[k][0];
  var year = revolutions[k][1];
  console.log("In the year " + year + " " + country + " had a revolution.");
}

But then what happens when the new guy decides he’s going to store his information in something like a Python dictionary, which doesn’t guarantee the creation order will be maintained, before he creates a CSV using it. We might get messages like “In the year Russia 1917 had a revolution.”

But wait, you say! CSV files can have headers! All you have to do is have the first row consist of descriptions of the columns, like so:

Country, Year
USA, 1776
France, 1789
Russia, 1917

And then we might going through our revolutions like so:

for (var k = 0; k < revolutions.length; k++) {
  var country = revolutions[k]['country'];
  var year = revolutions[k]['year'];
  console.log("In the year " + year + " " + country + " had a revolution.");
}

And this is an improvement. Even if New Guy messes up and gives us the following file we’re good to go:

Year, Country
1776, USA
1789, France
1917, Russia

But how did everyone know that your CSV data has a header? Sure, it’s obvious to you, but you’re a human: you’re really great at pattern recognition1 and computers are horrible at it. For all the average CSV processing code knows, “Country” is a country and “Year” is the year its revolution happened. So, you’re back to telling everyone using your file that there are headers and that they should use them. This isn’t terribly hard but neither is it automatic.

What about accents?

In all the examples before we’ve used characters that can be represented in ASCII. But what if we now have a CSV of famous revolutionaries?

Name in English, Name in Native Language, Date of Birth
Vladimir Ilyich Lenin, Владимир Ильич Ленин, 1870
Mao Zedong, 毛泽东, 1893
Fidel Castro, Fidel Castro, 1926

What character encoding was used? How will you know? If you want to avoid problems you’ll either need to agree on an encoding format2 or someone reading your file may have to try to detect the proper format. Other, real formats specify the character encoding. For instance, JSON uses UTF-8.

What about commas?

What if we want to list our revolutionaries like “surname, given name” to make sorting easier?

Name in English, Name in Native Language, Date of Birth
Castro, Fidel, Fidel Castro, 1926
Lenin, Vladimir Ilyich, Владимир Ильич Ленин, 1870
Mao, Zedong, 毛泽东, 1893

Castro comes before Lenin who comes before Mao. Simple, no? Except you probably noticed a problem, yes? We know have a header row of 3 columns and data rows of 4. This means that Fidel Castro’s date of birth is ‘Fidel Castro’! Beyond being wrong, an unforgiving CVS parser (frankly, a good one) may refuse to process the file. Or it may simply drop the fourth columns, since it has no named column to assign the values to. And what if you application tries to treat ‘Fidel Castro’ as a number? It might crash!

But can’t we avoid commas?

One common solution to the comma problem above is to use a different field separator character. Tabs and semi-colons are popular:

Name in English; Name in Native Language; Date of Birth
Castro, Fidel; Fidel Castro; 1926
Lenin, Vladimir Ilyich; Владимир Ильич Ленин; 1870
Mao, Zedong; 毛泽东; 1893

Problem solved, right? Well, not so fast. The file is no longer contains comma-separated values, so you did change the filename to revolutionaries.ssv, right? Of course not, no-one talks about SSV files.3 And even if you did, most people would just be confused: .csv they know, not .ssv. So in the interest of expediency you call your file revoluntionaries.csv. So this is another piece of information you’ll need to document and tell everyone who may ever create or read your “CSV” files.

Microsoft Excel is horrible with field separators. Depending on your locale a .csv file MUST use semi-colons, not commas, or Excel won’t import it. It’s probably because Europeans often use a comma as a decimal separator, while the Americans use periods. However, Americans often use commas as thousands separators. Unfortunately, either number format could lead to errors:

Item, Cost
Baseball, 19,99
Tesla Roadster, 120,000.95

So yes, I understand why European versions of Excel (but not, in my experience, American versions) prefer semi-colons. But still, don’t completely prohibit me from opening a comma separated CSV file! CSV is a non-format, I should be able to hurt myself as much as possible – that’s what it’s all about! I hope the person at Microsoft responsible regrets this a thousand times. And yes I’m bitter, that jerk was responsible for staying at a client’s office till midnight once when everything else was working perfectly hours earlier.

What about just quoting everything?

Well, all those different field separators are a pain, you intelligently say. Let’s just stick with commas to separate fields and use double quotation marks to indicate the start and end of the field contents:

"Name in English", "Name in Native Language", "Date of Birth"
"Castro, Fidel", "Fidel Castro", "1926"
"Lenin, Vladimir Ilyich", "Владимир Ильич Ленин", "1870"
"Mao, Zedong", "毛泽东", "1893"

Well, that’s the most logical and consistent approach. But of course most programs reading and writing CSV are anything but logical. Many are much more parsimonious with their field separators:

Name in English, Name in Native Language, Date of Birth
"Castro, Fidel", Fidel Castro, 1926
"Lenin, Vladimir Ilyich", Владимир Ильич Ленин, 1870
"Mao, Zedong", 毛泽东, 1893

Sure, there are no unnecessary quotation marks. On the hand, that means that a CVS parser needs to very flexible, able to handle columns both with and without quotation marks.

How do you handle newlines inside of fields?

I’m sure you immediately see the problem with this, right?

country, cities by population, capital
Netherlands, 1. Amsterdam
2. Rotterdam
3. Den Haag, Amsterdam
France, 1. Paris
2. Marseille
3. Lyon, Paris

Luckily there’s a simple solution:

"country", "cities by population", "capital"
"Netherlands", "1. Amsterdam
2. Rotterdam
3. Den Haag", "Amsterdam"
"France", "1. Paris
2. Marseille
3. Lyon", "Paris"

Obvious, of course. But it just reinforces that fact that quotation marks are really necessary any time you use CSV, especially with user-supplied data.

We probably need to be able to deal with quotation marks inside of quotation marks, right?

Just like we needed to deal with commas as part of our data, we’ll probably encounter quotation marks inside our data from time to time. Accordingly a sort of convention has developed. Unfortunately it is completely ridiculous: put a second quotation mark to indicate it is not the field separator:

"language", "body"
"en", "He said, ""What happened?"""
"fr", "Il disait, ""Qu'est ce qu'il est passé?"""

Lovely, isn’t it? In just about every other programming language and data format out there special characters are escaped with a backslash:

"language", "body"
"en", "He said, \"What happened?\""
"fr", "Il disait, \"Qu'est ce qu'il est passé?\""

Some CSV libraries will let you manually specify how quotation marks are escaped and accordingly you can use those format. However, not all and even if you can it’s yet another convention that you must make sure is communicated to everyone how may ever read or write a CSV as part of your system.

A column is not the same as a field

You may have noticed that I referred to commas as column separators and quotation marks as field separators. Why is that? Well, because they do subtly different things. First, there are the obvious differences. If you have N columns, you only put commas after the 1 to N-1 columns. In contrast, you put quotation marks after all N fields. This would be invalid:

"header1", "header2", 
"field1", "field2

Ok, you say, that makes sense but is the a difference between a column and a field? Again, unfortunately, yes. Most CSV parsers I’ve seen (specifically OpenOffice and the standard Python CSV library) make a distinction. While a column is expected to consist of one field, it may not only contain a field.

This is a very subtle point that is best illustrated by some examples:

header1, header2, header3
field1, "I'm here," André said., field3

First, you can see that all the fields should have been surrounded by quotation marks. Agreed, but what is interesting is how many CSV parsers will go wrong on these bad lines. What both OpenOffice4 and Python will do with the first row is assign field1 to the header1 column of row 1.

They will then move to the second column. Because most CSV writers use quotation marks selectively and do not treat them as an all-or-nothing affair, as discussed above, both OpenOffice and the Python CSV library will recognize the " character as the start of column header2‘s field, despite no quotation marks being used for the row’s first column or for the header row.

This sounds reasonable enough except that I'm here,, not "I'm here, will get assigned to the header2 column by the time it reaches the second ". It is here that the problems really start to accumulate.

Because the second unescaped quotation mark is reached, the reader knows that the field is done. However, it is not immediately followed by a column separator. What is the reader to do? An aggressive, strict one might throw out all characters between the second field separator and the next column separator. Neither of the programs I’ve been discussing do that.

Instead what the programs do is try to continue appending the characters they encounter until the next column separator to the header2 column. However, since the field has ended the characters are treated as raw data and none of the rules about processing fields are followed. This means that, for instance, the Python library will throw an exemption because é is not an ASCII character and only plain ASCII characters are expected when dealing with raw data. And no, the exemption will not be caught at all by the CSV parser, meaning that your entire CSV parsing, and perhaps your entire program, will stop and crash. And why should you be expecting this exception?

It is for this reason that I have been talking about both columns and fields. And perhaps you have a partner who has taken your admonishments a bit too much to heart and is trying to give you reasonably escaped data:

"family"; "members"; "country"
"Doe"; "John", "Jane"; "USA"

Even if we don’t have a problem with unescaped column separators, like we have avoided here, you can still have the problem of fields ending before their columns. For instance, in this case the members column will contain John, "Jane" or John", "Jane, depending on your parser.

Instead you should do something like this:

"family", "members", "country"
"Doe", "\"John\", \"Jane\"", "USA"

As usual with CSVs, regardless of what escaping strategy you pick, lists of items in single columns is a convention that you’ll need to communicate to everyone involved.

Conclusion

All the types of issues I’ve outlined above I have encountered with a real-world project over the last year. These issues have required an incredible amount of time and coordination among all parts and, frankly, it has not always been pleasant. Hopefully by pointing out all the problems with CSV I’ve convinced you never to use this abomination of a ‘format’.

And if you still think that JSON is too chatty, just gzip it for goodness sake!

  1. Cultural aside: you have read William Gibson’s brilliant Pattern Recognition, right? Its sequels, Spook Country and Zero History, are also fantastic. []
  2. Just use Unicode and UTF-8. []
  3. Though I have seen TSV and .tsv once in a blue moon. []
  4. Though this will only happen for the latter if you have a setting unchecked. And it is unchecked by default. []