Why I hate spreadsheets (part one of many)

CSV should be defined somewhere, right?  You should be able to tell if CSV is well formed? WRONG.  This is going to be the first post of many on CSV, the devils own file format.

Here’s the closest you can get to a specification of the CSV format, the RFC for the CSV MIME type.  It  “documents the format that seems to be followed by most implementations”, which while admirable is so full of weasel words it smells of ferrets.

My own summary of CSV is this:  DELIMITATED DATA IN A FILE THAT’S SORT OF ORGANISED ON A LINE BY LINE BASIS.  Judging by the thousands of random CSV files I’m trawling through at the moment, that’s about all they’ve got in common.  And don’t get me started on address formats.

Here’s a handy list of what you’ll encounter:

  • Headers
  • No headers
  • More headers than columns of data
  • Less headers than columns of data
  • Different headers placed somewhere randomly in the document
  • Double quotes
  • Single quotes
  • No quotes
  • Randomly unclosed quotes
  • Both types of quotes
  • Some lines with more columns than others
  • Some lines that swap data around
  • Empty lines
  • Tabs
  • Commas
  • Both
  • Cp1252
  • UTF-8
  • UTF-16
  • ISO-8859-1
  • Character sets you’ve never heard of
  • Character sets you can’t identify
  • .XLS files that are CSV
  • .XLS files that are TSV
  • .XLS files that are OOXML
  • .XLS files that are OLE2
  • .XLS files that are something else entirely
  • .XLS files that may as well be PDFs given the terrible way they’re organised

And the winner is…

My favourite.  This is the best thing ever.

  • PDFs that contain printouts of CSV

Great.  Thanks for that.

Written by Tom

5 Comments

Tom

Are you using Java? Apache POI might be handy for you. This will identify them from their bytes:

InputStream i = new ByteArrayInputStream(somebytes);
try {
return (POIFSFileSystem.hasPOIFSHeader(i) || POIXMLDocument.hasOOXMLHeader(i));
} catch (IOException e) {
e.printStackTrace();
return false;
}

Reply

Leave a Reply