Cocoa for Scientists (Part XXVI): Parsing CSV Data

Author: Drew McCormack
Web Site: www.mentalfaculty.com

On quite a few occasions, MacResearch readers have posted questions asking how you parse CSV (comma-separated values) data in Cocoa. CSV is a simple standard that is used to represent tables; it is used in widely varying fields, from Science to Finance — basically anywhere a table needs to be stored in a text file.

I’ve recently added CSV import to my flash card application, Mental Case. Before I began, I thought it would be a trivial matter of searching for some Objective-C sample code or an open source library with Google. I found solutions in scripting languages like Python, but nothing Cocoa based. After an hour or two of searching, I realized that if I wanted a Cocoa-native solution, I was going to have to roll my own. In this short tutorial, I will show you what I came up with, and hopefully save you the trouble of doing it yourself.

Simple CSV

Parsing CSV can actually be quite simple, if you know the structure of the data beforehand, and you don’t have to deal with quoted strings. In fact, I addressed this in an earlier tutorial that stored spectra in CSV format.

- (BOOL)readFromURL:(NSURL *)absoluteURL ofType:(NSString *)typeName 
    error:(NSError **)outError 
{
    NSString *fileString = [NSString stringWithContentsOfURL:absoluteURL 
        encoding:NSUTF8StringEncoding error:outError];
    if ( nil == fileString ) return NO;
    NSScanner *scanner = [NSScanner scannerWithString:fileString];
    [scanner setCharactersToBeSkipped:
        [NSCharacterSet characterSetWithCharactersInString:@"\n, "]];
    NSMutableArray *newPoints = [NSMutableArray array];
    float energy, intensity;
    while ( [scanner scanFloat:&energy] && [scanner scanFloat:&intensity] ) {
        [newPoints addObject:
            [NSMutableDictionary dictionaryWithObjectsAndKeys:
                [NSNumber numberWithFloat:energy], @"energy",
                [NSNumber numberWithFloat:intensity], @"intensity",
                nil]];
    }
    [self setPoints:newPoints];
    return YES;
}

The NSScanner class is what you use to do most of your string parsing in Cocoa. In the example above, it has been assumed that the CSV file is in a particular form, namely, that it has exactly two columns, each containing a decimal number. By telling the scanner to skip commas

    [scanner setCharactersToBeSkipped:
        [NSCharacterSet characterSetWithCharactersInString:@"\n, "]];

the parsing of each line is reduced to a single line

    while ( [scanner scanFloat:&energy] && [scanner scanFloat:&intensity] ) {

The scanFloat: method will try to read a floating-point number, returning NO upon failure. So the while loop will continue until the format does not meet expectations.

General CSV

As you can see, parsing CSV data can be very easy, but it is not always the case. When you have to deal with general CSV data, things can get quite complicated, because you have to take account of the possibility that strings contain quotations, and can even extend over multiple lines. For example, the following is a valid line of CSV data, containing two columns:

"The quick, brown fox", "jumped over the ""lazy"",  
dog"

In case you haven’t figured it out, the double quotation marks are treated as single quotations in the string, giving the two strings 'The quick, brown fox' and 'jumped over the "lazy"<new line>dog'.

Parsing this general form of CSV is considerably more difficult than the simple form, and it took me quite a while to come up with some clean code to do it. But I think I succeeded in the end. Here it is: (Update: I have changed this code to properly handle all newline varieties.)

@implementation NSString (ParsingExtensions)

-(NSArray *)csvRows {
    NSMutableArray *rows = [NSMutableArray array];

    // Get newline character set
    NSMutableCharacterSet *newlineCharacterSet = (id)[NSMutableCharacterSet whitespaceAndNewlineCharacterSet];
    [newlineCharacterSet formIntersectionWithCharacterSet:[[NSCharacterSet whitespaceCharacterSet] invertedSet]];

    // Characters that are important to the parser
    NSMutableCharacterSet *importantCharactersSet = (id)[NSMutableCharacterSet characterSetWithCharactersInString:@",\""];
    [importantCharactersSet formUnionWithCharacterSet:newlineCharacterSet];

    // Create scanner, and scan string
    NSScanner *scanner = [NSScanner scannerWithString:self];
    [scanner setCharactersToBeSkipped:nil];
    while ( ![scanner isAtEnd] ) {        
        BOOL insideQuotes = NO;
        BOOL finishedRow = NO;
        NSMutableArray *columns = [NSMutableArray arrayWithCapacity:10];
        NSMutableString *currentColumn = [NSMutableString string];
        while ( !finishedRow ) {
            NSString *tempString;
            if ( [scanner scanUpToCharactersFromSet:importantCharactersSet intoString:&tempString] ) {
                [currentColumn appendString:tempString];
            }

            if ( [scanner isAtEnd] ) {
                if ( ![currentColumn isEqualToString:@""] ) [columns addObject:currentColumn];
                finishedRow = YES;
            }
            else if ( [scanner scanCharactersFromSet:newlineCharacterSet intoString:&tempString] ) {
                if ( insideQuotes ) {
                    // Add line break to column text
                    [currentColumn appendString:tempString];
                }
                else {
                    // End of row
                    if ( ![currentColumn isEqualToString:@""] ) [columns addObject:currentColumn];
                    finishedRow = YES;
                }
            }
            else if ( [scanner scanString:@"\"" intoString:NULL] ) {
                if ( insideQuotes && [scanner scanString:@"\"" intoString:NULL] ) {
                    // Replace double quotes with a single quote in the column string.
                    [currentColumn appendString:@"\""]; 
                }
                else {
                    // Start or end of a quoted string.
                    insideQuotes = !insideQuotes;
                }
            }
            else if ( [scanner scanString:@"," intoString:NULL] ) {  
                if ( insideQuotes ) {
                    [currentColumn appendString:@","];
                }
                else {
                    // This is a column separating comma
                    [columns addObject:currentColumn];
                    currentColumn = [NSMutableString string];
                    [scanner scanCharactersFromSet:[NSCharacterSet whitespaceCharacterSet] intoString:NULL];
                }
            }
        }
        if ( [columns count] > 0 ) [rows addObject:columns];
    }

    return rows;
}

@end

(I’m releasing this code into the public domain, so use it as you please.)

This code is designed to be a category of NSString. The idea is that it will parse a string into rows and columns, under the assumption that it is in CSV format. The result is an array of arrays; entries in the containing array represent the rows, and those in the contained arrays represent columns in each row.

The code itself is fairly straightforward: It consists of a big while loop which continues until the whole string is parsed. An inner while loop looks through each row of CSV data, looking for significant landmarks, like an end of line, an opening or closing quotation mark, or a comma. By keeping track of opening and closing quotation marks, it is able to properly deal with commas and newlines embedded in quoted strings.

Conclusions

NSScanner is a useful class for parsing strings. It may not be quite as powerful as the regular expressions found in scripting languages like Perl and Python, but with just a few methods — eg, scanString:intoString: scanUpToCharactersFromSet:intoString:, scanFloat: — you can achieve an awful lot. If you need to do any basic string parsing in one of your Cocoa projects, give it a look.

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Huge text files

Have you tried that with huge text files? NSString stringWithContentsOfURL reads the entire file does it not? For large files I think you'd have to memory map parts of the file in memory which would be more complicated.

Huge text files

True. For very big text files, you would have a problem.
I don't know if memory mapping would help much: if a large file is read in, it is effectively memory-mapped by the system, because it goes into virtual memory.
If you had enormous CSV files, you would probably need to do something more low-level, using NSFileHandle, or maybe the NSStream classes.

Drew

---------------------------
Drew McCormack
http://www.maccoremac.com
http://www.macanics.net
http://www.macresearch.org

iwork numbers

could you send this to the good folks at apple who forgot to include this feature in numbers?

RE: Huge text files

It probably wouldn't matter if you're running on a server but if you're running on just a desktop then you'd probably consume all the physical ram and then it'd get paged out if needed.

File Encodings

I was also surprised when I needed to add CSV parsing to my app that there was nothing 'off the shelf' -- I eventually rolled my own (similar to what you did) but one thing that really bit me was encoding (My app has a lot of international users)

In your example you assume UTF8 encoding which is reasonable and the native encoding on the mac but the most popular CSV creator in the world (Excel) uses (I believe) UTF16 and there are other apps that put it in MacRoman (I never could figure out who was doing that)

Ultimately I spent probably more time writing the code to import the file properly regardless of it's encoding then I did the actual parsing.

-jm

Cocoadev

Hi Drew,

Nice code, just wondered whether you had seen this one given your introduction:
http://www.cocoadev.com/index.pl?ReadWriteCSVAndTSV

Cocoadev

Nice link. I don't think I saw that, although I definitely looked on CocoaDev.
I had a bit of a look around just now, but the CocoaDev server is having trouble, so I stopped.
There is sample code there, but like most of the other code I found, it doesn't treat all cases. In particular, a lot of the code you can find does not deal with new-lines in quoted strings, which are allowed (at least by Excel).

If there is a sample that treats all cases, please post the link so that people have a choice.

Drew

---------------------------
Drew McCormack
http://www.maccoremac.com
http://www.macanics.net
http://www.macresearch.org

New lines

I've looked into the potential new-line problem when using non-UTF8 strings, and came across a few useful tips. First, there is a newLineCharacterSet factory method in Leopard, which could replace the @"\n\r" line in my code.

For pre-leopard, I found a public domain category that achieves the same thing: http://codebeach.org/code/show/1

I might work that into the code above if I get time.

Drew

---------------------------
Drew McCormack
http://www.maccoremac.com
http://www.macanics.net
http://www.macresearch.org

Update for newlines

I have modified the code slightly to treat different varieties of newlines properly. I also tested it for a number of string types and line endings, including UTF8, UTF16, MacRoman, and Windows, and it seems to work in all cases.

Drew

---------------------------
Drew McCormack
http://www.maccoremac.com
http://www.macanics.net
http://www.macresearch.org

Big hammer

There is always the high overhead/short version

NSString *fileContent = [NSString stringWithContentsOfFile:path encoding:NSUTF8StringEncoding error:NULL];
NSArray *lines = [fileContent componentsSeparatedByString:@"\n"];
// In 10.5 can use
// [fileContent componentsSeparatedByCharactersInSet:[NSCharacterSet newlineCharacterSet]];

// Then for each line, use
NSArray *entriesInLine = [[lines objectAtIndex:i] fileContent componentsSeparatedByString:@","];

// In case you have quotes on the outside, do
[[entriesInLine objectAtIndex:j] stringByTrimmingCharactersInSet:quotes];
// where
quotes = [NSCharacterSet characterSetWithCharactersInString:@"\""];

This method is fairly slow. But for moderate sizes (up to a megabyte or two) you really won't feel it that much. You are using fairly general purpose objects when you use a character set, and a lot of objects are added to the autorelease pool. The main issue is memory use, since you will have multiple copies of the data file + object overhead. To avoid that, create NSAutoreleasePool objects inside the loops, and put about 100 objects in it before you remove it and create another pool. Use "po [NSAutoreleasePool showPools]" in the debugger to monitor how many objects are in the pool.

David

newlines

David, one of the issue of your code is that you won't be able to have newline characters as part of the cell content of your table. For example, if column 1 is the title of a song, and column 2 are the lyrics, then you want the cells of column 2 to include newline characters.

Otherwise, it is true that the NSArray method 'componentsSeparatedByString:' is a very convenient method, worth remembering when performance is not an issue.

Great Series!

I'm new to mac, a mathematician retired from the aerospace industry. I decided a couple of weeks ago to use Cocoa for some projects. If I had found your series sooner, I could have saved an ink cartridge, a whole bunch of paper, and the price of a book.

I was just starting to struggle with a CSV import, export for my current app when I found your earlier article 17, and this one. Even though I've read a good deal of apple documentation, other tutorials, and a fair part of the aforementioned book, each of your articles answers a question I've had, or clarifies something I've only partially understood.

I hope to be able to make a contribution someday.

Thanks

Just want to say thanks for a great series of tutorials. I have studied and been put off by Obj-C for several years, but this set of lessons has finally brought it all together to make sense.

Kudos!

[NSCFArray length]: unrecognized selector sent to instance

Hi Drew

Thanks for this great tutorial.
When I try to apply it in my program I run into the following problem:

Your code applied to a CSV file with one row and two columns:
e.g Borrower's Name,bubba

generates output like:

(
"Borrower's Name",
bubba
)

When I try to access the individual array elements by applying a scanner to separate "Borrower's Name" and bubba, I get the error [NSCFArray length]:unrecognized selector sent to instance...

Below is my code:


NSString *fileString = [NSString stringWithContentsOfFile:[sheet filename] encoding:NSUTF8StringEncoding error:NULL];

NSArray* tmpArray = [self csvRows:fileString]; // Here the CSV string is input into your parser...

NSMutableString *lineString = [[NSMutableString alloc] init];

...

lineString = [fieldsArray objectAtIndex: 0];

NSScanner *scanner = [NSScanner scannerWithString:lineString];

...

Any idea what I am doing wrong?

Thanks for your help.
Dominik

Re: [NSCFArray length]: unrecognized selector sent to instance

Hi Dominik,
The error is caused by sending the length message to an array, rather than a string.
The csvRows: method returns an NSArray of NSArrays. Each row is not a string, but an NSArray of entries representing the columns in the string.

I would need to see more of your code, because I can't see how you are getting the fieldsArray. Here is an example of how you could do it:

NSString *fileString = [NSString stringWithContentsOfFile:[sheet filename] encoding:NSUTF8StringEncoding error:NULL];
NSArray* tmpArray = [self csvRows:fileString]; // Here the CSV string is input into your parser...
NSArray* fieldsArray = [tmpArray objectAtIndex:5]; // Gives 6th row array
NSString* firstFieldString = [fieldsArray objectAtIndex:0]; // Gives first column in 6th row

Make sense?

Drew

---------------------------
Drew McCormack
http://www.maccoremac.com
http://www.macanics.net
http://www.macresearch.org

Re: [NSCFArray length]: unrecognized selector sent to instance

Thanks a lot Drew, this solved my problem.

FYI: I did define fieldsArray = [[NSMutableArray alloc] init]; in the -(id) init procedure as global variable.

Dominik

how big are ...

Drew, when you tell about "very big text files" , how big are those "very big files"?

regards,

Re: Big Files

Well, I usually just relate it to the RAM in a standard computer. So a few hundred megabytes should be possible with this algorithm. If you are heading to gigabytes, best to use a streaming parser.

Drew

---------------------------
Drew McCormack
http://www.maccoremac.com
http://www.macanics.net
http://www.macresearch.org

ohhh. I think my data wont

ohhh. I think my data wont go beyond 100MB. gigabytes is to0 far for me....

Empty last column bug

Dear Drew,

Thank you for the code. I am using it in an xcode project for card game design and very much appreciate it. However, I noticed that if the last column in a row is blank (@"") it will be omitted by your algorithm. This is a bug because an empty column at the end of a row should be allowed. I corrected the code by making more strict requirements for omitting an empty column. Namely, it must be at the end of the string, not just the end of a row, and it must be the only column in the row. This will omit an empty one-column row at the end of the string if it is not followed by a newline, but that is OK with me.

Here is my altered code for csvRows.

if ( [scanner isAtEnd] ) {
//CHANGED OMISSION REQUIREMENTS
if ( (![currentColumn isEqualToString:@""]) || ([columns count] > 0) ) [columns addObject:currentColumn];
finishedRow = YES;
}
else if ( [scanner scanCharactersFromSet:newlineCharacterSet intoString:&tempString] ) {
if ( insideQuotes ) {
// Add line break to column text
[currentColumn appendString:tempString];
}
else {
// End of row
//RETAIN AN EMPTY COLUMN BEFORE A NEWLINE
[columns addObject:currentColumn];
finishedRow = YES;
}
}

Leif
friedmayonnaise.com

Thank you

I can't even begin to tell you how much time this little snippet saved me tonight. It's not perfect, but it's a great start.

Thanks!
Eric

Re: Imperfection

Mind telling us what is 'imperfect'?

Drew

---------------------------
Drew McCormack
http://www.mentalfaculty.com
http://www.macanics.net
http://www.macresearch.org

QuickLook Plugin

I used this approach in my own litttle CSV app some time ago, and over the weekend decided that I need a CSV QuickLook Plugin. So I created one and since there's only a slightly altered version of Drew's code in there I thought I should mention that. Find the mini project here: http://code.google.com/p/quicklook-csv/
Maybe someone else finds this useful as well. :)

NSMutableCharacterSet Performance

Thanks for the great code. It's been quite useful.

I was having a performance issue when using this code (e.g., taking several seconds to parse a ~150K string), and I've found that creating an immutable copy of the two character sets seems to have resolved the problem.

I haven't done too much testing to figure out why, but using Shark found that a lot of time was being spent in the invertedSet method, so I'm guessing that NSMutableCharacterSet does something fancy about the way it stores changes that's avoided when it's made into a non-mutable set.

Adam

I've recently used this code

I've recently used this code in an iPad application that downloads large CSV files and parses them into objects stored in Core Data. While the author's code doesn't leak memory, it relies on -autoreleased objects quite a bit, and as a result used up all of the memory on the iPad and caused the app to crash.

The fix is pretty simple, and it's equally as valuable for iPad/iPhone apps as it would be for desktop apps - Create and drain your own NSAutoreleasePool. Two lines of code are required to make this happen:


while ( ![scanner isAtEnd] ) {
NSAutoreleasePool *pool = [[NSAutoreleasePool alloc] init]; // Add this line to create your own pool

Then, at the end of the while loop, call:

[pool drain];

And that will clean it up! Hope it helps...

Suggestion for a tweak to

Suggestion for a tweak to the simple CSV reader code (first example): Change [NSCharacterSet characterSetWithCharactersInString:@"\n, "] to [NSCharacterSet characterSetWithCharactersInString:@"\n\r, "]. For some odd reason, a csv file that I whipped together (I forget exactly where it originated, but I think Numbers) ended up having \r for the end of line characters, instead of \n. It threw me off for a while before I figured it out, and I can imagine that some people might not remember/know that there are multiple escape sequences for getting a new line, so it would be nice to have that taken care of in the example (especially since some of us [me in particular] don't read the comments until we've already been pulling our hair out for a few hours. ;) )

PAul