On quite a few occasions, MacResearch readers have posted questions asking how you parse CSV (comma-separated values) data in Cocoa. CSV is a simple standard that is used to represent tables; it is used in widely varying fields, from Science to Finance — basically anywhere a table needs to be stored in a text file.

I’ve recently added CSV import to my flash card application, Mental Case. Before I began, I thought it would be a trivial matter of searching for some Objective-C sample code or an open source library with Google. I found solutions in scripting languages like Python, but nothing Cocoa based. After an hour or two of searching, I realized that if I wanted a Cocoa-native solution, I was going to have to roll my own. In this short tutorial, I will show you what I came up with, and hopefully save you the trouble of doing it yourself.

Simple CSV

Parsing CSV can actually be quite simple, if you know the structure of the data beforehand, and you don’t have to deal with quoted strings. In fact, I addressed this in an earlier tutorial that stored spectra in CSV format.

- (BOOL)readFromURL:(NSURL *)absoluteURL ofType:(NSString *)typeName 
    error:(NSError **)outError 
{
    NSString *fileString = [NSString stringWithContentsOfURL:absoluteURL 
        encoding:NSUTF8StringEncoding error:outError];
    if ( nil == fileString ) return NO;
    NSScanner *scanner = [NSScanner scannerWithString:fileString];
    [scanner setCharactersToBeSkipped:
        [NSCharacterSet characterSetWithCharactersInString:@"\n, "]];
    NSMutableArray *newPoints = [NSMutableArray array];
    float energy, intensity;
    while ( [scanner scanFloat:&energy] && [scanner scanFloat:&intensity] ) {
        [newPoints addObject:
            [NSMutableDictionary dictionaryWithObjectsAndKeys:
                [NSNumber numberWithFloat:energy], @"energy",
                [NSNumber numberWithFloat:intensity], @"intensity",
                nil]];
    }
    [self setPoints:newPoints];
    return YES;
}

The NSScanner class is what you use to do most of your string parsing in Cocoa. In the example above, it has been assumed that the CSV file is in a particular form, namely, that it has exactly two columns, each containing a decimal number. By telling the scanner to skip commas

    [scanner setCharactersToBeSkipped:
        [NSCharacterSet characterSetWithCharactersInString:@"\n, "]];

the parsing of each line is reduced to a single line

    while ( [scanner scanFloat:&energy] && [scanner scanFloat:&intensity] ) {

The scanFloat: method will try to read a floating-point number, returning NO upon failure. So the while loop will continue until the format does not meet expectations.

General CSV

As you can see, parsing CSV data can be very easy, but it is not always the case. When you have to deal with general CSV data, things can get quite complicated, because you have to take account of the possibility that strings contain quotations, and can even extend over multiple lines. For example, the following is a valid line of CSV data, containing two columns:

"The quick, brown fox", "jumped over the ""lazy"",  
dog"

In case you haven’t figured it out, the double quotation marks are treated as single quotations in the string, giving the two strings 'The quick, brown fox' and 'jumped over the "lazy"<new line>dog'.

Parsing this general form of CSV is considerably more difficult than the simple form, and it took me quite a while to come up with some clean code to do it. But I think I succeeded in the end. Here it is: (Update: I have changed this code to properly handle all newline varieties.)

@implementation NSString (ParsingExtensions)

-(NSArray *)csvRows {
    NSMutableArray *rows = [NSMutableArray array];

    // Get newline character set
    NSMutableCharacterSet *newlineCharacterSet = (id)[NSMutableCharacterSet whitespaceAndNewlineCharacterSet];
    [newlineCharacterSet formIntersectionWithCharacterSet:[[NSCharacterSet whitespaceCharacterSet] invertedSet]];

    // Characters that are important to the parser
    NSMutableCharacterSet *importantCharactersSet = (id)[NSMutableCharacterSet characterSetWithCharactersInString:@",\""];
    [importantCharactersSet formUnionWithCharacterSet:newlineCharacterSet];

    // Create scanner, and scan string
    NSScanner *scanner = [NSScanner scannerWithString:self];
    [scanner setCharactersToBeSkipped:nil];
    while ( ![scanner isAtEnd] ) {        
        BOOL insideQuotes = NO;
        BOOL finishedRow = NO;
        NSMutableArray *columns = [NSMutableArray arrayWithCapacity:10];
        NSMutableString *currentColumn = [NSMutableString string];
        while ( !finishedRow ) {
            NSString *tempString;
            if ( [scanner scanUpToCharactersFromSet:importantCharactersSet intoString:&tempString] ) {
                [currentColumn appendString:tempString];
            }

            if ( [scanner isAtEnd] ) {
                if ( ![currentColumn isEqualToString:@""] ) [columns addObject:currentColumn];
                finishedRow = YES;
            }
            else if ( [scanner scanCharactersFromSet:newlineCharacterSet intoString:&tempString] ) {
                if ( insideQuotes ) {
                    // Add line break to column text
                    [currentColumn appendString:tempString];
                }
                else {
                    // End of row
                    if ( ![currentColumn isEqualToString:@""] ) [columns addObject:currentColumn];
                    finishedRow = YES;
                }
            }
            else if ( [scanner scanString:@"\"" intoString:NULL] ) {
                if ( insideQuotes && [scanner scanString:@"\"" intoString:NULL] ) {
                    // Replace double quotes with a single quote in the column string.
                    [currentColumn appendString:@"\""]; 
                }
                else {
                    // Start or end of a quoted string.
                    insideQuotes = !insideQuotes;
                }
            }
            else if ( [scanner scanString:@"," intoString:NULL] ) {  
                if ( insideQuotes ) {
                    [currentColumn appendString:@","];
                }
                else {
                    // This is a column separating comma
                    [columns addObject:currentColumn];
                    currentColumn = [NSMutableString string];
                    [scanner scanCharactersFromSet:[NSCharacterSet whitespaceCharacterSet] intoString:NULL];
                }
            }
        }
        if ( [columns count] > 0 ) [rows addObject:columns];
    }

    return rows;
}

@end

(I’m releasing this code into the public domain, so use it as you please.)

This code is designed to be a category of NSString. The idea is that it will parse a string into rows and columns, under the assumption that it is in CSV format. The result is an array of arrays; entries in the containing array represent the rows, and those in the contained arrays represent columns in each row.

The code itself is fairly straightforward: It consists of a big while loop which continues until the whole string is parsed. An inner while loop looks through each row of CSV data, looking for significant landmarks, like an end of line, an opening or closing quotation mark, or a comma. By keeping track of opening and closing quotation marks, it is able to properly deal with commas and newlines embedded in quoted strings.

Conclusions

NSScanner is a useful class for parsing strings. It may not be quite as powerful as the regular expressions found in scripting languages like Perl and Python, but with just a few methods — egscanString:intoString: scanUpToCharactersFromSet:intoString:scanFloat: — you can achieve an awful lot. If you need to do any basic string parsing in one of your Cocoa projects, give it a look.