Dealing with common and uncommon problems with parsing CSV with Ruby

Dealing with common and uncommon problems with parsing CSV with Ruby

Effectively parse CSV with Ruby

At first glance, parsing CSV files with Ruby seems really straightforward and non-problematic. The same for the CSV format itself, it’s just a set of data separated mainly by commas. However, the practice is usually more problematic and complex, especially when you have to parse CSV files uploaded by users.

You can assume that the file will be complete in terms of structure and appropriately encoded. The harsh truth is that files uploaded by users are often malformed, poorly formatted, or encoded. Some of them may even contain duplicated headers or empty rows. This article is a first-aid kit that you can use when you will spot any problems with parsing CSV.

All my notes are based on years of hands-on experience at iRonin.IT - a top software development company, where we provide custom software development and IT staff augmentation services for a wide array of technologies.

High memory usage

Yes, you can harm your application’s performance by parsing CSV files. This can happen when you will parse a very large file and load all contents into a memory to process rows. Every time you see code like this:

CSV.read("./users.csv").each do |row|
  # …
end

or

CSV.parse(File.read("./users.csv")).each do |row|
  # …
end

Keep in mind that if the file users.csv is very big, this line of code can cause memory issues as all records will be loaded into memory. To avoid such a situation, use foreach method:

CSV.foreach("./users.csv") do |row|
  # …
end

Duplicated headers

Based on my experience, it does not happen that often as malformed encoding or missing information but it is still the case. Given we have the following CSV data with the duplicated first_name header:

first_name,last_name,email,first_name,age\n
John,Doe,doe@gmail.com,Johnatan,30\n
Tom,Doe,tom@gmail.com,Thomas,40\n

If we are interested only in the first occurrence of the header’s values, we don’t have to do anything as the standard library will handle this automatically:

data = CSV.read("./duplications.csv", headers: true)
data.first['first_name'] # => “John”

If you are interested in accessing the second value of the duplicated header, you can convert the row into array and then to the hash. Thanks to this trick you will get a hash where the last of duplicated values is present:

data = CSV.read("./duplications.csv", headers: true)
transformed_data = Hash[data.first.to_a]
transformed_data['first_name'] # => “Johnatan”

Collecting values from duplicated headers

If you don’t want to ignore values for duplicated headers, you can easily collect all values with the inject method that is available on the CSV row with the standard library:

rows = []
csv = "first_name,email,first_name\nTom,tom@gmail.com,Jim"
CSV.parse(csv, headers: true).each do |row|
 attributes = row.inject({}) do |h, r|
   h[r.first] = Array.wrap(h[r.first]) << r.last; h
 end

 rows << attributes.transform_values { |v| v.size == 1 ? v.first : v }
end

rows # => [{"first_name"=>["Tom", "Jim"], "email"=>"tom@gmail.com"}]

Above solution will work well if there are no duplications or there are more than 2 columns duplicated as well.

Encoding issues

Let’s consider the following case: we are processing the CSV file with the custom encoding where one of the values contains some special characters:

uid,first_name,last_name,age
1,John,Non spécifié,19
2,Tim,Doe,20
3,Marcel,Doe,30

John’s last name is not specified, and the value is in french. The file encoding is ISO 8859-1. Opening the file in a standard way will throw an error indicating that the file is malformed:

CSV.read("./users.csv") # => Invalid byte sequence in UTF-8 in line 2. (CSV::MalformedCSVError)

We can fix the code by defining the encoding explicitly:

CSV.read("./users.csv", encoding: "ISO-8859-1", headers: true).each do |row|
 puts row['last_name']
end

# => Non spécifié
# => Doe
# => Doe

Non-standard separator

Usually, values in the CSV file are separated by commas or semicolons. By the default, the standard Ruby library expects a comma as the separator but you can overwrite this configuration with the col_sep setting:

csv = "first_name|last_name\nJohn|Doe"
CSV.parse(csv, col_sep: "|", headers: true).first.to_h
# => {"first_name"=>"John", "last_name"=>"Doe"}

Without the explicit configuration, parsing won’t work:

csv = "first_name|last_name\nJohn|Doe"
CSV.parse(csv, headers: true).first.to_h
# => {"first_name|last_name"=>"John|Doe"}

Detecting separator

If you are unsure what is the separator, you can write a code that will attempt to detect it. The below code won’t load the whole file into memory, it will only fetch headers and based on this value it will try to guess which separator was used:

separators = %w[, ; |]
headers = File.open('./users.csv') { |file| file.gets }
stats = separators.inject({}) { |stat, sep| stat.merge(sep => headers.count(sep)) }
separator = stats.sort_by { |_,v| v }.last.first

CSV.foreach('./users.csv', col_sep: separator) do |row|
 # ...
end

The good news is that if the CSV file was created by some software as a result of the data export, it should contain one of the standard separators.

Duplicated rows

The standard library allows duplicated rows so you should be aware of such behavior. There is no built-in option to remove duplicates so you have to implement the code from scratch.

In most cases, it’s enough to collect all rows in an array and then call uniq on the array to get distinct values. Another approach is to remove duplicated lines from the file before you will parse the CSV.

Being aware that the rows might be duplicated is part of the success.

Empty rows

By default, the standard Ruby library for parsing CSV format does not care about the empty rows and return them while parsing so it’s your responsibility to catch them and handle them properly.

However, the good news is that you can ignore empty rows on the parsing level by passing the skip_blanks: true flag:

CSV.read("./users.csv", headers: true, skip_blanks: true)

Formatting issues

By default, all values parsed by the CSV library will be returned as strings:

csv = "first_name,age,birth_date\nJohn,30,2000-01-15"
row = CSV.parse(csv, headers: true).first
row.to_h
# => {"first_name"=>"John", "age"=>"30", "birth_date"=>"2000-01-15"}

We can either implement our own code to transform values to the right format or take advantage of preprocessors. A preprocessor is an anonymous function that is invoked against each value in the CSV file.

Default preprocessor

The standard library provides some default preprocessors that we can use to make some additional parsing of the CSV data:

  • Integer

  • Float

  • Date

  • DateTime

If you would like to list all default preprocessors in the console, you can simply run CSV::Converters. You can also play with each preprocessor invoking the call method on it:

CSV::Converters[:integer].call("2") # => 2

By default, preprocessors are not used so you have to explicitly define which of them you would like to use:

csv = "first_name,age,birth_date\nJohn,30,2000-01-15"
row = CSV.parse(csv, headers: true, converters: [:integer, :date]).first
row.to_h
# => {"first_name"=>"John", "age"=>30, "birth_date"=>Sat, 15 Jan 2000}

Custom preprocessor

You can easily define your own preprocessor and apply it while parsing the contents of the CSV file. Simply define it as an anonymous function that accepts one argument, the string value of the CSV column:

money_converter = ->(value) {
 if value.match?(/^[0-9]* (USD|EUR)$/)
   captures = value.match(/^(?<amount>[0-9]*) (?<currency>USD|EUR)$/)
   Money.from_amount(captures[:amount].to_f, captures[:currency])
 else
   value
 end
}

The above function checks if the value is an amount with currency. If there is a match, the value is converted to the Money object (from Money gem). Otherwise, it returns the original value so other preprocessors can process it as well.

We can use our custom preprocessor almost the same way we are using the default ones:

csv = "first_name,age,salary\nJohn,30,100000 USD"
row = CSV.parse(csv, headers: true, converters: [money_converter]).first
row.to_h
# => {"first_name"=>"John", "age"=>"30", "salary"=>#<Money fractional:10000000 currency:USD>}

Of course, if you plan to use your custom converter in many places, you can add it to the standard library and reference it via symbol:

CSV::Converters[:money] = money_converter

CSV.parse(csv, headers: true, converters: [:money])

Didn't get enough of Ruby?

Check out our free books about Ruby to level up your skills and become a better software developer.