Today I Learned

TIL, 2018-05-30 Speeding up CSV Import into Database Insert

Musings, Ruby

On Processing Large CSV Files

Reference

  • Helpers for printing memory used and time spent:
require 'csv'
require_relative './helpers'

headers = ['id', 'name', 'email', 'city', 'street', 'country']

name    = "Pink Panther"
email   = "[email protected]"
city    = "Pink City"
street  = "Pink Road"
country = "Pink Country"

print_memory_usage do
  print_time_spent do
    CSV.open('data.csv', 'w', write_headers: true, headers: headers) do |csv|
      1_000_000.times do |i|
        csv << [i, name, email, city, street, country]
      end
    end
  end
end

require 'benchmark'

def print_memory_usage
  memory_before = `ps -o rss= -p #{Process.pid}`.to_i
  yield
  memory_after = `ps -o rss= -p #{Process.pid}`.to_i

  puts "Memory: #{((memory_after - memory_before) / 1024.0).round(2)} MB"
end

def print_time_spent
  time = Benchmark.realtime do
    yield
  end

  puts "Time: #{time.round(2)}"
end
  • When building the CSV file, the Ruby process did not spike in memory usage because the GC was reclaiming the used memory.
  • Reading the entire file at once: Really high memory usage. 19 sec, 920 MB.
  • Parsing CSV from an in-memory string CSV.parse...: 21 sec, 1 GB memory.
  • Parse line by line: 9.73 sec, 74 MB. (CSV.new...)
  • Fastest way: CSV.foreach. 9 sec, 0.5 MB.

Sealed Classes

Reference

  • Enums with actual types.
  • For a ResultClass, we have this:
sealed class Result {
 class Success(val items: List<String>): Result()
 class Failure(val error: Throwable): Result()
}
  • The when expression is a bit like pattern matching.
when (result) {
  is Result.Success -> showItems(result.items)
  is Result.Failure -> result.error.printStackTrace()
}

This project is maintained by daryllxd