Sunday, February 8, 2009

A "hackish" data loader

If I'm going to be dealing with whacks of data, I want a fairly clean way of loading the data into the database in the format I specified.

So, while the first order of business would be the creation of the PaperTrade class (which will be the first gem I release out of this work), I wanted a quick and dirty way to load the files so I had to do a few things first.

0. Create the table on the database.
  • So, I created a sqlite3 file called "db.db" (because I like palindromes)
  • I used the table definition below (at some point I'll come back and create a migration for this
  • CREATE TABLE stocks(
    id integer primary key,
    name varchar(20),
    date integer,
    open float,
    high float,
    low float,
    close float,
    volume integer,
    adj_close float);
1. Creation of the Data Loader class.
There are a few notes to this:
  • I use a couple of modules to abstract out the ActiveRecord calls so that the loader script which creates the DataLoad class will not have to deal with anything - the only part that is a bit unwieldy is that there is an "include Tables" to pull in the creation of the abstracted Stocks class. On the whole, I believe there should be a more elegant way of dealing with this but for now, it works. The only time I think it will bite me will be if I build more on to this so that multiple tables are created. That could get overly messy
  • The date format that the csv data I receive back from Yahoo! is DD/MM/YY which apparently is what it thinks all Canadian localised date formats should be. So, I had to create a twist date method to deal with that and it's highly idiosyncratic to my uses - you may not get the same mileage if your date data does not comply with that format. Also, there is a bit of a "Y19K" thing going on so don't import any data prior to 1919 or use this class after 2019 ;-)
  • Since I felt so "clever" with my Tables include, I decided to do one for the require statements.
  • module Requires
    require 'activerecord'
    end

    module Tables
    class Stock < ActiveRecord::Base
    end
    end

    class DataLoad
    include Requires
    def initialize(data_file_load, stock_name, db_file_name)
    @file_load = data_file_load
    @name = stock_name
    @db_file_name = db_file_name
    @adapter = 'sqlite3'
    end

    def connect_to_db
    # Connect to a database
    ActiveRecord::Base.establish_connection(
    {:adapter => @adapter,
    :database => @db_file_name})
    end

    def twist_date(stringer)
    months = ["", "jan", "feb", "mar", "apr", "may", "jun", "jul",
    "aug", "sep", "oct", "nov", "dec"]
    split_string = stringer.split('/')
    short_date = split_string.last
    if short_date.to_i > 19 then
    return_date = '19' + short_date
    else
    return_date = '20' + short_date
    end
    return [return_date.to_i, months[split_string[1].to_i], split_string[0].to_i]
    end

    def data_file_load
    @data_load = []
    File.open(@file_load) do |file|
    while line = file.gets
    @data_load << line.chomp!
    end
    @headers = @data_load.shift
    @headers.gsub!(/\"/,'')
    end
    end

    def csv_to_db
    @data_load.each do |line|
    nter = Stock.new
    nter.name = @name
    date_twisted = twist_date(line.split(',').first)
    nter.date = Time.local(date_twisted.first, date_twisted[1], date_twisted.last).to_i
    nter.open = line.split(',')[1].to_f
    nter.high = line.split(',')[2].to_f
    nter.low = line.split(',')[3].to_f
    nter.close = line.split(',')[4].to_f
    nter.volume = line.split(',')[5].to_i
    nter.adj_close = line.split(',')[6].to_f
    nter.save
    end
    end
    end #class End
At this point in time, I haven't written any unit tests but that should be in an upcoming post.

2. The data loader script:
  • # This is a script to load tables to the database

    require 'data_load_class'

    puts "enter the name of the csv file to load"
    data_file_load = gets.chomp!

    puts "enter the name of the identifer for the stock ex: USO"
    stock_name = gets.chomp!

    puts "enter the name of the database file"
    db_file_name = gets.chomp!

    a_data_loader = DataLoad.new(data_file_load, stock_name, db_file_name)

    a_data_loader.connect_to_db

    # We can only create the tables after we've connected
    # to the database
    include Tables

    # Get the data
    a_data_loader.data_file_load

    # Load 'em up
    a_data_loader.csv_to_db
Since I'm blogging the code as it exists today, there are a bunch of unpolished bits.

The parts I'm most unhappy about are:
  • The whole creating the ActiveRecord Stock class. I did quite a bit of reading on how to do this and wasn't pleased with any of the current solutions. I think Rails has it's own way of doing this so that's a study task for me.
  • Creating my own Y19K bug wasn't too pleasing but, I wanted to deal with the data as I was receiving it and not have ot 'normalise' the data through OOCalc or Excel if at all possible.
  • Lack of unit tests - I recognize this is a bit askew from my last post but, the fact of the matter is that I have not yet internalised the whole TDD because I have not made up my mind as to whether or not BDD (a la RSpec) makes more sense to me.
See you soon.

No comments:

Post a Comment