开发者

Joining tables from multiple files

I am trying to join data from multiple tables that I have in different tab separated files as an example:

I have the tables:

    file1.txt
    a   3
    b   4
    c   8
    d   22
    e   4


    file2.txt
    a   10.3 -2
    b   4.7 -1
    c   8.9 -2
    e   22.1    -1

    file3.txt
    b   T
    c   F
    d   T
    f   F
    g   T

I would like to join them on their common key which is the first column to produce the following table:

    a   3   10.3 -2
    b   4   4.7 -1开发者_开发知识库      T
    c   8   8.9 -2      F
    d   22              T
    e   4   22.1    -1  
    f                   F
    g                   T

How could I easilly achieve this with ruby..

Ted


I don't know of another way but this will create a hash with everything in it:

files = ['file1.txt', 'file2.txt', 'file3.txt']
result = Hash.new

files.each_with_index do |file, i|
    File.foreach(file) do |line|
        key, value = /(\w)\s+(.*)/.match(line).captures
        result[key] = Array.new(files.size) unless result.has_key?(key)
        result[key][i] = value
    end
end

The hash result looks like this:

{"a" => ["3", "10.3 -2", nil],
 "b" => ["4", "4.7 -1", "T"],
 "c" => ["8", "8.9 -2", "F"],
 "d" => ["22", nil, "T"],
 "e" => ["4", "22.1    -1", nil],
 "f" => [nil, nil, "F"],
 "g" => [nil, nil, "T"]}


You could do something like this:

require 'csv'

def load(file)
    CSV.open(file, :col_sep => "\t").
        each_with_object({ }) { |r, h| h[r.shift] = r }
end

# Load it all into hashes with a convenient format.
# The PK will be the key, the rest of the row will be the value as an array.
file1 = load('file1.txt')
file2 = load('file2.txt')
file3 = load('file3.txt')

# Figure out the rows in the final table 
rows = (file1.keys | file2.keys | file3.keys).each_with_object({}) { |k,h| h[k] = [] }

# Use the block form of Hash#merge to join.
cols   = [file1, file2, file3].inject([]) {|a, f| a.push(a.last.to_i + f.first.last.length)}
joined = rows.merge(file1).
              merge(file2) { |k, o, n| (o + [nil] * (cols[0] - o.length)) + n }.
              merge(file3) { |k, o, n| (o + [nil] * (cols[1] - o.length)) + n }

# Patch any missing values in the last column.
joined.each { |k, v| v.concat([nil] * (cols[2] - v.length)) }

The result is a hash like this:

{"a"=>["3",  "10.3", "-2", nil],
 "b"=>["4",  "4.7",  "-1", "T"],
 "c"=>["8",  "8.9",  "-2", "F"],
 "d"=>["22", nil,    nil,  "T"],
 "e"=>["4",  "22.1", "-1", nil],
 "f"=>[nil,  nil,    nil,  "F"],
 "g"=>[nil,  nil,    nil,  "T"]}

You can easily convert that to an array of arrays if desired. Generalizing to multiple files should also be fairly straight forward. There are, of course, other ways to implement the various steps but I'll leave those refinements as an exercise.

If the files were large then you'd be better off slopping them into an SQLite database and doing the join in SQL.


You could try using Sequel to load the files via ODBC, or ADO if you are on Windoze, then do the join there, and output the dataset array to text file in the same format.

Or import the files into sqlite (see basic instructions here, you may need to convert tabs to a different delimiter, not sure), then do the same thing in Sequel using the sqlite adapter instead.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜