Processing Chromosomal Data in Ruby
Say I have a file of chromosomal data I'm processing with Ruby,
#Base_ID Segment_ID Read_Depth
1 100
2 800
3 seg1 1900
4 seg1 2700
5 1600
6 2400
7 200
8 15000
9 seg2 300
10 seg2 400
11 seg2 900
12 1000
13 600
...
I'm sticking each row into a hash of arrays, with my keys taken from column 2, Segment_ID, and my values from column 3, Read_Depth, giving me
mr_hashy = {
"seg1" => [1900, 2700],
"" => [100, 800, 1600, 2400, 200, 15000, 1000, 600],
"seg2" => [300, 400, 900],
}
A primer, which is a small segment that consists of two consecutive rows in the above data, prepends and follows each regular segment. Regular segments have a non-empty-string value for Segment_ID, and vary in length, while rows with an empty string in the second column are parts of primers. Primer segments always have the same length, 2. Seen above, Base_ID's 1, 2, 5, 6, 7, 8, 12, 13 are parts of primers. In total, there are four primer s开发者_运维知识库egments present in the above data.
What I'd like to do is, upon encountering a line with an empty string in column 2, Segment_ID, add the READ_DEPTH to the appropriate element in my hash. For instance, my desired result from above would look like
mr_hashy = {
"seg1" => [100, 800, 1900, 2700, 1600, 2400],
"seg2" => [200, 15000, 300, 400, 900, 1000, 600],
}
hash = Hash.new{|h,k| h[k]=[] }
# Throw away the first (header) row
rows = DATA.read.scan(/.+/)[1..-1].map do |row|
# Throw away the first (entire row) match
row.match(/(\d+)\s+(\w+)?\s+(\d+)/).to_a[1..-1]
end
last_segment = nil
last_valid_segment = nil
rows.each do |base,segment,depth|
if segment && !last_segment
# Put the last two values onto the front of this segment
hash[segment].unshift( *hash[nil][-2..-1] )
# Put the first two values onto the end of the last segment
hash[last_valid_segment].concat(hash[nil][0,2]) if last_valid_segment
hash[nil] = []
end
hash[segment] << depth
last_segment = segment
last_valid_segment = segment if segment
end
# Put the first two values onto the end of the last segment
hash[last_valid_segment].concat(hash[nil][0,2]) if last_valid_segment
hash.delete(nil)
require 'pp'
pp hash
#=> {"seg1"=>["100", "800", "1900", "2700", "1600", "2400"],
#=> "seg2"=>["200", "15000", "300", "400", "900", "1000", "600"]}
__END__
#Base_ID Segment_ID Read_Depth
1 100
2 800
3 seg1 1900
4 seg1 2700
5 1600
6 2400
7 200
8 15000
9 seg2 300
10 seg2 400
11 seg2 900
12 1000
13 600
Second-ish refactor. I think this is clean, elegant, and most of all complete. It's easy to read with no hardcoded field lengths or ugly RegEx. I vote mine as the best! Yay! I'm the best, yay! ;)
def parse_chromo(file_name)
last_segment = ""
segments = Hash.new {|segments, key| segments[key] = []}
IO.foreach(file_name) do |line|
next if !line || line[0] == "#"
values = line.split
if values.length == 3 && last_segment != (segment_id = values[1])
segments[segment_id] += segments[last_segment].pop(2)
last_segment = segment_id
end
segments[last_segment] << values.last
end
segments.delete("")
segments
end
puts parse_chromo("./chromo.data")
I used this as my data file:
#Base_ID Segment_ID Read_Depth
1 101
2 102
3 seg1 103
4 seg1 104
5 105
6 106
7 201
8 202
9 seg2 203
10 seg2 204
11 205
12 206
13 207
14 208
15 209
16 210
17 211
18 212
19 301
20 302
21 seg3 303
21 seg3 304
21 305
21 306
21 307
Which outputs:
{
"seg1"=>["101", "102", "103", "104", "105", "106"],
"seg2"=>["201", "202", "203", "204", "205", "206", "207", "208", "209", "210", "211", "212"],
"seg3"=>["301", "302", "303", "304", "305", "306", "307"]
}
Here's some Ruby code (nice practice example :P). I'm assuming fixed-width columns, which appears to be the case with your input data. The code keeps track of which depth values are primer values until it finds 4 of them, after which it will know the segment id.
require 'pp'
mr_hashy = {}
primer_segment = nil
primer_values = []
while true
line = gets
if not line
break
end
base, segment, depth = line[0..11].rstrip, line[12..27].rstrip, line[28..-1].rstrip
primer_values.push(depth)
if segment.chomp == ''
if primer_values.length == 6
for value in primer_values
(mr_hashy[primer_segment] ||= []).push(value)
end
primer_values = []
primer_segment = nil
end
else
primer_segment = segment
end
end
PP::pp(mr_hashy)
Output on input provided:
{"seg1"=>["100", "800", "1900", "2700", "1600", "2400"],
"seg2"=>["200", "15000", "300", "400", "900", "1000"]}
精彩评论