Processing Chromosomal Data in Ruby

2023-02-02 12:24 问答作者：

Say I have a file of chromosomal data I'm processing with Ruby,

#Base_ID    Segment_ID      Read_Depth  
1                           100            
2                           800         
3           seg1            1900            
4           seg1            2700           
5                           1600            
6                           2400            
7                           200
8                           15000
9           seg2            300
10          seg2            400
11          seg2            900
12                          1000
13                          600
...

I'm sticking each row into a hash of arrays, with my keys taken from column 2, Segment_ID, and my values from column 3, Read_Depth, giving me

mr_hashy = { 
  "seg1"  => [1900, 2700],
  ""      => [100, 800, 1600, 2400, 200, 15000, 1000, 600],
  "seg2"  => [300, 400, 900],
}

A primer, which is a small segment that consists of two consecutive rows in the above data, prepends and follows each regular segment. Regular segments have a non-empty-string value for Segment_ID, and vary in length, while rows with an empty string in the second column are parts of primers. Primer segments always have the same length, 2. Seen above, Base_ID's 1, 2, 5, 6, 7, 8, 12, 13 are parts of primers. In total, there are four primer s开发者_运维知识库egments present in the above data.

What I'd like to do is, upon encountering a line with an empty string in column 2, Segment_ID, add the READ_DEPTH to the appropriate element in my hash. For instance, my desired result from above would look like

mr_hashy = {
  "seg1"   => [100, 800, 1900, 2700, 1600, 2400],
  "seg2"   => [200, 15000, 300, 400, 900, 1000, 600],
}

hash = Hash.new{|h,k| h[k]=[] }

# Throw away the first (header) row
rows = DATA.read.scan(/.+/)[1..-1].map do |row|
  # Throw away the first (entire row) match
  row.match(/(\d+)\s+(\w+)?\s+(\d+)/).to_a[1..-1]
end

last_segment       = nil
last_valid_segment = nil
rows.each do |base,segment,depth|
  if segment && !last_segment
    # Put the last two values onto the front of this segment
    hash[segment].unshift( *hash[nil][-2..-1] )
    # Put the first two values onto the end of the last segment
    hash[last_valid_segment].concat(hash[nil][0,2]) if last_valid_segment
    hash[nil] = []
  end
  hash[segment] << depth
  last_segment = segment
  last_valid_segment = segment if segment
end
# Put the first two values onto the end of the last segment
hash[last_valid_segment].concat(hash[nil][0,2]) if last_valid_segment
hash.delete(nil)

require 'pp'
pp hash
#=> {"seg1"=>["100", "800", "1900", "2700", "1600", "2400"],
#=>  "seg2"=>["200", "15000", "300", "400", "900", "1000", "600"]}

__END__
#Base_ID    Segment_ID      Read_Depth  
1                           100            
2                           800         
3           seg1            1900            
4           seg1            2700           
5                           1600            
6                           2400            
7                           200
8                           15000
9           seg2            300
10          seg2            400
11          seg2            900
12                          1000
13                          600

Second-ish refactor. I think this is clean, elegant, and most of all complete. It's easy to read with no hardcoded field lengths or ugly RegEx. I vote mine as the best! Yay! I'm the best, yay! ;)

def parse_chromo(file_name)

  last_segment = ""
  segments = Hash.new {|segments, key| segments[key] = []}

  IO.foreach(file_name) do |line|
    next if !line || line[0] == "#"

    values = line.split
    if values.length == 3 && last_segment != (segment_id = values[1])
      segments[segment_id] += segments[last_segment].pop(2)
      last_segment = segment_id
    end

    segments[last_segment] << values.last
  end

  segments.delete("")
  segments
end

puts parse_chromo("./chromo.data")

I used this as my data file:

#Base_ID    Segment_ID      Read_Depth  
1                           101            
2                           102            
3           seg1            103            
4           seg1            104           
5                           105            
6                           106            
7                           201
8                           202
9           seg2            203
10          seg2            204            
11                          205            
12                          206
13                          207
14                          208            
15                          209            
16                          210
17                          211
18                          212
19                          301
20                          302
21         seg3             303
21         seg3             304
21                          305
21                          306
21                          307

Which outputs:

{
  "seg1"=>["101", "102", "103", "104", "105", "106"], 
  "seg2"=>["201", "202", "203", "204", "205", "206", "207", "208", "209", "210", "211", "212"], 
  "seg3"=>["301", "302", "303", "304", "305", "306", "307"]
}

Here's some Ruby code (nice practice example :P). I'm assuming fixed-width columns, which appears to be the case with your input data. The code keeps track of which depth values are primer values until it finds 4 of them, after which it will know the segment id.

require 'pp'
mr_hashy = {}
primer_segment = nil
primer_values = []
while true
  line = gets
  if not line
    break
  end
  base, segment, depth = line[0..11].rstrip, line[12..27].rstrip, line[28..-1].rstrip
  primer_values.push(depth)
  if segment.chomp == ''
    if primer_values.length == 6
      for value in primer_values
        (mr_hashy[primer_segment] ||= []).push(value)
      end
      primer_values = []
      primer_segment = nil
    end
  else
    primer_segment = segment
  end
end
PP::pp(mr_hashy)

Output on input provided:

{"seg1"=>["100", "800", "1900", "2700", "1600", "2400"],
 "seg2"=>["200", "15000", "300", "400", "900", "1000"]}

继续阅读：bioinformatics hash ruby string

Processing Chromosomal Data in Ruby

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？