开发者

Avoiding/Removing nil when using a case statement while parsing a string

sample data:

DNA : 
This is a string

BaseQuality :
4 4 4 4 4 4 6 7 7 7 

Metadata : 
Is_read

DNA : 
yet another string

BaseQuality : 
4 4 4 4 7 7 4 8 4 4 4 4 4

Metadata :
Is_read
SCF_File 
.
.
.

I have a method that is using a case statement as follows to separate parts of a longer text fil开发者_StackOverflow中文版e into records using the delimeter "\n\n". And a class that models a data object

def parse_file(myfile)
    $/ = "\n\n"
    records = []
    File.open(myfile) do |f|
      f.each_line do |line|
        read = Read.new     
         case line
          when /^DNA/
            read.dna_data = line.strip
          when /^BaseQuality/
            read.quality_data =line.strip
          when /^Metadata/
            read.metadata =line.strip
          else
            puts "Unrecognized line: #{line}"
        end
        records.push read
      end
    end
    records
  end

class Read
attr_accessor :dna_data,:quality_data,:metadata
end

records.each do |r|
 puts r.dna_data
end

dna data contains the 'rightful' string part as well as two nil 'objects'/ irritating nils!

"This is a string"
nil
nil

My problems are the nil strings shown above which are assigned to dna_data when using read.dna_data = line.

Please how do you get rid of them? How do you avoid them in the first instance. What am i missing? Is my approach 'smelly'? Thank you


The problem is that the code creates a new instance of Read for each line. Instead, it should create an instance for each section. It appears that a section starts with the DNA header, so:

def parse_file(myfile)
  $/ = "\n\n"
  records = []
  File.open(myfile) do |f|
    read = nil                              # <- NEW
    f.each_line do |line|
      #read = Read.new                      # <- DELETED
      case line
      when /^DNA/
        read = Read.new                     # <- NEW
        read.dna_data = line.strip
      when /^BaseQuality/
        read.quality_data = line.strip
      when /^Metadata/
        read.metadata = line.strip
        records.push read                   # <= ADDED
      else
        puts "Unrecognized line: #{line}"
      end
      #records.push read                    # <= DELETED
    end
  end
  records
end

Having the parsed record pushed onto the records array after reading metadata works, but only if each record always contains metadata and the metadata is always last. We can make the program more forgiving of changes in the data layout by pushing the read onto records when it is first created:

def parse_file(myfile)
  $/ = "\n\n"
  records = []
  File.open(myfile) do |f|
    f.each_line do |line|
      read = Read.new
      case line
      when /^DNA/
        records << Read.new
        records.last.dna_data = line.strip
      when /^BaseQuality/
        records.last.quality_data = line.strip
      when /^Metadata/
        records.last.metadata = line.strip
      else
        puts "Unrecognized line: #{line}"
      end
    end
  end
  records
end


You may wish to see if BioRuby is appropriate to your needs. I use it to handle quality sequences as well as nucleotide sequences.


First off, I would avoid using Ruby for bioinformatics, it's not fast enough for certain set of problems. Sooner or later, you will hit issues and your program will crwal to a stop.

From what I gathered, you are trying to remove nils from an array. Here's two ways of doing so:

  1. use the compact method.

    [nil, nil, 'asdfa'].compact # >> ['asdfa']

  2. don't add nil when you are adding elements.

    records.push read unless read.nil?

    records.push read if read # nil gets evaluated to false.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜