Avoiding/Removing nil when using a case statement while parsing a string
sample data:
DNA :
This is a string
BaseQuality :
4 4 4 4 4 4 6 7 7 7
Metadata :
Is_read
DNA :
yet another string
BaseQuality :
4 4 4 4 7 7 4 8 4 4 4 4 4
Metadata :
Is_read
SCF_File
.
.
.
I have a method that is using a case statement as follows to separate parts of a longer text fil开发者_StackOverflow中文版e into records using the delimeter "\n\n". And a class that models a data object
def parse_file(myfile)
$/ = "\n\n"
records = []
File.open(myfile) do |f|
f.each_line do |line|
read = Read.new
case line
when /^DNA/
read.dna_data = line.strip
when /^BaseQuality/
read.quality_data =line.strip
when /^Metadata/
read.metadata =line.strip
else
puts "Unrecognized line: #{line}"
end
records.push read
end
end
records
end
class Read
attr_accessor :dna_data,:quality_data,:metadata
end
records.each do |r|
puts r.dna_data
end
dna data contains the 'rightful' string part as well as two nil 'objects'/ irritating nils!
"This is a string"
nil
nil
My problems are the nil strings shown above which are assigned to dna_data when using read.dna_data = line
.
Please how do you get rid of them? How do you avoid them in the first instance. What am i missing? Is my approach 'smelly'? Thank you
The problem is that the code creates a new instance of Read for each line. Instead, it should create an instance for each section. It appears that a section starts with the DNA header, so:
def parse_file(myfile)
$/ = "\n\n"
records = []
File.open(myfile) do |f|
read = nil # <- NEW
f.each_line do |line|
#read = Read.new # <- DELETED
case line
when /^DNA/
read = Read.new # <- NEW
read.dna_data = line.strip
when /^BaseQuality/
read.quality_data = line.strip
when /^Metadata/
read.metadata = line.strip
records.push read # <= ADDED
else
puts "Unrecognized line: #{line}"
end
#records.push read # <= DELETED
end
end
records
end
Having the parsed record pushed onto the records array after reading metadata works, but only if each record always contains metadata and the metadata is always last. We can make the program more forgiving of changes in the data layout by pushing the read onto records when it is first created:
def parse_file(myfile)
$/ = "\n\n"
records = []
File.open(myfile) do |f|
f.each_line do |line|
read = Read.new
case line
when /^DNA/
records << Read.new
records.last.dna_data = line.strip
when /^BaseQuality/
records.last.quality_data = line.strip
when /^Metadata/
records.last.metadata = line.strip
else
puts "Unrecognized line: #{line}"
end
end
end
records
end
You may wish to see if BioRuby is appropriate to your needs. I use it to handle quality sequences as well as nucleotide sequences.
First off, I would avoid using Ruby for bioinformatics, it's not fast enough for certain set of problems. Sooner or later, you will hit issues and your program will crwal to a stop.
From what I gathered, you are trying to remove nils from an array. Here's two ways of doing so:
use the compact method.
[nil, nil, 'asdfa'].compact # >> ['asdfa']
don't add nil when you are adding elements.
records.push read unless read.nil?
records.push read if read # nil gets evaluated to false.
精彩评论