Improving clojure lazy-seq usage for iterative text parsing
I'm writing a Clojure implementation of this coding challenge, attempting to find the average length of sequence records in Fasta format:
>1
GATCGA
GTC
>2
GCA
>3
AAAAA
For more background see this related StackOverflow post about an Erlang solution.
My beginner Clojure attempt uses lazy-seq to attempt to read in the file one record at a time so it will scale to large files. However it is fairly memory hungry and slow, so I suspect that it's not implemented optimally. Here is a solution using the BioJava library to abstract out the parsing of the records:
(import '(org.biojava.bio.seq.io SeqIOTools))
(use '[clojure.contrib.duck-streams :only (reader)])
(defn seq-lengths [seq-iter]
"Produce a lazy collection of sequence lengths given a BioJava StreamReader"
(lazy-seq
(if (.hasNext seq-iter)
(cons (.length (.nextSequence seq-iter)) (seq-lengths seq-iter)))))
(defn fasta-to-lengths [in-file seq-type]
"Use BioJava to read a Fasta input file as a StreamReader of sequences"
(seq-lengths (SeqIOTools/fileToBiojava "fasta" seq-type (reader in-file))))
(defn average [coll]
(/ (reduce + coll) (count coll)))
(when *command-line-args*
(println
(average (apply fasta-to-lengths *command-line-args*))))
and an equivalent approach without external libraries:
(use '[clojure.contrib.duck-streams :only (read-lines)])
(defn seq-lengths [lines cur-length]
"Retrieve lengths of sequences in the file using line lengths"
(lazy-seq
(let [cur-line (first lines)
remain-lines (rest lines)]
(if (= nil cur-line) [cur-length]
(if (= \> (first cur-line))
(cons cur-length (seq-lengths remain-lines 0))
(seq-lengths remain-lines (+ cur-length (.length cur-line))))))))
(defn fasta-to-lengths-bland [in-file seq-type]
; pop off first item since it will be everything up to the first >
(rest (seq-lengths (read-lines in-file) 0)))
(defn average [coll]
(/ (reduce + coll) (count coll)))
(when *command-line-args*
(println
(average (apply fasta-to-lengths-bland *command-line-args*))))
The current implementation takes 44 seconds on a 开发者_Python百科large file compared to 7 seconds for a Python implementation. Can you offer any suggestions on speeding the code up and making it more intuitive? Is the usage of lazy-seq correctly parsing the file record by record as intended?
It probably doesn't matter, but average
is holding onto the head of the seq of lengths.
The following is a wholly untested, but lazier way to do what I think you want.
(use 'clojure.java.io) ;' since 1.2
(defn lazy-avg [coll]
(let [f (fn [[v c] val] [(+ v val) (inc c)])
[sum cnt] (reduce f [0 0] coll)]
(if (zero? cnt) 0 (/ sum cnt)))
(defn fasta-avg [f]
(->> (reader f)
line-seq
(filter #(not (.startsWith % ">")))
(map #(.length %))
lazy-avg))
Your average
function is non-lazy -- it needs to realise the entire coll
argument while holding onto its head. Update: Just realised that my original answer included a nonsensical suggestion as to how to solve the above problem... argh. Fortunately ataggart has since posted a correct solution.
Other than that, your code does seem lazy at first glance, though the use of read-lines
is currently discouraged (use line-seq
instead).
If the file is really large and your functions will be called a large number of times, type-hinting seq-iter
in the argument vector of seq-length
-- ^NameOfBiojavaSeqIterClass seq-iter
, use #^
in place of ^
if you're on Clojure 1.1 -- might make a significant difference. In fact, (set! *warn-on-reflection* true)
, then compile your code and add type hints to remove all reflection warnings.
精彩评论