Parsing data with Clojure, interval problem
I'm writing a little parser in clojure for learning purpose. basically is a TSV file parser that need to be put in a database, but I added a complication. The complication itself is that in the same file there are more intervals. The file look like this:
###andreadipersio 2010-03-19 16:10:00###
USER COMM PID PPID %CPU %MEM TIME
root launchd 1 0 0.0 0.0 2:46.97
root DirectoryService 11 1 0.0 0.2 0:34.59
root notifyd 12 1 0.0 0.0 0:20.83
root diskarbitrationd 13 1 0.0 0.0 0:02.84`
....
###andreadipersio 2010-03-19 16:20:00###
USER COMM PID PPID %CPU %MEM TIME
root launchd 1 0 0.0 0.0 2:46.97
root DirectoryService 11 1 0.0 0.2 0:34.59
root notifyd 12 1 0.0 0.0 0:20.83
root diskarbitrationd 13 1 0.0 0.0 0:02.84
I ended up with this code:
(defn is-header?
"Return true if a line is header"
[line]
(> (count (re-find #"^\#{3}" line)) 0))
(defn extract-fields
"Return regex matches"
[line pattern]
(rest (re-find pattern line)))
(defn process-lines
[lines]
(map process-line lines))
(defn process-line
[line]
(if (is-header? line)
(extract-fields line header-pattern))
(extract-fields line data-pattern))
My idea is that in 'process-line' interval need to be merged with data so I have something like this:
('andreadipersio', '2010-03-19', '16:10:00', 'root', 'launchd', 1, 0, 0.0, 0.0, '2:46.97')
for every row till the next interval, but I can't figure how to 开发者_开发问答make this happen.
I tried with something like this:
(def process-line
[line]
(if is-header? line)
(def header-data (extract-fields line header-pattern)))
(cons header-data (extract-fields line data-pattern)))
But this doesn't work as excepted.
Any hints?
Thanks!
A possible approach:
Split the input into lines with
line-seq
. (If you want to test this on a string, you can obtain aline-seq
on it by doing(line-seq (java.io.BufferedReader. (java.io.StringReader. test-string)))
.)Partition it into sub-sequences each of which contains either a single header line or some number of "process lines" with
(clojure.contrib.seq/partition-by is-header? your-seq-of-lines)
.Assuming there's at least one process line after each header,
(partition 2 *2)
(where*2
is the sequence obtained in step 2 above) will return a sequence of a form resembling the following:(((header-1) (process-line-1 process-line-2)) ((header-2) (process-line-3 process-line-4)))
. If the input might contain some header lines not followed by any data lines, then the above could look like(((header-1a header-1b) (process-line-1 process-line-2)) ...)
.Finally, transform the output of step 3 (
*3
) with the following function:
(defn extract-fields-add-headers
[[headers process-lines]]
(let [header-fields (extract-fields (last headers) header-pattern)]
(map #(concat header-fields (extract-fields % data-pattern))
process-lines)))
(To explain the (last headers)
bit: the only case where we'll get multiple headers here is when some of them have no data lines of their own; the one actually attached to the data lines is the last one.)
With these example patterns:
(def data-pattern #"(\w+)\s+(\w+)\s+(\d+)\s+(\d+)\s+([0-9.]+)\s+([0-9.]+)\s+([0-9:.]+)")
(def header-pattern #"###(\w+)\s+([0-9-]+)\s+([0-9:]+)###")
;; we'll need to throw out the "USER COMM ..." lines,
;; empty lines and the "..." line which I haven't bothered
;; to remove from your sample input
(def discard-pattern #"^USER\s+COMM|^$|^\.\.\.")
the whole 'pipe' might look like this:
;; just a reminder, normally you'd put this in an ns form:
(use '[clojure.contrib.seq :only (partition-by)])
(->> (line-seq (java.io.BufferedReader. (java.io.StringReader. test-data)))
(remove #(re-find discard-pattern %)) ; throw out "USER COMM ..."
(partition-by is-header?)
(partition 2)
;; mapcat performs a map, then concatenates results
(mapcat extract-fields-add-headers))
(With the line-seq
presumably taking input from a different source in your final programme.)
With your example input, the above produces output like this (line breaks added for clarity):
(("andreadipersio" "2010-03-19" "16:10:00" "root" "launchd" "1" "0" "0.0" "0.0" "2:46.97")
("andreadipersio" "2010-03-19" "16:10:00" "root" "DirectoryService" "11" "1" "0.0" "0.2" "0:34.59")
("andreadipersio" "2010-03-19" "16:10:00" "root" "notifyd" "12" "1" "0.0" "0.0" "0:20.83")
("andreadipersio" "2010-03-19" "16:10:00" "root" "diskarbitrationd" "13" "1" "0.0" "0.0" "0:02.84")
("andreadipersio" "2010-03-19" "16:20:00" "root" "launchd" "1" "0" "0.0" "0.0" "2:46.97")
("andreadipersio" "2010-03-19" "16:20:00" "root" "DirectoryService" "11" "1" "0.0" "0.2" "0:34.59")
("andreadipersio" "2010-03-19" "16:20:00" "root" "notifyd" "12" "1" "0.0" "0.0" "0:20.83")
("andreadipersio" "2010-03-19" "16:20:00" "root" "diskarbitrationd" "13" "1" "0.0" "0.0" "0:02.84"))
You're doing (> (count (re-find #"^\#{3}" line)) 0)
, but you can just do (re-find #"^\#{3}" line)
and use the result as a boolean. re-find
returns nil
if the match fails.
If you're iterating over the items in a collection, and you want to skip some items or combine two or more items in the original into one item in the result, then 99% of the time you want reduce
. This usually ends up being very straightforward.
;; These two libs are called "io" and "string" in bleeding-edge clojure-contrib
;; and some of the function names are different.
(require '(clojure.contrib [str-utils :as s]
[duck-streams :as io])) ; SO's syntax-highlighter still sucks
(defn clean [line]
(s/re-gsub #"^###|###\s*$" "" line))
(defn interval? [line]
(re-find #"^#{3}" line))
(defn skip? [line]
(or (empty? line)
(re-find #"^USER" line)))
(defn parse-line [line]
(s/re-split #"\s+" (clean line)))
(defn parse [file]
(first
(reduce
(fn [[data interval] line]
(cond
(interval? line) [data (parse-line line)]
(skip? line) [data interval]
:else [(conj data (concat interval (parse-line line))) interval]))
[[] nil]
(io/read-lines file))))
I'm not totally sure based on your description, but perhaps you're just slipping up on the syntax. Is this what you want to do?
(def process-line [line]
(if (is-header? line) ; extra parens here over your version
(extract-fields line header-pattern) ; returning this result
(extract-fields line data-pattern))) ; implicit "else"
If the intent of your "cons
" is to group together headers with their associated detail data, you'll need some more code to accomplish that, but if it's just an attempt at "coalescing" and returning either a header or detail line depending on which it is, then this should be correct.
精彩评论