Processing (too) many XML files (with TagSoup)
I have a directory with about 4500 XML (HTML5) files, and I want to create a "manifest" of their data (essentially title
and base/@href
).
To this end, I've been using a function to collect all the relevant file paths, opening them with readFile, sending them into a tagsoup based parser and then outputting/formatting the resultant list.
This works for a subset of the files, but eventually runs into a openFile: resource exhausted (Too many open files)
error. After doing some reading, this isn't so surprising: I'm using mapM parseMetaDataFile files
which opens all the handles straight away.
What I can't figure out is how to work around the problem. I've tried reading a bit about Iteratee; Can I hook that up with Tagsoup easily? Strict IO, the way I used it anyway (heh), froze my computer even though the files aren't very big (28 KB on average).
Any pointers would be greatly appreciated. I realize the approach of creating a big list might fail as well, but 4.5k elements isn't that long... Also, there should probably be less String
and more ByteString
everywhere.
Here's some code. I apologize for the naivety:
import System.FilePath
import Text.HTML.TagSoup
data MetaData = MetaData String String deriving (Show, Eq)
-- | Given HTML input, produces a MetaData structure of its essentials.
-- Should obviously account for errors, but simplified here.
readMetaData :: String -> MetaData
readMetaData input = MetaData title base
where
title =
innerText $
(takeWhile (~/= TagClose "title") . dropWhile (~/= TagOpen "title" []))
tags
base = fromAttrib "href" $ head $ dropWhile (~/= TagOpen "base" []) tags
tags = parseTags input
-- | Parses MetaData from a file.
parseMetaDataFile :: FilePath -> IO MetaData
parseMetaDataFile path = fmap readMetaData $ readFile path
-- | From a given root, gets the FilePaths of the files we are int开发者_运维技巧erested in.
-- Not implemented here.
getHtmlFilePaths :: FilePath -> IO [FilePath]
getHtmlFilePaths root = undefined
main :: IO
main = do
-- Will call openFile for every file, which gives too many open files.
metas <- mapM parseMetaDataFile =<< getHtmlFilePaths
-- Do stuff with metas, which will cause files to actually be read.
The quick and dirty solution:
parseMetaDataFile path = withFile path $ \h -> do
res@(MetaData x y) <- fmap readMetaData $ hGetContents h
Control.Exception.evaluate (length (x ++ y))
return res
A slightly nicer solution is to write a proper NFData
instance for MetaData
, instead of just using evaluate.
If you want to keep the current design you must make sure parseMetaDataFile has consumed the entire string from readFile before returning. When readFile reaches end-of-file the file descriptor will be closed.
精彩评论