开发者

Serialising and counting a list of values

I need to serialise a large list of values using a custom encoding function (which I have). I've done this and it works, but I'd also like to have it count how many values are being serialised and written to disk whilst still using a relatively constant amount of memory (i.e. it shouldn't need to keep the entire input list around, as it gets very large).

Without the requirement of keeping a count, binary, cereal and blaze-builder all work (using the equivalent of B.writeFile "foo" . runPut . mapM_ encodeValue); but no matter what I try to do with any of these libraries it seems that the resulting ByteString gets kept around in memory until it is finished rather than starting to be written to disk as soon as a chunk is available (even when using toByteStringIO from blaze-builder).

This is a minimal example demonstrating what I've been trying to do:

import Data.Binary
import Data.Binary.Put
import Control.Monad(foldM)
import qualified Data.ByteString.Lazy as B

main :: IO ()
main = do let ns = [1..10000000] :: [Int]
              (count,b) = runPutM $ foldM (\ c n -> c `seq` (put n >开发者_StackOverflow中文版> return (c+1))) (0 :: Int) ns
          B.writeFile "testOut" b
          print count

When compiled and run with +RTS -hy, the result is an almost triangular graph dominated by ByteString values.

The only solution I've found so far (that I'm not a big fan of) is to do the looping (either directly or with foldM) in IO using B.appendFile rather than within Put or directly constructing a Builder value, which to me doesn't seem very elegant. Is there a better way?


I'm a bit surprised that toByteStringIO doesn't work, hopefully someone more familiar with that library will provide an answer.

That being said, whenever I want to intermix stream processing with IO actions, I usually find iteratees to be the most elegant solution. This is because they allow for precise control over how much data is processed and retained, and for combining the streaming aspects with other arbitrary IO actions. There are several iteratee implementations on hackage; this example is with "iteratee" because it's the one I'm most familiar with.

import Data.Binary.Put
import Control.Monad
import Control.Monad.IO.Class
import qualified Data.ByteString.Lazy as B
import Data.ByteString.Lazy.Internal (defaultChunkSize)
import Data.Iteratee hiding (foldM)
import qualified Data.Iteratee as I

main :: IO ()
main = do 
  let ns = [1..80000000] :: [Int]
  iter <- enumPureNChunk ns (defaultChunkSize `div` 8)
                            (joinI $ serializer $ writer "testOut")
  count <- run iter
  print count

serializer = mapChunks ((:[]) . runPutM . foldM
   (\ !cnt n -> put n >> return (cnt+1)) 0)

writer fp = I.foldM
   (\ !cnt (len,ck) -> liftIO (B.appendFile fp ck) >> return (cnt+len))
   0

There are three parts to this. writer is the "iteratee", i.e. a data consumer. It writes each chunk of data as its received and keeps a running count of the length. serializer is a stream transformer a.k.a. "enumeratee". It takes an input chunk of type [Int] and serializes it to a stream with type [(Int, B.ByteString)] (number of elements, bytestring). Finally enumPureNChunk is the "enumerator", which produces a stream, in this case from the input list. It takes enough elements from the input to fill a single lazy bytestring chunk (I'm on 64bit, divide by 4 for 32bit systems), and then writes them to disk so they can be GC'd.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜