Haskell ByteStrings - ending up with large file loaded into memory

2023-01-20 20:34 问答作者：

Greetings,

I'm trying to understand why I'm seeing the entire file loaded into memory with the following program, yet if you comment out the line below "(***)" then the program runs in constant (about 1.5M) space.

EDIT: The file is about 660MB, the field in column 26 is a date string like '2009-10-01', and there are one million lines. The process uses about 810MB by the time it hits the 'getLine'

Am I right in thinking it's related to the splitting of the string using 'split', and that somehow the underlying ByteString that has been read from the file can't be garbage-collected because it's still referenced? But if so, then I thought BS.copy would work around that. Any ideas how to force the computation - I can't seem to get 'seq' into the right place to have an effect.

(NB the source file is tab-separated lines)

Thanks in advance,

Kevin

module Main where

import System.IO
import qualified Data.ByteString.Lazy.Char8 as BS
import Control.Monad


type Record = BS.ByteString

importRecords :: String -> IO [Record]
importRecords filename = do
    liftM (map importRecord.BS.lines) (BS.readFile filename)

importRecord :: BS.ByteString -> Record
importRecord txt = r
  where 
    r = getField 26
    getField f = BS.copy 开发者_JAVA百科$ ((BS.split '\t' txt) !! f)

loopInput :: [Record] -> IO ()
loopInput jrs = do
    putStrLn $ "Done" ++ (show $ last jrs)
    hFlush stdout
    x <- getLine
    return ()

    -- (***)
    loopInput jrs

main = do 
    jrs <- importRecords "c:\\downloads\\lcg1m.txt"
    loopInput jrs

Your call to last forces the list, jrs. To figure that out it must run through the entire file building up thunks for each entry in jrs. Because you aren't evaluating each element in jrs (except the last one) these thunks hang out with references to the bytestring, so that must stay in memory.

The solution is to force the evaluation of those thunks. Because we're talking about space the first thing I did was actually to store your info in a smaller format:

type Year   = Word16
type Month  = Word8
type Day    = Word8
data Record = Rec {-# UNPACK #-} !Year {-# UNPACK #-} !Month {-# UNPACK #-} !Day 
        deriving (Eq, Ord, Show, Read)

This reduces that ugly 10 byte Bytestring (+ overhead of ~16 bytes of structure information) to around 8 bytes.

importRecord now has to call toRecord r to get the right type:

toRecord :: BS.ByteString -> Record
toRecord bs =
    case BS.splitWith (== '-') bs of
            (y:m:d:[]) -> Rec (rup y) (rup m) (rup d)
            _ -> Rec 0 0 0

rup :: (Read a) => BS.ByteString -> a
rup = read . BS.unpack

We'll need to evalute data when we convert from ByteString to Record, so lets use the parallel package and define an NFData instance from DeepSeq.

instance NFData Record where
    rnf (Rec y m d) = y `seq` m `seq` d `seq` ()

Now we're ready to go, I modified main to use evalList, thus forcing the whole list before your function that wants the last one:

main = do
    jrs <- importRecords "./tabLines"
    let jrs' = using jrs (evalList rdeepseq)
    loopInput jrs'

And we can see the heap profile looks beautiful (and top agrees, the program uses very little memory).

Haskell ByteStrings - ending up with large file loaded into memory

Sorry about that other ~~misleading~~ wrong answer - I was hooked on the fact that incremental processing fixes it and didn't really realize the thunks really were hanging around, not sure why my brain glided over that. Though I do stand by the gist, you should incrementally process this information making all of this answer moot.

FYI the huge bytestring didn't show up in those previous heap profiles I posted because foreign allocations (which includes ByteString) aren't tracked by the heap profiler.

There seem to be two questions here:

why does the memory usage depend on the presence or absence of the line (***);
why is the memory usage with (***) present about 800MB, rather than, say, 40MB.

I don't really know what to say about the first one that TomMD didn't already say; inside the loopInput loop, jrs can never be freed, because it's needed as an argument to the recursive call of loopInput. (You know that return () doesn't do anything when (***) is present, right?)

As for the second question, I think you are right that the input ByteString isn't being garbage collected. The reason is that you never evaluate the elements of your list jrs besides the last one, so they still contain references to the original ByteString (even though they are of the form BS.copy ...). I would think that replacing show $ last jrs with show jrs would reduce your memory usage; does it? Alternatively, you could try a stricter map, like

map' f []     = []
map' f (x:xs) = ((:) $! (f $! x)) (map' f xs)

Replace the map in importRecords with map' and see whether that reduces your memory usage.

继续阅读：bytestring haskell lazy-evaluation

Haskell ByteStrings - ending up with large file loaded into memory

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？