IO over big files in haskell: Performance issue

2023-02-19 05:15 问答作者：

I'm trying to work over big files using Haskell. I'd like to browse an input file byte after byte, and to generate an output byte after byte. Of course I need the IO to be buffered with blocks of reasonable size (a few KB). I can't do it, and I need your help please.

import System 
import qualified Data.ByteString.Lazy as BL 
import Data.Word  
import Data.List

main :: IO () 
main =     
    do         
        args <- System.getArgs         
        let filename = head args         
        byteString <- BL.readFile filename         
        let wordsList = BL.unpack byteString         
        let foldFun acc word = doSomeStuff word : acc
        let wordsListCopy = foldl' foldFun [] wordsList
        let byteStringCopy = BL.pack (reverse wordsListCopy)
        BL.writeFile (filename ++ ".cpy") byteStringCopy
    where
        doSomeStuff = id

I name 开发者_C百科this file TestCopy.hs, then do the following:

$ ls -l *MB
-rwxrwxrwx 1 root root 10000000 2011-03-24 13:11 10MB
-rwxrwxrwx 1 root root  5000000 2011-03-24 13:31 5MB
$ ghc --make -O TestCopy.hs 
[1 of 1] Compiling Main             ( TestCopy.hs, TestCopy.o )
Linking TestCopy ...
$ time ./TestCopy 5MB

real    0m5.631s
user    0m1.972s
sys 0m2.488s
$ diff 5MB 5MB.cpy
$ time ./TestCopy 10MB 

real    3m6.671s
user    0m3.404s
sys 1m21.649s
$ diff 10MB 10MB.cpy 
$ time ./TestCopy 10MB +RTS -K500M -RTS

real    2m50.261s
user    0m3.808s
sys 1m13.849s
$ diff 10MB 10MB.cpy 
$

My problem: There is a huge difference between a 5MB and a 10 MB file. I'd like the performances to be linear in the size of the input file. Please what am i doing wrong, and how can I achieve this? I don't mind using lazy bytestrings or anything else as long as it works, but it has to be a standard ghc library.

Precision: It's for a university project. And I'm not trying to copy files. The doSomeStuff function shall perform compression/decompression actions that I have to customize.

For chunked input processing I would use the enumerator package.

import Data.Enumerator
import Data.Enumerator.Binary (enumFile)

We use bytestrings

import Data.ByteString as BS

and IO

import Control.Monad.Trans (liftIO)
import Control.Monad (mapM_)
import System (getArgs)

Your main function could look like following:

main =
  do (filepath:_) <- getArgs
     let destination
     run_ $ enumFile filepath $$ writeFile (filepath ++ ".cpy")

enumFile reads 4096 bytes per chunk and passes these to writeFile, which writes it down.

enumWrite is defined as:

enumWrite :: FilePath -> Iteratee BS.ByteString IO ()
enumWrite filepath =
  do liftIO (BS.writeFile filepath BS.empty)   -- ensure the destination is empty
     continue step
  where
  step (Chunks xs) =
    do liftIO (mapM_ (BS.appendFile filepath) xs)
       continue step
  step EOF         = yield () EOF

As you can see, the step function takes chunks of bytestrings and appends them to the destination file. These chunks have the type Stream BS.Bytestring, where Stream is defined as:

data Stream a = Chunks [a] | EOF

On an EOF step terminates, yielding ().

To have a much more elaborate read on this I personally recommend Michael Snoymans tutorial

The numbers

$ time ./TestCopy 5MB
./TestCopy 5MB  2,91s user 0,32s system 96% cpu 3,356 total

$ time ./TestCopy2 5MB
./TestCopy2 5MB  0,04s user 0,03s system 93% cpu 0,075 total

That's quite an improvement. Now in order to implement your fold you probably want to write an Enumeratee, which is used to transform a input stream. Fortunately there is already a map function defined in the enumerator package, which can be modified for your need, i.e. it can be modified to carry over state.

On the construction of the intermediate result

You construct wordsList in reverse order and reverse it afterwards. I think difference lists do a better job, because appends take only O(1) time due to the fact that appending is only a function composition. I'm not sure whether they takes more space though. Here's a rough sketch of difference lists:

type DList a = [a] -> [a]

emptyList :: DList a
emptyList = id

snoc :: DList a -> a -> DList a
snoc dlist a = dlist . (a:)

toList :: DList a -> [a]
toList dlist = dlist []

This answer is probably not needed anymore, but I added it for completeness.

I take it this is a follow on to Reading large file in haskell? from yesterday.

Try compiling with "-rtsopts -O2" instead of just "-O".

You claim "I'd like to browse an input file byte after byte, and to generate an output byte after byte." but your code reads the entire input before trying to create any output. This is just not very representative of the goal.

With my system I see "ghc -rtsopts --make -O2 b.hs" giving

(! 741)-> time ./b five
real 0m2.789s user 0m2.235s sys 0m0.518s
(! 742)-> time ./b ten
real 0m5.830s user 0m4.623s sys 0m1.027s

Which now looks linear to me.

继续阅读：bytestring haskell io lazy-evaluation

IO over big files in haskell: Performance issue

The numbers

On the construction of the intermediate result

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

The numbers

On the construction of the intermediate result

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？