开发者

How to control laziness on example of Double each element of list

I tried to solve Duplicate each element in a Haskell List task and make it as a full program, that write the list to standard output

Here is my solution

main :: IO ()
main = getList >>= return . doubleEachItem >>= putStrLn . show

getList = return [1,3,5]

doubleEachItem :: [Int] -> [Int]
doubleEachItem = foldr (++) [] . map (take 2 . repeat)

But when I try to process really long list

getList = return . take 10000000000 $ repeat 15

the program terminated with out of memory error.

The question is: how I can improve the program, so it would be able to process lists of any size?

Edit:

I think program crashed because I run it with runghc command. In that case ghc consume about 3Gbytes of memory and was killed by operating system. The output file after crash was only about 0.6开发者_开发知识库GBytes. The reason of such behavior is unclear for me.

When I compile the program to native executable ghc haskell03.hs -o haskell03 and execute it with redirection to a file ./haskell03 >out03.txt it perfectly runs in constant memory space and produces output file with speed about 20MBytes/sec. When the program finished the output file takes 57GBytes. Total execution time was 47 minutes.


Oh! I think I know what's going on. This is subtle.

runghc runs in interpreted mode, as if you had loaded an run the file through ghci. getList is a top-level binding, so its value (and all values it references) are cached. That means that GHC was trying to cache your entire ten billion element array.

You might think that if you inlined the list it would work:

main = putStrLn . show . doubleEachElement $ [1..10^10]

But not so, because main is a top-level binding as well, and it references [1..10^10], so the giant unfolded version of that list will be cached there as well.

The general rule is, when something doesn't depend on the input, it can be cached. So you can fix this by making it appear to depend on the input. In interpreted mode, GHC is not very smart about this, so it's fairly easy to trick it:

main = do
    n <- return (10^10)
    putStrLn . show . doubleEachElement $ [1..n]

This would work with your getList as well, if getList took a parameter instead of hard-coding 10^10.

Fortunately, in compiled mode, GHC looks harder at the uses of these top-level symbols, so it can see that the list is only used once, and thus can begin garbage collecting its heads as it is output.

This problem does not come up in the real world very often, because programs often depend heavily on their input, so there aren't any top level big constants like this. But yeah, it can be a pain when GHC tries to cache something you don't want it to.


Your code is perfectly lazy, it doesn't need any optimization. What is probably going on is that all the output is on a single line, which is being buffered. You can fix this by turning off buffering:

import System.IO
main = do
    hSetBuffering stdout NoBuffering
    getList >>= ...

Or by, say, printing each number on its own line

main = getList >>= return . doubleEachItem >>= mapM_ print


This the application of answer of luqui to original program from the question with minimum modification

main :: IO ()
main = return 1 >>= getList >>= return . doubleEachElement >>= putStrLn . show

getList doNotCache = return . take (10^10) $ repeat 15

doubleEachElement :: [Int] -> [Int]
doubleEachElement = foldr (++) [] . map (take 2 . repeat)

The changes are emphasized with bold.

As luqui has shown, getList was cached because it was constant expression. But we can make haskell to believe that it is not constant simply by providing dummy argument, dependent on input.


Your doubleEachItem implementation via foldr require whole list constructed in memory, because foldr beginning work from end of list. To avoid this, replace foldr to foldl' or rewrite function as:

doubleEachItem = concatMap (replicate 2)
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜