Reasoning laziness
I have the following snippet:
import qualified Data.Vector as V
import qualified Data.ByteString.Lazy as BL
import System.Environment
import Data.Word
import qualified Data.List.Stream as S
histogram :: [W开发者_运维百科ord8] -> V.Vector Int
histogram c = V.accum (+) (V.replicate 256 0) $ S.zip (map fromIntegral c) (S.repeat 1)
mkHistogram file = do
hist <- (histogram . BL.unpack) `fmap` BL.readFile file
print hist
I see it like this: Nothing is done until printing. When printing the thunks are unwinded by first unpacking, then mapping fromIntegral one Word8 at a time. Each of these word8's are zipped with 1, again one value at a time. This tuples are then taken by the accumulator function which updates the array, one tuple/Word8 at a time. Then we move to the next thunk and repeat until no more content left.
This would allow for creating histograms in constant memory, but alas this is not happening, but instead it crashes with stack overflow. If I try to profile it, I see it running to the end, but taking memory a lot (300-500 Mb for a 2.5 Mb file). Memory is obtained linearly until the end until it can be released, forming a "nice" triangular graph.
Where did my reasoning go wrong and what steps should I take to make this run in constant memory?
I believe the problem is that Data.Vector
is not strict in its elements. So although your reasoning is right, when accumulating the histogram your thunks looks like:
<1+(1+(1+0)) (1+(1+0)) 0 0 (1+(1+(1+(1+0)))) ... >
Rather than
<3 2 0 0 4 ...>
And only when you print are those sums computed. I don't see a strict accum
function in the docs (shame), and there isn't any place to hook in a seq
. One way out of this predicament may be to use Data.Vector.Unboxed
instead, since unboxed types are unlifted aka strict. Maybe you could request a strict accum
function with your example as a use case.
精彩评论