开发者

Correct ReadP usage in Haskell

I did a very simple parser for lists of numbers in a file, using ReadP in Haskell. It works, but it is very slow... is this normal behavior of this type of parser or am I doing something wrong?

import Text.ParserCombinators.ReadP
import qualified Data.IntSet as IntSet
import Data.Char

setsReader :: ReadP [ IntSet.IntSet ]
setsReader = 
    setReader `sepBy` ( char '\n' )

innocentWhitespace :: ReadP ()
innocentWhitespace = 
    skipMany $ (char ' ') <++ (char '\t' )

setReader :: ReadP IntSet.IntSet
setReader =  do 
    innocentWhitespace
    int_list <- integerReader `sepBy1`  innocentWhitespace
    innocentWhitespace 
    return $ IntSet.fromList int_list

integerReader :: ReadP Int
integerReader = do
    digits <- many1 $ satisfy isDigit 
    ret开发者_C百科urn $ read digits

readClusters:: String -> IO [ IntSet.IntSet ]
readClusters filename = do 
    whole_file <- readFile filename 
    return $ ( fst . last ) $ readP_to_S setsReader whole_file 


setReader has exponential behavior, because it is allowing the whitespace between the numbers to be optional. So for the line:

12 34 56

It is seeing these parses:

[1,2,3,4,5,6]
[12,3,4,5,6]
[1,2,34,5,6]
[12,34,5,6]
[1,2,3,4,56]
[12,3,4,56]
[1,2,34,56]
[12,34,56]

You could see how this could get out of hand for long lines. ReadP returns all valid parses in increasing length order, so to get to the last parse you have to traverse through all these intermediate parses. Change:

int_list <- integerReader `sepBy1` innocentWhitespace

To:

int_list <- integerReader `sepBy1` mandatoryWhitespace

For a suitable definition of mandatoryWhitespace to squash this exponential behavior. The parsing strategy used by parsec is more resistant to this kind of error, because it is greedy -- once it consumes input in a given branch, it is committed to that branch and never goes back (unless you explicitly asked it to). So once it correctly parsed 12, it would never go back to parse 1 2. Of course that means it matters in which order you state your choices, which I always find to be a bit of a pain to think about.

Also I would use:

head [ x | (x,"") <- readP_to_S setsReader whole_file ]

To extract a valid whole-file parse, in case it very quickly consumed all input but there were a hundred bazillion ways to interpret that input. If you don't care about the ambiguity, you would probably rather it return the first one than the last one, because the first one will arrive faster.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜