开发者

How can I approximate Python's or operator for set comparison in Scala?

After hearing the latest Stack Overflow podcast, Peter Norvig's compact Python spell-checker intrigued me, so I decided to implement it in Scala if I could express it well in the functional Scala idiom, and also to see how many lines of code it would take.

Here's the whole problem. (Let's not compare lines of code yet.)

(Two notes: You can run this in the Scala interpreter, if you wish. If you need a copy of big.txt, or the whole project, it's on GitHub.)

import scala.io.Source

val alphabet = "abcdefghijklmnopqrstuvwxyz"

def train(text:String) = {
  "[a-z]+".r.findAllIn(text).foldLeft(Map[String, Int]() withDefaultValue 1)
    {(a, b) => a(b) = a(b) + 1}
}

val NWORDS = train(Source.fromFile("big.txt").getLines.mkString.toLowerCase)

def known(words:Set[String]) =
  {Set.empty ++ (for(w <- words if NWORDS contains w) yield w)}

def edits1(word:String) = {
  Set.em开发者_如何学JAVApty ++
  (for (i <- 0 until word.length) // Deletes
    yield (word take i) + (word drop (i + 1))) ++ 
  (for (i <- 0 until word.length - 1) // Transposes
    yield (word take i) + word(i + 1) + word(i) + (word drop (i + 2))) ++ 
  (for (i <- 0 until word.length; j <- alphabet) // Replaces
    yield (word take i) + j + (word drop (i+1))) ++ 
  (for (i <- 0 until word.length; j <- alphabet) // Inserts
    yield (word take i) + j + (word drop i))
}

def known_edits2(word:String) = {Set.empty ++ (for (e1 <- edits1(word);
  e2 <- edits1(e1) if NWORDS contains e2) yield e2)}

def correct(word:String) = {
  val options = Seq(() => known(Set(word)), () => known(edits1(word)),
    () => known_edits2(word), () => Set(word))
  val candidates = options.foldLeft(Set[String]())
    {(a, b) => if (a.isEmpty) b() else a}
  candidates.foldLeft("") {(a, b) => if (NWORDS(a) > NWORDS(b)) a else b}
}

Specifically, I'm wondering if there's anything cleaner I can do with the correct function. In the original Python, the implementation is a bit cleaner:

def correct(word):
    candidates = known([word]) or known(edits1(word)) or
      known_edits2(word) or [word]
    return max(candidates, key=NWORDS.get)

Apparently in Python, an empty set will evaluate to Boolean False, so only the first of the candidates to return a non-empty set will be evaluated, saving potentially expensive calls to edits1 and known_edits2.

The only solution I would come up with is the version you see here, where the Seq of anonymous functions are called until one returns a non-empty Set, which the last one is guaranteed to do.

So experienced Scala-heads, is there a more syntactically concise or better way to do this? Thanks in advance!


I'm not sure why you're attempting to use lazy evaluation for known rather than simply using a stream as oxbow_lakes illustrated. A better way of doing what he did:

def correct(word: String) = {
  import Stream._

  val str = cons(known(Set(word)), 
            cons(known(edits1(word)),
            cons(known_edits2(word),
            cons(Set(word), empty))))

  str find { !_.isEmpty } match {
    case Some(candidates) =>
      candidates.foldLeft(Set[String]()) { (res, n) =>
        if (NWORDS(res) > NWORDS(n)) res else n
      }

    case None => Set()
  }
}

The exploits the fact that Stream.cons is lazy already and so we don't need to wrap everything up in a thunk.

If you're really in the mood for nice syntax though, we can add some syntactic sugar to all of those conses:

implicit def streamSyntax[A](tail: =>Stream[A]) = new {
  def #::(hd: A) = Stream.cons(hd, tail)
}

Now our previously-ugly str definition falls into the following:

def correct(word: String) = {
  val str = known(Set(word)) #:: known(edits1(word)) #::
            known_edits2(word) #:: Set(word) #:: Stream.empty

  ...
}


Would this work? The _ syntax is a partially applied function and by using a (lazy) Stream, I ensure that the evaluations in the reduceLeft (which I think is more appropriate than foldLeft here) only happen as required!

def correct(word:String) = {
  Stream(known(Set(word)) _, 
         known(edits1(word)) _, 
         known_edits2(word) _, 
         Set(word) _
         ).find( !_().isEmpty ) match {
    case Some(candidates) =>
      candidates.reduceLeft {(res, n) => if (NWORDS(res) > NWORDS(n)) res else n}
    case _ => "" //or some other value

}

I've probably made some syntax errors here, but I think the Stream approach is a valid one


Scala 2.7-ready (including implicit performance work-around):

class Or[A](one: Set[A]) {
  def or(other: => Set[A]): Set[A] = if (one.isEmpty) other else one 
}

implicit def toOr[A](one: Set[A]) = new Or(one)

def correct(word: String) = {
  candidates = known(Set(word)) or known(edits1(word)) or known_edits2(word) or Set(word)
  candidates.foldLeft("") {(a, b) => if (NWORDS(a) > NWORDS(b)) a else b}
}

Scala 2.8-goodness:

implicit def toOr[A](one: Set[A]) = new AnyRef { 
  def or(other: => Set[A]): Set[A] = if (one.isEmpty) other else one 
}

def correct(word: String) = {
  candidates = known(Set(word)) or known(edits1(word)) or known_edits2(word) or Set(word)
  candidates.max(Ordering.fromLessThan[String](NWORDS(_) < NWORDS(_)))
}

That said, I pretty much upvoted everyone else's. I hadn't consider a Stream.

EDIT

It seems Ordering.fromLessThan can lead to twice the necessary comparisons. Here is an alternate version for that line:

  candidates.max(new Ordering[String] { def compare(x: String, y: String) = NWORDS(x) - NWORDS(y) })


Iterators are also lazy (although not very functional since you can only iterate over them once.) So, you could do it like this:

  def correct(word: String) = {
    val sets = List[String => Set[String]](
      x => known(Set(x)), x => known(edits1(x)), known_edits2
    ).elements.map(_(word))

    sets find { !_.isEmpty } match {
      case Some(candidates: Set[String]) => candidates.reduceLeft { (res, n) => if (NWORDS(res) > NWORDS(n)) res else n }
      case None => word
    }
  }

As a bonus, Iterator's find() method doesn't force evaluation of the next element.


I've tried to implement a short Scala implementation of the spelling corrector. It's 15 lines without imports. The shortest replacement for Python's or is simple call by name parameter:

def or[T](candidates : Seq[T], other : => Seq[T]) = if(candidates.isEmpty) other else candidates
def candidates(word: String) = or(known(List(word)), or(known(edits1(word)), known(edits2(word))))

In a real world scenario I would use the implicit conversion Daniel proposed.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜