How to do fast prefix string matching in Scala
I'm using some Java code to do fast prefix lookups, using java.util.TreeSet, could I be using scala's TreeSet instead? Or a different solution?
/** A class that uses a TreeSet to d开发者_开发知识库o fast prefix matching
*/
class PrefixMatcher {
private val _set = new java.util.TreeSet[String]
def add(s: String) = _set.add(s)
def findMatches(prefix: String): List[String] = {
val matches = new ListBuffer[String]
val tailSet = _set.tailSet(prefix)
for ( tail <- tailSet.toArray ) {
val tailString = tail.asInstanceOf[String]
if ( tailString.startsWith(prefix) )
matches += tailString
else
return matches.toList
}
matches.toList
}
}
Use a Trie. Nobody's actually posted a Trie here yet, despite the fact that some people have posted sorted TreeMap data structures that they have misnamed as tries. Here is a fairly representative sample of a Trie implementation in Java. I don't know of any in Scala. See also an explanation of Tries on Wikipedia.
The from & takeWhile approach:
class PrefixMatcher {
private val _set = new TreeSet[String]
def add(s: String) = _set.add(s)
def findMatches(prefix: String): Iterable[String] =
_set from prefix takeWhile(_ startsWith prefix)
}
An alternative is to select a subset from prefix to prefix++ (the smallest string after the prefix). This selects only the range of the tree that actually starts with the given prefix. Filtering of entries is not necessary. The subSet method will create a view of the underlying set.
There's still some work (overflow and empty strings won't work) in the increment method but the intent should be clear.
class PrefixMatcher {
private val _set = new java.util.TreeSet[String]
def add(s: String) = _set.add(s)
def findMatches(prefix: String) : Set[String] = {
def inc(x : String) = { //ignores overflow
assert(x.length > 0)
val last = x.length - 1
(x take last) + (x(last) + 1).asInstanceOf[Char]
}
_set.subSet(prefix, inc(prefix))
}
}
The same works with the scala jcl.TreeSet#range implementation.
As I understand it, the Scala TreeSet is backed by the Java TreeSet, but using the Scala variant would allow you to shorten up the code using a sequence comprehension (http://www.scala-lang.org/node/111) giving you an implementation that looked something like (for Scala 2.7):
import scala.collection.jcl.TreeSet;
class PrefixMatcher
{
private val _set = new TreeSet[String]
def add(s: String) = _set.add(s)
def findMatches(prefix: String): Iterable[String] =
for (s <- _set.from(prefix) if s.startsWith(prefix)) yield s
}
object Main
{
def main(args: Array[String]): Unit =
{
val pm = new PrefixMatcher()
pm.add("fooBar")
pm.add("fooCow")
pm.add("barFoo")
pm.findMatches("foo").foreach(println)
}
}
Apologies for any bad Scala style on my part, I'm just getting used to the language myself.
I blogged about finding matches for a combination of prefixes a while ago. It's a harder problem, as you don't know when one prefix ends and the other begins. It might interest you. I'll even post below the code that I did not blog (yet, hopefully :), though it is stripped of all comments, none of which were made in English:
package com.blogspot.dcsobral.matcher.DFA
object DFA {
type Matched = List[(String, String)]
def words(s : String) = s.split("\\W").filter(! _.isEmpty).toList
}
import DFA._
import scala.runtime.RichString
class DFA {
private val initialState : State = new State(None, "")
private var currState : State = initialState
private var _input : RichString = ""
private var _badInput : RichString = ""
private var _accepted : Boolean = true
def accepted : Boolean = _accepted
def input : String = _input.reverse + _badInput.reverse
def transition(c : Char) : List[(String, Matched)] = {
if (c == '\b') backtrack
else {
if (accepted) {
val newState = currState(c)
newState match {
case Some(s) => _input = c + _input; currState = s
case None => _badInput = c + _badInput; _accepted = false
}
} else {
_badInput = c + _badInput
}
optionList
}
}
def transition(s : String) : List[(String, Matched)] = {
s foreach (c => transition(c))
optionList
}
def apply(c : Char) : List[(String, Matched)] = transition(c)
def apply(s : String) : List[(String,Matched)] = transition(s)
def backtrack : List[(String, Matched)] = {
if(_badInput isEmpty) {
_input = _input drop 1
currState.backtrack match {
case Some(s) => currState = s
case None =>
}
} else {
_badInput = _badInput drop 1
if (_badInput isEmpty) _accepted = true
}
optionList
}
def optionList : List[(String, Matched)] = if (accepted) currState.optionList else Nil
def possibleTransitions : Set[Char] = if (accepted) (currState possibleTransitions) else Set.empty
def reset : Unit = {
currState = initialState
_input = ""
_badInput = ""
_accepted = true
}
def addOption(s : String) : Unit = {
initialState addOption s
val saveInput = input
reset
transition(saveInput)
}
def removeOption(s : String) : Unit = {
initialState removeOption s
val saveInput = input
reset
transition(saveInput)
}
}
class State (val backtrack : Option[State],
val input : String) {
private var _options : List[PossibleMatch] = Nil
private val transitions : scala.collection.mutable.Map[Char, State] = scala.collection.mutable.Map.empty
private var _possibleTransitions : Set[Char] = Set.empty
private def computePossibleTransitions = {
if (! options.isEmpty)
_possibleTransitions = options map (_.possibleTransitions) reduceLeft (_++_)
else
_possibleTransitions = Set.empty
}
private def computeTransition(c : Char) : State = {
val newState = new State(Some(this), input + c)
options foreach (o => if (o.possibleTransitions contains c) (o computeTransition (newState, c)))
newState
}
def options : List[PossibleMatch] = _options
def optionList : List[(String, Matched)] = options map (pm => (pm.option, pm.bestMatch))
def possibleTransitions : Set[Char] = _possibleTransitions
def transition(c : Char) : Option[State] = {
val t = c.toLowerCase
if (possibleTransitions contains t) Some(transitions getOrElseUpdate (t, computeTransition(t))) else None
}
def apply(c : Char) : Option[State] = transition(c)
def addOption(option : String) : Unit = {
val w = words(option)
addOption(option, w.size, List(("", w.head)), w)
}
def addOption(option : String, priority : Int, matched : Matched, remaining : List[String]) : Unit = {
options find (_.option == option) match {
case Some(pM) =>
if (!pM.hasMatchOption(matched)) {
pM.addMatchOption(priority, matched, remaining)
if (priority < pM.priority) {
val (before, _ :: after) = options span (_ != pM)
val (highPriority, lowPriority) = before span (p => p.priority < priority ||
(p.priority == priority && p.option < option))
_options = highPriority ::: (pM :: lowPriority) ::: after
}
transitions foreach (t => pM computeTransition (t._2, t._1))
}
case None =>
val (highPriority, lowPriority) = options span (p => p.priority < priority ||
(p.priority == priority && p.option < option))
val newPM = new PossibleMatch(option, priority, matched, remaining)
_options = highPriority ::: (newPM :: lowPriority)
transitions foreach (t => newPM computeTransition (t._2, t._1))
}
computePossibleTransitions
}
def removeOption(option : String) : Unit = {
options find (_.option == option) match {
case Some(possibleMatch) =>
val (before, _ :: after) = options span (_ != possibleMatch)
(possibleMatch.possibleTransitions ** Set(transitions.keys.toList : _*)).toList foreach (t =>
transition(t).get.removeOption(option))
_options = before ::: after
computePossibleTransitions
case None =>
}
}
}
class PossibleMatch (val option : String,
thisPriority : Int,
matched : Matched,
remaining : List[String]) {
private var _priority = thisPriority
private var matchOptions = List(new MatchOption(priority, matched, remaining))
private var _possibleTransitions = matchOptions map (_.possibleTransitions) reduceLeft (_++_)
private def computePossibleTransitions = {
_possibleTransitions = matchOptions map (_.possibleTransitions) reduceLeft (_++_)
}
def priority : Int = _priority
def hasMatchOption(matched : Matched) : Boolean = matchOptions exists (_.matched == matched)
def addMatchOption(priority : Int, matched : Matched, remaining : List[String]) : Unit = {
if (priority < _priority) _priority = priority
val (highPriority, lowPriority) = matchOptions span (_.priority < priority)
val newMO = new MatchOption(priority, matched, remaining)
matchOptions = highPriority ::: (newMO :: lowPriority)
computePossibleTransitions
}
def bestMatch : Matched = matchOptions.head.matched.reverse.map(p => (p._1.reverse.toString, p._2)) :::
remaining.tail.map(w => ("", w))
def possibleTransitions : Set[Char] = _possibleTransitions
def computeTransition(s: State, c : Char) : Unit = {
def computeOptions(state : State,
c : Char,
priority : Int,
matched : Matched,
remaining : List[String]) : Unit = {
remaining match {
case w :: ws =>
if (!w.isEmpty && w(0).toLowerCase == c.toLowerCase) {
val newMatched = (w(0) + matched.head._1, matched.head._2.substring(1)) :: matched.tail
val newPriority = if (matched.head._1 isEmpty) (priority - 1) else priority
if (w.drop(1) isEmpty)
s.addOption(option, newPriority - 1, ("", ws.head) :: newMatched , ws)
else
s.addOption(option, newPriority, newMatched, w.substring(1) :: ws)
}
if (ws != Nil) computeOptions(s, c, priority, ("", ws.head) :: matched, ws)
case Nil =>
}
}
if(possibleTransitions contains c)
matchOptions foreach (mO => computeOptions(s, c, mO.priority, mO.matched, mO.remaining))
}
}
class MatchOption (val priority : Int,
val matched : Matched,
val remaining : List[String]) {
lazy val possibleTransitions : Set[Char] = Set( remaining map (_(0) toLowerCase) : _* )
}
It really needs some refactoring, though. I always do it when I'm start to explain it for the blog.
Ok, I just realized what you want is pretty much what a friend of mine suggested for another problem. So, here is his answer, simplified for your needs.
class PrefixMatcher {
// import scala.collection.Set // Scala 2.7 needs this -- and returns a gimped Set
private var set = new scala.collection.immutable.TreeSet[String]()
private def succ(s : String) = s.take(s.length - 1) + ((s.charAt(s.length - 1) + 1)).toChar
def add(s: String) = set += s
def findMatches(prefix: String): Set[String] =
if (prefix.isEmpty) set else set.range(prefix, succ(prefix))
}
精彩评论